2017-02-22: Archive Now (archivenow): A Python Library to Integrate On-Demand Archives
Examples: Archive Now (archivenow) CLI |
$ archivenow --all www.cnn.com
is sufficient for the current CNN homepage to be captured and preserved by all configured archives in this Python library.
Archive Now allows you to accomplish the following major tasks:
- A web page can be pushed into one archive
- A web page can be pushed into multiple archives
- A web page can be pushed into all archives
- Adding new archives
- Removing existing archives
$ pip install archivenow
To install from the source code:
$ git clone git@github.com:oduwsdl/archivenow.git
$ cd archivenow
$ pip install -r requirements.txt
$ pip install ./
"pip", "archivenow", and "docker" may require "sudo"
Archive Now can be used through:
- The CLI (or A Docker Container)
- A Web Service
- Python code
1. The CLI
Usage of sub-commands in archivenow can be accessed through providing the -h or --help flag:
$ archivenow -h
usage: archivenow [-h][--cc][--cc_api_key [CC_API_KEY]]
[--ia][--is][--wc][-v][--all][--server]
[--host [HOST]][--port [PORT]][URI]
positional arguments:
URI URI of a web resource
optional arguments:
-h, --help show this help message and exit
--cc Use The Perma.cc Archive
--cc_api_key [CC_API_KEY]
An API KEY is required by The Perma.cc
Archive
--ia Use The Internet Archive
--is Use The Archive.is
--wc Use The WebCite Archive
-v, --version Report the version of archivenow
--all Use all possible archives
--server Run archiveNow as a Web Service
--host [HOST] A server address
--port [PORT] A port number to run a Web Service
Examples:
To archive the web page (www.foxnews.com) in the Internet Archive:
$ archivenow --ia www.foxnews.com
https://web.archive.org/web/20170209135625/http://www.foxnews.com
By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments provided:
$ archivenow www.foxnews.com
https://web.archive.org/web/20170215164835/http://www.foxnews.com
To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and The Archive.is:
$ archivenow --ia --is www.foxnews.com
https://web.archive.org/web/20170209140345/http://www.foxnews.com http://archive.is/fPVyc
To save the web page (www.foxnews.com) in all configured web archives:
$ archivenow --all www.foxnews.com --cc_api_key $Your-Perma-CC-API-Key
https://perma.cc/8YYC-C7RM
https://web.archive.org/web/20170220074919/http://www.foxnews.com
http://archive.is/jy8B0
http://www.webcitation.org/6o9IKD9FP
Run it as a Docker Container (you need to do "docker pull" first)
$ docker pull maturban/archivenow
$ docker run -it --rm maturban/archivenow -h
$ docker run -p 80:12345 -it --rm maturban/archivenow --server
$ docker run -p 80:11111 -it --rm maturban/archivenow --server --port 11111
$ docker run -it --rm maturban/archivenow --ia http://www.cnn.com
...
2. A Web Service
You can run archivenow as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 11111)
$ archivenow --server
* Running on http://127.0.0.1:12345/ (Press CTRL+C to quit)
To save the web page (www.foxnews.com) in The Internet Archive through the web service:
$ curl -i http://127.0.0.1:12345/ia/www.foxnews.com
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 95
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Thu, 09 Feb 2017 14:29:23 GMT
{
"results": [
"https://web.archive.org/web/20170209142922/http://www.foxnews.com"
]
}
To save the web page (www.foxnews.com) in all configured archives though the web service:
$ curl -i http://127.0.0.1:12345/all/www.foxnews.com
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 172
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Thu, 09 Feb 2017 14:33:47 GMT
{
"results": [
"https://web.archive.org/web/20170209143327/http://www.foxnews.com",
"http://archive.is/H2Yfg",
"http://www.webcitation.org/6o9Jubykh",
"Error (The Perma.cc Archive): An API KEY is required"
]
}
you may use the Perma.cc API_Key as following:
$ curl -i http://127.0.0.1:12345/all/www.foxnews.com?cc_api_key=$Your-Perma-CC-API-Key
3. Python Usage
>>> from archivenow import archivenow
To save the web page (www.foxnews.com) in The WebCite Archive:
>>> archivenow.push("www.foxnews.com","wc")
['http://www.webcitation.org/6o9LTiDz3']
To save the web page (www.foxnews.com) in all configured archives:
>>> archivenow.push("www.foxnews.com","all")
['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]
To save the web page (www.foxnews.com) in The Perma.cc:
>>> archivenow.push("www.foxnews.com","cc","cc_api_key=$Your-Perma-cc-API-KEY")
['https://perma.cc/8YYC-C7RM']
To start the server from Python do the following. The server/port number can be passed (e.g,
start(port=1111, host='localhost')):
>>> archivenow.start()
* Running on http://127.0.0.1:12345/ (Press CTRL+C to quit)
Configuring a new archive or removing existing one
Adding a new archive is as simple as adding a handler file in the folder "handlers". For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write ">>>archivenow.push("www.cnn.com","ma")". In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. It might be helpful to see how other "*_handler.py" organized.
Removing an archive can be done by one of the following options:
- Removing the archive handler file from the folder "handlers"
- Rename the archive handler file to other name that does not end with "_handler.py"
- Simply, inside the handler file, set the variable "enabled" to "False"
Notes
The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the 'same' resource. For example, if you send a request to the IA to capture (www.cnn.com) at 10:00pm, the IA will create a new memento (let's call it M1) of the CNN homepage. The IA will then return M1 for all requests to archive the CNN homepage received before 10:02pm. The Archive.is sets this time gap to five minutes.
Updates and pull requests are welcome: https://github.com/oduwsdl/archivenow
--Mohamed Aturban
2017-11-13 Edit:
We added a new UI page to "archivenow". It will show up when running "archivenow" as a web service. This page allows users to submit a URL to selected archives as shown below:
You can install and run "archivenow" as a web service using the "pip" command or a Docker container (you can change the default IP and/or port number):
As a Docker container:
$ docker run -p 22222:11111 -it --rm maturban/archivenow --server --port 11111 --host 0.0.0.0
Using pip:
$ pip install archivenow
$ archivenow --server
Comments
Post a Comment