2017-02-22: Archive Now (archivenow): A Python Library to Integrate On-Demand Archives

Examples: Archive Now (archivenow) CLI

A small part of my research is to ensure that certain web pages are preserved in public web archives to hopefully be available and retrievable whenever needed at any time in the future. As archivists believe that "lots of copies keep stuff safe", I have created a Python library (Archive Now) to push web resources into several on-demand archives, such as The Internet Archive, WebCite, Perma.cc, and Archive.is. For any reason, one archive stops serving temporarily or permanently, it is likely that copies can be fetched from other archives. By Archive Now, one command like:

$ archivenow --all www.cnn.com

is sufficient for the current CNN homepage to be captured and preserved by all configured archives in this Python library.

Archive Now allows you to accomplish the following major tasks:

A web page can be pushed into one archive
A web page can be pushed into multiple archives
A web page can be pushed into all archives
Adding new archives
Removing existing archives

Install Archive Now from PyPI:
$ pip install archivenow

To install from the source code:
$ git clone git@github.com:oduwsdl/archivenow.git
$ cd archivenow
$ pip install -r requirements.txt
$ pip install ./

"pip", "archivenow", and "docker" may require "sudo"

Archive Now can be used through:

The CLI (or A Docker Container)
A Web Service
Python code

1. The CLI

Usage of sub-commands in archivenow can be accessed through providing the -h or --help flag:
$ archivenow -h
usage: archivenow [-h][--cc][--cc_api_key [CC_API_KEY]]
[--ia][--is][--wc][-v][--all][--server]
[--host [HOST]][--port [PORT]][URI]
positional arguments:
URI URI of a web resource
optional arguments:
-h, --help show this help message and exit
--cc Use The Perma.cc Archive
--cc_api_key [CC_API_KEY]
An API KEY is required by The Perma.cc
Archive
--ia Use The Internet Archive
--is Use The Archive.is
--wc Use The WebCite Archive
-v, --version Report the version of archivenow
--all Use all possible archives
--server Run archiveNow as a Web Service
--host [HOST] A server address
--port [PORT] A port number to run a Web Service

Examples:

To archive the web page (www.foxnews.com) in the Internet Archive:

$ archivenow --ia www.foxnews.com
https://web.archive.org/web/20170209135625/http://www.foxnews.com

By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments provided:

$ archivenow www.foxnews.com
https://web.archive.org/web/20170215164835/http://www.foxnews.com

To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and The Archive.is:

$ archivenow --ia --is www.foxnews.com
https://web.archive.org/web/20170209140345/http://www.foxnews.com http://archive.is/fPVyc

To save the web page (www.foxnews.com) in all configured web archives:

$ archivenow --all www.foxnews.com --cc_api_key $Your-Perma-CC-API-Key
https://perma.cc/8YYC-C7RM
https://web.archive.org/web/20170220074919/http://www.foxnews.com
http://archive.is/jy8B0
http://www.webcitation.org/6o9IKD9FP

Run it as a Docker Container (you need to do "docker pull" first)

$ docker pull maturban/archivenow

$ docker run -it --rm maturban/archivenow -h
$ docker run -p 80:12345 -it --rm maturban/archivenow --server
$ docker run -p 80:11111 -it --rm maturban/archivenow --server --port 11111
$ docker run -it --rm maturban/archivenow --ia http://www.cnn.com
...

2. A Web Service

You can run archivenow as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 11111)

$ archivenow --server
* Running on http://127.0.0.1:12345/ (Press CTRL+C to quit)

To save the web page (www.foxnews.com) in The Internet Archive through the web service:

$ curl -i http://127.0.0.1:12345/ia/www.foxnews.com

HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 95
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Thu, 09 Feb 2017 14:29:23 GMT

{
"results": [
"https://web.archive.org/web/20170209142922/http://www.foxnews.com"
]
}

To save the web page (www.foxnews.com) in all configured archives though the web service:

$ curl -i http://127.0.0.1:12345/all/www.foxnews.com

HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 172
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Thu, 09 Feb 2017 14:33:47 GMT

{
"results": [
"https://web.archive.org/web/20170209143327/http://www.foxnews.com",
"http://archive.is/H2Yfg",
"http://www.webcitation.org/6o9Jubykh",
"Error (The Perma.cc Archive): An API KEY is required"
]
}

you may use the Perma.cc API_Key as following:

$ curl -i http://127.0.0.1:12345/all/www.foxnews.com?cc_api_key=$Your-Perma-CC-API-Key

3. Python Usage

>>> from archivenow import archivenow

To save the web page (www.foxnews.com) in The WebCite Archive:

>>> archivenow.push("www.foxnews.com","wc")
['http://www.webcitation.org/6o9LTiDz3']

To save the web page (www.foxnews.com) in all configured archives:

>>> archivenow.push("www.foxnews.com","all")
['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]

To save the web page (www.foxnews.com) in The Perma.cc:

>>> archivenow.push("www.foxnews.com","cc","cc_api_key=$Your-Perma-cc-API-KEY")
['https://perma.cc/8YYC-C7RM']

To start the server from Python do the following. The server/port number can be passed (e.g,

start(port=1111, host='localhost')):
>>> archivenow.start()
* Running on http://127.0.0.1:12345/ (Press CTRL+C to quit)

Configuring a new archive or removing existing one

Adding a new archive is as simple as adding a handler file in the folder "handlers". For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write ">>>archivenow.push("www.cnn.com","ma")". In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. It might be helpful to see how other "*_handler.py" organized.

Removing an archive can be done by one of the following options:

Removing the archive handler file from the folder "handlers"
Rename the archive handler file to other name that does not end with "_handler.py"
Simply, inside the handler file, set the variable "enabled" to "False"

Notes

The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the 'same' resource. For example, if you send a request to the IA to capture (www.cnn.com) at 10:00pm, the IA will create a new memento (let's call it M1) of the CNN homepage. The IA will then return M1 for all requests to archive the CNN homepage received before 10:02pm. The Archive.is sets this time gap to five minutes.

Updates and pull requests are welcome: https://github.com/oduwsdl/archivenow

--Mohamed Aturban

2017-11-13 Edit:

We added a new UI page to "archivenow". It will show up when running "archivenow" as a web service. This page allows users to submit a URL to selected archives as shown below:

You can install and run "archivenow" as a web service using the "pip" command or a Docker container (you can change the default IP and/or port number):

As a Docker container:

$ docker run -p 22222:11111 -it --rm maturban/archivenow --server --port 11111 --host 0.0.0.0

Using pip:

$ pip install archivenow
$ archivenow --server

Search This Blog

Web Science and Digital Libraries Research Group

2017-02-22: Archive Now (archivenow): A Python Library to Integrate On-Demand Archives

Comments

Post a Comment