Wednesday, February 22, 2017

2017-02-22: Archive Now (archivenow): A Python Library to Integrate On-Demand Archives

Examples: Archive Now (archivenow) CLI
A small part of my research is to ensure that certain web pages are preserved in public web archives to hopefully be available and retrievable whenever needed at any time in the future. As archivists believe that "lots of copies keep stuff safe", I have created a Python library (Archive Now) to push web resources into several on-demand archives, such as The Internet Archive, WebCite, Perma.cc, and Archive.is. For any reason, one archive stops serving temporarily or permanently, it is likely that copies can be fetched from other archives. By Archive Now, one command like:
   
$ archivenow --all www.cnn.com

is sufficient for the current CNN homepage to be captured and preserved by all configured archives in this Python library.

Archive Now allows you to accomplish the following major tasks:
  • A web page can be pushed into one archive
  • A web page can be pushed into multiple archives
  • A web page can be pushed into all archives  
  • Adding new archives
  • Removing existing archives
Install Archive Now from PyPI:
    $ pip install archivenow

To install from the source code:
    $ git clone git@github.com:oduwsdl/archivenow.git
    $ cd archivenow
    $ pip install -r requirements.txt
    $ pip install ./


"pip", "archivenow", and "docker" may require "sudo"

Archive Now can be used through:

   1. The CLI

Usage of sub-commands in archivenow can be accessed through providing the -h or --help flag:
   $ archivenow -h
   usage: archivenow [-h][--cc][--cc_api_key [CC_API_KEY]] 

                        [--ia][--is][--wc][-v][--all][--server]
                        [--host [HOST]][--port [PORT]][URI]
   positional arguments:
     URI                   URI of a web resource
   optional arguments:
     -h, --help            show this help message and exit
     --cc                  Use The Perma.cc Archive
     --cc_api_key [CC_API_KEY]
                           An API KEY is required by The Perma.cc

                           Archive
     --ia                  Use The Internet Archive
     --is                  Use The Archive.is
     --wc                  Use The WebCite Archive
     -v, --version         Report the version of archivenow
     --all                 Use all possible archives
     --server              Run archiveNow as a Web Service
     --host [HOST]         A server address
     --port [PORT]         A port number to run a Web Service


Examples:
   
To archive the web page (www.foxnews.com) in the Internet Archive:

$ archivenow --ia www.foxnews.com
https://web.archive.org/web/20170209135625/http://www.foxnews.com


By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments provided:

$ archivenow www.foxnews.com
https://web.archive.org/web/20170215164835/http://www.foxnews.com


To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and The Archive.is:


$ archivenow --ia --is www.foxnews.com
https://web.archive.org/web/20170209140345/http://www.foxnews.com http://archive.is/fPVyc


To save the web page (www.foxnews.com) in all configured web archives:


$ archivenow --all www.foxnews.com --cc_api_key $Your-Perma-CC-API-Key
https://perma.cc/8YYC-C7RM

https://web.archive.org/web/20170220074919/http://www.foxnews.com
http://archive.is/jy8B0
http://www.webcitation.org/6o9IKD9FP

Run it as a Docker Container (you need to do "docker pull" first)

$ docker pull maturban/archivenow

$ docker run -it --rm maturban/archivenow -h
$ docker run -p 80:12345 -it --rm maturban/archivenow --server
$ docker run -p 80:11111 -it --rm maturban/archivenow --server --port 11111
$ docker run -it --rm maturban/archivenow --ia http://www.cnn.com
...


   2. A Web Service

You can run archivenow as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 11111)

$ archivenow --server
  * Running on http://127.0.0.1:12345/ (Press CTRL+C to quit)

To save the web page (www.foxnews.com) in The Internet Archive through the web service:

$ curl -i http://127.0.0.1:12345/ia/www.foxnews.com

     HTTP/1.0 200 OK
     Content-Type: application/json
     Content-Length: 95
     Server: Werkzeug/0.11.15 Python/2.7.10
     Date: Thu, 09 Feb 2017 14:29:23 GMT

    {
      "results": [
        "https://web.archive.org/web/20170209142922/http://www.foxnews.com"
      ]
    }


To save the web page (www.foxnews.com) in all configured archives though the web service:

$ curl -i http://127.0.0.1:12345/all/www.foxnews.com

    HTTP/1.0 200 OK
    Content-Type: application/json
    Content-Length: 172
    Server: Werkzeug/0.11.15 Python/2.7.10
    Date: Thu, 09 Feb 2017 14:33:47 GMT

    {
      "results": [
        "https://web.archive.org/web/20170209143327/http://www.foxnews.com",
        "http://archive.is/H2Yfg",
        "http://www.webcitation.org/6o9Jubykh",
        "Error (The Perma.cc Archive): An API KEY is required"
      ]
    }


you may use the Perma.cc API_Key as following:

$ curl -i http://127.0.0.1:12345/all/www.foxnews.com?cc_api_key=$Your-Perma-CC-API-Key


   3. Python Usage

>>> from archivenow import archivenow

To save the web page (www.foxnews.com) in The WebCite Archive:

>>> archivenow.push("www.foxnews.com","wc")
['http://www.webcitation.org/6o9LTiDz3']


To save the web page (www.foxnews.com) in all configured archives:


>>> archivenow.push("www.foxnews.com","all")
['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]


To save the web page (www.foxnews.com) in The Perma.cc:

>>> archivenow.push("www.foxnews.com","cc","cc_api_key=$Your-Perma-cc-API-KEY")
['https://perma.cc/8YYC-C7RM']


To start the server from Python do the following. The server/port number can be passed (e.g,

start(port=1111, host='localhost')):
>>> archivenow.start()

* Running on http://127.0.0.1:12345/ (Press CTRL+C to quit)

Configuring a new archive or removing existing one

Adding a new archive is as simple as adding a handler file in the folder "handlers". For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write ">>>archivenow.push("www.cnn.com","ma")". In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. It might be helpful to see how other "*_handler.py" organized.

Removing an archive can be done by one of the following options:
  • Removing the archive handler file from the folder "handlers"
  • Rename the archive handler file to other name that does not end with "_handler.py"
  • Simply, inside the handler file, set the variable "enabled" to "False" 

Notes

The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the 'same' resource. For example, if you send a request to the IA to capture (www.cnn.com) at 10:00pm, the IA will create a new memento (let's call it M1) of the CNN homepage. The IA will then return M1 for all requests to archive the CNN homepage received before 10:02pm. The Archive.is sets this time gap to five minutes.

Updates and pull requests are welcome: https://github.com/oduwsdl/archivenow

--Mohamed Aturban

2 comments:

  1. Great! I was looking for a better way to integrate archive.is with my app.

    ReplyDelete
  2. thanks for the tips and information..i really appreciate it.. signagecloud

    ReplyDelete