Wednesday, February 22, 2017

2017-02-22: Archive Now (archivenow): A Python Library to Integrate On-Demand Archives

Examples: Archive Now (archivenow) CLI
A small part of my research is to ensure that certain web pages are preserved in public web archives to hopefully be available and retrievable whenever needed at any time in the future. As archivists believe that "lots of copies keep stuff safe", I have created a Python library (Archive Now) to push web resources into several on-demand archives, such as The Internet Archive, WebCite, Perma.cc, and Archive.is. For any reason, one archive stops serving temporarily or permanently, it is likely that copies can be fetched from other archives. By Archive Now, one command like:
   
$ archivenow --all www.cnn.com

is sufficient for the current CNN homepage to be captured and preserved by all configured archives in this Python library.

Archive Now allows you to accomplish the following major tasks:
  • A web page can be pushed into one archive
  • A web page can be pushed into multiple archives
  • A web page can be pushed into all archives  
  • Adding new archives
  • Removing existing archives
Install Archive Now from PyPI:
    $ pip install archivenow

To install from the source code:
    $ git clone git@github.com:oduwsdl/archivenow.git
    $ cd archivenow
    $ pip install -r requirements.txt
    $ pip install ./


"pip", "archivenow", and "docker" may require "sudo"

Archive Now can be used through:

   1. The CLI

Usage of sub-commands in archivenow can be accessed through providing the -h or --help flag:
   $ archivenow -h
   usage: archivenow [-h][--cc][--cc_api_key [CC_API_KEY]] 

                        [--ia][--is][--wc][-v][--all][--server]
                        [--host [HOST]][--port [PORT]][URI]
   positional arguments:
     URI                   URI of a web resource
   optional arguments:
     -h, --help            show this help message and exit
     --cc                  Use The Perma.cc Archive
     --cc_api_key [CC_API_KEY]
                           An API KEY is required by The Perma.cc

                           Archive
     --ia                  Use The Internet Archive
     --is                  Use The Archive.is
     --wc                  Use The WebCite Archive
     -v, --version         Report the version of archivenow
     --all                 Use all possible archives
     --server              Run archiveNow as a Web Service
     --host [HOST]         A server address
     --port [PORT]         A port number to run a Web Service


Examples:
   
To archive the web page (www.foxnews.com) in the Internet Archive:

$ archivenow --ia www.foxnews.com
https://web.archive.org/web/20170209135625/http://www.foxnews.com


By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments provided:

$ archivenow www.foxnews.com
https://web.archive.org/web/20170215164835/http://www.foxnews.com


To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and The Archive.is:


$ archivenow --ia --is www.foxnews.com
https://web.archive.org/web/20170209140345/http://www.foxnews.com http://archive.is/fPVyc


To save the web page (www.foxnews.com) in all configured web archives:


$ archivenow --all www.foxnews.com --cc_api_key $Your-Perma-CC-API-Key
https://perma.cc/8YYC-C7RM

https://web.archive.org/web/20170220074919/http://www.foxnews.com
http://archive.is/jy8B0
http://www.webcitation.org/6o9IKD9FP

Run it as a Docker Container (you need to do "docker pull" first)

$ docker pull maturban/archivenow

$ docker run -it --rm maturban/archivenow -h
$ docker run -p 80:12345 -it --rm maturban/archivenow --server
$ docker run -p 80:11111 -it --rm maturban/archivenow --server --port 11111
$ docker run -it --rm maturban/archivenow --ia http://www.cnn.com
...


   2. A Web Service

You can run archivenow as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 11111)

$ archivenow --server
  * Running on http://127.0.0.1:12345/ (Press CTRL+C to quit)

To save the web page (www.foxnews.com) in The Internet Archive through the web service:

$ curl -i http://127.0.0.1:12345/ia/www.foxnews.com

     HTTP/1.0 200 OK
     Content-Type: application/json
     Content-Length: 95
     Server: Werkzeug/0.11.15 Python/2.7.10
     Date: Thu, 09 Feb 2017 14:29:23 GMT

    {
      "results": [
        "https://web.archive.org/web/20170209142922/http://www.foxnews.com"
      ]
    }


To save the web page (www.foxnews.com) in all configured archives though the web service:

$ curl -i http://127.0.0.1:12345/all/www.foxnews.com

    HTTP/1.0 200 OK
    Content-Type: application/json
    Content-Length: 172
    Server: Werkzeug/0.11.15 Python/2.7.10
    Date: Thu, 09 Feb 2017 14:33:47 GMT

    {
      "results": [
        "https://web.archive.org/web/20170209143327/http://www.foxnews.com",
        "http://archive.is/H2Yfg",
        "http://www.webcitation.org/6o9Jubykh",
        "Error (The Perma.cc Archive): An API KEY is required"
      ]
    }


you may use the Perma.cc API_Key as following:

$ curl -i http://127.0.0.1:12345/all/www.foxnews.com?cc_api_key=$Your-Perma-CC-API-Key


   3. Python Usage

>>> from archivenow import archivenow

To save the web page (www.foxnews.com) in The WebCite Archive:

>>> archivenow.push("www.foxnews.com","wc")
['http://www.webcitation.org/6o9LTiDz3']


To save the web page (www.foxnews.com) in all configured archives:


>>> archivenow.push("www.foxnews.com","all")
['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]


To save the web page (www.foxnews.com) in The Perma.cc:

>>> archivenow.push("www.foxnews.com","cc","cc_api_key=$Your-Perma-cc-API-KEY")
['https://perma.cc/8YYC-C7RM']


To start the server from Python do the following. The server/port number can be passed (e.g,

start(port=1111, host='localhost')):
>>> archivenow.start()

* Running on http://127.0.0.1:12345/ (Press CTRL+C to quit)

Configuring a new archive or removing existing one

Adding a new archive is as simple as adding a handler file in the folder "handlers". For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write ">>>archivenow.push("www.cnn.com","ma")". In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. It might be helpful to see how other "*_handler.py" organized.

Removing an archive can be done by one of the following options:
  • Removing the archive handler file from the folder "handlers"
  • Rename the archive handler file to other name that does not end with "_handler.py"
  • Simply, inside the handler file, set the variable "enabled" to "False" 

Notes

The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the 'same' resource. For example, if you send a request to the IA to capture (www.cnn.com) at 10:00pm, the IA will create a new memento (let's call it M1) of the CNN homepage. The IA will then return M1 for all requests to archive the CNN homepage received before 10:02pm. The Archive.is sets this time gap to five minutes.

Updates and pull requests are welcome: https://github.com/oduwsdl/archivenow

--Mohamed Aturban

Monday, February 13, 2017

2017-02-13: Electric WAILs and Ham

Mat Kelly recently posted Lipstick or Ham: Next Steps For WAIL in which he spoke about the past, present, and potential future for WAIL. Web Archiving Integration Layer (WAIL) is a tool that seeks to address the disparity between institutional and individual archiving tools by providing one-click configuration and utilization of both Heritrix and Wayback from a user's personal computer. I am here to speak on the realization of WAIL's future by introducing WAIL-Electron.

WAIL-Electron



WAIL has been completely revised from a Python application using modern Web technologies into an Electron application. Electron combines a Chromium (Chrome) browser with Node.js allowing for native desktop applications to be created using only HTML, CSS, and JavaScript.

The move to Electron has brought with it many improvements most importantly, of which is the ability to update and package WAIL for the three major operating systems: Linux, MacOS, and Windows. Support for these operating systems is easily achieved by packing utility used (electron-packager) which allows one to produce the binary for a specific system. Also thanks to this move, the directory structure issue mentioned by Mat in his post has been resolved. Electron applications have their own directory structure inside the OS-specific application directory path accessible via their API. Here the packager will place the tools WAIL makes available for use.


Electric Ham


The meat of this revision is adding new functionality to WAIL in addition to the tools already made available through WAIL, namely Heritrix and Wayback. This new functionality comes in two parts. First, WAIL is now collection-centric. The previous revision, WAIL Python, added the WARC files created through WAIL to a single archive. This archive was an ambiguous collection of sorts where users had to create their own means of associating the WARCs to each other. Initially, this beneficial feature allowed users to archive what they saw at any given instance and replay the preserved page immediately. But updates to WAIL could not be justified if they did not build upon the existing functionality; which is why the concept of personal collection-based archiving was introduced.

Collections

WAIL now provides users with the ability for the curation of personalized web archive collections akin to the subscription service Archive-It except on their local machines. By default, WAIL comes with an initial collection and allows for the creation of additional collections.


The Collections screen displays the collections created through WAIL. This view displays the collection name along with some summary information about it.

  • Seeds: How many seeds are associated with the collection
  • Last Updated: the last time (date and time) the collection was updated
  • Size: How large is the collection on the file system

Creation of a collection is as simple as clicking the New Collection button available on the Collections (home) screen of WAIL. After doing so, a dialog will appear from which users can specify the name, title, and description for the collection. Once these fields have been filled in, WAIL will create the collection that users can access from the Collections View.


The Collection View displays the information about each seed contained in the collection

  • Seed URL: The URL
  • Added: The date time it was added to the Collection
  • Last Archived: The last time it was archived through WAIL
  • Mementos: The number of Mementos for the seed in the collection

along with a link for viewing the seed in Wayback.

Seeds can be added to a collection from either the live web or from WARC files present on the filesystem. To aid in the process of adding a seed from the live web, WAIL provides the user will the ability to "check" the seed before archiving.


The check provides summary information about the seed that includes the HTTP status code and a report on the embedded resources contained in the page. This lets users choose an archive configuration before starting WAIL's archival process to configure and launch a Heritrix crawl.

To add a seed from the filesystem all the user has to do is drag and drop the (W)ARC file into the corresponding interface for that functionality. WAIL will process the (W)ARC file and display a list of potential seeds discovered.



WAIL can not automatically determine the seed due to the nature of (W)ARC files. Rather WAIL uses heuristics on the contents of the (W)ARC file to determine which entries are valid candidates for the seed URL. From this display, the user chooses the correct one. WAIL will then add the seed to the collection, and it will be available for replay from the Collection View.

Twitter Archiving

The second added functionality is the ability to monitor and archive Twitter content automatically. This was made possible thanks to the scholarship I received for preserving online news. There are two options for the Twitter archival feature implemented in WAIL. The first is monitoring a user’s timeline for tweets which were tweeted after the monitoring has started with the option of selecting only the tweets containing hashtags specified during configuration. The second, a slight variation of the first, will only archive tweets that have specific keywords in the tweet’s body as specified during configuration.

What makes this unique is how WAIL preserves this content. Before this addition, WAIL utilized Heritrix as the primary preservation means. Heritrix executes HTTP GET requests to retrieve the target web page and archives the HTTP response headers and the content returned from the server. The embedded Javascript of the web page is not executed potentially decreasing the fidelity of the capture. This is problematic when archiving Twitter content since the rendering of tweets is done exclusively through client-side Javascript. 

To address this WAIL utilizes the native Chromium browser provided by Electron in conjunction with WARCreate. Modifications were made to WARCreate in order to integrate it with WAIL to eliminate the need for human intervention to decide when to generate the WARC and to work inside of Electron. By integrating WARCreate into WAIL the archival process of Twitter content has been simplified to loading the URL of the tweet into the browser and waiting until the browser indicates that the page has been rendered in its entirety. Then the archival process through WARCreate is initiated. Once the WARC has been generated, it is added to the collection specified by the user.

Putting on Lipstick

As mentioned in Mat's blog post, the UI for WAIL-Python needed an update not only for its maintainability but also for a cohesive user experience across supported platforms. At the time of starting this revision of WAIL, the choices available for the front-end framework as seen on Github were plentiful. It simply boiled down to choosing the one that had the "least" painful setup and deployment process with a learning curve such that any person taking over the project could be brought up to speed with minimal effort.

With this in mind, React was chosen for WAIL's UI library; it is unopinionated about other technologies which may be used alongside it and features a large production tested ecosystem with an active developer community. React is only a view library, which is why WAIL uses Redux and Immutable.js to complete the traditional MVC package. This React, Redux, and Immutable.js stack provide WAIL a consistent user experience across supported platforms and a much more manageable codebase. On the tools side of making WAIL look and perform beautifully, WAIL is now using Ilya Kreymer's pywb. Pywb is used by WAIL for both replay and to aid in the heavy lifting of managing the collections.

WAIL is now available from the project's release page on Github available.  For more information about how to use WAIL be sure to visit the wiki.

- John Berlin