Monday, February 13, 2017

2017-02-13: Electric WAILs and Ham

Mat Kelly recently posted Lipstick or Ham: Next Steps For WAIL in which he spoke about the past, present, and potential future for WAIL. Web Archiving Integration Layer (WAIL) is a tool that seeks to address the disparity between institutional and individual archiving tools by providing one-click configuration and utilization of both Heritrix and Wayback from a user's personal computer. I am here to speak on the realization of WAIL's future by introducing WAIL-Electron.

WAIL-Electron



WAIL has been completely revised from a Python application using modern Web technologies into an Electron application. Electron combines a Chromium (Chrome) browser with Node.js allowing for native desktop applications to be created using only HTML, CSS, and JavaScript.

The move to Electron has brought with it many improvements most importantly, of which is the ability to update and package WAIL for the three major operating systems: Linux, MacOS, and Windows. Support for these operating systems is easily achieved by packing utility used (electron-packager) which allows one to produce the binary for a specific system. Also thanks to this move, the directory structure issue mentioned by Mat in his post has been resolved. Electron applications have their own directory structure inside the OS-specific application directory path accessible via their API. Here the packager will place the tools WAIL makes available for use.


Electric Ham


The meat of this revision is adding new functionality to WAIL in addition to the tools already made available through WAIL, namely Heritrix and Wayback. This new functionality comes in two parts. First, WAIL is now collection-centric. The previous revision, WAIL Python, added the WARC files created through WAIL to a single archive. This archive was an ambiguous collection of sorts where users had to create their own means of associating the WARCs to each other. Initially, this beneficial feature allowed users to archive what they saw at any given instance and replay the preserved page immediately. But updates to WAIL could not be justified if they did not build upon the existing functionality; which is why the concept of personal collection-based archiving was introduced.

Collections

WAIL now provides users with the ability for the curation of personalized web archive collections akin to the subscription service Archive-It except on their local machines. By default, WAIL comes with an initial collection and allows for the creation of additional collections.


The Collections screen displays the collections created through WAIL. This view displays the collection name along with some summary information about it.

  • Seeds: How many seeds are associated with the collection
  • Last Updated: the last time (date and time) the collection was updated
  • Size: How large is the collection on the file system

Creation of a collection is as simple as clicking the New Collection button available on the Collections (home) screen of WAIL. After doing so, a dialog will appear from which users can specify the name, title, and description for the collection. Once these fields have been filled in, WAIL will create the collection that users can access from the Collections View.


The Collection View displays the information about each seed contained in the collection

  • Seed URL: The URL
  • Added: The date time it was added to the Collection
  • Last Archived: The last time it was archived through WAIL
  • Mementos: The number of Mementos for the seed in the collection

along with a link for viewing the seed in Wayback.

Seeds can be added to a collection from either the live web or from WARC files present on the filesystem. To aid in the process of adding a seed from the live web, WAIL provides the user will the ability to "check" the seed before archiving.


The check provides summary information about the seed that includes the HTTP status code and a report on the embedded resources contained in the page. This lets users choose an archive configuration before starting WAIL's archival process to configure and launch a Heritrix crawl.

To add a seed from the filesystem all the user has to do is drag and drop the (W)ARC file into the corresponding interface for that functionality. WAIL will process the (W)ARC file and display a list of potential seeds discovered.



WAIL can not automatically determine the seed due to the nature of (W)ARC files. Rather WAIL uses heuristics on the contents of the (W)ARC file to determine which entries are valid candidates for the seed URL. From this display, the user chooses the correct one. WAIL will then add the seed to the collection, and it will be available for replay from the Collection View.

Twitter Archiving

The second added functionality is the ability to monitor and archive Twitter content automatically. This was made possible thanks to the scholarship I received for preserving online news. There are two options for the Twitter archival feature implemented in WAIL. The first is monitoring a user’s timeline for tweets which were tweeted after the monitoring has started with the option of selecting only the tweets containing hashtags specified during configuration. The second, a slight variation of the first, will only archive tweets that have specific keywords in the tweet’s body as specified during configuration.

What makes this unique is how WAIL preserves this content. Before this addition, WAIL utilized Heritrix as the primary preservation means. Heritrix executes HTTP GET requests to retrieve the target web page and archives the HTTP response headers and the content returned from the server. The embedded Javascript of the web page is not executed potentially decreasing the fidelity of the capture. This is problematic when archiving Twitter content since the rendering of tweets is done exclusively through client-side Javascript. 

To address this WAIL utilizes the native Chromium browser provided by Electron in conjunction with WARCreate. Modifications were made to WARCreate in order to integrate it with WAIL to eliminate the need for human intervention to decide when to generate the WARC and to work inside of Electron. By integrating WARCreate into WAIL the archival process of Twitter content has been simplified to loading the URL of the tweet into the browser and waiting until the browser indicates that the page has been rendered in its entirety. Then the archival process through WARCreate is initiated. Once the WARC has been generated, it is added to the collection specified by the user.

Putting on Lipstick

As mentioned in Mat's blog post, the UI for WAIL-Python needed an update not only for its maintainability but also for a cohesive user experience across supported platforms. At the time of starting this revision of WAIL, the choices available for the front-end framework as seen on Github were plentiful. It simply boiled down to choosing the one that had the "least" painful setup and deployment process with a learning curve such that any person taking over the project could be brought up to speed with minimal effort.

With this in mind, React was chosen for WAIL's UI library; it is unopinionated about other technologies which may be used alongside it and features a large production tested ecosystem with an active developer community. React is only a view library, which is why WAIL uses Redux and Immutable.js to complete the traditional MVC package. This React, Redux, and Immutable.js stack provide WAIL a consistent user experience across supported platforms and a much more manageable codebase. On the tools side of making WAIL look and perform beautifully, WAIL is now using Ilya Kreymer's pywb. Pywb is used by WAIL for both replay and to aid in the heavy lifting of managing the collections.

WAIL is now available from the project's release page on Github available.  For more information about how to use WAIL be sure to visit the wiki.

- John Berlin

No comments:

Post a Comment