Friday, June 3, 2016

2016-06-03: Lipstick or Ham: Next Steps for WAIL

The development, state, and future of 🐳 Web Archiving Integration Layer. 💄∨🐷?                                                                 

Some time ago I created and deployed Web Archiving Integration Layer (frequently abbreviated as WAIL), an application that provides users pre-configured local instances of Heritrix and OpenWayback. This tool was originally created for the Personal Digital Archiving 2013 conference and has gone through a metamorphosis.

The original impetus for creating the application was that the browser-based WARCreate extension required some sort of server-like software to save files locally because of the limitations of the Google Chrome API and JavaScript at the time (2012). WARCreate would perform an HTTP POST to this local server instance, which could would then return an HTTP response with an appropriate MIME type that would cause the browser to download the file. I initially used XAMPP for this with a PHP script within the Apache instance. This was unwieldy and a little more complex of a procedure than I wanted for the user.

With the introduction of the HTML5 File API, this server software was no longer required. The File API, however, is sandboxed to an isolated file system accessible only to the browser. To circumvent this restriction, I utilize the FileSaver.js library but this, too, has limitations in size of the file that can be download -- 500 MiB (about 524 MB) for Google Chrome.


With Apache no longer being a requirement for WARCreate, I investigated using XAMPP's bundled copy of Apache web server and the additionally bundled Tomcat Java server for other web archiving purposes, namely the engine to run the Java-based OpenWayback. This worked well but still felt heavy for a user's PC, as Java applications do. The added Java requirement also meant that I could include a pre-configured Heritrix, Internet Archive's Java-based archival crawler, within XAMPP. The XAMPP interface, however, was generic relative to simply controlling services, a UI scheme I wanted to obscure from the target audience.

A locally hosted web-based interface might have been suitable but as with the WARCreate-to-local-file problems, having a browser launch applications on the user's machine was likely to be problematic. Being already familiar with Python, I created a script using the wxPython (the Python port of wxWidgets) library that allows a user to specify a URI for Heritrix to crawl (by programmatically creating crawl configurations) and locations for the resulting WARCs to which Heritrix should write and OpenWayback read.

This additional Graphical User Interface (GUI) "Layer" for "Integrating" "Web Archive" tools (Heritrix and OpenWayback) spawned the awkwardly named, "Web Archiving Integration Layer". The acronym, while descriptive, also reiterated ODU WS-DL's trend of associating produced software with sea creatures (as I referenced once before).

Ceci n'est pas un cochon
Requiring the target user base (digital humanities scholars and amateur web archivists) to go to the command-line to launch a Python script was unacceptable, however, and the remedy to this problem has been partially to blame for the slowdown in further development of WAIL. To "Freeze" code is to create the more familiar "Application" that a user would double click to launch. At the time (2013), PyInstaller provided the best application freezing functionality in that it performed dependency resolution, created cross-platform binaries, and provided a mode to produced a single binary file, which was not initially necessary but became appealing.

In the beginning, WAIL was compiled for Windows and MacOS X (or nowadays called simply "OS X"). In the latter, single-file applications are very common, as OS X's ".app" faux directory structure allows the application tools and resources to be nicely packaged. Eventually, this was also a useful place to include the OpenWayback and Heritrix binaries. That Windows does not have this abstraction but instead frequently provides a directory of files with the ".exe" being the binary is the reason that WAIL for Windows has not been updated since 2013.

Plagued with Problems

As if the decoupling of the OS X and Windows versions was not bad enough, OS X ceased bundling the Java runtime with the operating system (which required WAIL to install the runtime), Heritrix required an older version of Java (it would break with the latest version), and just generally Java problems all around. These problems persist to this day but ultimately it was these requirements and configuration issues that WAIL was designed to solve or at least mitigate for the user.

The WAIL code that drives the UI is also quite the mess. Despite being researchers where code function should supercede its form, because WAIL is publicly available (both the binary and the source), it ought to reflect the quality in form to the extent of function.

Refactor or Is That Fiddly?

I have been maintaining and improving the code but eventually either another WS-DLite will be doing the same or the project will die. I believe there still to be merit in a locally hosted web archive, particularly for the digital humanities scholars that aren't familiar with system interaction via the command-line and manually rewriting configuration files.

We are looking into other routes to make the code more intuitive to maintain but still functionally equivalent to if not greater than the Python-based native app in its current state. We have bundled the newly developed Go-based MemGator Memento aggregator (blog post to come) with WAIL as a cross-platform native executable. We also hope to include other tools that personal web archivists would find useful with the requirement being that it must run natively and include no further non-bundled dependencies. Two tools on our radar are Ilya Kreymer's pywb, part of the replay component that's driving Webrecorder, and the heavily coupled (with pywb) InterPlanetary Wayback (ipwb) system we developed at the Archives Unleashed Hackathon in March.

The question still remains whether to rework the current code or to overhaul the UI in a way that is more extensible and maintainable. The Electron packaging library, as used by the native Slack application, Atom editor, and many other software projects, looks to be the route to take to achieve these goals. Additionally, interfaces written for Electron can be compiled to native applications, a feature that will allow the ethos of WAIL to be retained.

However, rewriting the UI does not a more useful application make and doing so boils down to putting lipstick on a pig. External dependencies should be the primary problem to tackle. From that, including additional functionality and tools to make the application more useful (the "ham" if this simile can be stretched any further) ought to be given priority.

—Mat Kelly (@machawk1)

No comments:

Post a Comment