2016-07-21: Dockerizing ArchiveSpark - A Tale of Pair Hacking

"Some doctors prescribe application of sandalwood paste to remedy headache, but making the paste and applying it is no less of a headache." -- an Urdu proverb
This is the translation of a couplet from an Urdu poem which is often used as a proverb. This couplet nicely reflects my feeling when Vinay Goel from the Internet Archive was demonstrating how suitable ArchiveSpark was for our IMLS Museums data analysis during the Archives Unleashed 2.0 Datathon, in the Library of Congress, Washington DC on June 14, 2016. ArchiveSpark allows easy data extraction, derivation, and analysis from standard web archive files (such as CDX and WARC). On the back of my head I was thinking, it seems nice, cool, and awesome to use ArchiveSpark (or Warcbase) for the task, and certainly a good idea for serious archive data analysis, but perhaps an overkill for a two day hackathon event. Installing and configuring these tools would have required us to setup a Hadoop cluster, Jupyter notebook, Spark, and a bunch of configurations for the ArchiveSpark itself. After doing all that, we would have to setup an HDFS storage and import a few terabytes of archived data (CDX and WARC files) into it. It would have easily taken up a whole day for someone new to these tools, leaving almost no time for the real data analysis. That is why we decided to use standard Unix text processing tools for CDX analysis.

Pair Hacking

Fast-forward to the next week, we were attending JCDL 2016 in Rutgers University, New Jersey. On June 22, during the half an hour coffee break I asked Helge Holzmann, the developer of ArchiveSpark, to help me understand the requirements and steps involved for a basic ArchiveSpark setup on a Linux machine so that I can create a Docker image to eliminate some friction for new users. We sat down together and discussed the minimal configurations that would make the tool work on a regular file system on a single machine without the complexities of a Hadoop cluster and HDFS. Based on his instructions, I wrote a Dockerfile that can be used to build a self contained, pre-configured, and ready to spin Docker image. After some tests and polish I published the ArchiveSpark Docker image publicly. This means, running an ArchiveSpark instance is now as simple as running the following command (assuming, Docker is installed on the machine):

$ docker run -p 8888:8888 ibnesayeed/archivespark

This command essentially means, run a Docker container from the ibnesayeed/archivespark image and map the internal container port 8888 to the host port 8888 (to make it accessible from outside the container). This will automatically download the images from Docker Hub if not in the local cache (which will be the case for the first run). Once the service is up an running (which will take a few minutes for the first time depending on the download speed, but subsequent runs will take a couple of seconds), the notebook will be accessible from a web browser at http://localhost:8888/. The default image is pre-loaded with some example files, including a CDX file, a corresponding WARC file, and a notebook file to get started with the system. To work on your own data set, please follow the instructions to mount host directories of CDX, WARC, and notebook files inside the container.

As I tweeted about this new development, I got immediate encouraging responses from different people using Linux, Mac, and Windows machines.

Under the hood

For those who are interested in knowing what is happening under the hood of this Docker image, I will walk through the Dockerfile itself to explain how it is built.

We have used the official jupyter/notebook image as the base image. This means we are starting with a Docker image that includes all the necessary libraries and binaries to run the Jupyter notebook. Next, I added my name and email address as maintainer of the image. Then we installed JRE using standard apt-get command. Next, we downloaded Spark binary with Hadoop from a mirror and extracted in a specific directory (this location will later be used in a configuration file). Then we downloaded the ArchiveSpark kernel and extracted it in a location where Spark expects the kernels to reside. Next, we overwrite the configuration file of the ArchiveSpark kernel using a customized kernel.json file. This custom configuration file overwrites some placeholders of the default config file, specifies the Spark directory (where Spark was extracted), and modifies it to run in a non-cluster mode on a single machine. Next three lines add sample files/folders (example.ipynb file, cdx folder, and warc folder respectively) in the container and create volumes where host files/folders can be mounted at run time to work on real data. Finally, the default command "jupyter notebook --no-browser" is added which will run by default when a container instance is spun without a custom command.


In conclusion, we see this dockerization of ArchiveSpark as a contribution to the web archiving community that eliminates the setup and getting started friction from a very useful archive analysis tool. We believe that this simplification will encourage increased usage of the tool in web archive related hackathons, quick personal archive explorations, research projects, demonstrations, and classrooms. We believe that there is a need and usefulness of dockerizing and simplifying other web archiving related tools (such as Warcbase) to give new users a friction-free choice to get started with different tools. Going forward, some improvements that can be made to the ArchiveSpark Docker image include (but not limited to) running the notebook inside the container under a non-root user, adding a handful of ready-to-run sample notebook files for common tasks in the image, and making the image configurable at run time (for example to allow local or Hadoop cluster mode and HDFS or plain file system store) while keeping the defaults that work well for simple usage.


Sawood Alam