2018-06-11: Web Archive and Digital Libraries (WADL) Workshop Trip Report from JCDL2018

Mat Kelly reports on the Web Archiving and Digital Libraries (WADL) Workshop 2018 that occurred in Fort Worth, Texas.                                                                                                                                                                                                                                                                                                                                                                            ⓖⓞⓖⓐⓣⓞⓡⓢ

On June 6, 2018, after attending JCDL 2018 (trip report), WS-DL members attended the Web Archiving and Digital Libraries 2018 Workshop (#wadl2018) in Fort Worth, Texas (see trip reports from WADL 2017, 2016, 2015, 2013). WS-DL's contributions to the workshop included multiple presentations inclusive of the workshop keynote by my PhD advisor, which I discuss below.

The Project Panel

Martin Klein (@mart1nkle1n) initially welcomed the workshop attendees and had the group of 26-or-so participants give a quick overview of who they were and their interest in attending. He then introduced Zhiwu Xie (@zxie) of Virginia Tech to begin the series of presentations reporting on the project kickoff of the IMLS-funded project (as establish at WADL 2017) "Continuing Education to Advance Web Archiving". A distinguishing feature of this project compared to others, Zhiwu said, is that the projects will use project-based problem solving instead of the products being surveys and lectures. He highlighted a collection of Curriculum modules involving existing practice (event archiving) to feed into various Web archiving tools (e.g., Social Feed Manager (SFM), ArchiveSpark, and Archives Unleashed Toolkit) to facilitate the understanding of the fundamentals (e.g., web, data science, big data) to produce experience in libraries, archives, and programming. The focus here was on individuals that had some level of prior experience with archives instead of the program being designed as training for those with zero experience in the area.

ODU WS-DL's Michael Nelson (@phonedude_mln) continued with the one motivation is to encourage storytelling using Web archives and how that has been hampered with the recent closing of Storify. Some recent work of the group (including the in-develop project MementoEmbed) would allow this concept to be revitalized despite Storify's demise through systematic "card" generation of mementos to allow a more persistent (in the preservation sense) version of the story to be extracted and retained.

Justin Littman (@justin_littman) of George Washington University Libraries continued the project description by describing Social Feed Manager's and emphasized that what you get from the Twitter API may well differ from what you get from the Web interface. The purpose of SFM is to be an easy-to-use, self-service Web interface to drive down the barriers in collecting social media data for academic research.

Ian Milligan (@ianmilligan1) continued by giving a quick run-down of his group's Archives Unleashed Projects, noting a realization in the project's development that not all historians like working with the command-line and Scala. He then briefly described the projects' approach of a filter-analyze-aggregate-visualize to make using large collections of Web archives more effective for research.

Wrapping up the project report, Ed Fox described Virginia Tech's initial attempts at performing crawls with Heritrix via Archive-It and how noisy the results were. He emphasized that a typical crawling approach consisting of starting with seed URIs harvested from tweets does not work well. The event model his group is developing and further evaluating will help guide the crawling procedure.

Ed's presentation completed the series of reports for the IMLS project panel and began a series of individuals presenting.

Individual Presentations

John Berlin (@johnaberlin) started off with an abbreviated version of his Master's Thesis titled, "Swimming In A Sea Of JavaScript, Or: How I Learned To Stop Worrying And Love High-Fidelity Replay". While John had recently given his defense in April (see his post for more details), this presentation focused on some of the more problematic aspects of archival replay as caused by JavaScript. He highlighted specific instances where the efforts of a replay system to accurately replay JavaScript varied from causing a page to display a completely blank viewport (see CNN.com has been unarchivable since November 1st, 2016) to the representation being highjacked to declare Brian Williams as the originator of "Gin and Juice" long before Snoop Dogg(y Dogg). John has created a Chrome and Firefox extension he dubbed "Wayback Plus Plus" that mitigates JavaScript-based replay issues using client-side redirects. See his presentation for more details.

The workshop participants then had a break to grab a boxed lunch and followed with Ed Fox, again, presenting "A Study of Historical Short URLs in Event Collections of Tweets". In this work Ed highlighted the number of tweets in their collections that had URLs, namely that 10% had 2 URLs and less than 0.5% had 3 or more. From this collection, his group analyzed how many of the URLs linked are still accessible in Internet Archive's Wayback Machine with an emphasis that the Wayback Machine is not covering a lot of things that are in the Twitter data he has gathered. His group also analyzed the time difference between when a tweet with URLs was made and when it was archived and found that 50% were archived within 5 days after the tweet was posted.


The workshop keynote, "Enabling Personal Use of Web Archives" was next and presented by my PhD Advisor Dr. Michele C. Weigle (@weiglemc). Her presentation initially gave a high-level overview of the needs of those that want to perform personal Web archiving and the tools that the WS-DL group have created over the years in facilitating the efforts to address those needs. She highlighted the early work of the group in identifying disasters in existing archives with a segue of the realization that many archive users lack in that there are more archives beyond Internet Archive.

In her (our) group's tooling to encourage Web users to Archive What They See Now, they created the WARCreate Chrome extension to create WARC files from any Web page. To resolve the issue of what a user is to do with their WARCs, they then created the Web Archiving Integration Layer (WAIL) (and later an Electron version) to allow individuals to control both the preservation and replay process. To give users a better picture of the archived Web as they browsed, they created the Chrome extension Mink to give users a measure of how well-archived (in terms of quantity) a URI is as they browsed the live Web and optionally (and easily) submit the URI currently viewed to 1-to-3 Web archives.

Dr. Weigle also highlighted the work of other WS-DL students of past and present like Yasmin Anwar's (@yasmina_anwar) Dark and Stormy Archives (DSA) and Shawn Jones' (@shawnmjones) upcoming MementoEmbed tool.

Following a tool review, Dr. Weigle asked, "What if browsers could natively interpret and replay WARCs?". She performed a high level review of what could be possible if the compatibility barriers between the archived and live Web were resolved through live Web tools that could natively interact with the archived Web. In one example, she provided a screenshot where in-place of the "secure" badge a browser provides, it might also be aware that it is viewing an archived page and indicate as such.

Libby Hemphill (@libbyh) presented next with "Developing a Social Media Archive at ICPSR" where her group sought to make data useful for people who wanted to understand how we are today from the perspective of people of the long-distant future. She mentioned how messy it can be to consider the ethical challenges when archiving social media data and that people have different levels of comfort depending of what sort of research for which their social media content is to be used. She outlined an architecture of their social media archive SOMAR for federating data to follow the terms of service, rehydrating tweets to follow the terms of research, and other aspects of the social-media-to-research-data process.

The workshop then took another break with a simultaneous poster session including a poster by Justin Littman titled, "Supporting social media research at scale" and WS-DL's Sawood Alam's (@ibnesayeed) "A Survey of Archival Replay Banners". Just prior to their poster presentations, each gave a lightning talk as a quick overview to entice attendees into stopping by.

After the break, WS-DL's Mohamed Aturban (@maturban1) presented "It is Hard to Compute Fixity on Archived Web Pages". Mohamed's work highlighted an issue that subtle changes in content may be difficult to detect using conventional hashing methods to compute the fixity of Web pages. He emphasized that computing the fixity of the root HTML page of a memento is not enough for fixity and that the fixity must also be computed for all embedded resources. With an approach utilizing Merkle trees (or on WP), he generates a hash of the composite memento representative of the fixity of all embedded resources. In one example highlighted in his recent post and tech report, Mohamed showed the manipulation of Climate Change data.

To wrap up the presentations for the workshop, I (Mat Kelly, @machawk1) presented "Client-Assisted Memento Aggregation Using the Prefer Header". This work highlighted one particular aspect of my presentation the previous day at JCDL 2018 (see blog post), namely of how the framework in the basis presentation facilitates the specification of which archives are aggregated using Memento. The previous investigation by Jones, Van de Sompel et al. (see "Mementos in the Raw, Take Two") used the HTTP Prefer header to allow a client to request the un-rewritten version of mementos from archival replay system. In my work, I imagined a more capable Memento aggregator that would expose the archives aggregated and allow a client, basing their customizations off of the aggregator's response, customize the set of archives aggregated by sending the set as base64-encoded data in the Prefer request header.


When I was through with the final presentation, Ed Fox began the wrap-up of the conference. This discussion of all attendees opened the floor for comments and recommendations for the future of the workshop. With the discussion finished, the workshop came to a close. As usual, I found this workshop extremely informative, though I was familiar with many of the participants previous work. I am hoping, as also expressed by other attendees, to encourage other fields to become involved and present their ongoing work and ideas at this informal workshop. Doing so, from the perspective of both an attendee and presenter, has proven valuable.

Mat (@machawk1)