2016-06-24: Web Archive and Digital Libraries (WADL) Workshop Trip Report from JCDL2016

Trip Report for the Web Archiving and Digital Libraries Workshop 2016 in Newark, NJ.                           

Following our recent trip to the Joint Conference on Digital Libraries (JCDL), WS-DL attempted the Web Archiving and Digital Libraries (WADL) Workshop (#wadl2016) (see 2015 and 2013 reports) co-located with the conference in Newark, New Jersey. This post documents the presentations and our take-home from the workshop.

On Wednesday, June 22, 2016 at around 2pm, Ed Fox of Virginia Tech welcomed the crowd of 30+ librarians, researchers, and archivists into the two half-day event by first introducing Vinay Goel (@vinaygo) of Internet Archive (IA) to lead the panel of "Worldwide Activities on Web Archiving".

Vinay described the recent efforts by IA to make the archives more accessible through additional interfaces to the archive's holdings. As occurred at Archives Unleashed 2.0 Datathon earlier this month, Vinay demoed the then restricted access beta version of the Wayback Machine interface, now sporting a text input box that searches the contents of the archives. Vinay noted that the additional search functionality was limited to scanning the homepages using limited metadata but hopes to improve the interface before publicly deploying it.

Near the end of Vinay's presentation, he asked Ian Milligan (actively tweeting the workshop per above) for a quick summary of the aforementioned Datathon. Ian spoke about the format and namely, the efforts to propagate the event into the future by registering the Archives Unleashed organization as an LLC.

Following the first presentation, Herbert Van de Sompel (@hvdsomp) presented the first full paper of the workshop, "Divising Affordable and Functional Linked Data Archives". Herbert first gave a primer of the Memento Framework then segued into the analysis he and his co-authors performed of the DBPedia archive. While originally using MongoDB to store the archival dumps, which he admitted might have not been a good design decision, Herbert described the work done to make the archive Memento-compatible.

His group performed an analysis on availability, bandwidth, and cost of making the archive accessible using a simple data dump, SPARQL endpoints, and Subject URI initially in terms of method's expressiveness, potential Memento support and ability to support cross-time data. Using the recent research performed by his intern, Miel Vander Sande (@Miel_vds), he spoke of Linked Data Fragments (described by selectors, controls, and metadata), particularly, Triple Pattern Fragments and the tradeoffs compared to the other methods of representing the linked data. Using a Binary RDF Representation of the data in combination with Triple Pattern Fragments produced a result that costs less and facilitates consumption better than the previous methods of storage. This led his team to produced a second version of the DBPedia archive with Memento support with the advantages of LDF and the Binary RDF Representation in-place.

After a short break, a series of 2-3 minute lightning talks began.

Yinlin Chen of Virginia Tech gave the first presentation with "A Library to Manage Web Archive Files in Cloud Storage". In this work, Yinlin spoke about the integration Islandora applications the Fedora Commons digital repository and various cloud providers like Amazon S3 and Microsoft Azure.

Sunshin Lee (also of VT) presented second in the lightening talks with "Archiving and Analyzing Tweets and Webpages with the DLRL Hadoop Cluster". Sunshin described VT's Integrated Digital Event Archiving and Library (IDEAL) project and their utilization of a 20 node Hadoop cluster for collecting and archiving tweets and web pages for analysis, searching, and visualizing.

I (Mat Kelly - @machawk1) presented second with "InterPlanetary Wayback: The Permanent Web Archive". In this work, which Sawood (@ibnesayeed) and I initially prototyped at the Archives Unleashed Hackathon in March, I gave high level details of integrating Web ARChive (WARC) files with the InterPlanetary File System for inherent de-duplication and dissemination of an archives' holdings. This work was also presented as a poster at JCDL 2016.

Sawood Alam presented his own work after me with "MemGator - A Portable Concurrent Memento Aggregator". His aggregator, written in the Go programming language allows for concurrent querying of a user-specified set of archives and allows users to deploy their own Memento aggregator. This work was also presented as a poster at JCDL 2016.

Bela Gipp (@BelaGipp) presented the final lightening talk with "Using the Blockchain of Cryptocurrencies for Timestamping Digital Cultural Heritage". In this work, he spoke of his group's work of timestamping data using the blockchain for information integrity. He has a live demo of his system available as well as an API for access.

Following Bela's talk, the session broke for a poster presentation. After resuming the talks, Richard Furuta (@furuta) presented "Evaluating Unexpected Change in a Distributed Collection". In this work they examined the ACM Digital Library's collection of proceedings to identify the complexity and categories of change. From this analysis of sites on the live web, he ran a user study to class the results into "Correct", "Kind of Correct", "University page", "Hello World", "Domain for Sale", "Error", and "Deceiving" with the cross-dimension of relevance (from Very Much to Not At All) as well as the percentage of the time the site was correctly identified.

Mark Phillips (@vphill) presented the final paper of the day, "Exploratory Analysis of the End of Term Web Archive: Comparing two collections". In this work his group examined end of (presidential) term crawls from 2008 and 2012. After the organization that originally performed the crawl in 2008 stated that they were not planning on crawling the government (.gov and .mil sites) for the 2012 election, his group planning on doing it in lieu. With the unanticipated result of the incumbent winning the 2012 election, he performed the crawls as a learning exercise for when the change of power did occur, as is guaranteed in the 2016 presidential election.

Mark performed an analysis from his 2008 and 2012 archives at UNT, which consisted of CDX files of over 160 million URIs in the 2008 collection and over 194 million URIs in the 2012 collection. The initial analysis tried to determine when the content was harvested with the goal being to crawl before and after the election as well as after the inauguration. He found distinctly dissimilar patterns in the two crawls and realized that they were way off when estimating the times that the crawls occurred that was the result of the crawler only being to archive the sites when it could based on a long queue of other crawls. He also performed a data type analysis, finding significant increases in PNGs and JPEGs and decreases in PDFs and GIFs, among other changes. The take-home from his analysis was that the selection of URIs ought to be more driven by partners and the community. His CDX analysis code is available online.

WADL Day 2

The second of two days of WADL began with Ed Fox reiterating that most users know little about the Internet beyond the web and the role of social media in the realm of web archiving. He introduced a panel of three speakers to present in sequence next.

Mark Phillips began by polling the room asking if anyone else was using Twitter data in their work. Stating that his current work uses four data sets and that he will be using six more in the near future going forward, he stated, "After Nelson Mandela died he collected 10 million tweets but only made available tweet IDs as well as the mapping between various identifiers and the embedded images. He emphasized the importance in documentation, namely a README describing the data sets.

Laura Wrubel (@liblaura) spoke next about the open source Social Feed Manager, a tool that allows users to create collections from social media platforms. She pointed out the need to provide documentation of how datasets are created for both researchers and archivists. With the recent release of version 1.0 of the software, she is hoping for feedback from collaborators.

Ian Milligan (@ianmilligan1) spoke next (slides available) about his open source strategy for documenting events. Based off a collection of more than 318,000 unique users that used the #elxn42 hashtag (for the 42nd Canadian election), he leveraged the Twitter API to fetch data, citing the time was of the essence since the data became mostly inaccessible after 7-9 days without "bags of money".

Using twarc, he was able to create his own archives to analyze the tweets using twarc-report and twarc-utilities and visualize the data. Ian reintroduced the concept of tweet "hydration" and the difference between the legality versus the ethics of storing and sharing the data, referencing the Black Twitter project out of USC. Per Twitter's TOS, the JSON cannot be stored. Stating the contrary, "We need to be very ethically conscious but if we don't collect it, archives of powerful people will be lost and we'll only have the institutional archives".

Following the panel, Ed Fox suggested the potential of creating a research consortium for data sharing. An effort to build this is in the works.

After the panel, Zhiwu Xie (@zxie) presented the first paper of the day on "Nearline Web Archiving". Citing Masanes' 2006 book Web Archiving, Zhiwu stated that many other types now exist in the data material, some of which straddle the categories Masanes defined (client, side, and transactional). "The terminology is relatively loose and not mutually exclusive", Zhiwu said.

In his work, he enabled the Apache Disk Cache and investigated using that as a basis for creating server-driven web archiving. He reiterated that this model does not fit any of Masanes' 3 categories of web archiving and provided a new set of definitions:

  • Online: archiving is part of the web transaction and always adds to the server load, but can only archive one response at a time
  • Offline: Archiving is not part of the web transaction but a separate process. Can be used for batch archiving and handle many responses at a time.
  • Nearline: Depends on the accumulation of web transaction, but as a seaprate process can be batched but in smaller granularity

Zhiwu spoke further about the tools he utilized by Apache to clean the cache. His prototype modified the Apache module (in the C language) to preserve a copy of the cache prior to being deleted from the system. "As far as I know, there is no strong C support for WARC files", Zhiwu said.

Brenda Reyes Ayala (@CamtheWicked) presented next with "We need new names: Applying existing models of Information Quality to web archives". In this work, Brenda stated that much of the work done by web archivists is too technical and that some thought should be done about web archiving concepts without the use of technology. "How do you determine whether this archive is good enough?", she said.

Brenda spoke of the notion of Information Quality (IQ), as it is usually portrayed as a multi-dimensional construct with facets such as accuracy and validity. Citing Spaniol et al.'s work on Data Quality in Web Archiving, namely the notions of archival coherence and blurriness, and WS-DL's own Ainsworth and Nelson on temporal coherence, Brenda stated that there was a lot of talk on data quality without full consideration of IQ. Defining this further, she said, "IQ is data at a very low level that has structure, change, etc. added to it for more information." She finished with asking if other types of coherence exist than temporal coherence, e.g., topical coherence.

Mohamed Farag presented the last work of the conference with "Which webpage should we crawl first? Social media-based webpage source importance guidance". In this work, he targeted focus crawling. Through experimentation using a tweet data set of the Brussels Attack, he first extracted and shorted all URIs, identified seeds, and ran event focused crawls to collect 1,000 web pages with consideration for URI uniqueness. His group used the harvest ratio to evaluate the crawler's output. From the experiment he noted biases in the data set include domain, topical biases, genre biases (e.g., news, social media), political party biases, etc.

Closing

Following Mohamed's presentation, Rick Furuta and Ed Fox spoke about the future of the WADL workshop stating that, as with recent years, there was an opportunity to collaborate again with IEEE-TCDL to increase the number of venues with which to associate the workshop. Ed suggested to move the workshop from being ad hoc year-to-year to have some sort of institutional stamp on it. "It seems we should have more official connection.", Ed said, "Directly associating with IEEE-TCDL will give more prestige to WADL".

Ed then spoke about the previous call for submissions for the IJDL issue with a focus on web archiving and the state of the submission for publication. He then had the participants count off, group together, discuss future efforts for the workshop, and present our breakout discussions to the other workshop participants.

Overall, WADL 2016 was a very productive, interesting, and enjoyable experience. The final take was that the workshop will continue to be active during the year beyond the yearly and hopefully will further facilitate research in the area of web archiving and digital libraries.

—Mat (@machawk1)

EDIT: @liblaura provided a correction to this post following publication.

Comments