Friday, July 6, 2012

2012-07-05: Exploring the WAC: Challenges in Providing Access to the World's Web Archives

The Web Archive Cooperative (WAC) held its 2012 Summer Workshop June 29–30 at Stanford University Palo Alto, California. The workshop focused on the challenges (and some solutions) of providing easy access to the World’s web archives. The WS-DL Research Group had six members in attendance.

Memento and Source Code Repositories — Harihar Shankar (LANL) 

Memento allows temporal access to web resources using datetime. Version control services such as GitHub also allow temporal access, but using a version number instead of datetime. Harihar Shankar of the Los Alamos National Laboratory (LANL) Research Library presented Memento and a Memento/GitHub proxy prototyped at LANL. The proxy enables access to GitHub projects through datetime. For many use cases, datetime is much simpler that Git’s 25-hex-character commit id.

A Research Agenda for “Obsolete Data or Resources” — Michael Nelson (ODU)

Old Dominion University’s Michael Nelson presented WAC’s research agenda for obsolete data and resources. His presentation covered the public’s misconceptions about web archiving, where the web archiving community can improve, the origin of the current notion of time on the web, the gaps bridged by Memento, and some of the progress made to date. Many details and examples are available in the slides.

Building Full Text Indexes of Web Content using Open Source Tools — Erik Hetzner (California Digital Library)

Google knows how to index the Web and allow casual users to discover resources in mere seconds. Add time to the mix and current indexing and search solutions break down. Eric Hetzner described the challenges and approaches of temporal currently being address at California Digital Library (CDL). CDL has 49 public archives, 19 partners, and nearly 1 billion URLs across archives of 3684 web sites. Nearly 50TB of archives must be stored, indexed, archived, and searched. CDL’s current solutions, which use NutchWAX, do not easily allow for deduplication, metadata indexing, and other optimizations. These and other architectural limitations motivated CDL to begin building anew.

‘‘Tiki + HDFS + Pig + solr = weari’’

CDL is now using a combination of open source products for its new WEb ARchiving Indexer (weari). Tika is used for text parsing, Hadoop and HDFS for scalability, Pig for data analysis, and Solr for search.  Erik's slides are now available.

Issues in Preserving Scientific and Scholarly Data in Web Archiving — Laura Wynholds (UCLA)

Laura Wynholds studies scientists and what they do with their data. She has been working with scientists from the Center for Embedded Network Sensing (CENS) and Sloan Digital Sky Survey. At both, she has found a variety of data lifecycles and standards. She has found that data and its associated documentation is shared in many ways, form formal institutional stewardship and repositories to informal means such as email, FedEx, and web sites. Large, well-used data sets tend to have very good preservation arrangements. Medium and small data sets do not. However, many medium and small data sets are shared on the web and could be subject to web archiving. The web archive status of two data sets (The VLA FIRST Survey and COMPLETE) was assessed. Neither was well-represented in public web archives. Those data that were archived were not in formats required by scientists (e.g. low-resolution images). So, web archiving can preserve scientific data, but changes in selection criteria are required for web archiving to be truely effective.

Whose Content is it Anyway? User Perspectives on Archiving Social Media — Cathy Marshall (Microsoft Research)

Cathy Marshall presented her current findings on the public’s views on ownership and reuse of visual media. In the web archiving community we feel the need to preserve the historical Web just as libraries have traditionally preserved copies of books, newspapers, and magazines. Cathy’s research addresses the social issues with which we in the web archiving community must contend. Many photographs, blogs, and tweets and publically accessable on the web, which makes archiving them technically simple. However, people learn that their pictures and posts are being archived, are frequently surprised and upset by the fact. This is especially true if the archiving organization is a government entity such as the Library of Congress. Much of Cathy’s presentation is covered in detail in her JCDL’12 paper “On the institutional archiving of social media”.

Panel: Legal Opportunities for Web Archiving — Kathy Hashimoto and David Hansen (Berkeley Digital Library Copyright Project)

An other important consideration for web archivists is copyright. The “Legal Opportunities for Web Archiving” panel discussion focused on approaches to ensure web archiving is and remains free of legal burden and litigation. In the United States, copyright is derived from Article I, Section 8 of the Consitution and USC Title 17, Chapter 1. There are legal opportunities for web archiving in § 107 (Fair Use), § 108 (Libraries and Archives), § 109 (“First Sale” Doctrine), § 110 (Non-profit performances). The panel discussed the structure of copyright and the issues and problems with copyright in the web archiving context. More information is available on Members of the Berkeley Digital Library Copyright Project web site.

ArcSpread: Familiar Concepts Towards Archive Analytics for Social Scientists — Andreas Paepcke (Stanford)

Web archives have been collecting information for nearly two decades, but making this information easily accessable to non-Computer Scientists continues be a challenge. Andreas Paepcke is working with social scientists to build tools that allow high-level interaction with archives. The ArcSpread tool (Narrated demo) uses the Stanford WebBase as its data source. A spreadsheet metaphor provides a working environment familiar to most computer users.

Text-Entity-Time Analytics in a Temporal Coherent Web Archive — Marc Spaniol (LAWA Project)

Marc Spaniol is a member of the Longitudinal Analytics of Web Archive Data (LAWA) project where he studies temporal aspects of Web evolution. A detailed description is presented in "Tracking entities in web archives: the LAWA project". Web Archives are a gold mine of information, but we lack effective mining tools. Currently, entity tracking is labor-intensive and tedius process. The relevant URIs must be known and web archive searching is notoriously difficult. Additionally, following web archive links creates time diffusion and web archive crawls suffer from temporal incoherence. Text-Entity-Time Analytics focuses on tracking entities (people, places, etc.) over time. The AIDA framework is an online tool for entity detection and disambiguation. Measuring temporal incoherence requires is key to understanding the sources of incoherence. Spaniol has developed the SHARC framework that allows incoherence measurement and demonstrated that simple changes to crawling strategies will improve temporal coherence.

Archiving Web Pages with Hadoop and Pig — Aaron Binns (Internet Archive)

The Internet Archive (IA) currently holds over 176,000,000,000 resources that require nearly 3 petabytes of storage stored as Web Archive (WARC), CDX, and Web Archive Transformation (WAT) files. IA processes this mass of resources using Hadoop and Pig. The problem definition, big data description, and architectural overview presented by Binns were excellent. The slides contain many more details and are well worth a look even without Aaron’s live explanation.

Beyond BigData: Challenges for Facebook’s Data Infrastructure – Sameet Agarwal (Facebook)

When it comes to big data, few would argue that Facebook has more data to crunch than nearly anyone else. Sameet Argawal manages Facebooks 100PB (yes, petabyte!) Hadoop cluster—the largest Hadoop cluster in the world. Facebook’s needs have driven it contribute to Hadoop and to lead the development Hive, a peta-scale data warehouse based on Hadoop. This data warehouse has been the source for several interesting studies, including the recently-publicized reduction of six degrees of separation to four (actually 4.74). While a 100PB Hadoop cluster many seem like a problem solved, many issues still need research and resolution. How to keep a 100PB cluster running. How to fairly allocate resources to multiple tenants. How coordinate mutiple clusters. How to coordinate multiple, geographically dispersed clusters. Currently, log data from is delivered overnight. How can this latency be reduced or eliminated. Facebook’s data is naturally a graph. Is the set of tables the best way to represent the data? Is converting graph data into a set of map-reduce jobs the right approach.


Many thanks to Frank McCown (Harding University) for organizing the workshop, Andreas Paepcke and Hector Garcia-Molina at Stanford for hosting, the National Science Foundation (NSF) for their support (1009916), and especially Marianne Siroker for all the time and effort she put into the food and facilities arrangments.

— Scott G. Ainsworth


  1. The link to the presentation PDF for 'Building Full Text Indexes of Web Content using Open Source Tools' is broken. Can it be fixed? I'd love to see it.

  2. Hi Andy,

    I've removed the link for the code4lib presentation and replaced them with Erik's slides from the WAC: