2015-10-07: IMLS and NSF fund web archive research for WS-DL

In the spring and summer of 2015, the Web Science and Digital Libraries (WS-DL) group has received a total of $950k of funding from the IMLS and the NSF to study various aspects of web archiving.  Although previously announced on twitter (IMLS: 2015-03-31 & NSF: 2015-08-25), here we provide greater context for how these awards support our vision for the future of web archiving*.

Our IMLS proposal is titled "Combining Social Media Storytelling With Web Archives" and a PDF of the full proposal is available directly from the IMLS.  This proposal is joint with our partners at Archive-It and is informed by our experiences in several areas, such as:
Our most illuminating insight (somewhat obvious in retrospect) is to not try to include all of the collection's holdings in its summarization, but to only surface the exemplary components sufficient to distinguish one collection from the next.  One example we frequently use is "how do we distinguish the many `human rights' collections available in Archive-It?"  They all have different perspectives, but they can be difficult to navigate for those without detailed knowledge of the seed URIs and the collection development policy. 

The IMLS proposal will investigate two main thrusts:
  1. Selecting a small number (e.g., 20) of exemplary pages from a collection (often 100s of archived copies of 1000s of web pages) and loading them in an existing tool such as Storify as a summarization interface (instead of custom & unfamiliar interfaces).  Yasmin AlNoamany has some exciting preliminary work in this area; for example see her TPDL 2015 paper examining what makes a "good" story on Storify, and her presentation "Using Web Archives to Enrich the Live Web Experience Through Storytelling".
  2. Using existing stories to generate seed URIs for collections.  One problem for human-generated web archive collections is that they depend on the domain knowledge of curators.   For example, the image above shows two Storify stories about early riots in Kiev (aka Kyiv) which predated much of the exposure in Western media and then the subsequent escalation of the crisis.  The collection at Archive-It was not begun until the annexation of the Crimea was imminent, possibly missing the URIs that document the early stages of this developing story.  Our idea is to mine social media, especially stories, for semi-automated, early creation of web archive collections. 
The NSF proposal is titled "Increasing the Value of Existing Web Archives" and represents a shift in how we think about web archiving.  One point we've made for a while now (for example, see our 2014 presentation "Accessing the Quality of Web Archives") is that we must shift our current focus of simply piling up bits in the archive to more nuanced questions of how to make the archives more immediately useful (as opposed to just insurance for future loss) and to how to assess & meaningfully convey the quality of the archived page.  This proposal will have three main research thrusts:
  1. Inspired by Martin Klein's PhD research and Hugo Huurdeman et al.'s "Finding Pages on the Unarchived Web" from JCDL 2014, we would like to see archives provide recommendations of related pages in the archive, as well as suggested "replacements" for pages that are not archived.  Web archives now just return a "yes" (200) or "no" (404) when you query for a URI -- they should be able to provide more detailed answers based on their holdings.
  2. We'd like to further investigate the various issues of how well a page is archived.  We have some preliminary work from Justin Brunelle for automatically assessing the impact of missing embedded resources (typically stylesheets and images), as well as from Scott Ainsworth on detecting temporal violations -- combinations of HTML and images that never occurred on the live web (see "Only One Out of Five Archived Web Pages Existed as Presented" from HT 2015).  
  3. Related to #2, we need to find a better way to visualize the temporal & archival makeup of replayed pages.  For example, the LANL Time Travel service does a nice job of showing the various archives that contribute resources to a reconstruction, but questions remain about scale as well as describing temporal violations and their likely semantic impact.  Similarly, we'd like to investigate how to convey the request environment that generated the representation you're viewing now (see our 2013 D-Lib paper "A Method for Identifying Personalized Representations in Web Archives" for preliminary ideas on linking various geoip, mobile vs. desktop, and other related representations). 
We have been very fortunate with respect to funding in 2015 and we look forward to continued progress on the research thrusts outlined above.  We'd like to thank everyone that made these awards possible.  We welcome any feedback or interest on these (and other) projects as we progress.  Watch this blog and @WebSciDL for continued research updates.


* = See also our 2014 award for $324k from the NEH for the study of personal web archiving and our 2014 award for $49k from the IIPC for profiling web archives for a more complete picture of our research vision for web archives.