Friday, August 28, 2015

2015-08-28 Original Header Replay Considered Coherent

Introduction


As web archives have advanced over time, their ability to capture and playback web content has grown. The Memento Protocol, defined in RFC 7089, defines an HTTP protocol extension that bridges the present and past web by allowing time-based content negotiation. Now that Memento is operational at many web archives, analysis of archive content is simplified. Over the past several years, I have conducted analysis of web archive temporal coherence. Some of the results of this analysis will be published at Hypertext'15. This blog post discusses one implication of the research: the benefits achieved when web archives playback original headers.

Archive Headers and Original Headers


Consider the headers (Figure 1) returned for a logo from the ODU Computer Science Home Page as archived on Wed, 29 Apr 2015 15:15:23 GMT.

HTTP/1.1 200 OK
Content-Type: image/gif
Last-Modified: Wed, 29 Apr 2015 15:15:23 GMT
Figure 1. No Original Header Playback

Try to answer the question "Was the representation provided by the web archive valid for Tue, 28 Apr 2015 12:00:00 GMT?" (i.e., the day before). The best answer possible is maybe. Because I have spent many hours using the Computer Science web site, I know the site changes infrequently. Given this knowledge, I might upgrade the answer from maybe to probably. This difficulty answering is due to the Last-Modified header reflecting the date archived instead of the date the image itself was last modified. And, although it is true that the memento (archived copy) was indeed modified Wed, 29 Apr 2015 15:15:23 GMT, this merging of original resource Last-Modified and memento Last-Modified loses valuable information. (Read Memento-Datetime is not Last-Modified for more details.)

Now consider the headers (figure 2) for another copy that was archived Sun, 14 Mar 2015 22:21:07 GMT. Take special note of the X-Archive-Orig-* headers. These are a playback of original headers that were included in the response when the logo image was captured by the web archive.

HTTP/1.1 200 OK
Content-Type: image/gif
X-Archive-Orig-etag: "52d202fb-19db"
X-Archive-Orig-last-modified: Sun, 12 Jan 2014 02:50:35 GMT
X-Archive-Orig-expires: Sat, 19 Dec 2015 13:01:55 GMT
X-Archive-Orig-accept-ranges: bytes
X-Archive-Orig-cache-control: max-age=31104000
X-Archive-Orig-connection: keep-alive
X-Archive-Orig-date: Wed, 24 Dec 2014 13:01:55 GMT
X-Archive-Orig-content-type: image/gif
X-Archive-Orig-server: nginx
X-Archive-Orig-content-length: 6619
Memento-Datetime: Sun, 14 Mar 2015 22:21:07 GMT
Figure 2. Original Header Playback

Compare the Memento-Datetime (which is the archive datetime) and the X-Archive-Orig-last-modified headers while answering this question: "Was the representation provided by the web archive valid for Tue, 13 Mar 2015 12:00:00 GMT?". Clearly the answer is yes.

Why This Matters


For the casual web archive user, the previous example may seem like must nit-picky detail. Still, consider the Weather Underground page archived on Thu, 09 Dec 2004 19:09:26 GMT and shown in Figure 3.

Weather Underground as archived Thu, 09 Dec 2004 19:09:26 GMT
Figure 3. Weather Underground as archived Thu, 09 Dec 2004 19:09:26 GMT
The Weather Underground page (like most) is a composition of many resources including the web page itself, images,  style sheets, and JavaScript. Note the conflict between the forecast of light drizzle and the completely clear radar image. Figure 4 shows the relevant headers returned for the radar image:

HTTP/1.1 200 OK
Memento-Datetime: Mon, 12 Sep 2005 22:34:45 GMT
X-Archive-Orig-last-modified: Mon, 12 Sep 2005 22:32:24 GMT
Figure 4. Prima Facie Coherence Headers

Clearly the radar image was captured much later than the web page—over 9 months later in fact! But this alone does not prove the radar image is the incorrect image (perhaps Weather Underground radar images were broken on 09 Dec 2004). However, the Memento-Datetime and X-Archive-Orig-last-modified headers tell the full story, showing that not only was the radar image captured well after the web page was archived, but also that the radar image was modified well after the web page was archived. Thus, together Memento-Datetime and X-Archive-Org-Last-Modified are prima facie evidence that the radar image is temporally violative with respect to the archived web page in which it is displayed. Figure 5 illustrates this pattern. The black left-to-right arrow is time. The black diamond and text represent the web page; the green represents the radar image. The green line shows that the radar image X-Archive-Orig-Last-Modified and Memento-Datetime bracket the web page archival time. Details on this pattern and others are detailed in our framework technical report.

Figure 5. Prima Facie Coherence

But Does Header Playback Really Matter?


Of course, if few archived images and other embedded resources include Last-Modified headers, the overall effect could be inconsequential. However, the results to be published at Hypertext'15 show that using the Last-Modified header makes a significant coherence improvement: using Last-Modified to select embedded resources increased mean prima facie coherence from ~41% to ~55% compared to using just Memento-Datetime. And, at the time the research was conducted, only the Internet Archive included Last-Modified playback. If the other 14 web archives included in the study also implemented header playback, we project that mean prima facie coherence would have been about 80%!

Web Archives Today


When the research leading to the Hypertext'15 paper was conducted, only the Internet Archive included Last-Modified playback. This limited prima facie coherence determination to only embedded resources retrieved from the Internet Archive. As shown in Table 1, additional web archives now playback original headers. The table also show which archives implement the Memento Protocol (and are therefore RFC 7089 compliant) and which archives use OpenWayback, which already implements header replay. Although header playback is a long way from universal, progress is happening. We look forward to continuing coherence improvement as additional web archives implement header playback and the Memento Protocol.

Table 1. Current Web Archive Status
Web Archive Header Playback? Memento Compliant? OpenWayback?
Archive-It Yes Yes Yes
archive.today No Yes No
arXiv No No No
Bibliotheca Alexandrina Web Archive Unknown1 Yes Yes
Canadian Government Web Archive No Proxy No
Croatian Web Archive No Proxy No
DBPedia Archive No Yes No
Estonian Web Archive No Proxy No
GitHub No Proxy No
Icelandic Web Archive Yes Yes Yes
Internet Archive Yes Yes Yes
Library of Congress Web Archive Yes Yes Yes
NARA Web Archive No Proxy Yes
Orain No Proxy No
PastPages Web Archive No Yes No
Portugese Web Archive No Proxy No
PRONI Web Archive No Yes Yes
Slovenian Web Archive No Proxy No
Stanford Web Archive Yes Yes Yes
UK Government Web Archive No Yes Yes
UK Parliament's Web Archive No Yes Yes
UK Web Archive Yes Yes Yes
Web Archive Singapore No Proxy No
WebCite No Proxy No
WikiPedia No Proxy No
1Unavailable at the time this post was written.

Wrap Up


Web archives featuring both the capture and replay of original header show significantly better temporal coherence in recomposed web pages. Currently, web archives using Heritrix and OpenWayback implement these features; no archives using other software are known to do so. Implementing original header capture and replay is highly recommended as it will allow implementation of improved recomposition heuristics (which is a topic for another day and another post).


— Scott G. Ainsworth

Friday, August 21, 2015

2015-08-20: ODU, L3S, Stanford, and Internet Archive Web Archiving Meeting



Two weeks ago (on Aug 3, 2015), I was glad to be invited to visit Internet Archive in San Francisco in order to share our latest work with a set of the Web Archiving pioneers from around the world.

The attendees were Jefferson Bailey and Vinay Goel from IA, Nicholas Taylor and Ahmed AlSum from Stanford, and Wolfgang Nejdl, Ivana Marenzi and Helge Holzmann from L3S.

First, we took a quick introduction to each others mentioning the purpose and the nature of our work to IA.

Then, Nejdl introduced the Alexandria project, and demoed the ArchiveWeb project, which aims to develop tools and techniques to explore and analyze Web archives in a meaningful way. In the project, they develop tools that will allow users to visualize and collaboratively interact with Archive-it collections by adding new resources in the form of tags and comments. Furthermore, it contains a collaborative search and sharing platform.

I presented the off-topic detection work with a live demo for the tool, which can be downloaded and tested from https://github.com/yasmina85/OffTopic-Detection.


The off-topic tool aims to automatically detect when the archived page goes off-topic, which means the page changed through time to move away from the initial scope of the page. The tool suggests a list of off-topic pages based on a specific threshold that is input by the user. Based on evaluating the tool, we suggest values for the threshold in a research paper* that can be used to detect the off-topic pages.

A site for one of the candidates for Egypt’s 2012 presidential election. Many of the captures of hamdeensabhay.com are not about the Egyptian Revolution. Later versions show an expired domain (as does the live Web version).

Examples for the usage of the tool:
--------

Example 1: Detecting off-topic pages in 1826 collection

python detect_off_topic.py -i 1826 -th 0.15
extracting seed list

http://agroecol.umd.edu/Research/index.cfm
http://casademaryland.org

50 URIs are extracted from collection https://archive-it.org/collections/1826
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://agroecol.umd.edu/Research/index.cfm
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://casademaryland.org

Downloading 4 mementos out of 306
Downloading 14 mementos out of 306

Detecting off-topic mementos using Cosine Similarity method

Similarity memento_uri
0.0 http://wayback.archive-it.org/1826/20131220205908/http://www.mncppc.org/commission_home.html/
0.0 http://wayback.archive-it.org/1826/20141118195815/http://www.mncppc.org/commission_home.html

Example 2: Detecting off-topic pages for http://hamdeensabahy.com/

python detect_off_topic.py -t https://wayback.archive-it.org/2358/timemap/link/http://hamdeensabahy.com/  -m wcount -th -0.85

Downloading 0 mementos out of 270
http://wayback.archive-it.org/2358/20140524131241/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130621131337/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140602131307/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140528131258/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130617131324/http://www.hamdeensabahy.com/


Downloading 4 mementos out of 270

Extracting text from the html

Detecting off-topic mementos using Word Count method

Similarity memento_uri
-0.979434447301 http://wayback.archive-it.org/2358/20121213102904/http://hamdeensabahy.com/

-0.966580976864 http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/

-0.94087403599 http://wayback.archive-it.org/2358/20130526131402/http://www.hamdeensabahy.com/

-0.94087403599 http://wayback.archive-it.org/2358/20130527143614/http://www.hamdeensabahy.com/


Nicholas insisted on the importance of the off-topic tool from QA perspective, while Internet Archives folks focused on the required computation resources and how it can be shared with Archive-It partners. The group discussed some user interface options to display the output of the tool.

After the demo, we discussed the importance of the tool, especially in the crawling quality assurance practices.  While demoing ArchiveWeb interface, some of the visualization for pages from different collections showed off-topic pages.  We all agreed that it is important that those pages won’t appear to the users when they browse the collections.

It was amazing to spend time in IA and knowing about the last trend from other research groups. The discussion showed the high reputation of WS-DL research in the web archiving community around the world.

*Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, Detecting Off-Topic Pages in Web Archives, Proceedings of TPDL 2015, 2015.

----
Yasmin




Tuesday, August 18, 2015

2015-08-18: Three WS-DL Classes Offered for Fall 2015


https://xkcd.com/657/

The Web Science and Digital Libraries Group is offering three classes this fall.  Unfortunately there are no undergraduate offerings this semester, but there are three graduate classes covering the full WS-DL spectrum:

Note that while 891 classes count toward the 24 hours of 800-level class work for the PhD program, they do not count as one of the "four 800-level regular courses" required.  Students looking to satisfy one of the 800-level regular courses should consider CS 834.  Students considering doing research in the broad areas of Web Science should consider taking all three of these classes this semester.

--Michael