Thursday, June 23, 2011

2011-06-23: How Much of the Web is Archived?

There are many questions to ask about web archiving and digital preservation - why is archiving important? what should be archived? what is currently being archived? how often should pages be archived?

The short paper "How Much of the Web is Archived?" (Scott G. Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson), published at JCDL 2011, is our first step at determining to what extent the web is being archived and by which archives.

To address this question, we sampled URIs from four sources to estimate the percentage of archived URIs and the number and frequency of archived versions. We chose 1000 URIs from each of the following sources:
  1. Open Directory Project (DMOZ) - sampled from all URIs (July 2000 - Oct 2010)
  2. Delicious - random URIs from the Recent Bookmarks list
  3. Bitly - random hash values generated and dereferenced
  4. search engine caches (Google, Bing, Yahoo!) - random sample of URIs from queries of 5-grams (using Google's N-gram data)
For each of the sample URIs (4000 in all), we used Memento to discover archived versions, or mementos, of the URI.

We categorize the archives as Internet Archive (using the classic Wayback Machine), search engine caches (Google, Bing, and Yahoo!), and other (e.g., Diigo, Archive-It, UK National Archives, WebCite).

Our first set of graphs (click on each graph for a larger version) shows the mementos discovered for each URI, ordered by the first observation date. These are separated by sample set, so the y-axis on each graph runs from 0-1000. Brown dots indicate mementos discovered at the Internet Archive, blue dots indicate those found in search engine caches, and red dots indicate mementos found at other archives.

BitlySearch engines

There are a few interesting observations:
  • DMOZ URIs are well-represented, especially in the Internet Archive. There are two likely reasons for this: DMOZ is the primary source for seed URIs for the Internet Archive and the DMOZ sample contains more old URIs than the other sources.

  • Bitly URIs are very poorly represented. The majority of Bitly URIs are not found in any archive. This is currently under further investigation.

  • There is a large gap in mementos found in the Internet Archive, starting in 2008. We suspect this is because of the use of the classic version of the Wayback Machine.
The second set of graphs shows the relationship between the density of mementos for a URI and the URI's age. The x-axis is the estimated creation date of the URI, and the y-axis is the number of mementos found for this URI. Large dots indicate that several URIs had similar creation dates and number of mementos. We show density guidelines for 0.5, 1, and 2 mementos created per month.

BitlySearch engines

From these graphs, we make the following observations:
  • Many of the DMOZ URIs are archived at least once every 2 months.
  • Older Delicious URIs have many mementos.
  • A few Bitly URIs have many mementos, but most URIs have 0-10 mementos.
So, how much of the web is archived? Depends on which "web" you mean.

Ahmed presented this work at JCDL 2011. His presentation slides are below:

This work supported in part by the Library of Congress and NSF IIS 1009392.


2012-12-27 Update: The long version of this paper has been posted to arXiv, along with the URI datasets described in the paper.  

2011-06-29: OAC Demo of SVG and Constrained Targets

Online annotating service is a tool that helps to annotate different resources with different authors and give this annotation a separate URI that can be shared using a Facebook post, blog post, tweet, etc.

Web annotations can be described as a relation between different resources with different media types like text, image, audio, or video. The web annotation service will be able to provide:
  • A unique URI for every annotation.
  • Persistent annotations.
  • Annotate specific part of media.
  • Keep track of the resources.
  • Present annotation in browser.
  • Meet the OAC model requirements (alpha3 release) .
Open Annotation Model:
This service will generate annotations that meet the OAC model specification. In an annotation that contains different resources, the OAC will introduce a new resource that describes the relationships between the resources that make the annotation.

A user who is interested in wildlife is browsing a page about elephants in Africa, and he was interested in the map the shows where the elephants live exactly.

The user relates this image to another image that shows how people kill the elephants in order to sell their expensive tusks in another website. Now, the tusks picture annotates the map, and shows the reason behind the decreasing number in elephants in central Africa.

How does it work?
The process starts at the client side, where the user creates his annotation using the SVG-Edit. SVG-Edit is an open source plugin that has been designed to create SVG graphs in local desks. SVG-Edit has been modified to meet our requirement where we can edit the graph online and send the results to our annotation online service.

SVG will enable the annotator to annotate specific parts of the image using any shape. This will solve the problem of the W3C media fragmentation specification that supports rectangular shapes only.

After creating the annotation using the SVG-Edit, the annotation data will be sent to our online service that does the following:
  • Pushes all related resources to the WebCite archive, in this case, each resource will have at least two copies with different URIs, one of them is the archived copy.
  • The service will generate an RDF file mentioning the relationships between the resources.
  • With all the different URIs generated of the resources and their archived copies, a resource map will be created for every annotation created. The associated resource map will aggregate all the resources that are related to this annotation. The resource map will be referred to by the link-header when the page gets dereferenced.
  • Since the generated URI of the annotation will be long, another short URI will be generated using the URI shortening service. The new short URI will make it easy for the annotation to be shared on tweets or Facebook posts.
  • At the end of the annotating process, the user will get a simple and short URI that can be easily posted in user’s mail, twitter or Facebook.
  • When users dereference the URI they get the annotation back.

Pushing the annotation data to the web-service.

Retrieving the annotation using its URI.

You can check the video ( to watch a demonstration on how this service works.

For more details, you can refer to the paper "Persistent Annotations Deserve New URIs" which has been published in JCDL 2011, and the slides are below:

This service will help you in:
  • Minting new URIs for the annotations.
  • Annotating the media fragments was made possible using the SVG and its media tags.
  • Using the web archives solves the issue of keeping the annotation persistent over time.
  • Keep track of all the related resources using ORE Resource Maps.
This work supported in part by the NSF IIS 1009392.

-- Abdulla

Saturday, June 18, 2011

2011-06-18: Report on the 2011 Digging into Data Challenge Conference

On June 9-10 I attended the 2011 Digging into Data Challenge Conference in Washington DC, which was a status report of the eight projects selected during the initial 2009 Digging into Data Challenge.

Unfortunately, due to traffic challenges to and from the conference, I was able to catch only one half of the sessions. Jennifer Howard of the Chronicle of Higher Education gives a good summary of the sessions (day 1 and day 2).

The highlights of the sessions I attended included the "Data Mining with Criminal Intent" project (whose poster is shown above), which includes the use of the Voyeur Tools for text collection summarization on the "Old Bailey", a corpus of criminal court proceedings in London 1674-1913. Also interesting was the "Mapping the Republic of Letters" project, which is basically social network analysis based on the letter exchanges of prominent scientists and intellectuals during the 18th century. Also of note was Tony Hey's keynote. Although his video/slides are not yet available from the conference website, you can get an idea of his presentation by looking at his OSCON 2009 presentation (slides, video), although Digging Into Data presentation was more recent and expanded. Interesting projects that I learned of included: Digital Narratives, NodeXL, and Zentity.

The conference had a new-to-me format about which I'm not entirely sure how I feel. The project PIs would present the status and highlights of their projects for 45 minutes, and then a respondent not involved with the project would present a rebuttal / evaluation / response / contextualization. The respondents that I saw were gracious and complimentary, but I heard during the breaks that was not necessarily the case for at least one of the respondents that I missed.

There is a 2011 Digging Into Data Challenge, although with less than a week between the conference and the due date of June 16 it is not clear to me how much the experiences of the previous participants could be incorporated into the 2011 submissions.


Friday, June 17, 2011

2011-06-17: The "Book of the Dead" Corpus

We are delighted to introduce the "Book of the Dead", a corpus of missing web pages. The corpus contains 233 URIs all of which are dead meaning they result in a 404 "Page not Found" response. The pages were collected during a crawl conducted by the Library of Congress for web pages related to the topics of federal elections and terror between 2004 and 2006.

We created the corpus to test the performance of our methods to rediscover missing web pages introduced in the paper "Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure" published at JCDL 2010. In addition we now thankfully have Synchronicity, a tool that can help overcome the 404 detriment to everyone's browsing experience in real time.

To the best of our knowledge the Book of the Dead is the first corpus of this kind. It is publicly available and we are hopeful that fellow researchers can benefit from it by conducting related work. The corpus can be downloaded at:

And one more thing... not only does the corpus include the missing URIs, it also contains a best guess of what each of the URIs used to be about. We used Amazon's Mechanical Turk and asked workers to guess what the content of the missing pages used to. We only provided the URIs and the general topics elections and terror. The workers were supposed to just analyze the URI and draw their conclusions. Sometime this can be an easy task, for example the URI:

is clearly about an election event in 2004. Maybe one could know that "lp" stands for Libertarian Party and "de" for Delaware. Now this URI makes real sense and most likely "Morris" was a candidate running for office during the elections.

All together the Book of the Dead now offers missing URIs and their estimated "aboutness" which makes it a valuable dataset for retrieval and archival research.

Friday, June 10, 2011

2011-06-10: Launching Synchronicity - A Firefox Add-on for Rediscovering Missing Web Pages in Real Time

Today we introduce Synchronicity, a Firefox extension that supports the user in rediscovering missing web pages. It triggers on the occurrence of 404 "Page not Found" errors, provides archived copies of the missing page as well as five methods to query search engines for the new location of the page (in case it has moved) or to obtain a good enough replacement page (in case the page is really gone).
Synchronicity works in real time and helps to overcome the detriment of link rot in the web.

Download the add-on from and follow the installation instructions. After restarting Firefox you will notice Synchronicity's shrimp icon in the right corner of the status bar.

Whenever a 404 "Page not Found" error occurs the little icon will change colors and turn to notify the user that it has caught the error. Just click once on the red icon and the Synchronicity panel will load up.
Synchronicity utilizes the Memento framework to obtain archived copies of a page. On startup you are in the Archived Version tab where two visualizations of all available archived copies are offered.
The TimeGraph is a static image giving an overview of the number of copies available per year. Three drop down boxes enable you to pick a particular copy by date and have it display in the main browser window.
The TimeLine offers a "zoomable" way to explore the copies in dependence of the time they were archived. Each copy is represented by the icon of its hosting archive. You can click on the icon to receive metadata about the copy and see a link that will display the copy. You can also filter the copies by their archive.

Based on these copies Synchronicity provides two content based methods:
  1. the title of the page
  2. the keywords (lexical signature) of the page
that both can be used as queries against Google, Yahoo! and Bing. The idea is that these queries represent the "aboutness" of the missing page and hence make a good query to discover the page at its new location (URI) or a discover a good enough replacement page that satisfies the user's information need.

Synchronicity can further obtain tags from Delicious created by users to annotate the page. Even thought tags are sparse, if available they can make a well performing search engine query. Additionally Synchronicity will extract the most salient keywords from pages that link to the missing page (link neighborhood lexical signature) that again can be used as a query.
Lastly Synchronicity offers a convenient way to modify the URL that caused the 404 error and try. The idea is that maybe shortening the path will get where you want to go.

These last three methods can be applied if no archived copy of the missing page can be found.

Synchronicity provides a straight forward interface but also enables more experienced users to modify all parameters underlying the extraction of titles, keywords, tags and extended keywords. The Expert Interface lets you for example show the titles of the last n copies where you specify the value of n. It also enables you to pick a particular copy to extract the keywords from and change many more parameters.

Synchronicity is a beta release so do not let it perform open-heart surgery on your mother-in-law!
It was developed within the the WS-DL research group in the Computer Science Department at Old Dominion University by Moustafa Aly and Martin Klein under supervision of Dr. Michael L. Nelson.

Please send your feedback, comments and suggestions for improvement to