Showing posts from June, 2011

2011-06-23: How Much of the Web is Archived?

There are many questions to ask about web archiving and digital preservation - why is archiving important? what should be archived? what is currently being archived? how often should pages be archived? The short paper "How Much of the Web is Archived?" (Scott G. Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson), published at JCDL 2011, is our first step at determining to what extent the web is being archived and by which archives. To address this question, we sampled URIs from four sources to estimate the percentage of archived URIs and the number and frequency of archived versions. We chose 1000 URIs from each of the following sources: Open Directory Project (DMOZ) - sampled from all URIs (July 2000 - Oct 2010) Delicious - random URIs from the Recent Bookmarks list Bitly - random hash values generated and dereferenced search engine caches ( Google , Bing , Yahoo! ) - random sample of URIs from queries of 5-grams (using Google&#

2011-06-29: OAC Demo of SVG and Constrained Targets

Online annotating service is a tool that helps to annotate different resources with different authors and give this annotation a separate URI that can be shared using a Facebook post, blog post, tweet, etc. Web annotations can be described as a relation between different resources with different media types like text, image, audio, or video. The web annotation service will be able to provide: A unique URI for every annotation. Persistent annotations. Annotate specific part of media. Keep track of the resources. Present annotation in browser. Meet the OAC model requirements ( alpha3 release ) . Open Annotation Model: This service will generate annotations that meet the OAC model specification. In an annotation that contains different resources, the OAC will introduce a new resource that describes the relationships between the resources that make the annotation. Example: A user who is interested in wildlife is browsing a page about elephants in Africa, and he was interested in the map

2011-06-18: Report on the 2011 Digging into Data Challenge Conference

On June 9-10 I attended the 2011 Digging into Data Challenge Conference in Washington DC, which was a status report of the eight projects selected during the initial 2009 Digging into Data Challenge. Unfortunately, due to traffic challenges to and from the conference, I was able to catch only one half of the sessions. Jennifer Howard of the Chronicle of Higher Education gives a good summary of the sessions ( day 1 and day 2 ). The highlights of the sessions I attended included the " Data Mining with Criminal Intent " project (whose poster is shown above), which includes the use of the Voyeur Tools for text collection summarization on the " Old Bailey ", a corpus of criminal court proceedings in London 1674-1913. Also interesting was the " Mapping the Republic of Letters " project, which is basically social network analysis based on the letter exchanges of prominent scientists and intellectuals during the 18th century. Also of note was Tony Hey

2011-06-17: The "Book of the Dead" Corpus

We are delighted to introduce the "Book of the Dead" , a corpus of missing web pages. The corpus contains 233 URIs all of which are dead meaning they result in a 404 "Page not Found" response. The pages were collected during a crawl conducted by the Library of Congress for web pages related to the topics of federal elections and terror between 2004 and 2006. We created the corpus to test the performance of our methods to rediscover missing web pages introduced in the paper " Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure " published at JCDL 2010 . In addition we now thankfully have Synchronicity , a tool that can help overcome the 404 detriment to everyone's browsing experience in real time. To the best of our knowledge the Book of the Dead is the first corpus of this kind. It is publicly available and we are hopeful that fellow researchers can benefit from it by conducting related work. The corpus can be downloaded at:

2011-06-10: Launching Synchronicity - A Firefox Add-on for Rediscovering Missing Web Pages in Real Time

Today we introduce Synchronicity , a Firefox extension that supports the user in rediscovering missing web pages. It triggers on the occurrence of 404 "Page not Found" errors, provides archived copies of the missing page as well as five methods to query search engines for the new location of the page (in case it has moved) or to obtain a good enough replacement page (in case the page is really gone). Synchronicity works in real time and helps to overcome the detriment of link rot in the web. Installation: Download the add-on from and follow the installation instructions. After restarting Firefox you will notice Synchronicity's shrimp icon in the right corner of the status bar. Usage: Whenever a 404 "Page not Found" error occurs the little icon will change colors and turn to notify the user that it has caught the error. Just click once on the red icon and the Synchronicity panel will load up. Synchroni