Posts

2011-06-18: Report on the 2011 Digging into Data Challenge Conference

Image
On June 9-10 I attended the 2011 Digging into Data Challenge Conference in Washington DC, which was a status report of the eight projects selected during the initial 2009 Digging into Data Challenge. Unfortunately, due to traffic challenges to and from the conference, I was able to catch only one half of the sessions. Jennifer Howard of the Chronicle of Higher Education gives a good summary of the sessions ( day 1 and day 2 ). The highlights of the sessions I attended included the " Data Mining with Criminal Intent " project (whose poster is shown above), which includes the use of the Voyeur Tools for text collection summarization on the " Old Bailey ", a corpus of criminal court proceedings in London 1674-1913. Also interesting was the " Mapping the Republic of Letters " project, which is basically social network analysis based on the letter exchanges of prominent scientists and intellectuals during the 18th century. Also of note was Tony Hey

2011-06-17: The "Book of the Dead" Corpus

Image
We are delighted to introduce the "Book of the Dead" , a corpus of missing web pages. The corpus contains 233 URIs all of which are dead meaning they result in a 404 "Page not Found" response. The pages were collected during a crawl conducted by the Library of Congress for web pages related to the topics of federal elections and terror between 2004 and 2006. We created the corpus to test the performance of our methods to rediscover missing web pages introduced in the paper " Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure " published at JCDL 2010 . In addition we now thankfully have Synchronicity , a tool that can help overcome the 404 detriment to everyone's browsing experience in real time. To the best of our knowledge the Book of the Dead is the first corpus of this kind. It is publicly available and we are hopeful that fellow researchers can benefit from it by conducting related work. The corpus can be downloaded a

2011-06-10: Launching Synchronicity - A Firefox Add-on for Rediscovering Missing Web Pages in Real Time

Image
Today we introduce Synchronicity , a Firefox extension that supports the user in rediscovering missing web pages. It triggers on the occurrence of 404 "Page not Found" errors, provides archived copies of the missing page as well as five methods to query search engines for the new location of the page (in case it has moved) or to obtain a good enough replacement page (in case the page is really gone). Synchronicity works in real time and helps to overcome the detriment of link rot in the web. Installation: Download the add-on from https://addons.mozilla.org/en-US/firefox/addon/synchronicity and follow the installation instructions. After restarting Firefox you will notice Synchronicity's shrimp icon in the right corner of the status bar. Usage: Whenever a 404 "Page not Found" error occurs the little icon will change colors and turn to notify the user that it has caught the error. Just click once on the red icon and the Synchronicity panel will load up. Synchr

2011-05-20: Report on the 2011 IIPC General Assembly

Image
I spent the week of May 9--13 at the KB in The Hague, the Netherlands for the 2011 IIPC General Assembly . Joining me there was Rob Sanderson of LANL. Rob had attended the 2010 GA in Singapore, but this was my first IIPC and I learned a great deal. The first day was open to the public in a special session entitled " Out of the Box: Building and Using Web Archive Collections ", of which I missed most because I was taking a nap after arriving the morning of May 9. Fortunately, Inge Angevaare prepared a comprehensive summary of the first day . I believe presentations and a video of highlights from the first day will be available from the IIPC site shortly. The next three days were spent in the IIPC plenary and working groups. Rob gave a high-level Memento status report on Tuesday, and Rob and I gave a more detailed tutorial later in the day: Memento: Updated technical details (May 2011) from Herbert Van de Sompel Wednesday and Thursday were largely spen

2011-04-13: Implementing Time Travel for the Web

Image
Recent trends in digital libraries are towards integration with the architecture of the World Wide Web . The award-winning Memento Project proposes extending HTTP to provide protocol-level access to mementos (archived previous states) of web resources. Using content negotiation and other protocol operations, rather than archive-specific methods, Memento provides the digital library and preservation community with a standardized method to navigate between the original resource and its mementos. Memento Client State Chart The ODU Web Sciences and Digital Libraries Research Group has partnered with the LANL Research Library to create Memento and develop prototype Memento-compliant client and server implementations. A variety of Memento clients have been created, tested, and co-evolved along with the Memento protocol. There is now a FireFox extension , Internet Explorer browser helper object, and WebKit -based Android browser . The design and technical solutions identified during th

2011-04-08: Radiation Map of Japan

Image
The devastation wrought by the 11 March earthquake in Japan, and the depths of the human misery left in the wake of the massive Tsunami have left many people awestruck. The size of the quake itself was enormous and many people have had a hard time comprehending just how big this earthquake was. Some sites like Japan Quake Map help us to comprehend the magnitude of this event. As a result of the earthquake and tsunami the nuclear reactor at Dai-ichi was severely damaged and has been leaking radiation. The radiation readings have been made available by WIDE and Japan's Nuclear Safety Division . The idea was to use R to create an informative map of Japan showing the radiation levels of the different prefectures. Python was used to import the data from both of the web sites and insert it into a MySQL database. The format of both of the pages was understandably quite dynamic and resulted in the python script needing to be tweaked quite often. Sometimes it was easier to just copy

2011-04-07: MITRE Records Expo Trip Report

I have just returned from MITRE 's Records Expo on MITRE's Campus in McLean, VA . The Records Expo is designed to raise awareness of the archival responsibilities of employees within MITRE, and also inform our sponsors about the archives and records management work we're doing. I was invited to present some of the research being done in digital preservation at ODU and MITRE. (George Despres and I have recently received funding to perform digital preservation research on the digital objects living within the corporate intranet. Our research was explained at the Expo.) Records expo View more presentations from Justin Brunelle We set up booths in the MITRE 2 building, equipped with big-screen TVs with slide shows about other archival and records management systems being pioneered at MITRE (the slides are For Office Use Only, and cannot be shared in this blog). Several MITRE employees attended and listened to presentations given by the archives team and the records

2011-03-25: OAC Phase II Workshop Trip Report

Image
I've just finished attending the Open Annotation Collaboration (OAC) Phase II Workshop in Chicago, IL (March 24-25, 2011). The quality of the presentations was very high and I was surprised at how much the OAC community has grown in a relatively short time. Although I've served on OAC technical review panels before and my student, Abdulla Alasaadi, has worked on a small prototype (to be presented at JCDL 2011 ) for using SVG instead of the W3C Media Fragments for specifying an annotation target, I haven't been keeping up with the OAC community as closely as I should. The Workshop has all the presentations online , as well as a wiki that contains various commentary, use cases, etc. (also, the hash tag is " #oacwkshp "). Although all of the presentations generated a lot of discussion from the attendees, the presentations that I learned the most from were: Annotation Supporting Collaborative Development of Scholarly Editions ( Jane Hunter and Anna Gerbe

2011-03-21: Grasshopper, prepare yourself. It is time to speak of graphs and digital libraries and other things.

Image
Announcing the publication of an Old Dominion Computer Science Department technical report and an homage to Davide Carradine, Keye Luke and the television series Kung Fu. "Grasshopper." "Yes, Master Po?" "Grasshopper, you have passed many tests of strength, agility and stamina. But that is not enough. There are other trials you must pass before you are permitted to attempt to lift the fiery brazier. I will ask you a series of questions. “Let us begin. What is a graph?" "Master; a graph is a mathematical construct made of objects that may, or may not be connected to each other." "Grasshopper, how does a graph relate to digital libraries and the world where we live?" "Master; a graph is composed of nodes (or vertices) that can be connected in a pairwise manner with edges (or arcs). In the world of Facebook, people take the place of nodes and the connection that is made when one person “friends” another creates an edge. In

2011-03-09: Adventures with the Delicious API

I recently conducted an experiment on tags provided from the bookmarking site delicious.com . The goal was to obtain a decent sized sample set of URIs and tags that users have used to annotate the URIs. The website provides a recent tool that automatically redirects to a somewhat random URI that was recently annotated by some Delicious user. By parsing the HTTP headers I was able to grab the redirect URI and therefore build a corpus of 5000 unique URIs. The URI for the tool is http://www.delicious.com/recent/?random=1 . As the second step I needed to obtain the corresponding tags for each URI. I tried to be a good programmer and used the Delicious API to query for the tags instead of parsing the web interface. In order to use the API (v1) you need an account with Delicious/Yahoo. The request for https://username:pwd@api.del.icio.us/v1/posts/suggest?url=http://www.google.com/ for example returns an XML formated response with the top five popular tags: search google search eng

2011-03-04: Personal Digital Archiving Conference 2011

Image
Last week, along with Dr. Nelson , I attended the 2nd annual conference of Personal Digital Archiving held at the Internet Archive in the heart of the foggy city , San Francisco. The weather was not on our side as the sunny state was facing the worst weather in quite a while . This didn't turn my spirit down as I was excited to be in room with experts and passionate geniuses whose collective IQ could cause an integer over-flow ! The general atmosphere was really nice; participants were very friendly and eager to introduce themselves and get to know you. I got exposed to a ton of ideas, projects and insights over coffee sometimes while other times just going up and down the stairs. My only regret is that I don't have a contact card as I got a bunch of them; I got to get me some of these! So that the readers can relive this experience with me I have divided the conference into two days each in turn is divided into sessions. I will try to highlight a thing or two from eac