Posts

2011-07-26: Universal Access to All Knowledge

Image
On July 26, 2011, the Web Science and Digital Library group at Old Dominion University hosted Kris Carpenter Negulescu, Director of the Web Group at the Internet Archive who gave a talk entitled “Universal Access to All Knowledge”. The presentation started with an introduction about what the Internet Archive is, then, she gave us some information about what are the archived materials in Internet Archive for now: Text (+2.9M books), Moving Images (+542,500 items), Audio (+950,000 items), Television broadcast (+1M hours), Web Pages (+150 billion pages). Moreover, she gave an overview about some of the special collections such as K-12 students and NASA images . After that, Kris explained the common collection strategies that are used by the Internet Archive to crawl the web. Frequently, they are doing a broad survey for the wide range domains such as .com, .net, .org, etc. They also considered the frequency of change for these websites and gave more support to the sites without

2011-07-28: Web Video Discussing Preservation Disappears After 24 Hours

Image
One week ago (July 21, 2011) I was fortunate enough to be invited to speak about Web Archiving on Canada AM , sort of like the Today Show or Good Morning America in the US. I was asked to appear on the program in part because of the July 17, 2011 article in the Washington Post, which followed a July 6, 2011 blog post for the Chronicle of Higher Education, which was based on a June 23, 2011 blog post about our JCDL 2011 paper " How Much of the Web is Archived? ". In other words, the process went like this: step 1 - get lucky & step 2 - let preferential attachment do its thing. I was able to do the appearance in Washington DC, while attending the NDSA/NDIIPP 2011 Partner Meetup . The morning of July 21, I took a taxi to an ABC studio in DC, did the interview (about 4 minutes) and took a taxi back to the conference in time to make the morning session. I had not been on TV before and was both nervous and excited. The local and Canadian crew made the entire exp

2011-07-25: NDSA/NDIIPP Partner Meetup 2011 Trip Report

Image
The NDSA/NDIIPP ( @ndiipp ) Partner Meetup took place July 19-21 at the Hyatt Regency Washington on Capitol Hill in Washington, DC. Technical and non-technical joined together to form an aggregated consortium of archivists, librarians, digital media specialists and concerned parties. Three representatives from the ODU Web Sciences and Digital Libraries group attended to make archivists aware of tools they had developed to accomplish the common goal of web archiving. WS-DL’s Comtributions to the NDSA/NDIPP Meetup Mat Kelly presented the Mozilla Firefox add-on Archive Facebook to a breakout group of presentations specifically targeting web archiving. The redesigned and re-architected add-on allows a user to archive the content of his/her Facebook account with the result being truly WYSIWYG versus Facebook’s native offerings of a content dump.   NDIIPP/NDSA 2011 - Archive Facebook from Mat Kelly Vivens Ndatinya showed the workings of a tool he is currently buildin

2011-07-21: Towards a Machine-Actionable Scholarly Communication System

I've told all the members of my research group they should watch this, so I thought I might as well make the same recommendation to the rest of the world... Herbert Van de Sompel presented "Towards a Machine-Actionable Scholarly Communication System" at LIBER 2011 in Barcelona, Spain on June 30, 2011. You really have to simultaneously watch the video and review the slides to get the full impact of the presentation. The first part is a succinct review of various projects, but starting at slide 16 ("nanopublications") things really get interesting. Well worth the 40 minute investment. Towards a Machine-Actionable Scholarly Communication System View more presentations from Herbert Van de Sompel --Michael

2011-07-05: JCDL 2011 Trip Report

Image
JCDL 2011 ( #jcdl2011 ) was held June 13–16 in Ottawa, Ontario, Canada. The weather was beautiful and the conference sessions wonderful. The ODU Web Sciences and Digital Libraries team was fortunate enough to have six of its members attend, present three short papers, and demonstrate the Synchronicity Firefox extension. Our Contributions to JCDL 2011 Ahmed Alsum presented How Much of the Web is Archived? This paper approximates the amount of the Web that is archived using four URI sources. From this data, we observe significant variation in archival rate in URIs from different sources. So, how much of the web is archived? It depends on which web you mean. ( pdf , slides ). How Much of the Web is Archived? JCDL 2011 from Ahmed AlSum Martin Klein presented Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures , which details a method for discovering missing web pages (the dreaded 404 ). Martin also demonstrated Synchronicity , a Firefox

2011-06-23: How Much of the Web is Archived?

Image
There are many questions to ask about web archiving and digital preservation - why is archiving important? what should be archived? what is currently being archived? how often should pages be archived? The short paper "How Much of the Web is Archived?" (Scott G. Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson), published at JCDL 2011, is our first step at determining to what extent the web is being archived and by which archives. To address this question, we sampled URIs from four sources to estimate the percentage of archived URIs and the number and frequency of archived versions. We chose 1000 URIs from each of the following sources: Open Directory Project (DMOZ) - sampled from all URIs (July 2000 - Oct 2010) Delicious - random URIs from the Recent Bookmarks list Bitly - random hash values generated and dereferenced search engine caches ( Google , Bing , Yahoo! ) - random sample of URIs from queries of 5-grams (using Google&#

2011-06-29: OAC Demo of SVG and Constrained Targets

Image
Online annotating service is a tool that helps to annotate different resources with different authors and give this annotation a separate URI that can be shared using a Facebook post, blog post, tweet, etc. Web annotations can be described as a relation between different resources with different media types like text, image, audio, or video. The web annotation service will be able to provide: A unique URI for every annotation. Persistent annotations. Annotate specific part of media. Keep track of the resources. Present annotation in browser. Meet the OAC model requirements ( alpha3 release ) . Open Annotation Model: This service will generate annotations that meet the OAC model specification. In an annotation that contains different resources, the OAC will introduce a new resource that describes the relationships between the resources that make the annotation. Example: A user who is interested in wildlife is browsing a page about elephants in Africa, and he was interested in the m

2011-06-18: Report on the 2011 Digging into Data Challenge Conference

Image
On June 9-10 I attended the 2011 Digging into Data Challenge Conference in Washington DC, which was a status report of the eight projects selected during the initial 2009 Digging into Data Challenge. Unfortunately, due to traffic challenges to and from the conference, I was able to catch only one half of the sessions. Jennifer Howard of the Chronicle of Higher Education gives a good summary of the sessions ( day 1 and day 2 ). The highlights of the sessions I attended included the " Data Mining with Criminal Intent " project (whose poster is shown above), which includes the use of the Voyeur Tools for text collection summarization on the " Old Bailey ", a corpus of criminal court proceedings in London 1674-1913. Also interesting was the " Mapping the Republic of Letters " project, which is basically social network analysis based on the letter exchanges of prominent scientists and intellectuals during the 18th century. Also of note was Tony Hey