Posts

2011-02-08: An Evaluation of Link Neighborhood Lexical Signatures to Rediscover Missing Web Pages

Image
The final project for my master's degree focused on the problem of “missing” web pages, those URIs that return an error result when retrieved.  When a web page is no longer available at a given URI, it may be available at a new URI, and this research proposes and demonstrates a new method for finding the new URI. Prior research has proposed using the lexical signature of a page as a search query to find the same or similar content at a new URI.  A lexical signature (LS) is a few words that are used in that page much more often than they are used in other pages on the Web, and so are thought to describe what the page is about.  That LS is then used as a search query which will hopefully find the target page in its results. Previously-proposed methods for using an LS to find a new URI required either that the page be analyzed before being lost (ref: P&W) or that cached or archived versions of the page be available for analysis.  If the page had not previously been analyzed and

R Tutorial

Image
As part of Dr. Weigle's CS 796/896 Visual Analytics Seminar I will offer a tutorial on the statistical computing software R . I will give an hands-on introduction into data input/output, simple data manipulation, and (of course) plotting. If you for example always wondered how to fill data vectors, import data from MySQL databases, compute the mean and standard deviation, execute logic, ranking and sorting operations, compute correlations and linear regressions, use loops and write functions as well as create scatter-, bar-, box- and other plots and all that in R, you will enjoy this tutorial. The tutorial is targeted towards course participants which have a natural interest in data visualization but it also has merit for other MS and Ph.D. students doing research and consequently dealing with and plotting data. While this introduction naturally can not cover all aspects of R and does not claim to be exhaustive, it will help students getting started with the software and the ma

2011-01-11: WS-DL Spring 2011 Classes

We are fortunate to have two classes of interest to WS-DL members this semester: Dr. Nelson 's CS 751/851 Introduction to Digital Libraries and Dr. Weigle 's CS 796/896 Visual Analytics Seminar . They'll run back-to-back on Tuesdays in ECSB r. 2120, with 796/896 beginning at 1:30pm and 751/851 beginning at 4:20pm. --Michael

2010-12-27: Google Summer Internship, ZĂ¼rich Switzerland

Image
"Hello Hany!...We are glad to inform you that you have been accepted in the summer internship program this year in Google ZĂ¼rich GmBH!". Call me a geek but these were the best words I have ever heard! I now work for Google, well in one way or another! After struggling with the visa issues I finally got my Swiss Schengen visa and the work permit. The Swiss people are very strict and precise, they thought I was 2 persons, one named Hany Khalil, and the other Hany SalahEldeen! Well I don't blame them (fyi, in Egypt we don't have the concept of family name, your name is a concatenation of your ancestors names, my name then my father's, then his father's...etc). All my life I have been called Hany SalahEldeen but for some reason the American embassy in Cairo decided that my grandfather's name Khalil suits me better. "Ich spreche kein Deutsch!" or "I don't speak German" Was the sentence I was repeating to my self on the plane to ZĂ¼rich

2010-12-06: Memento Wins the 2010 Digital Preservation Award

Image
The Memento Project won the 2010 Digital Preservation Award in London on December 1, 2010. The DPA is sponsored by the Digital Preservation Coalition , and the Memento Project is sponsored by the Library of Congress (see also: LC's project page ). Details about the DPA are provided in several press releases, including ones from the DPC , ODU , LANL and LC . DPC has also posted a short video of an interview with Herbert . And for posterity, the original tweet from William Kilbride announcing the winner (more information from the award ceremony will be announced on #dpa2010 ). Thanks to the DPC, the DPA judges, the Library of Congress, and everyone on the Memento team! --Michael

2010-12-02: NASA IPCC Data System Workshop

Image
I attended a NASA Intergovernmental Panel Climate Change (IPCC) Data System Workshop in Greenbelt Maryland, November 9 - 10. The IPCC is an international committee overseeing the assessment of global climate change. The purpose of this workshop is to discuss technical plan to prepare, incorporate and share IPCC-relevant NASA satellite observational datasets to support the Coupled Model Intercomparison Project Phase 5 ( CMIP5 ). CMIP is a standard protocol and framework for evaluating climate model simulation (hindcast) and predictions/simulation of future climate change. CMIP5 is the 5th evaluation and being organized and lead by the Program for Climate Model Diagnosis and Intercomparison ( PCMDI ) mission at Lawrence Livermore National Laboratory. All of this activity will help contribute to the IPCC 5th Assessment Report (APCC AR5) and beyond. In prior assessments, NASA observational datasets were not used (or very little). NASA HQ has recognized the richness and important of

2010-11-15: Memento Presentation at UNC; Memento ID

Image
I recently had a chance to return to the School of Information and Library Science , UNC Chapel Hill, where I had a most enjoyable post-doc during the academic year 2000-2001. Jane Greenberg was nice enough to invite me to speak about Memento in her INLS 520 "Organization of Information" class on Tuesday, November 9th as well as give an invited lecture about Memento to the UNC Scholarly Communications Working Group on Wednesday, November 10th. When I first went to UNC I had the office next to Jane and she was just an assistant professor, now she's a full professor and director of the Metadata Research Center . I enjoyed catching up with her and my many other friends and colleagues at SILS. My slides are available on slideshare.net ; they are mostly a combination of slides I've posted before, but with some updates in the HTTP headers. Although the changes are very slight, the recently submitted (11/12/10) Memento Internet Draft takes precedence over all of

2010-11-05: Memento-Datetime is not Last-Modified

Image
One of the key contributions of the Memento Framework is the HTTP response header " Memento-Datetime " (previously called "Content-Datetime" in our earlier publications & slides). Memento-Datetime is the sticky, intended datetime* for the representation returned when a URI is dereferenced. The presence of the Memento-Datetime HTTP response header is how the client realizes it has reached a Memento. Rather than formally explain what we mean by "sticky, intended datetime", it is easier to explain how it is neither the value in the HTTP response header Last-Modified , nor is it the creation date of the resource (which has no corresponding HTTP header, for reasons that will become clear). For the examples below, we'll define the following abbreviations: CD (Creation-Datetime) = the datetime the resource was created MD (Memento-Datetime) = the datetime the representation was observed on the web LM (Last-Modified) = the datetime the resource last cha

2010-10-21: RRAC Presentation

Tuesday, I gave a presentation introducing some of the research we are doing in our WSDL group to the Records and Archivists (RRAC) national meeting. This group is made of archivists at Federally Funded Research and Development Centers (like MITRE and Aerospace ) and University Archivists. Digital Preservation - ODU from Justin Brunelle I used slides from several of Dr. Nelson's and Martin Klein's presentations (credits recently given in the last slide). I also gave the same presentation to the Agile Development department (of which Carlton is a member) on Tuesday. Both groups widely received the research and had very interesting ideas and comments. The RRAC folks (who were of non-technical backgrounds) questioned the projected lifespan and availability of archives like the Internet Archive (IA) . We also discussed the possibility of the Twitter virus being stored in the IA (and I have yet to investigate this possibility). The other interesting topic of disc

2010-10-11: A Blast from the past: My road to Ws-Dl!

Image
Hello everyone, I am Hany SalahEldeen, a PhD student in my first year and I am honored to be a new member of the Ws-Dl group at Old Dominion University and supervised by Dr. Michael Nelson. I have been in the group for a couple of months now so I thought I should introduce myself and give a background summary on my career before Ws-Dl because I believe if you didn't know where you were, you will never know where you are going. I received my BSc. in Computer Systems Engineering at Alexandria University, Egypt in 2008. My graduation project entitled " VOID: The web-based integrated development environment " was selected to win the first prize in the graduation projects competition in the University for year 2008. For the last 2 years of my degree I was working in a software company back home called eSpace technologies , I worked in developing systems using Ruby on Rails , and was one of the members who developed Neverblock (an open source project to enable easy developm

2010-10-11: ArchiveFacebook Version 1.2 is released

Celebrating a year from the very first release of ArchiveFacebook the development team is releasing the new version 1.2. Throughout the last couple of months we have received feedback from the users asking for enhancements and resolving issues. We also received lots of compliments and thumbs up! This feedback was channeled and analyzed to give us an idea on how to enhance the user experience. We released version 1.2 3 days ago with lots of bug fixes and new features, among which the expansion of stories and posts on comments. Several users suggested that it would be useful to be able to archive all the posts and comments on a certain activity (status update, event attendance, photo...etc). Now V 1.2 can support this and any activity stream within your Facebook profile. The new version seems to be highly anticipated to an extent that the number of downloads within the first 3 days even before announcing the release reached 2000 according to Mozilla: https://addons.mozilla.org/en-US/

2010-10-04: WAC Kickoff Meeting; LC Storage Architectures Meeting, DPC Award Shortlist

Image
On September 24, I attended the kickoff meeting at Stanford for the Web Archiving Cooperative (WAC) Project, a joint NSF project (~$2.8M) between Stanford , Old Dominion and Harding . A summary of the meeting will be published at a later date, but it was attended by several members of our Advisory Board (from memory: Chris Borgman (UCLA), Trisha Cruse (CDL), Rick Furuta (TAMU), Alon Halevy (Google), Carl Lagoze (Cornell), Raghu Ramakrishnan (Yahoo), Herbert Van de Sompel (LANL)) and several members and friends of the Stanford Infolab . I gave two presentations, the first was a quick review of the state of web preservation (with the obligatory heavy emphasis on Memento ), and the second was some of my ruminations about future things that we should (or should not) explore in the context of WAC. Review of Web Archiving View more presentations from Michael Nelson . My Point of View: Michael L. Nelson Web Archiving Cooperative View more presentations from Michael Nelson .

2010-08-28: A Lookup for Nicknames and Diminutive Names

I created a simple lookup file that contains United States given names (first names) and their associated nicknames or diminutive names. For example "gregory" -> "greg", or "geoffrey" -> "geoff".  The file can be downloaded and contributed to from here  http://code.google.com/p/nickname-and-diminutive-names-lookup/ . This lookup was started from  http://www.tngenweb.org/franklin/frannick. htm  which is used for genealogy purposes. It was a good source to start from but because it is used for genealogy purposes there are some pretty of old names in there.  There was also a significant effort to make it machine readable, i.e. separate names with commas, remove human readable conventions, like "rickie(y)", so that it would be made into two different names "rickie", and "ricky". This is a large list with about 700 entries. Any help from people to clean this list up and add to it is greatly appreciated. Think o

2010-08-18: Fall 2010 Classes

Image
There will be two WS-DL classes offered for Fall 2010. CS 418/518 "Web Programming" will be taught by Martin Klein , but it will be similar in format and content to prior offerings , especially in respect to the focus on LAMP . This class involves significant programming, developing a single project throughout the semester. It is a good complement to CS 495/595 "Web Server Development" which last taught by Martin, in Spring 2010. 2010-08-30 edit: The class page for CS 418/518 is now available. I will teach CS 895 "Time on the Web", a new class that will deal explore the issues of Web resources evolving through time and how we interact with them. Aside from the canonical background readings, we will focus on current and recent projects such as our own Memento & Synchronicity , as well as OAC , Zoetrope , The Re:Search Engine , ADAPT , Past Web Browser , and other projects and papers to be determined. This class will be heavily oriented to

2010-07-27: NDIIPP Partners Meeting, IETF 78

Image
On July 20-22, I was at the NDIIPP Partners Meeting in Arlington VA, along with Martin Klein and Michele Weigle . The Library of Congress has not yet uploaded a public summary of the meeting, but there were a number of interesting additions to previous NDIIPP Partners Meetings (edit: the meeting slides are now available). First, there were keynotes from both the Librarian of Congress , James Billington , as well as the Archivist of the United States , David Ferriero . There was also a ceremony to commemorate the charter members (which includes ODU CS ) of the National Digital Stewardship Alliance (NDSA). I don't think the NDSA has a canonical web site yet, so the iPRES 2009 paper by Anderson, Gallinger & Potter is probably the best available description (edit: LC has announced a NDSA web site ). There was a theme of exploring the questions about "why we should care about digital preservation". The Library of Congress debuted this video, now available on th