Posts

Showing posts with the label publications

2011-02-08: An Evaluation of Link Neighborhood Lexical Signatures to Rediscover Missing Web Pages

Image
The final project for my master's degree focused on the problem of “missing” web pages, those URIs that return an error result when retrieved.  When a web page is no longer available at a given URI, it may be available at a new URI, and this research proposes and demonstrates a new method for finding the new URI. Prior research has proposed using the lexical signature of a page as a search query to find the same or similar content at a new URI.  A lexical signature (LS) is a few words that are used in that page much more often than they are used in other pages on the Web, and so are thought to describe what the page is about.  That LS is then used as a search query which will hopefully find the target page in its results. Previously-proposed methods for using an LS to find a new URI required either that the page be analyzed before being lost (ref: P&W) or that cached or archived versions of the page be available for analysis.  If the page had not previously been analyzed and

2010-04-22: Papers landed at Hypertext and JCDL 2010

Image
Not without pride I see two of my papers being accepted at the upcoming conferences ACM Hypertext (HT) and ACM/IEEE Joint Conference on Digital Libraries (JCDL) . The paper " Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure " will be published at JCDL. It is co-authored with my advisor Dr. Michael L. Nelson . As part of my ongoing dissertation work we are investigating methods to rediscover missing web pages with the help of the web infratructure (search engines, their caches, the Internet Archive, etc) in real time meaning while the user is browsing the web. This paper evaluates the performance of four of these methods: the title of the web page, its lexical signature (LS) representing the most salient terms of its content, its tags obtained from delicious.com and its neighborhood lexical signature (NHLS), a LS based on content of pages that link to the centroid page. We generate a corpus of web pages by randomly sampling from the Open Directo