Posts

2014-05-28: The road to the most precious three letters, PHD

Image
On May 10th, 2014, the commencement with hundreds of students wearing their caps and gowns and ready for the moment of graduation can’t be forgotten. For me, it was the coronation for a long trip towards my Ph.D. degree in computer science. A few days before that, on May 3rd, 2014, I submitted my dissertation that was entitled “ Web Archive Services Framework For Tighter Integration Between The Past And Present Web ” to the ODU registrar's office as a declaration of the completion of the requirements for the degree. On Feb 26th, 2014, I defended my dissertation that was presented with these slides and is available for watching on video streaming.   In my research, I explored a proposed service framework that provided APIs for the web archive corpus to enable users and third party developers to access the web archive on four levels. The first level is the content level that gives access to the actual content of web archive corpuses with various filter.  The second

2014-05-25: IIPC GA 2014

Image
I attended the International Internet Preservation Consortium (IIPC) General Assembly 2014 ( #iipcGA14 ) hosted by the Bibliothèque nationale de France (BnF) in Paris.  Although the GA ran the entire week (May 19 -- May 23), I was only able to attend May 20 & 21.  It looks like I missed some good material on the first day, including keynotes from Wendy Hall and Wolfgang Nejdl , and a presentation from Common Crawl .  Martin Klein also presented an overview of the Hiberlink project, as well as the " mset attribute " that we are working on with the people from Harvard .  I arrived after lunch on May 20, in time for a really strong session on "Harvesting and access: technical updates", featuring talks about Solr indexing ( Andy Jackson et al.) ( Andy's slides ), deduplicating content in WARCs ( Kristinn Sigurðsson ), Heritrix updates (Kris Carpenter), and Open Wayback ( Helen Hockx ).  Within WS-DL, we haven't really done much with Solr in our p

2014-05-08: Support for Various HTTP Methods on the Web

Image
While clearly not all URIs will support all HTTP methods, we wanted to know what methods are widely supported, and how well is the support advertised in HTTP responses. Full range of HTTP method support is crucial for RESTful Web services. Please read our  previous blog post  for definitions and pointers about REST and HATEOAS. Earlier, we have done a brief analysis of HTTP method support in the HTTP Mailbox paper. We have extended the study to carry out deeper analysis of the same and look at various aspects of it. We initially sampled 100,000 URIs from the DMOZ and found that only 40,870 URIs were live. Our further analysis was based on the response code, "Allow" header, and "Server" header for OPTIONS request from those live URIs. We found that out of those 40,870 URIs: 55.31% do not advertise which methods they support 4.38% refuse the OPTIONS method, either with a 405 or 501 response code 15.33% support only HEAD, GET, and OPTIONS 38.53% support

2014-04-14: ECIR 2014 Trip report

Image
From ECIR 2014 official flicker account Between Apr. 14 to Apr. 16, 2014, in the beautiful Amsterdam city in Netherlands, I attended the the 36th European Conference on Information Retrieval (ECIR 2014). The conference started with Workshops/Tutorials day on Apr 13, which I didn't attend. The first day was the workshops and tutorials day. ECIR 2014 had a wide range of workshops/tutorials that covered various aspects of IR such as: Text Quantification: A Tutorial , GamifIR' 14  workshop,  Context Aware Retrieval and Recommendation workshop ( CaRR 2014 ), Information Access in smart cities workshop ( i-ASC 2014 ), and Bibliometric-enhanced Information Retrieval workshop ( BIR 2014 ). The main conference started on April 14 with a welcome note from the conference chair Maarten de Rijke . After that,   Ayse Goker , from Robert Gordon University presented the winner of Karen Spärck Jones award and the keynote speaker Eugene Agichtein , a professor at Emory University . His

2014-04-18: Grad Cohort Workshop (CRA-W) 2014 Trip Report

Image
Last week on April 10-11, 2014 I attended the Graduate Cohort Workshop  2014 that took place at the Hyatt Regency in Santa Clara. While there, I  enjoyed the nice weather of California and saw the home to the headquarters of several high-tech companies. CRA-W (Computer Research Association's Committee on the Status of Women in Computing Research) sponsors  a number of activities focused on helping graduate students succeed in CSE research careers. These include educational and community building events, and mentoring. The event was part  of CRA-W, which has several goals, including  (1) increase the number of women in computing (2) provide strategies and information on navigating graduate school (3) early insight into career paths (4) meet others, speaker, graduate students, networking among others and among others. Women students in their first, second or third year of graduate school in computer science and engineering or a closely related field, who are studying at a US

2014-04-17: TimeGate Design Options For MediaWiki

Image
We've been working on the development, testing, and improvement of the Memento MediaWiki Extension .  One of our principle concerns is performance. The Memento MediaWiki Extension supports all Memento concepts : Original Resource (URI-R) - in MediaWiki parlance referred to as a "topic URI" Memento (URI-M) - called "oldid page" in MediaWiki TimeMap (URI-T) - analogous to the MediaWiki history page, but in a machine readable format TimeGate (URI-G) - no native equivalent in MediaWiki; acquires a datetime from the Memento client, supplies back the appropriate URI-M for the client to render This article will focus primarily on the TimeGate (URI-G), specifically the analysis of two different alternatives in the implementation of TimeGate.  In this article we use the following terms to refer to these two alternatives: Special:TimeGate - where we use a MediaWiki Special Page to act as a URI-G explicitly URI-R=URI-G - where a URI-R acts as a URI-G if it det

2014-04-01: Yesterday's (Wiki) Page, Today's Image?

Image
Web pages, being complex documents, contain embedded resources like images.  As practitioners of digital preservation well know, ensuring that the correct embedded resource is captured when the main page is preserved presents a very difficult problem.  In  A Framework for Evaluation of Composite Memento Temporal Coherence , Scott Ainsworth, Michael L. Nelson, and Herbert Van de Sompel explore this very concept. Figure 1: Web Archive Weather Underground Example Showing the Different Ages of Embedded Resources In Figure 1, borrowed from that paper, we see a screenshot of the  Web Archive's December 9, 2004 memento from Weather Underground .  Even though the age of most of these embedded images differ greatly from the main page, they don't really impact its meaning.  Of interest is the weather map that differs by 9 months, which shows clear skies even though the forecast of the main page calls for clouds and light rain. The Web Archive, as a service external to the resour

2014-03-01: Starting my research internship at NUS

Image
Well, I made it! I am finally on the green fine island. After a long trip from Norfolk international airport to Washington DC Dulles then 23 hours in the air except for a fueling pit-stop in Tokyo Narita airport I landed in Changi airport in Singapore. To give you some context, I was invited to spend a semester at the National University of Singapore and work with Dr. Min Yen Kan in the WING research group . The purpose was to work in a common area of interest that helps me progress in the final leg of my PhD marathon and increase the collaboration between our WS-DL lab and WING yielding a reputable paper (or more?). In short, I am a WING this semester! So buckle up! Due to jet lag being a miserable companion the first couple of days, I decided not to take the first day off to rest and settle and go directly to the university. Or maybe it was my excitement? I will never confess. At NUS I did the regular paperwork and met my colleague and fellow research partner for the

2014-03-01 Domains per page over time

Image
A few days ago, I read an interesting blog post by  Peter Bengtsson . Peter is sampling web pages  and computing basic statistic on the number of domains ( RFC 3986 Host ) required to completely render the page.  Not surprisingly, the mean is quite high: 33.  Also not surprisingly, he has found pages that depend on more than 100 different domains. This started me thinking about how this has changed over time. Over the course of my research I have acquired a corpus of composite mementos (archived web pages and all their embedded images, CSS, etc.) dating from 1996 to 2013.  So, I did a little number crunching. What I suspected and confirmed is that the number of domains has increased over time and that the rate of increase has also increased. This is reflected in the median domains data show in Figure 1. Note that the median shown (3) is a fraction of Peter's (25). I believe there are two major reasons for this. First, our current process for recomposing composite memen