Wednesday, May 28, 2014

2014-05-28: The road to the most precious three letters, PHD

On May 10th, 2014, the commencement with hundreds of students wearing their caps and gowns and ready for the moment of graduation can’t be forgotten. For me, it was the coronation for a long trip towards my Ph.D. degree in computer science. A few days before that, on May 3rd, 2014, I submitted my dissertation that was entitled “Web Archive Services Framework For Tighter Integration Between The Past And Present Web” to the ODU registrar's office as a declaration of the completion of the requirements for the degree. On Feb 26th, 2014, I defended my dissertation that was presented with these slides and is available for watching on video streaming.

In my research, I explored a proposed service framework that provided APIs for the web archive corpus to enable users and third party developers to access the web archive on four levels.

  • The first level is the content level that gives access to the actual content of web archive corpuses with various filter. 
  • The second level is the metadata level that gives access to two types of metadata. The ArcLink system extracts, preserves, and delivers the temporal web graph for the corpus. ArcLink was published as a poster in JCDL 2013 with my favorite minute madness and with a more detailed version as a tech report. ArcLink was presented in IIPC GA 2013 and received good feedback from the web archives consortium. The second type of metadata was thumbnails, we proposed thumbnails summarization techniques to select and generate distinguished set of pages that represent the main changes in the visual appearance of webpage through time. This work has been presented at ECIR 2014
  • The third level is URI level where we tried to extend the default URI lookup interface to benefit form the HTTP redirection. This research has been discussed in TempWeb 2013 and the full paper available in the proceedings. 
  • The fourth level is archive level where we quantified the current web archiving activities on two directions. The percentage of web archives materials regarding the live web corpus that was presented in JCDL 2011 and detailed version appeared as tech report. This work attracted the attention of various reporters to discuss it such as: The Atlantic, The Chronicle of Higher Education, and MIT Technology Review. The second direction was the distribution of web archives materials where we developed new methods to profile the web archives based on the TLD and languages. The work was presented at TPDL 2013, and an extended version with a larger dataset is accepted for publication in an IJDL special issue.
Now, while writing about it from my office at Stanford University Library, where I’m working as web archiving engineer and leading the technical activities for the new Stanford web archiving project, I remember the long trip since I've arrived in the US in Fall 2009 to start my degree. It was a long trip to gain the most precious three letters that will be attached to my name forever, Ahmed AlSum, PhD.
@JFK on Aug 2, 2009
Ahmed AlSum

Sunday, May 25, 2014

2014-05-25: IIPC GA 2014

I attended the International Internet Preservation Consortium (IIPC) General Assembly 2014 (#iipcGA14) hosted by the Bibliothèque nationale de France (BnF) in Paris.  Although the GA ran the entire week (May 19 -- May 23), I was only able to attend May 20 & 21.  It looks like I missed some good material on the first day, including keynotes from Wendy Hall and Wolfgang Nejdl, and a presentation from Common CrawlMartin Klein also presented an overview of the Hiberlink project, as well as the "mset attribute" that we are working on with the people from Harvard

I arrived after lunch on May 20, in time for a really strong session on "Harvesting and access: technical updates", featuring talks about Solr indexing (Andy Jackson et al.) (Andy's slides), deduplicating content in WARCs (Kristinn Sigurðsson), Heritrix updates (Kris Carpenter), and Open Wayback (Helen Hockx).  Within WS-DL, we haven't really done much with Solr in our projects or classes and that's a shortcoming we should address soon.

The morning of May 20 began with presentations from Helen Hockx and Gildas Illien about creating IIPC-branded collections (essentially continuing the Olympics collections available so far), followed by breakout sessions to discuss the legal and technical issues regarding such collections (guess which one is the most problematic!).  Although all considered this an interesting direction for IIPC to pursue, I'm not sure we made much progress on how to proceed.

After lunch, I gave my presentation in a session that included status updates about the KB's web archives (Anna Rademakers (slides)) and the Internet Memory Foundation (Leïla Medjkoune and Florent Carpentier (slides)).  My talk established the metaphor of web archives as "cluttered attics, garages, and basements" and then about profiling web archives to better perform query routing at the Memento Aggregator, as well as provide an interchange format and mechanism to coordinate IIPC crawling and coverage activities, including the contents of dark archives.

The day ended with a session about archiving Dutch public TV (Lotte Belice Baltussen (slides)) and crawling & archiving RSS feeds (Kristinn Sigurðsson (slides)).  Thursday and Friday closed out with public workshops, but I was already well into my homeward bound ordeal during those days. 

As always, the IIPC GA was filled with informative sessions and a collaborative spirit.  It was great catching up with old friends, and especially good to see WS-DL alumni Martin Klein (LANL) and Ahmed AlSum (Stanford).  Unfortunately, it is probably one of the last events at which we'll see Kris Carpenter since she is transitioning out of the Internet Archive.  I regret that my schedule did not allow me to attend the entire GA.  Although it is not quite official yet, it looks like the 2015 GA will be held at/near Stanford.


N.B. I will update the narrative above with links to the slides as they become available.

2014-05-27 Update: A mostly complete set of presentations is now available.

2014-06-18 Update: Blog posts about the IIPC GA from Ahmed AlSum and Nicholas Taylor.  

2014-07-28 Update: The BnF has posted some of the videos from the GA.

Monday, May 12, 2014

2014-05-08: Support for Various HTTP Methods on the Web

While clearly not all URIs will support all HTTP methods, we wanted to know what methods are widely supported, and how well is the support advertised in HTTP responses. Full range of HTTP method support is crucial for RESTful Web services. Please read our previous blog post for definitions and pointers about REST and HATEOAS. Earlier, we have done a brief analysis of HTTP method support in the HTTP Mailbox paper. We have extended the study to carry out deeper analysis of the same and look at various aspects of it.

We initially sampled 100,000 URIs from the DMOZ and found that only 40,870 URIs were live. Our further analysis was based on the response code, "Allow" header, and "Server" header for OPTIONS request from those live URIs. We found that out of those 40,870 URIs:
  • 55.31% do not advertise which methods they support
  • 4.38% refuse the OPTIONS method, either with a 405 or 501 response code
  • 15.33% support only HEAD, GET, and OPTIONS
  • 38.53% support HEAD, GET, POST, and OPTIONS
  • 0.12% have syntactic errors in how they convey which methods they support
  • 2.99% have RFC compliance issues such as a 200 (OK) response code to an OPTIONS request but OPTIONS is not present in the Allow header, 405 (Method not supported) response code without an Allow header, or 405 response code, but OPTIONS method is present in the Allow header
Below is an example of an OPTIONS request with a successful response:

$ curl -I -X OPTIONS
HTTP/1.1 200 OK
Date: Wed, 07 Aug 2013 23:11:04 GMT
Server: Apache/2.2.17 (Unix) PHP/5.3.5 mod_ssl/2.2.17 OpenSSL/0.9.8q
Content-Length: 0
Content-Type: text/html


The above code illustrates that the URI returns 200 OK response, it uses Apache web server and it supports GET, HEAD, POST, and OPTIONS methods.

The following OPTIONS request illustrates an unsuccessful response which has RFC compliance issue in it:

$ curl -I -X OPTIONS
HTTP/1.1 405 Not Allowed
Content-Type: text/html
Date: Wed, 07 Aug 2013 22:24:05 GMT
Server: nginx
Content-Length: 166
Connection: keep-alive


The above code illustrates that the URI returns 405 Not Allowed response, it uses Nginx web server and it does not tell what methods it allows.

Table 1: Interleaved Method Support Distribution.

Table 1 gives an interleaved distribution of method support. It shows the count and percentage of URIs in our sample set for all the combinations of supported and unsupported methods. If a combination is not listed in the table then it does not occur in our sample set.

In our sample set, about 55% URIs claim support for GET and POST methods, but less than 2% of the URIs claim support for one or more of PUT, PATCH, or DELETE methods. The full technical report can be found at arXiv.


Sawood Alam

Friday, May 2, 2014

2014-04-14: ECIR 2014 Trip report

From ECIR 2014 official flicker account
Between Apr. 14 to Apr. 16, 2014, in the beautiful Amsterdam city in Netherlands, I attended the the 36th European Conference on Information Retrieval (ECIR 2014). The conference started with Workshops/Tutorials day on Apr 13, which I didn't attend.

The first day was the workshops and tutorials day. ECIR 2014 had a wide range of workshops/tutorials that covered various aspects of IR such as: Text Quantification: A Tutorial, GamifIR' 14 workshop,  Context Aware Retrieval and Recommendation workshop (CaRR 2014), Information Access in smart cities workshop (i-ASC 2014), and Bibliometric-enhanced Information Retrieval workshop (BIR 2014).

The main conference started on April 14 with a welcome note from the conference chair Maarten de Rijke. After that,  Ayse Goker, from Robert Gordon University presented the winner of Karen Spärck Jones award and the keynote speaker Eugene Agichtein, a professor at Emory University. His presentation, which entitled "Inferring Searcher Attention and Intention by Mining Behavior Data", covered the challenges and the opportunities in the IR field and the future research areas.

First, he compared between the challenges of “Search” on 2002, where it aimed to support global information access and the contextual retrieval, and “Search” on 2012 (SWIRL 2012), where it focused on what beyond the ranked list and the evaluation. Eugene moved after that to the concept of inferring the search intention. In this area, Eugene pointed to use the interaction data such as asking questions by understanding the search term in social CQA, and some unsuccessful queries may be converted to automatic questions that are forwarded to the people (CQA) to answer it. Also, he considered the mining the query logs and click logs as sources of data that may enhance the search experience.

Then, Eugene discussed the challenges of having realistic search behavioral data outside the major search engines.  Eugene discussed UFindIt, a game to control the search behavior data at scale. Also, he showed some examples about override the big and expensive eye tracker equipment such as ViewSer that enabled remote eye tracking.

Finally, Eugene listed some of the future trends in IR field such as: behavior models for ubiquitous search, the future vision in search interface by developing an intelligent assistant and augmented reality, developing new tools for  analysis of cognitive processing, using mobile devices with camera as an eye tracking tool, optimizing the power consumption for the search task for mobile devices, and the privacy concern for searching.

After the break were two parallel sessions (Recommendation and Evaluation). I attended the recommendation session,where Chenyi Zhang from Zhejiang University presented his paper entitled "Content + Attributes: a Latent Factor Model for Recommending Scientific Papers in Heterogeneous Academic Networks" . In this paper, they proposed a new enhanced latent model for recommendation system for the academic papers. The system incorporates the paper content (e.g., title and abstract in plain text) and includes additional attributes (e.g., author, venue, publish year). The system solves the cold start for the new user by incorporating social media.  In the evaluation session, Colin Wilkie, from University of Glasgow, presented Best and Fairest: An Empirical Analysis of Retrieval System Bias. After lunch, we had the first poster/demo session. There was a set of interesting demos: DAIKnow, Khresmoi Professional, and ORMA.

The second day, April 14, started with a panel discussion about "Panel on the Information Retrieval Research Ecosystem" but due to the jet lag, I couldn't attend the morning session. After lunch, we started the next poster/demo session. I enjoyed the discussion around, GTE-Cluster: A Temporal Search Interface for Implicit Temporal Queries and TripBuilder who won the best demo award.

In the third and last day, April 15, the keynote speaker was Gilad Mishne, Director of Search at Twitter. Gilad introduced Twitter search as building the train track while the train is running hundreds of miles an hour. Gilad discussed the challenges of the search task in Twitter. He defined the challenges to be: mainstream input of tweets, on-time indexing, ranking tweets, and aggregating the results between tweets and people that required multiple indexes and multiple ranking techniques. Also, he distinguished the behavior in twitter search from search engines, as it is not repeated, 29% of top queries on twitter change hourly and 44% change daily. Gilad explained that there is a human in the loop for tweet annotation, Twitter hires "on-call" crowdsourced workers to categorize the queries, for example to determine if it is news-related or not. There are  a set of IR techniques that will not work with twitter search such as: anchor text,  term frequency, click data,  and relevance judgments. Twitter results optimization targets decreasing the bad results, which will increase good search experience, using evaluation metric so-called cr@p3 (fraction of crap in the top 3 docs).

The next session was "Digital Library" session where I presented my paper "Thumbnail Summarization for Web Archives". In this paper, we proposed various techniques to predict the change in the web page visual appearance based on the change of the HTML text in order to select a subset of the TimeMap that represents the major changes of the website through time. We suggested using SimHash fingerprint to estimate the changes between the pages. We proposed three algorithms that may minimize the size of the TimeMap to 25%.

The next presentation was "CiteSeerX: A Scholarly Big Dataset" by Cornelia Caragea. She spoke about some use cases for Scholarly article databases. Cagalna used DBLP content to clean the CiteXseer database.  She assumed that if there are two articles similar in title, author, and number of pages, then they are duplicate. However, one of the audience discussed a special use-case in the medical publications where this assumption is not right.

Then, Marijn Koolen from University of Amsterdam presented User Reviews in the Search Index? That'll Never Work!. Marjjn studied the user reviews for books on the web, e.g., Amazon, to enhance the search experience for books. He showed different examples about useful and unuseful comments. He used a big dataset of 2.8 million books description collected from Amazon and LT, augmented by 1.8 M entries from LoC and BL. The industry track ran in parallel with my session, this is an interesting slides from Alessandro Benedetti, Zaizi UK.

After lunch, I attended the industry track session with a presentation about the global search engines. Pavel Seryukov from Yandex presented "Analyzing Behavioral Data for Improving Search Experience at Yandex". Pavel spoke about Yandex efforts to share user data. Yandex ran click data challenge for 3 years right now. He showed how they anonymized the click logs by converting it into numbers.

The next presenter was Peter Mika from Yahoo Labs. His presentation entitled "Semantic Search at Yahoo". In this presentation, Peter gave an overview about the status of the semantic web and how it is used by the search engines.

By the end of the day, it was the closing session where the conference chair thanked the organizer for their efforts. Also, ECIR 2015 committee promoted the next ECIR event at Vienna, Austria. Finally, ECIR 2014 media committee made this wonderful video that incorporated various moments from ECIR 2014.

Ahmed AlSum