2015-07-07: WADL 2015 Trip Report

It was the last day of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) 2015 when the Workshop on Web Archiving and Digital Libraries (WADL) 2015 was scheduled and it started on time. When I entered in the workshop room, I realized we needed a couple of more chairs to accommodate all the participants, which was a good problem to have. The session started with a brief informal introduction of individual participants. Without wasting any time, the lightning talks session was started.

Gerhard Gossen started the lightning talk session with his presentation on "The iCrawl System for Focused and Integrated Web Archive Crawling". It was a short description of how iCrawl can be used to create archives for current events, targeted primarily to researchers and journalists. The demonstration illustrated how to search on the Web and Twitter for trending topics to find good seed URLs, manually add seed URLs and keywords, extract entities, configure crawling basic policies and finally start/schedule the crawling.

Ian Milligan presented his short talk on "Finding Community in the Ruins of GeoCities: Distantly Reading a Web Archive". He introduced GeoCities and explained why it matters. He illustrated the preliminary exploration of the data such as images, text, and topic extraction from it. He announced plans for a Web Analytics Hackathon in Canada in 2016 based on Warcbase and is looking for collaborators. He expressed the need of documentation for researchers. To acknowledge the need of context, he said, "In an archive you can find anything you want to prove, need to contextualize to validate the results."

Zhiwu Xie presented a short talk on "Archiving the Relaxed Consistency Web". This was focused on inconsistency problem mainly seen in crawler based archives. He described the illusion of consistency on distributed social media systems and the role of timezome differences. They found that the newer content is more inconsistent. In a simulation more than 60% of timelines were found inconsistent. They propose proactive redundant crawls and compensatory estimation of archival credibility as potential solutions to the issue.

Martin Lhotak and Tomas Foltyn presented their talk on "The Czech Digital Library - Fedora Commons based solution for aggregation, reuse, dissemination and archiving of digital documents". They introduced three main digitization areas in the Czech Republic - Manuscriptorium (early printed books and manuscripts), Kramerius (modern collections from 1801), and WebArchiv (digital archive of the Czech web resources). Their goal is to aggregate all digital library content from Czech Republic under Czech Republic Library (CDL).

Todd Suomela presented "Analytics for monitoring usage and users of Archive-IT collections". The University of Alberta is using Archive-It since 2009 where they have 19 different collections of which 15 are public. Their collections are proposed by public, faculty, or librarians then the proposal goes to the Collection Development Committee for the review. Todd evaluated the user activity (using Google Analytics) and the collection management aspects of the UA Digital Libraries.

After the lightning talks were over, workshop participants took a break and looked at the posters and demonstrations associated with the lightning talks above.

Our colleague Lulwah Alkwai had her "Best Student Paper" award winner full paper, "How Well Are Arabic Websites Archived?" presentation scheduled the same day, hence we joined her in the main conference track.

During the lunch break, awards were announced where our WSDL Research Group secured the Best Student Paper and the Best Poster awards. While some people were still enjoying their lunch, Dr. J. Stephen Downie presented the closing keynote on HathiTrust Digital Library. I learned a lot more about the HathiTrust, its collections, how they deal with the copyright and (not so) open data, and their mantra, "bring computing to the data" for the sake of the fair use of the copyright data. Finally, the there were announcements about the next year's JCDL conference which will be held in Newark, NJ from 19 to 23 June, 2016. After that we assembled again in the workshop room for the remaining sessions of the WADL.

Robert Comer and Andrea Copeland together presented "Methods for Capture of Social Media Content for Preservation in Memory Organizations". They talked about preserving personal and community heritage. They outlined the issues and challenges in preserving the history of the social communities and the problem of preserving the social media in general. They are working on a prototype tool called CHIME (Community History in Motion Everyday).

Mohamed Farag presented his talk on "Building and Archiving Event Web Collections: A focused crawler approach". Mohamed described the current approaches of building event collections, 1) manually - which leads to the hight quality but requires lots of effort and 2) social media - which is quick, but may result in potentially low quality collections. They are looking for balance between the two approaches to develop an Event Focused Crawler (EFC) that retrieves web pages that are similar to those with the curator selected seed URLs with the help of a topic detection model. They have made an event detection service demo available.

Zhiwu Xie presented "Server-Driven Memento Datetime Negotiation - A UWS Case". He described Uninterruptable Web Service (UWS) architecture which uses Memento to provide continuous service even if a server goes down. Then he proposed an ammendment in the workflow of the Memento protocol for a server-driven content negotiation instead of an agent-driven approcah to improve the efficiency of UWS.

Luis Meneses presented his talk on "Grading Degradation in an Institutionally Managed Repository". He motivated his talk by saying that degradation in data collection is like a library with books with missing pages. He illustrated examples from his testbed collection to introduce nine classes of degradation from the least damaged to the worst as 1) kind of correct, 2) university/institution pages, 3) directory listings, 4) blank pages, 5) failed redirects, 6) error pages, 7) pages in a different language, 8) domain for sale, and 9) deceiving pages.

The last speaker of the session, Sawood Alam (your author) presented "Profiling Web Archives". I briefly described Memento Aggregator and the need of profiling the long tail of archives to improve the efficiency of the aggregator. I described various profile types and policies, analyzed their cost in terms of space and time, and measured the routing efficiency of each profile. Also, I discussed the serialization format and scale related issues such as incremental updates. I took the advantage of being the last presenter of the workshop and kept the participants away from their dinner longer than I was supposed to.

Thanks Mat for your efforts in recording various sessions. Thanks Martin for the poster pictures. Thanks to everyone who contributed to the WADL 2015 Group Notes, it was really helpful. Thanks to all the organizers, volunteers and participants for making it a successful event.


Sawood Alam