Friday, August 28, 2015

2015-08-28 Original Header Replay Considered Coherent

Introduction


As web archives have advanced over time, their ability to capture and playback web content has grown. The Memento Protocol, defined in RFC 7089, defines an HTTP protocol extension that bridges the present and past web by allowing time-based content negotiation. Now that Memento is operational at many web archives, analysis of archive content is simplified. Over the past several years, I have conducted analysis of web archive temporal coherence. Some of the results of this analysis will be published at Hypertext'15. This blog post discusses one implication of the research: the benefits achieved when web archives playback original headers.

Archive Headers and Original Headers


Consider the headers (Figure 1) returned for a logo from the ODU Computer Science Home Page as archived on Wed, 29 Apr 2015 15:15:23 GMT.

HTTP/1.1 200 OK
Content-Type: image/gif
Last-Modified: Wed, 29 Apr 2015 15:15:23 GMT
Figure 1. No Original Header Playback

Try to answer the question "Was the representation provided by the web archive valid for Tue, 28 Apr 2015 12:00:00 GMT?" (i.e., the day before). The best answer possible is maybe. Because I have spent many hours using the Computer Science web site, I know the site changes infrequently. Given this knowledge, I might upgrade the answer from maybe to probably. This difficulty answering is due to the Last-Modified header reflecting the date archived instead of the date the image itself was last modified. And, although it is true that the memento (archived copy) was indeed modified Wed, 29 Apr 2015 15:15:23 GMT, this merging of original resource Last-Modified and memento Last-Modified loses valuable information. (Read Memento-Datetime is not Last-Modified for more details.)

Now consider the headers (figure 2) for another copy that was archived Sun, 14 Mar 2015 22:21:07 GMT. Take special note of the X-Archive-Orig-* headers. These are a playback of original headers that were included in the response when the logo image was captured by the web archive.

HTTP/1.1 200 OK
Content-Type: image/gif
X-Archive-Orig-etag: "52d202fb-19db"
X-Archive-Orig-last-modified: Sun, 12 Jan 2014 02:50:35 GMT
X-Archive-Orig-expires: Sat, 19 Dec 2015 13:01:55 GMT
X-Archive-Orig-accept-ranges: bytes
X-Archive-Orig-cache-control: max-age=31104000
X-Archive-Orig-connection: keep-alive
X-Archive-Orig-date: Wed, 24 Dec 2014 13:01:55 GMT
X-Archive-Orig-content-type: image/gif
X-Archive-Orig-server: nginx
X-Archive-Orig-content-length: 6619
Memento-Datetime: Sun, 14 Mar 2015 22:21:07 GMT
Figure 2. Original Header Playback

Compare the Memento-Datetime (which is the archive datetime) and the X-Archive-Orig-last-modified headers while answering this question: "Was the representation provided by the web archive valid for Tue, 13 Mar 2015 12:00:00 GMT?". Clearly the answer is yes.

Why This Matters


For the casual web archive user, the previous example may seem like must nit-picky detail. Still, consider the Weather Underground page archived on Thu, 09 Dec 2004 19:09:26 GMT and shown in Figure 3.

Weather Underground as archived Thu, 09 Dec 2004 19:09:26 GMT
Figure 3. Weather Underground as archived Thu, 09 Dec 2004 19:09:26 GMT
The Weather Underground page (like most) is a composition of many resources including the web page itself, images,  style sheets, and JavaScript. Note the conflict between the forecast of light drizzle and the completely clear radar image. Figure 4 shows the relevant headers returned for the radar image:

HTTP/1.1 200 OK
Memento-Datetime: Mon, 12 Sep 2005 22:34:45 GMT
X-Archive-Orig-last-modified: Mon, 12 Sep 2005 22:32:24 GMT
Figure 4. Prima Facie Coherence Headers

Clearly the radar image was captured much later than the web page—over 9 months later in fact! But this alone does not prove the radar image is the incorrect image (perhaps Weather Underground radar images were broken on 09 Dec 2004). However, the Memento-Datetime and X-Archive-Orig-last-modified headers tell the full story, showing that not only was the radar image captured well after the web page was archived, but also that the radar image was modified well after the web page was archived. Thus, together Memento-Datetime and X-Archive-Org-Last-Modified are prima facie evidence that the radar image is temporally violative with respect to the archived web page in which it is displayed. Figure 5 illustrates this pattern. The black left-to-right arrow is time. The black diamond and text represent the web page; the green represents the radar image. The green line shows that the radar image X-Archive-Orig-Last-Modified and Memento-Datetime bracket the web page archival time. Details on this pattern and others are detailed in our framework technical report.

Figure 5. Prima Facie Coherence

But Does Header Playback Really Matter?


Of course, if few archived images and other embedded resources include Last-Modified headers, the overall effect could be inconsequential. However, the results to be published at Hypertext'15 show that using the Last-Modified header makes a significant coherence improvement: using Last-Modified to select embedded resources increased mean prima facie coherence from ~41% to ~55% compared to using just Memento-Datetime. And, at the time the research was conducted, only the Internet Archive included Last-Modified playback. If the other 14 web archives included in the study also implemented header playback, we project that mean prima facie coherence would have been about 80%!

Web Archives Today


When the research leading to the Hypertext'15 paper was conducted, only the Internet Archive included Last-Modified playback. This limited prima facie coherence determination to only embedded resources retrieved from the Internet Archive. As shown in Table 1, additional web archives now playback original headers. The table also show which archives implement the Memento Protocol (and are therefore RFC 7089 compliant) and which archives use OpenWayback, which already implements header replay. Although header playback is a long way from universal, progress is happening. We look forward to continuing coherence improvement as additional web archives implement header playback and the Memento Protocol.

Table 1. Current Web Archive Status
Web Archive Header Playback? Memento Compliant? OpenWayback?
Archive-It Yes Yes Yes
archive.today No Yes No
arXiv No No No
Bibliotheca Alexandrina Web Archive Unknown1 Yes Yes
Canadian Government Web Archive No Proxy No
Croatian Web Archive No Proxy No
DBPedia Archive No Yes No
Estonian Web Archive No Proxy No
GitHub No Proxy No
Icelandic Web Archive Yes Yes Yes
Internet Archive Yes Yes Yes
Library of Congress Web Archive Yes Yes Yes
NARA Web Archive No Proxy Yes
Orain No Proxy No
PastPages Web Archive No Yes No
Portugese Web Archive No Proxy No
PRONI Web Archive No Yes Yes
Slovenian Web Archive No Proxy No
Stanford Web Archive Yes Yes Yes
UK Government Web Archive No Yes Yes
UK Parliament's Web Archive No Yes Yes
UK Web Archive Yes Yes Yes
Web Archive Singapore No Proxy No
WebCite No Proxy No
WikiPedia No Proxy No
1Unavailable at the time this post was written.

Wrap Up


Web archives featuring both the capture and replay of original header show significantly better temporal coherence in recomposed web pages. Currently, web archives using Heritrix and OpenWayback implement these features; no archives using other software are known to do so. Implementing original header capture and replay is highly recommended as it will allow implementation of improved recomposition heuristics (which is a topic for another day and another post).


— Scott G. Ainsworth

Friday, August 21, 2015

2015-08-20: ODU, L3S, Stanford, and Internet Archive Web Archiving Meeting



Two weeks ago (on Aug 3, 2015), I was glad to be invited to visit Internet Archive in San Francisco in order to share our latest work with a set of the Web Archiving pioneers from around the world.

The attendees were Jefferson Bailey and Vinay Goel from IA, Nicholas Taylor and Ahmed AlSum from Stanford, and Wolfgang Nejdl, Ivana Marenzi and Helge Holzmann from L3S.

First, we took a quick introduction to each others mentioning the purpose and the nature of our work to IA.

Then, Nejdl introduced the Alexandria project, and demoed the ArchiveWeb project, which aims to develop tools and techniques to explore and analyze Web archives in a meaningful way. In the project, they develop tools that will allow users to visualize and collaboratively interact with Archive-it collections by adding new resources in the form of tags and comments. Furthermore, it contains a collaborative search and sharing platform.

I presented the off-topic detection work with a live demo for the tool, which can be downloaded and tested from https://github.com/yasmina85/OffTopic-Detection.


The off-topic tool aims to automatically detect when the archived page goes off-topic, which means the page changed through time to move away from the initial scope of the page. The tool suggests a list of off-topic pages based on a specific threshold that is input by the user. Based on evaluating the tool, we suggest values for the threshold in a research paper* that can be used to detect the off-topic pages.

A site for one of the candidates for Egypt’s 2012 presidential election. Many of the captures of hamdeensabhay.com are not about the Egyptian Revolution. Later versions show an expired domain (as does the live Web version).

Examples for the usage of the tool:
--------

Example 1: Detecting off-topic pages in 1826 collection

python detect_off_topic.py -i 1826 -th 0.15
extracting seed list

http://agroecol.umd.edu/Research/index.cfm
http://casademaryland.org

50 URIs are extracted from collection https://archive-it.org/collections/1826
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://agroecol.umd.edu/Research/index.cfm
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://casademaryland.org

Downloading 4 mementos out of 306
Downloading 14 mementos out of 306

Detecting off-topic mementos using Cosine Similarity method

Similarity memento_uri
0.0 http://wayback.archive-it.org/1826/20131220205908/http://www.mncppc.org/commission_home.html/
0.0 http://wayback.archive-it.org/1826/20141118195815/http://www.mncppc.org/commission_home.html

Example 2: Detecting off-topic pages for http://hamdeensabahy.com/

python detect_off_topic.py -t https://wayback.archive-it.org/2358/timemap/link/http://hamdeensabahy.com/  -m wcount -th -0.85

Downloading 0 mementos out of 270
http://wayback.archive-it.org/2358/20140524131241/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130621131337/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140602131307/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140528131258/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130617131324/http://www.hamdeensabahy.com/


Downloading 4 mementos out of 270

Extracting text from the html

Detecting off-topic mementos using Word Count method

Similarity memento_uri
-0.979434447301 http://wayback.archive-it.org/2358/20121213102904/http://hamdeensabahy.com/

-0.966580976864 http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/

-0.94087403599 http://wayback.archive-it.org/2358/20130526131402/http://www.hamdeensabahy.com/

-0.94087403599 http://wayback.archive-it.org/2358/20130527143614/http://www.hamdeensabahy.com/


Nicholas insisted on the importance of the off-topic tool from QA perspective, while Internet Archives folks focused on the required computation resources and how it can be shared with Archive-It partners. The group discussed some user interface options to display the output of the tool.

After the demo, we discussed the importance of the tool, especially in the crawling quality assurance practices.  While demoing ArchiveWeb interface, some of the visualization for pages from different collections showed off-topic pages.  We all agreed that it is important that those pages won’t appear to the users when they browse the collections.

It was amazing to spend time in IA and knowing about the last trend from other research groups. The discussion showed the high reputation of WS-DL research in the web archiving community around the world.

*Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, Detecting Off-Topic Pages in Web Archives, Proceedings of TPDL 2015, 2015.

----
Yasmin




Tuesday, August 18, 2015

2015-08-18: Three WS-DL Classes Offered for Fall 2015


https://xkcd.com/657/

The Web Science and Digital Libraries Group is offering three classes this fall.  Unfortunately there are no undergraduate offerings this semester, but there are three graduate classes covering the full WS-DL spectrum:

Note that while 891 classes count toward the 24 hours of 800-level class work for the PhD program, they do not count as one of the "four 800-level regular courses" required.  Students looking to satisfy one of the 800-level regular courses should consider CS 834.  Students considering doing research in the broad areas of Web Science should consider taking all three of these classes this semester.

--Michael

Monday, July 27, 2015

2015-07-27: Upcoming Colloquium, Visit from Herbert Van de Sompel

On Wednesday, August 5, 2015 Herbert Van de Sompel (Los Alamos National Laboratory) will give a colloquium in the ODU Computer Science Department entitled "A Perspective on Archiving the Scholarly Web".  It will be held in the third floor E&CS conference room (r. 3316) at 11am.  Space is somewhat limited (the first floor auditorium is being renovated), but all are welcome to attend.  The abstract for his talk is:

 A Perspective on Archiving the Scholarly Web
As the scholarly communication system evolves to become natively web-based and starts supporting the communication of a wide variety of objects, the manner in which its essential functions -- registration, certification, awareness, archiving -- are fulfilled co-evolves.  Illustrations of the changing implementation of these functions will be used to arrive at a high-level characterization of a future scholarly communication system and of the objects that will be communicated. The focus will then shift to the fulfillment of the archival function for web-native scholarship. Observations regarding the status quo, which largely consists of back-office processes that have their origin in paper-based communication, suggest the need for a change. The outlines of a different archival approach inspired by existing web archiving practices will be explored.
This presentation will be an evolution of ideas following his time as a visiting scholar at DANS, in conjunction with Dr. Andrew Treloar (ANDS) (2014-01 & 2014-12). 

Dr. Van de Sompel is an internationally recognized pioneer in the field of digital libraries and web preservation, with his contributions including many of the architectural solutions that define the community, including: OpenURL, SFX, OAI-PMH, OAI-ORE, info URI, bX, djatoka, MESUR, aDORe, Memento, Open Annotation, SharedCanvas, ResourceSync, and Hiberlink.

Also during his time at ODU, he will be reviewing the research projects of PhD students in the Web Science and Digital Libraries group as well as exploring new areas for collaboration with us.  This will be Dr. Van de Sompel's first trip to ODU since 2011 when he and Dr. Sanderson served as the external committee members for Martin Klein's PhD dissertation defense.

2015-08-17 Edit:

We recorded the presentation, but this professionally edited version from summer 2014 is better to watch online:



I've also compiled the slides of the seven students (Corren McCoy, Alexander Nwala, Scott Ainsworth, Lulwah Alkwai, Sawood Alam, Justin Brunelle, and Mat Kelly) who gave presentations on the status of their PhD research:




--Michael

Friday, July 24, 2015

2015-07-24: ICSU World Data System Webinar #6: Web-Centric Solutions for Web-Based Scholarship

Earlier this week Herbert Van de Sompel gave a webinar for the ICSU World Data System entitled "Web-Centric Solutions for Web-Based Scholarship".  It's a short and simple review of some of the interoperability projects we've worked on through since 1999, including OAI-PMH, OAI-ORE, and Memento.  He ends with a short nod to his simple but powerful "Signposting the Scholarly Web" proposal, but the slides in the appendix give the full description.



The main point of this presentation was to document how each project successively further embraced the web, not just as a transport protocol but fully adopting the semantics as part of the protocol.  Herbert and I then had a fun email discussion about how the web, scholarly communication, and digital libraries were different in 1999 (the time of OAI-PMH & our initial collaboration) and now.  Some highlights include:
  • Although Google existed, it was not the hegemonic force that it is today, and contemporary search engines that did exist (e.g., AltaVista, Lycos) weren't that great (both in terms of precision and recall).  
  • The Deep Web was still a thing -- search engines did not reliably find obscure resources likely scholarly resources (cf. our 2006 IEEE IC study "Search Engine Coverage of the OAI-PMH Corpus" and Kat Hagedorn's 2008 follow up "Google Still Not Indexing Hidden Web URLs").
  • Related to the above, the focus in digital libraries was on repositories, not the web itself.  Everyone was sitting on an SQL database of "stuff" and HTTP was seen just as a transport in which to export the database contents.  This meant that the gateway script (ca. 1999, it was probably in Perl DBI) between the web and the database was the primary thing, not the database records or the resultant web pages (i.e., the web "resource").  
  • Focus on database scripts resulted in lots of people (not just us in OAI-PMH) tunneling ad-hoc/homemade protocols over HTTP.  In fairness, Roy Fielding's thesis defining REST only came out in 2000, and the W3C Web Architecture document was drafted in 2002 and finalized in 2004.  Yes, I suppose we should have sensed the essence of these documents in the early HTTP RFCs (2616, 2068, 1945) but... we didn't. 
  • The very existence of technologies such as SOAP (ca. 1998) nicely illustrates the prevailing mindset of HTTP as a replaceable transport. 
  • Technologies similar to OAI-PMH, such as RSS, were in flux and limited to 10 items (belying their news syndication origin which made them unsuitable for digital library applications).  
  • Full-text was relatively rare, so the focus was on metadata (see table 3 in the original UPS paper; every digital library description at the time distinguished between "records" and "records with full-text links").  Even if full-text was available, downloading and indexing it was an expensive operation for everyone involved -- bandwidth was limited and storage was expensive in 1999!  Sites like xxx.lanl.gov even threatened retaliation if you downloaded their full-text (today's text on that page is less antagonistic, but I recall the phrase "we fight back!").  Credit to CiteSeer for being an early digital library that was the first to use full-text (DL 1998).
Eventually Google Scholar announced they were deprecating OAI-PMH support, but the truth is they never really supported it in the first place.  It was just simpler to crawl the web, and the early focus on keeping robots out of the digital library had given way to making sure that they got into the digital library (e.g., Sitemaps).

The OAI-ORE and then Memento projects were more web-centric, as Herbert nicely explains in the slides, with OAI-ORE having a Semantic Web spin and Memento being more grounded in the IETF community.   As Herbert says at the beginning of the video, our perspective in 1999 was understandable given the practices at the time, but he goes on to say that he frequently reviews proposals about data management, scholarly communication, data preservation, etc. that continue to treat the web as a transport protocol over which the "real" protocol is deployed.  I would add that despite the proliferation of web APIs that claim to be RESTful, we're seeing a general retreat from REST/HATEOAS principles by the larger web community and not just the academic and scientific community.

In summary, our advice would be to fully embrace HTTP, since it is our community's Fortran and it's not going anywhere anytime soon

--Michael

Thursday, July 23, 2015

2015-07-22: I Can Haz Memento

Inspired by the "#icanhazpdfmovement and built upon the Memento  service, I Can Haz Memento attempts to expand the awareness of Web Archiving through Twitter. Given a URL (for a page) in a tweet with the hash tag "#icanhazmemento," the I Can Haz Memento service replies the tweet with a link pointing to an archived version of the page closest to the time of the tweet. The consequence of this is: the archived version closest to the time of the tweet likely expresses the intent of the user at the time the link was shared.
Consider a scenario where Jane shares a link in a tweet to the front page of cnn about a story on healthcare. Given the fluid nature of the news cycle, at some point, the story about healthcare would be replaced by another fresh story; thus the link in Jane's tweet and its corresponding intent (healthcare story) become misrepresented by Jane's original link (for the new story). This is where I Can Haz Memento comes into the picture. If Jane included "#icanhazmemento" in her tweet, the service would have replied Jane's tweet with a link representing:
  • An archived version (closest to her tweet time) of the front page healthcare story on cnn, if the page had already been archived within a given temporal threshold (e.g 24 hours)Or
  • A newly archived version of the same page. In other words, the service does the archiving and returns the link to the newly archived page, if the page was not already archived.
How to use I Can Haz Memento
Method 1: In order to use the service, include the hashtag "#icanhazmemento" in the tweet with the link to the page you intend to archive or retrieve an archived version. For example, consider Shawn Jones' tweet below for http://www.cs.odu.edu:
Which prompted the following reply from the service:
Method 2: In Method 1, the hashtag "#icanhazmemento" and the URL,  http://www.cs.odu.edu, reside in the same tweet, but Method 2 does not impose this restriction. If someone (@anwala) tweeted a link (e.g arsenal.com), and you (@wsdlodu) wished the request be treated in the same manner as Method 1 (as though "#icanhazmemento" and  arsenal.com were in the same tweet), all that is required is a reply to the original tweet (without the "#icanhazmemento") with a tweet which includes "#icanhazmemento." Consider an example of Method 2 usage:
  1. @acnwala tweets arsenal.com without "#icanhazmemento"
  2. @wsdlodu replies the @acnwala's tweet with "#icanhazmemento"
  3. @icanhazmemento replies @wsdlodu with the archived versions of arsenal.com
The scenario (1, 2 and 3) is outlined by the following tweet threads:
 I Can Haz Memento - Implementation

I Can Haz Memento is implemented in Python and leverages the Twitter Tweepy API. The implementation is captured by the following subroutines:
  1. Retrieve links from tweets with "#icanhazmemento": This was achieved due to Tweepy's api.search API method. The sinceIDValue is used to keep track of already visited tweets. Also, the application sleeps in between each request in order to comply with Twitter's API rate limits, but not before retrieving the URLs from each tweet.
  2. After the URLs in 1. have been retrieved, the following subroutine
    • Makes an HTTP Request to the Timegate API in order to get the the Memento (instance of the resource) closest to the time of tweet (since the time of tweet is passed as a parameter for datetime content negotiation):
    • If the page is not found in any archive, it is pushed to archive.org and archive.is for archiving:
The source code for the application is available on Gitub. We acknowledge the effort of Mat Kelly who wrote the first draft of the application. And we hope you use #icanhazmemento.
--Nwala

Tuesday, July 7, 2015

2015-07-07: WADL 2015 Trip Report


It was the last day of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) 2015 when the Workshop on Web Archiving and Digital Libraries (WADL) 2015 was scheduled and it started on time. When I entered in the workshop room, I realized we needed a couple of more chairs to accommodate all the participants, which was a good problem to have. The session started with a brief informal introduction of individual participants. Without wasting any time, the lightning talks session was started.

Gerhard Gossen started the lightning talk session with his presentation on "The iCrawl System for Focused and Integrated Web Archive Crawling". It was a short description of how iCrawl can be used to create archives for current events, targeted primarily to researchers and journalists. The demonstration illustrated how to search on the Web and Twitter for trending topics to find good seed URLs, manually add seed URLs and keywords, extract entities, configure crawling basic policies and finally start/schedule the crawling.

Ian Milligan presented his short talk on "Finding Community in the Ruins of GeoCities: Distantly Reading a Web Archive". He introduced GeoCities and explained why it matters. He illustrated the preliminary exploration of the data such as images, text, and topic extraction from it. He announced plans for a Web Analytics Hackathon in Canada in 2016 based on Warcbase and is looking for collaborators. He expressed the need of documentation for researchers. To acknowledge the need of context, he said, "In an archive you can find anything you want to prove, need to contextualize to validate the results."



Zhiwu Xie presented a short talk on "Archiving the Relaxed Consistency Web". This was focused on inconsistency problem mainly seen in crawler based archives. He described the illusion of consistency on distributed social media systems and the role of timezome differences. They found that the newer content is more inconsistent. In a simulation more than 60% of timelines were found inconsistent. They propose proactive redundant crawls and compensatory estimation of archival credibility as potential solutions to the issue.



Martin Lhotak and Tomas Foltyn presented their talk on "The Czech Digital Library - Fedora Commons based solution for aggregation, reuse, dissemination and archiving of digital documents". They introduced three main digitization areas in the Czech Republic - Manuscriptorium (early printed books and manuscripts), Kramerius (modern collections from 1801), and WebArchiv (digital archive of the Czech web resources). Their goal is to aggregate all digital library content from Czech Republic under Czech Republic Library (CDL).

Todd Suomela presented "Analytics for monitoring usage and users of Archive-IT collections". The University of Alberta is using Archive-It since 2009 where they have 19 different collections of which 15 are public. Their collections are proposed by public, faculty, or librarians then the proposal goes to the Collection Development Committee for the review. Todd evaluated the user activity (using Google Analytics) and the collection management aspects of the UA Digital Libraries.



After the lightning talks were over, workshop participants took a break and looked at the posters and demonstrations associated with the lightning talks above.


Our colleague Lulwah Alkwai had her "Best Student Paper" award winner full paper, "How Well Are Arabic Websites Archived?" presentation scheduled the same day, hence we joined her in the main conference track.



During the lunch break, awards were announced where our WSDL Research Group secured the Best Student Paper and the Best Poster awards. While some people were still enjoying their lunch, Dr. J. Stephen Downie presented the closing keynote on HathiTrust Digital Library. I learned a lot more about the HathiTrust, its collections, how they deal with the copyright and (not so) open data, and their mantra, "bring computing to the data" for the sake of the fair use of the copyright data. Finally, the there were announcements about the next year's JCDL conference which will be held in Newark, NJ from 19 to 23 June, 2016. After that we assembled again in the workshop room for the remaining sessions of the WADL.



Robert Comer and Andrea Copeland together presented "Methods for Capture of Social Media Content for Preservation in Memory Organizations". They talked about preserving personal and community heritage. They outlined the issues and challenges in preserving the history of the social communities and the problem of preserving the social media in general. They are working on a prototype tool called CHIME (Community History in Motion Everyday).




Mohamed Farag presented his talk on "Building and Archiving Event Web Collections: A focused crawler approach". Mohamed described the current approaches of building event collections, 1) manually - which leads to the hight quality but requires lots of effort and 2) social media - which is quick, but may result in potentially low quality collections. They are looking for balance between the two approaches to develop an Event Focused Crawler (EFC) that retrieves web pages that are similar to those with the curator selected seed URLs with the help of a topic detection model. They have made an event detection service demo available.



Zhiwu Xie presented "Server-Driven Memento Datetime Negotiation - A UWS Case". He described Uninterruptable Web Service (UWS) architecture which uses Memento to provide continuous service even if a server goes down. Then he proposed an ammendment in the workflow of the Memento protocol for a server-driven content negotiation instead of an agent-driven approcah to improve the efficiency of UWS.



Luis Meneses presented his talk on "Grading Degradation in an Institutionally Managed Repository". He motivated his talk by saying that degradation in data collection is like a library with books with missing pages. He illustrated examples from his testbed collection to introduce nine classes of degradation from the least damaged to the worst as 1) kind of correct, 2) university/institution pages, 3) directory listings, 4) blank pages, 5) failed redirects, 6) error pages, 7) pages in a different language, 8) domain for sale, and 9) deceiving pages.



The last speaker of the session, Sawood Alam (your author) presented "Profiling Web Archives". I briefly described Memento Aggregator and the need of profiling the long tail of archives to improve the efficiency of the aggregator. I described various profile types and policies, analyzed their cost in terms of space and time, and measured the routing efficiency of each profile. Also, I discussed the serialization format and scale related issues such as incremental updates. I took the advantage of being the last presenter of the workshop and kept the participants away from their dinner longer than I was supposed to.





Thanks Mat for your efforts in recording various sessions. Thanks Martin for the poster pictures. Thanks to everyone who contributed to the WADL 2015 Group Notes, it was really helpful. Thanks to all the organizers, volunteers and participants for making it a successful event.

Resources

--
Sawood Alam