2017-07-04: Web Archiving and Digital Libraries (WADL) Workshop Trip Report From JCDL2017

Web Archiving and Digital Libraries Workshop was held after JCDL 2017 from June 6, 2017, to June 23, 2017. I live-tweeted both days and you can follow along on Twitter with this blog post using the hashtag wadl2017 or via the notes/minutes of WADL2017. I also created a list on Twitter of the speaker/presenters Twitter handles, go give them a follow to keep up to date with their exciting work.

Day 1 (June 22)

WADL2017 kicked off at 2 pm with Martin Klein and Edward Fox welcoming us to the event by giving an overview and introduction to the presenters and panelists.

@mart1nkle1n kicks off a #JCDL2017 attached session by scrbblinging #WADL2017 hashtag on the blackboard. pic.twitter.com/CUN3fPKS4l
— Sawood Alam (@ibnesayeed) June 22, 2017

Keynote

The opening keynote of WADL2017 was National Digital Platform (NDP), Funding Opportunities, and Examples Of Currently Funded Projects by Ashley Sands (IMLS).

@ashley247 with her opening keynote at #wadl2017 @US_IMLS #jcdl2017 #wadl2017 pic.twitter.com/c0w5mZZGNF
— Martin Klein (@mart1nkle1n) June 22, 2017

In the keynote Sands spoke about the desired values for the national digital platform, how IMLS offers various grant categories and funding opportunities for archiving projects, and the submission procedure for grants as well as tips to writing IMLS grant proposals. Sands also shared what a successful (funded) proposal looks like, and how to apply to become a reviewer of the proposals!

.@ashley247 desired values for national digital platform #wadl2017 pic.twitter.com/XFuIcRjc2z
— John Berlin (@johnaberlin) June 22, 2017

.@ashley247 shares funding opportunities for archiving projects @ #WADL2017 #JCDL2017 pic.twitter.com/HbCL4DkZ54
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

Very helpful tips and recs for IMLS grant proposals! Thanks for a super informative prez @ashley247 #WADL2017 #jcdl2017 pic.twitter.com/KOIdYyou86
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

.@ashley247 successful proposal from 2015. "Combining Social Media Storytelling with Web Archives" @WebSciDL + @archiveitorg #wadl2017 pic.twitter.com/t43S6cZFOq
— John Berlin (@johnaberlin) June 22, 2017

.@ashley247 apply to be a reviewer, more voices more diveristy #wadl2017 pic.twitter.com/rrRUXS8yys
— John Berlin (@johnaberlin) June 22, 2017

Lightning Talks

First up in the lightning talks was Ross Spenser from the New Zealand Web Archive on "HTTPreserve: Auditing Document-Based Hyperlinks" (poster)

.@beet_keeper from New Zealand Web Archive presenting @httpreserve #wadl2017 pic.twitter.com/1z3x9VE7Zn
— John Berlin (@johnaberlin) June 22, 2017

Finally get to meet @beet_keeper! He’s presenting on HTTPreserve, repo at https://t.co/uDJKJ2OCQY. #WADL2017 pic.twitter.com/G92rNVc98N
— Ian Milligan (@ianmilligan1) June 22, 2017

Spenser has created a tool that will check the status of a URL on the live web and if it has been archived by the Internet Archive (httpreserve) which is a part of a large suite of tools under the same name. You can try it out via httpreserve.info and the project is open to contributions from the community as well!

GREAT!! Awesome to know about projects that welcome issues and PRs! #wadl2017
— John Berlin (@johnaberlin) June 23, 2017

The second talk was Muhammad Umar Qasim on "WARC-Portal: A Tool for Exploring the Past". WARC-Portal is a tool that seeks to provide access for researchers to browse and search through custom collections and provides tools for analyzing these collections via Warcbase.

WARC-Portal: A Tool for Exploring the Past #wadl2017 pic.twitter.com/2RsUADChhO
— John Berlin (@johnaberlin) June 22, 2017

Third talks was by Sawood Alam on "The Impact of URI Canonicalization on Memento Count". Alam spoke about the ratio of representations vs redirects obtained from dereferencing each archived capture. For a more detailed explanation of this you can read our blog post or the full technical report.

Impact of URI Canonicalization on Memento Count from Mat Kelly

The final talks was by Edward Fox on "Web Archiving Through In-Memory Page Cache". Fox spoke about Nearline vs. Transactional Web Archiving and the advantages of using a Redis cache.

Paper Sessions

First, up for in paper sessions was Ian Milligan, Nick Ruest and Ryan Deschamps on "Building a National Web Archiving Collaborative Platform: Web Archives for Longitudinal Knowledge Project"

.@ianmilligan1 @ruebot @RyanDeschamps tag team presentation "The WALK Project"#wadl2017 pic.twitter.com/9LkibDm73F
— John Berlin (@johnaberlin) June 22, 2017

The WALK project seeks to address the issue of "To use Canadian web archives you have to really want to use them, that is you need to be an expert" by "Bringing Canadian web archives into a centralised portal with access to derivative datasets".

And now a joint prez on Building a Nat'l Web Archiving Collaborative Platform: Web Archives for Longitudinal Knowledge Proj. #WADL2017 pic.twitter.com/Ya5HZOTt4A
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

Enter WALK: 61 collections, 16 TB of WARC files, developed new Solr front end based on Project Blacklight (currently indexed 250 million records). The WALK workflow consists of using Warcbase and a handful of other command line tools to retrieve data from the Internet Archive, generate scholarly derivatives (visualizations, etc) automatically, upload those derivatives to Dataverse and ensure the derivatives are available to the research team.

.@RyanDeschamps on the WALK workflow, including description of collection with network graphs #WADL2017 pic.twitter.com/cgkRhtCzLV
— Emily Maemura (@emilymaemura) June 22, 2017

To ensure that WALK could scale the WALK project will be building on top of Blacklight and contributing it back to the community as WARCLight.

.@ruebot shows (unsurprisingly) the cutest slide of #WADL2017 pic.twitter.com/g0Wob4Bwz1
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

The second paper of WADL2017 presentation was by Sawood Alam on "Avoiding Zombies in Archival Replay Using ServiceWorker." Alam spoke about how through the use of ServiceWorkers URI's that were missed during rewriting or not rewritten at all due to the dynamic nature of the web can be rerouted dynamically by the ServiceWorker to hit the archive rather than the live web.

Avoiding Zombies in Archival Replay Using ServiceWorker from Sawood Alam

Ian Milligan was up next presenting "Topic Shifts Between Two US Presidential Administrations". One of the biggest questions that Milligan noted during his talk was how to proceed training a classifier if there was no annotated data by which to train it by. To address this question (issue), Milligian used bootstrapping to start off via bag of words and keyword matching. He noted that is method works with noisy but reasonable data. The classifiers were trained to look for biases in admins, Trump vs Obama seems to work with dramatic differences and the TL;DR is the classifiers do learn the biases. For more detailed information about the paper see Milligan's blog post about it.

Slides for our #wadl2017 #jcdl2017 talk: “Comparing Topic Shifts Between Two US Presidential Administrations.” https://t.co/rrDsb0HJQu pic.twitter.com/o4OiqZtLzI
— Ian Milligan (@ianmilligan1) June 22, 2017

Closing the first day of WADL2017 was Brenda Reyes Ayala with the final paper presentation on "Web Archives: A preliminary exploration vs reality". Ayala spoke about looking at Archive-It support tickets, as XML, then cleaned and anonymized then using qualitative coding, grounded theory for analysis and presented three expectations when considering user expectations, their mental models when working with archives.

The original website had X number of documents, it would also follow that the archived website also has X number of documents.

Reality: an archived website was often much larger or smaller than the user had expected.

A web archive only includes content that is closely related to the topic.

Reality: Due to crawler settings, scoping rules, and the nat-6/23ure of the web, web archives often include content that is not topic-specific. This was especially the case with social media sites. Users saw the presence of this content as being of little relevance and superfluous.

Content that looks irrelevant is actually irrelevant.

Reality: A website contains pages or elements that are not obviously important but help “behind the scenes” to make other elements or pages render correctly or function properly.

This is knowledge that is known by the partner specialist, but usually unknown or invisible to the user or creator of an archive. Partner specialists often had to explain the true nature of this seemingly irrelevant content Domains and sub-domains are the same thing, and they do not affect the capture of a website.

Reality: These differences usually affect how a website is captured.

2017-08-25 edit: Slides accompanying Ayala's talk made available. Web archives: A preliminary exploration of user expectations vs. reality hosted by The Portal to Texas History

Day 2 (June 23)

Day two started off with a panel featuring Emily Maemura, Dawn Walker, Matt Price, and Maya Anjur-Dietrich on "Challenges for Grassroots Web Archiving of Environmental Data". The first event hosted took place in December in Toronto to preserve the EPA data from the Obama administration during the Trump transition. The event had roughly two-hundred participants and during the event hundreds of press articles, tens of thousands of URL’s seeded to Internet Archive, dozens of coders building tools and a sustainable local community of activists interested in continuing the work. Since then seven events in Philly, NYC, Ann Arbor, Cambridge MA, Austin TX, Berkeley were hosted/co-hosted with thirty-one more planned in cities across the country.

Matt Price talks about very important EDGI project--keeping track of environmental data online: https://t.co/21JcMwsfBm #WADL2017 #JCDL2017
— Jasmine Mulliken (@jasminemulliken) June 23, 2017

After the panel was Tom J. Smyth on Legal Deposit, Collection Development, Preservation, and Web Archiving at Library and Archives Canada Web Archival Scoping Documents. Smyth spoke on questions about how to start building a collection for a budding web archive that does not have the scale as well as an established one and that it has:

Web Archival Scoping Documents

What priority
What type
What are we trying to document
What degree are we trying to document

Controlled Collection Metadata, Controlled vocabulary

Evolves over time with the collection topic

Quality Control Framework

Essential for setting a cut-off point for quality control

Selected Web Resources must pass four checkpoints

Is the resource in-scope of the collection and theme
(when in doubt consult the Scoping Document)
Heritage Value, is the content unique available in other formats,
(what contexts can it be used)
Technology / Preservation
Quality Control

.@smythbound gives @RyanDeschamps a #wadl2017 shoutout for his "topic Jeopardy," when thinking about curating collections. pic.twitter.com/k26LhCdObo
— Ian Milligan (@ianmilligan1) June 23, 2017

@smythbound you won the cuteness round of #Zombies and #Unicorns against @ibnesayeed at #WADL2017 #JCDL2017 pic.twitter.com/xqE9yR22aL
— Sawood Alam (@ibnesayeed) June 23, 2017

The next paper presenters up were Muhammad Umar Qasim and Sam-Chin Li for "Working Together Toward a Shared Vision: Canadian Government Information Digital Preservation Network (CGI - DPN)". The Canadian Government Information Digital Preservation Network (CGI - DPN) is a project that seeks to preserve digital collections of government information and ensure the long-term viability of digital materials through geographically dispersed servers, protective measures against data loss, and forward format migration. The project will also as a backup server in cases where the main server is unavailable as well as act as a means of restoring lost data. To achieve the goals the project is using Archive-It for the web crawls and collection building then using LOCKSS to disseminating the collections to additional peers (LOCKSS nodes).

Sam-chin Li and Muhammed Umar Qasim "Working Together Towards A Shared Vision" #wadl2017 pic.twitter.com/oLN8o4ftdb
— John Berlin (@johnaberlin) June 23, 2017

Nick Ruest was up next speaking on "Strategies for Collecting, Processing, and Analyzing Tweets from Large Newsworthy Events". Ruest spoke about how Twitter is big data and handling the can be difficult. Ruest also spoke about how to handle the big Twitter data in a sane manner by using tools such as Hydrator or Twarc from the DocNow project.

.@ruebot Strategies for handling large news worthy tweet collections @documentnow #wadl2017 pic.twitter.com/Mu0j9Lzhtw
— John Berlin (@johnaberlin) June 23, 2017

Here are my #WADL2017 slides if you want follow along at home.https://t.co/PZtJzhxhMH

❤️ @documentnow
— nick ruest (@ruebot) June 23, 2017

The final paper presentation of the day was Saurabh Chakravarty, Eric Williamson, and Edward Fox on "Classification of Tweets using Augmented Training". Chakravarty discussed using the cosine similarity measure on Word2Vec based vector representation of tweets and how it can be used to label unlabeled examples. How training a classifier using Augmented Training does provide improvements in classification efficacy and how a Word2Vec based representation generated out of a richer corpus like Google News provides better improvements with augmented training.

"Classification Of Tweets Using Augmented Training" Datasets #wadl2017 pic.twitter.com/pLKywUe3YP
— John Berlin (@johnaberlin) June 23, 2017

This is a great project, given how hard it is to classify short tweets – they train a dataset, then use on auxiliary tweets. #wadl2017 pic.twitter.com/wFPDFhH31o
— Ian Milligan (@ianmilligan1) June 23, 2017

Closing Round Table On WADL

The final order of business for WADL 2017 was a round table discussion with the participants and attendees concerning next years WADL and how to make WADL even better. There was a lot of great ideas and suggestions made as the round table progressed with the participants of this discussion becoming the most excited about the following:

WADL 2018 (naturally of course)
Seeking out additional collaboration and information sharing with those who are actively looking for web archiving but are unaware of / did not meet up for WADL
Looking into bringing proceedings to WADL, perhaps even a journal
Extending the length of WADL to a full two or three day event
Integration of remote site participation for those who wish to attend but can not due to geographical location or travel expenses

Till Joint Conference on Digital Libraries 2018 June 3 - 7 in Fort Worth, Texas, USA
- John Berlin

Search This Blog

Web Science and Digital Libraries Research Group