Wednesday, July 5, 2017

2017-07-04: Web Archiving and Digital Libraries (WADL) Workshop Trip Report From JCDL2017


Web Archiving and Digital Libraries Workshop was held after JCDL 2017 from June 6, 2017, to June 23, 2017. I live-tweeted both days and you can follow along on Twitter with this blog post using the hashtag wadl2017 or via the notes/minutes of WADL2017. I also created a list on Twitter of the speaker/presenters Twitter handles, go give them a follow to keep up to date with their exciting work.

Day 1 (June 22)

WADL2017 kicked off at 2 pm with Martin Klein and Edward Fox welcoming us to the event by giving an overview and introduction to the presenters and panelists.

Keynote

The opening keynote of WADL2017 was National Digital Platform (NDP), Funding Opportunities, and Examples Of Currently Funded Projects by Ashley Sands (IMLS).
In the keynote Sands spoke about the desired values for the national digital platform, how IMLS offers various grant categories and funding opportunities for archiving projects, and the submission procedure for grants as well as tips to writing IMLS grant proposals. Sands also shared what a successful (funded) proposal looks like, and how to apply to become a reviewer of the proposals!

Lightning Talks

First up in the lightning talks was Ross Spenser from the New Zealand Web Archive on "HTTPreserve: Auditing Document-Based Hyperlinks" (poster)

Spenser has created a tool that will check the status of a URL on the live web and if it has been archived by the Internet Archive (httpreserve) which is a part of a large suite of tools under the same name. You can try it out via httpreserve.info and the project is open to contributions from the community as well!
The second talk was Muhammad Umar Qasim on "WARC-Portal: A Tool for Exploring the Past". WARC-Portal is a tool that seeks to provide access for researchers to browse and search through custom collections and provides tools for analyzing these collections via Warcbase.
Third talks was by Sawood Alam on "The Impact of URI Canonicalization on Memento Count". Alam spoke about the ratio of representations vs redirects obtained from dereferencing each archived capture. For a more detailed explanation of this you can read our blog post or the full technical report.

The final talks was by Edward Fox on "Web Archiving Through In-Memory Page Cache". Fox spoke about Nearline vs. Transactional Web Archiving and the advantages of using a Redis cache.

Paper Sessions

First, up for in paper sessions was Ian Milligan, Nick Ruest and Ryan Deschamps on "Building a National Web Archiving Collaborative Platform: Web Archives for Longitudinal Knowledge Project"
The WALK project seeks to address the issue of "To use Canadian web archives you have to really want to use them, that is you need to be an expert" by "Bringing Canadian web archives into a centralised portal with access to derivative datasets".
Enter WALK: 61 collections, 16 TB of WARC files, developed new Solr front end based on Project Blacklight (currently indexed 250 million records). The WALK workflow consists of using Warcbase and a handful of other command line tools to retrieve data from the Internet Archive, generate scholarly derivatives (visualizations, etc) automatically, upload those derivatives to Dataverse and ensure the derivatives are available to the research team.
To ensure that WALK could scale the WALK project will be building on top of Blacklight and contributing it back to the community as WARCLight.
The second paper of WADL2017 presentation was by Sawood Alam on "Avoiding Zombies in Archival Replay Using ServiceWorker." Alam spoke about how through the use of ServiceWorkers URI's that were missed during rewriting or not rewritten at all due to the dynamic nature of the web can be rerouted dynamically by the ServiceWorker to hit the archive rather than the live web. 
Avoiding Zombies in Archival Replay Using ServiceWorker from Sawood Alam

Ian Milligan was up next presenting "Topic Shifts Between Two US Presidential Administrations". One of the biggest questions that Milligan noted during his talk was how to proceed training a classifier if there was no annotated data by which to train it by. To address this question (issue), Milligian used bootstrapping to start off via bag of words and keyword matching. He noted that is method works with noisy but reasonable data. The classifiers were trained to look for biases in admins, Trump vs Obama seems to work with dramatic differences and the TL;DR is the classifiers do learn the biases. For more detailed information about the paper see Milligan's blog post about it.
Closing the first day of WADL2017 was Brenda Reyes Ayala with the final paper presentation on "Web Archives: A preliminary exploration vs reality". Ayala spoke about looking at Archive-It support tickets, as XML, then cleaned and anonymized then using qualitative coding, grounded theory for analysis and presented three expectations when considering user expectations, their mental models when working with archives.
The original website had X number of documents, it would also follow that the archived website also has X number of documents.
Reality: an archived website was often much larger or smaller than the user had expected.
A web archive only includes content that is closely related to the topic.
Reality: Due to crawler settings, scoping rules, and the nat-6/23ure of the web, web archives often include content that is not topic-specific. This was especially the case with social media sites. Users saw the presence of this content as being of little relevance and superfluous.
Content that looks irrelevant is actually irrelevant.
Reality: A website contains pages or elements that are not obviously important but help “behind the scenes” to make other elements or pages render correctly or function properly.
This is knowledge that is known by the partner specialist, but usually unknown or invisible to the user or creator of an archive. Partner specialists often had to explain the true nature of this seemingly irrelevant content Domains and sub-domains are the same thing, and they do not affect the capture of a website.
Reality: These differences usually affect how a website is captured.

Day 2 (June 23)

Day two started off with a panel featuring Emily Maemura, Dawn Walker, Matt Price, and Maya Anjur-Dietrich on "Challenges for Grassroots Web Archiving of Environmental Data". The first event hosted took place in December in Toronto to preserve the EPA data from the Obama administration during the Trump transition. The event had roughly two-hundred participants and during the event hundreds of press articles, tens of thousands of URL’s seeded to Internet Archive, dozens of coders building tools and a sustainable local community of activists interested in continuing the work. Since then seven events in Philly, NYC, Ann Arbor, Cambridge MA, Austin TX, Berkeley were hosted/co-hosted with thirty-one more planned in cities across the country.
After the panel was Tom J. Smyth on Legal Deposit, Collection Development, Preservation, and Web Archiving at Library and Archives Canada Web Archival Scoping Documents. Smyth spoke on questions about how to start building a collection for a budding web archive that does not have the scale as well as an established one and that it has:
Web Archival Scoping Documents
  • What priority
  • What type
  • What are we trying to document
  • What degree are we trying to document
Controlled Collection Metadata, Controlled vocabulary
  • Evolves over time with the collection topic
Quality Control Framework
  • Essential for setting a cut-off point for quality control
Selected Web Resources must pass four checkpoints
  • Is the resource in-scope of the collection and theme
    (when in doubt consult the Scoping Document)
  • Heritage Value, is the content unique available in other formats,
    (what contexts can it be used)
  • Technology / Preservation
  • Quality Control

The next paper presenters up were Muhammad Umar Qasim and Sam-Chin Li for "Working Together Toward a Shared Vision: Canadian Government Information Digital Preservation Network (CGI - DPN)". The Canadian Government Information Digital Preservation Network (CGI - DPN) is a project that seeks to preserve digital collections of government information and ensure the long-term viability of digital materials through geographically dispersed servers, protective measures against data loss, and forward format migration. The project will also as a backup server in cases where the main server is unavailable as well as act as a means of restoring lost data. To achieve the goals the project is using Archive-It for the web crawls and collection building then using LOCKSS to disseminating the collections to additional peers (LOCKSS nodes).
Nick Ruest was up next speaking on "Strategies for Collecting, Processing, and Analyzing Tweets from Large Newsworthy Events". Ruest spoke about how Twitter is big data and handling the can be difficult. Ruest also spoke about how to handle the big Twitter data in a sane manner by using tools such as Hydrator or Twarc from the DocNow project.


The final paper presentation of the day was Saurabh Chakravarty, Eric Williamson, and Edward Fox on "Classification of Tweets using Augmented Training". Chakravarty discussed using the cosine similarity measure on Word2Vec based vector representation of tweets and how it can be used to label unlabeled examples. How training a classifier using Augmented Training does provide improvements in classification efficacy and how a Word2Vec based representation generated out of a richer corpus like Google News provides better improvements with augmented training.

Closing Round Table On WADL

The final order of business for WADL 2017 was a round table discussion with the participants and attendees concerning next years WADL and how to make WADL even better. There was a lot of great ideas and suggestions made as the round table progressed with the participants of this discussion becoming the most excited about the following:
  1. WADL 2018 (naturally of course)
  2. Seeking out additional collaboration and information sharing with those who are actively looking for web archiving but are unaware of / did not meet up for WADL
  3. Looking into bringing proceedings to WADL, perhaps even a journal
  4. Extending the length of WADL to a full two or three day event
  5. Integration of remote site participation for those who wish to attend but can not due to geographical location or travel expenses
Till Joint Conference on Digital Libraries 2018 June 3 - 7 in Fort Worth, Texas, USA
- John Berlin

No comments:

Post a Comment