2017-07-04: Web Archiving and Digital Libraries (WADL) Workshop Trip Report From JCDL2017
Day 1 (June 22)
WADL2017 kicked off at 2 pm with Martin Klein and Edward Fox welcoming us to the event by giving an overview and introduction to the presenters and panelists.@mart1nkle1n kicks off a #JCDL2017 attached session by scrbblinging #WADL2017 hashtag on the blackboard. pic.twitter.com/CUN3fPKS4l— Sawood Alam (@ibnesayeed) June 22, 2017
Keynote
The opening keynote of WADL2017 was National
Digital Platform (NDP), Funding Opportunities, and Examples Of Currently Funded Projects by Ashley Sands (IMLS).
In the keynote Sands spoke about the desired values for the national digital platform, how IMLS offers various grant categories and funding opportunities for archiving projects, and the submission procedure for grants as well as tips to writing IMLS grant proposals. Sands also shared what a successful (funded) proposal looks like, and how to apply to become a reviewer of the proposals!@ashley247 with her opening keynote at #wadl2017 @US_IMLS #jcdl2017 #wadl2017 pic.twitter.com/c0w5mZZGNF— Martin Klein (@mart1nkle1n) June 22, 2017
— John Berlin (@johnaberlin) June 22, 2017
.@ashley247 shares funding opportunities for archiving projects @ #WADL2017 #JCDL2017 pic.twitter.com/HbCL4DkZ54— Jasmine Mulliken (@jasminemulliken) June 22, 2017
Very helpful tips and recs for IMLS grant proposals! Thanks for a super informative prez @ashley247 #WADL2017 #jcdl2017 pic.twitter.com/KOIdYyou86— Jasmine Mulliken (@jasminemulliken) June 22, 2017
.@ashley247 successful proposal from 2015. "Combining Social Media Storytelling with Web Archives" @WebSciDL + @archiveitorg #wadl2017 pic.twitter.com/t43S6cZFOq— John Berlin (@johnaberlin) June 22, 2017
— John Berlin (@johnaberlin) June 22, 2017
Lightning Talks
First up in the lightning talks was Ross Spenser from the New Zealand
Web Archive on "HTTPreserve: Auditing Document-Based Hyperlinks" (poster)
The final talks was by Edward Fox on "Web Archiving Through In-Memory Page Cache". Fox spoke about Nearline vs. Transactional Web Archiving and the advantages of using a Redis cache.
.@beet_keeper from New Zealand Web Archive presenting @httpreserve #wadl2017 pic.twitter.com/1z3x9VE7Zn— John Berlin (@johnaberlin) June 22, 2017
Spenser has created a tool that will check the status of a URL on the live web and if it has been archived by the Internet Archive (httpreserve) which is a part of a large suite of tools under the same name. You can try it out via httpreserve.info and the project is open to contributions from the community as well!Finally get to meet @beet_keeper! He’s presenting on HTTPreserve, repo at https://t.co/uDJKJ2OCQY. #WADL2017 pic.twitter.com/G92rNVc98N— Ian Milligan (@ianmilligan1) June 22, 2017
The second talk was Muhammad Umar Qasim on "WARC-Portal: A Tool for Exploring the Past". WARC-Portal is a tool that seeks to provide access for researchers to browse and search through custom collections and provides tools for analyzing these collections via Warcbase.GREAT!! Awesome to know about projects that welcome issues and PRs! #wadl2017— John Berlin (@johnaberlin) June 23, 2017
Third talks was by Sawood Alam on "The Impact of URI Canonicalization on Memento Count". Alam spoke about the ratio of representations vs redirects obtained from dereferencing each archived capture. For a more detailed explanation of this you can read our blog post or the full technical report.WARC-Portal: A Tool for Exploring the Past #wadl2017 pic.twitter.com/2RsUADChhO— John Berlin (@johnaberlin) June 22, 2017
The final talks was by Edward Fox on "Web Archiving Through In-Memory Page Cache". Fox spoke about Nearline vs. Transactional Web Archiving and the advantages of using a Redis cache.
Paper Sessions
First, up for in paper sessions was Ian Milligan, Nick Ruest and Ryan
Deschamps on "Building a National Web Archiving Collaborative Platform: Web
Archives for Longitudinal Knowledge Project"
The final paper presentation of the day was Saurabh Chakravarty, Eric Williamson, and Edward Fox on "Classification of Tweets using Augmented Training". Chakravarty discussed using the cosine similarity measure on Word2Vec based vector representation of tweets and how it can be used to label unlabeled examples. How training a classifier using Augmented Training does provide improvements in classification efficacy and how a Word2Vec based representation generated out of a richer corpus like Google News provides better improvements with augmented training.
- John Berlin
The WALK project seeks to address the issue of "To use Canadian web archives you have to really want to use them, that is you need to be an expert" by "Bringing Canadian web archives into a centralised portal with access to derivative datasets"..@ianmilligan1 @ruebot @RyanDeschamps tag team presentation "The WALK Project"#wadl2017 pic.twitter.com/9LkibDm73F— John Berlin (@johnaberlin) June 22, 2017
Enter WALK: 61 collections, 16 TB of WARC files, developed new Solr front end based on Project Blacklight (currently indexed 250 million records). The WALK workflow consists of using Warcbase and a handful of other command line tools to retrieve data from the Internet Archive, generate scholarly derivatives (visualizations, etc) automatically, upload those derivatives to Dataverse and ensure the derivatives are available to the research team.And now a joint prez on Building a Nat'l Web Archiving Collaborative Platform: Web Archives for Longitudinal Knowledge Proj. #WADL2017 pic.twitter.com/Ya5HZOTt4A— Jasmine Mulliken (@jasminemulliken) June 22, 2017
To ensure that WALK could scale the WALK project will be building on top of Blacklight and contributing it back to the community as WARCLight..@RyanDeschamps on the WALK workflow, including description of collection with network graphs #WADL2017 pic.twitter.com/cgkRhtCzLV— Emily Maemura (@emilymaemura) June 22, 2017
— Jasmine Mulliken (@jasminemulliken) June 22, 2017The second paper of WADL2017 presentation was by Sawood Alam on "Avoiding Zombies in Archival Replay Using ServiceWorker." Alam spoke about how through the use of ServiceWorkers URI's that were missed during rewriting or not rewritten at all due to the dynamic nature of the web can be rerouted dynamically by the ServiceWorker to hit the archive rather than the live web.
Avoiding Zombies in
Archival Replay Using ServiceWorker from Sawood Alam
Ian Milligan was up next presenting "Topic Shifts Between Two US Presidential Administrations". One of the biggest questions that Milligan noted during his talk was how to proceed training a classifier if there was no annotated data by which to train it by. To address this question (issue), Milligian used bootstrapping to start off via bag of words and keyword matching. He noted that is method works with noisy but reasonable data. The classifiers were trained to look for biases in admins, Trump vs Obama seems to work with dramatic differences and the TL;DR is the classifiers do learn the biases. For more detailed information about the paper see Milligan's blog post about it.
Ian Milligan was up next presenting "Topic Shifts Between Two US Presidential Administrations". One of the biggest questions that Milligan noted during his talk was how to proceed training a classifier if there was no annotated data by which to train it by. To address this question (issue), Milligian used bootstrapping to start off via bag of words and keyword matching. He noted that is method works with noisy but reasonable data. The classifiers were trained to look for biases in admins, Trump vs Obama seems to work with dramatic differences and the TL;DR is the classifiers do learn the biases. For more detailed information about the paper see Milligan's blog post about it.
Closing the first day of WADL2017 was Brenda Reyes Ayala with the final paper presentation on "Web Archives: A preliminary exploration vs reality". Ayala spoke about looking at Archive-It support tickets, as XML, then cleaned and anonymized then using qualitative coding, grounded theory for analysis and presented three expectations when considering user expectations, their mental models when working with archives.Slides for our #wadl2017 #jcdl2017 talk: “Comparing Topic Shifts Between Two US Presidential Administrations.” https://t.co/rrDsb0HJQu pic.twitter.com/o4OiqZtLzI— Ian Milligan (@ianmilligan1) June 22, 2017
The original website had X number of documents, it would also follow that the archived website also has X number
of documents.
Reality: an archived website was often much larger or smaller than the user had expected.
A web archive only includes content that is closely related to the topic.
Reality: Due to crawler settings, scoping rules, and the nat-6/23ure of the web, web archives often include content that is not topic-specific. This was especially the case with social media sites. Users saw the presence of this content as being of little relevance and superfluous.
Content that looks irrelevant is actually irrelevant.
Reality: A website contains pages or elements that are not obviously important but help “behind the scenes”
to make other elements or pages render correctly or function properly.
This is knowledge that is known by the partner specialist, but usually unknown or invisible to the user or
creator of an archive. Partner specialists often had to explain the true nature of this seemingly irrelevant
content Domains and sub-domains are the same thing, and they do not affect the capture of a website.
Reality: These differences usually affect how a website is captured.
2017-08-25 edit: Slides accompanying Ayala's talk made available.
Web archives: A preliminary exploration of user expectations vs. reality
hosted by The Portal to Texas History
Day 2 (June 23)
Day two started off with a panel featuring Emily Maemura, Dawn Walker, Matt Price, and Maya Anjur-Dietrich on "Challenges for Grassroots Web Archiving of Environmental Data". The first event hosted took place in December in Toronto to preserve the EPA data from the Obama administration during the Trump transition. The event had roughly two-hundred participants and during the event hundreds of press articles, tens of thousands of URL’s seeded to Internet Archive, dozens of coders building tools and a sustainable local community of activists interested in continuing the work. Since then seven events in Philly, NYC, Ann Arbor, Cambridge MA, Austin TX, Berkeley were hosted/co-hosted with thirty-one more planned in cities across the country.After the panel was Tom J. Smyth on Legal Deposit, Collection Development, Preservation, and Web Archiving at Library and Archives Canada Web Archival Scoping Documents. Smyth spoke on questions about how to start building a collection for a budding web archive that does not have the scale as well as an established one and that it has:Matt Price talks about very important EDGI project--keeping track of environmental data online: https://t.co/21JcMwsfBm #WADL2017 #JCDL2017— Jasmine Mulliken (@jasminemulliken) June 23, 2017
Web Archival Scoping Documents
- What priority
- What type
- What are we trying to document
- What degree are we trying to document
- Evolves over time with the collection topic
- Essential for setting a cut-off point for quality control
-
Is the resource in-scope of the collection and theme
(when in doubt consult the Scoping Document) - Heritage Value, is the content unique available in other formats,
(what contexts can it be used) - Technology / Preservation
- Quality Control
.@smythbound gives @RyanDeschamps a #wadl2017 shoutout for his "topic Jeopardy," when thinking about curating collections. pic.twitter.com/k26LhCdObo— Ian Milligan (@ianmilligan1) June 23, 2017
The next paper presenters up were Muhammad Umar Qasim and Sam-Chin Li for "Working Together Toward a Shared Vision: Canadian Government Information Digital Preservation Network (CGI - DPN)". The Canadian Government Information Digital Preservation Network (CGI - DPN) is a project that seeks to preserve digital collections of government information and ensure the long-term viability of digital materials through geographically dispersed servers, protective measures against data loss, and forward format migration. The project will also as a backup server in cases where the main server is unavailable as well as act as a means of restoring lost data. To achieve the goals the project is using Archive-It for the web crawls and collection building then using LOCKSS to disseminating the collections to additional peers (LOCKSS nodes).@smythbound you won the cuteness round of #Zombies and #Unicorns against @ibnesayeed at #WADL2017 #JCDL2017 pic.twitter.com/xqE9yR22aL— Sawood Alam (@ibnesayeed) June 23, 2017
Nick Ruest was up next speaking on "Strategies for Collecting, Processing, and Analyzing Tweets from Large Newsworthy Events". Ruest spoke about how Twitter is big data and handling the can be difficult. Ruest also spoke about how to handle the big Twitter data in a sane manner by using tools such as Hydrator or Twarc from the DocNow project.Sam-chin Li and Muhammed Umar Qasim "Working Together Towards A Shared Vision" #wadl2017 pic.twitter.com/oLN8o4ftdb— John Berlin (@johnaberlin) June 23, 2017
.@ruebot Strategies for handling large news worthy tweet collections @documentnow #wadl2017 pic.twitter.com/Mu0j9Lzhtw— John Berlin (@johnaberlin) June 23, 2017
Here are my #WADL2017 slides if you want follow along at home.https://t.co/PZtJzhxhMH— nick ruest (@ruebot) June 23, 2017
❤️ @documentnow
The final paper presentation of the day was Saurabh Chakravarty, Eric Williamson, and Edward Fox on "Classification of Tweets using Augmented Training". Chakravarty discussed using the cosine similarity measure on Word2Vec based vector representation of tweets and how it can be used to label unlabeled examples. How training a classifier using Augmented Training does provide improvements in classification efficacy and how a Word2Vec based representation generated out of a richer corpus like Google News provides better improvements with augmented training.
"Classification Of Tweets Using Augmented Training" Datasets #wadl2017 pic.twitter.com/pLKywUe3YP— John Berlin (@johnaberlin) June 23, 2017
This is a great project, given how hard it is to classify short tweets – they train a dataset, then use on auxiliary tweets. #wadl2017 pic.twitter.com/wFPDFhH31o— Ian Milligan (@ianmilligan1) June 23, 2017
Closing Round Table On WADL
The final order of business for WADL 2017 was a round table discussion with the participants and attendees concerning next years WADL and how to make WADL even better. There was a lot of great ideas and suggestions made as the round table progressed with the participants of this discussion becoming the most excited about the following:- WADL 2018 (naturally of course)
- Seeking out additional collaboration and information sharing with those who are actively looking for web archiving but are unaware of / did not meet up for WADL
- Looking into bringing proceedings to WADL, perhaps even a journal
- Extending the length of WADL to a full two or three day event
- Integration of remote site participation for those who wish to attend but can not due to geographical location or travel expenses
- John Berlin
Comments
Post a Comment