Monday, November 20, 2017

2017-11-20: Dodging the Memory Hole 2017 Trip Report

At the Internet Archive, it was rainy in San Francisco, but that did not deter those of us attending Dodging the Memory Hole 2017. We engaged in discussions about a very important topic: the preservation of online news content.

Keynote: Brewster Kahle, founder and digital librarian for the Internet Archive

Brewster Kahle is well known in digital preservation and especially web archiving circles. He founded the Internet Archive in May 1996. The WS-DL and LANL's Prototyping Team collaborate heavily with those from the Internet Archive, so hearing his talk was quite inspirational.

We are familiar with the Internet Archive's efforts to archive the Web, visible mostly through the Wayback Machine, but the goal of the Internet Archive is "Universal Access to All Knowledge", something that Kahle equates to the original Library of Alexandria or putting humans on the moon. To that extent, he highlighted many initiatives by the Internet Archive to meet this goal. He mentioned that the contents of a book take up roughly a MegaByte. With 28 TeraBytes the works of the Library of Congress can be stored digitally—digitizing it is another matter, but it is completely doable, and by digitizing it we remove restrictions on access due to distance and other factors. Why stop with documents? There are many other types of content. Kahle highlighted the efforts by the Internet Archive to make television content, video games, audio, and more. They also have a loaning program whereby they allow users to borrow books, which are also digitized using book scanners. He stressed that, because of its mission to provide content to all, the Internet Archive is indeed a library.

As a library, the Internet Archive also becomes a target for governments seeking information on the activities of their citizens. Kahle highlighted one incident in which the FBI sent a letter demanding information from the Internet Archive. Thanks to help from the Electronic Frontier Foundation, the Internet Archive sued the United States government and won, defending the rights of those using their services.

Kahle emphasized that we can all help with preserving the web by helping the Internet Archive build its holdings of web content. The Internet Archive contains a form with a simple "save page now" button, but they also support other methods of submitting content.

Contributions from Los Alamos National Laboratory (LANL) and Old Dominion University (ODU)

Martin Klein from LANL and Mark Graham from the Internet Archive

Martin Klein presented work on Robust Links. Martin briefly used motivating work he had done with Herbert Van de Sompel at Los Alamos National Laboratory, mentioning the problems of link rot, and content drift, the latter of which I also worked on.
He covered how one can create links that are robust by:
  1. submitting a URI to a web archive
  2. decorating the link HTML so that future users can reach archived versions of the linked content
For the first item, he talked about how one can use tools like the Internet Archive's "Save Page Now" button as well as WS-DL's own ArchiveNow. The second item is covered by the Robust Links specification. Mark Graham, Directory of the Wayback Machine at the Internet Archive, further expanded upon Martin's talk by describing how the Wayback Extension also provides the capability to save pages, navigate the archive, and more. It is available for Chrome, Safari, and Firefox. It is shown in the screenshots below.
A screenshot of the Wayback Extension in Chrome.
A screenshot of the Wayback Extension in Safari. Note the availability of the option "Site Map", which is not present in the Chrome version
A screenshot of the Wayback Extension in Firefox. Note how there is less functionality.

Of course, the WS-DL efforts of ArchiveNow and Mink augment these preservation efforts by submitting content to multiple web archives, including the Internet Archive.

I enjoyed one of the most profound revelations from Martin and Mark's talk: URIs are addresses, not the content that was on the page at the moment you read it. I realize that efforts like IPFS are trying to use hashes to address this dichotomy, but the web has not yet migrated to them.

Shawn M. Jones from ODU

I presented a lightning talk highlighting a blog post from earlier this year where I try to answer the question: where can we post stories summarizing web archive collections? I talked about why storytelling works as a visualization method for summarizing collections and then evaluated a number of storytelling and curation tools with the goal of finding those that best support this visualization method.

Selected Presentations

I tried to cover elements of all presentations while live tweeting during the event, and wish I could go into more detail here, but, as usual I will only cover a subset.

Mark Graham highlighted the Internet Archive's relationships with online news content. He highlighted a report by Rachel Maddow where she used the Internet Archive to recover tweets posted by former US National Security Advisor Michael Flynn, thus incriminating him. He talked about other efforts, such as NewsGrabber, Archive-It, and the GDELT project, which all further archive online news or provide analysis of archived content. Most importantly, he covered "News At Risk"—content that has been removed from the web by repressive regimes, further emphasizing the importance of archiving it for future generations. In that vein, he discussed the Environmental Data & Governance Initiative, set up to archive environmental data from government agencies after Donald Trump's election.

Ilya Kreymer and Anna Perricci presented their work on Webrecorder, web preservation software hosted at An impressive tool for "high fidelity" web archiving, Webrecorder allows one to record a web browsing session and save it to a WARC. Kreymer demonstrated its use on a CNN news web site with an embedded video, showing how the video was captured as well as the rest of the content on the page. The web platform allows one to record using their native browser, or they can choose from a few other browsers and configurations in case the user agent plays a role in the quality of recording or playback. For offline use, they have also developed Webrecorder Player, in which one can playback their WARCs without requiring an Internet connection. Anna Perricci said that it is perfect for browsing a recorded web session on an airplane. Contributors to this blog have written about Webrecorder before.

Katherine Boss, Meredith Broussard, Fernando Chirigati, and Rémi Rampin discussed the problems surrounding the preservation of news apps: interactive content on news sites that allow readers to explore data collected by journalists on a particular issue. Because of their dynamic nature, news apps are difficult to archive. Unlike static documents, they can not be printed or merely copied. They often consist of client and server side code developed without a focus on reproducibility. Preserving news apps often requires the assistance of the organization that created the news app, which is not always available. Rémi Rampin noted that, for those organizations that were willing to help them, their group had had success using the research reproducibility tool reprozip to preserve and play back news apps.

Roger Macdonald and Will Crichton provided an overview of the Internet Archive's efforts to provide information from TV news. They have employed the Esper video search tool as a way to explore their colleciton. Because it is difficult for machines to derive meaning from pixels within videos, they used captioning as a way to provide for effective searching and analysis of the TV news content at the Internet Archive. Their goal is to allow search engines to connect fact checking to the TV media. To this end, they employed facial recognition on hours of video to find content where certain US politicians were present. From there one can search for a politician and see where they have given interviews on such news channels as CNN, BBC, and Fox News. Alternatively, they are exploring the use of identifying the body position of each person in a frame. Using this, it might be possible to answer questions such as "every video where a man is standing over a woman". The goal is to make video as easy as text to search for meaning.

Maria Praetzellis highlighted a project named Community Webs that uses Archive-It. Community Webs provides libraries the tools necessary to preserve news and other content relevant to their local community. Through community webs, local public libraries receive education and training, help with collection development, and archiving services and infrastructure.

Kathryn Stine and Stephen Abrams presented the work done on the Cobweb Project. Cobweb provides an environment where many users can collaborate to produce seeds that can then be captured by web archiving initiatives. If an event is unfolding and news stories are being written, the documents containing these stories may change quickly, thus it is imperative for our cultural memory that these stories be captured as close to publication as possible. Cobweb provides an environment for the community to create a collection of seeds and metadata related to one of these events.
Matthew Weber shared some results from the News Measures Research Project. This project started as an attempt to "create an archive of local news content in order to assess the breadth and depth of local news coverage in the United States". The researchers were surprised to discover that local news in the United States covers a much larger area than expected: 546 miles on average. Most areas are "woefully underserved". Consolidation of corporate news ownership has led to fewer news outlets in many areas and the focus of these outlets is becoming less local and more regional. These changes are of concern because the press is important to the democratic processes within the United States.


As usual, I met quite a few people during our meals and breaks. I appreciate talks over lunch with Sativa Peterson of Arizona State Library and Carolina Hernandez of the University of Oregon. It was nice to discuss the talks and their implications for journalism with Eva Tucker of Centered Media and Barrett Golding of Hearing Voices. I also appreciated feedback and ideas from Ana Krahmer of the University of North Texas, Kenneth Haggerty of the University of Missouri, Matthew Collins of the University of San Francisco Gleeson Library, Kathleen A. Hansen of University of Minnesota, and Nora Paul, retired director of Minnesota Journalism Center. I was especially intrigued by discussions with Mark Graham on using storytelling with web archives, Rob Brackett of Brackett Development, who is interested in content drift, and James Heilman, who works on WikiProject Medicine with Wikipedia.


Like last year, Dodging the Memory Hole was an inspirational conference highlighting current efforts to save online news. Having it at the Internet Archive further provided expertise and stimulated additional discussion on the techniques and capabilities afforded by web archives. Pictures of the event are available on Facebook. Video coverage is broken up into several YouTube videos: Day 1 before lunch, Day 1 after lunch, Day 2 before lunch, Day 2 after lunch, and lightning talks. DTMH highlights the importance of news in an era of a changing media presence in the United States, further emphasizing that web archiving can help us fact-check statements so we can hold onto a record of not only how we got here, but also guide where we might go next. -- Shawn M. Jones

No comments:

Post a Comment