Tuesday, June 9, 2015

2015-06-09: Web Archiving Collaboration: New Tools and Models Trip Report

Mat Kelly and Michele Weigle travel to and present at the Web Archiving Collaboration Conference in NYC.                           

On June 4 and 5, 2015, Dr. Weigle (@weiglemc) and I (@machawk1) traveled to New York City to attend the Web Archiving Collaboration conference held at the Columbia School of International and Public Affairs. The conference gave us an opportunity to present our work from the incentive award provided to us by Columbia University Libraries and the Andrew W. Mellon Foundation in 2014.

Robert Wolven of Columbia University Libraries started off the conference with welcoming the audience and emphasizing the variety of presentations that were to occur on that day. He then introduced Jim Neal, the keynote speaker.

Jim Neal starting by noting the challenges of "repository chaos", namely, which version of a document should be cited for online resources if multiple versions exist. "Born-digital content must deal with integrity", he said, "and remain as unimpaired and undivided as possible to ensure scholarly access."

Brian Carver (@brianwc) and Michael Lissner (@mlissner) of Free Law Project (@freelawproject) followed the keynote with Brian first stating, "Too frequently I encounter public access systems that have utterly useless tools on top of them and I think that is unfair." He described his project's efforts to make available court data from the wide variety of systems digitally deployed by various courts on the web. "A one-size-fits-all solution cannot guarantee this across hundreds of different court websites.", he stated, further explaining that each site needs its own algorithm of scraping to extract content.

To facilitate the crowd sourcing of scraping algorithms, he has created a system where users can supply "recipes" to extract content from the courts' sites as they are posted. "Everything I work with is in the public domain. If anyone says otherwise, I will fight them about it.", he mentioned regarding the demands people have brought to him when finding their name in the now accessible court documents. "We still find courts using WordPerfect. They can cling to old technology like no one else."

Free Law Project slides

Shailin Thomas (@shailinthomas) and Jack Cushman from the Berkman Center for Internet and Society, Harvard University spoke next of Perma.cc. "From the digital citation in the Harvard Law Review from the last 10 year, 73% of the online links were broken. Over 50% of the links cited by the Supreme Court are broken." They continued to describe the Perma API and the recent Memento compliance.

Perma.cc slides

After a short break, Deborah Kempe (@nyarcist) of the Frick Art Reference Library describe her recent observation that there is a digital shift in art moving to the Internet. She has been working with both Archive-It and Hanzo Archives for quality assurance of captured websites and for on-demand captures of sites that her organization found particularly challenging (respectively). One example of the latter is Wangechi Mutu's site, which has an animation on the homepage, which Archive-It was unable to capture but Hanzo was.

In the same session, Lily Pregill (@technelily) of NYARC stated, "We needed a discovery system to unite NYARC arcade and our Archive-It collection. We anticipated creating yet another silo of an archive." While she stated that the user interface is still under construction, it does allow the results of her organization's archive to be supplemented with results from Archive-It.

New York Art Resources Consortium (NYARC) slides

Following Lily in the session, Anna Perricci (@AnnaPerricci) of Columbia University Libraries talked about the Contemporary Composers Web Archive, which consists of 11 participating curators from 56 sites currently available in Archive-It. The "Ivies Plus" collaboration has Columbia building web archives with seeds chosen by subject specialists from Ivy League universities along with a few other universities.

Ivies Plus slides

In the same session, Alex Thurman (@athurman) (also from Columbia) presented on the IIPC Collaborative Collection Initiative. He referenced the varying legal environments between members based on countries, some being able to do full TLD crawling while some members (namely, in the U.S.) have no protection from copyright. He spoke of the preservation of Olympics web sites from 2010, 2012, and 2014 - the latter being the first logo to contain a web address. "Though Archive-It had a higher upfront cost", he said about the initial weighing of various option for Olympic website archiving, it was all-inclusive of preservation, indexing, metadata, replay, etc." To publish their collections, they are looking into utilizing the .int TLD, which is reserved for internationally significant information but is underutilized in that only about 100 sites exist, all which have research value.

International Internet Preservation Consortium collaborative collections slides

The conference then broke for a provided lunch then started with Lightning Talks.

To start off the lightning talks, Michael Lissner (@mlissner) spoke about RECAP, what it is, what has it done and what is next for the project. Much of the content contained with the Public Access to Court and Electronic Records (PACER) system is paywalled public domain documents. Obtaining the documents costs users ten cents per page with a three dollar maximum. "To download the Lehman Brothers proceedings would cost $27000.", he said. His system leverages user's browser via the extension framework to save a copy of the downloads from a user to Internet Archive and also first query the archive for a user to see if the document has been previously downloaded.

Dragan Espenschied (@despens) gave the next Lightning Talk talking about preserving digital art pieces, namely those on the web. He noted one particular example where the artist extensively used scrollbars, which are less common place in user interface today. To accurately re-experience the work, he fired up a browser based MacOS 9 emulator:

Jefferson Bailey @jefferson_bail followed Dragan with his work in investigating archive access methods that are not URI centric. He has begun working with WATs (web archive transformations), LGAs (longitudinal graph analyses), and WANEs (web archive named entities).

Dan Chudnov (@dchud) then spoke of his work at GWU Libraries. He had developed Social Feed Manager, a Django application to collect social media data from Twitter. Previously, researchers had been copy and pasting tweets into Excel documents. His tool automated this process. "We want to 1. See how to present this stuff, 2. Do analytics to see what's in the data and 3. Find out how to document the now. What do you collect for live events? What keywords are arising? Whose info should you collect?", he said.

Jack Cushman from Perma.cc gave the next lightning talk about ToolsForTimeTravel.org, a site that is trying to make a strong dark archive. The concept would prevent archivists from reading material within until conditions are met. Examples where this would be applicable are the IRA Archive at Boston College, Hillary Clinton's e-mails, etc.

With the completion of the Lightning Talks, Jimmy Lin (@lintool) of University of Maryland and Ian Milligan (@ianmilligan1) of University of Waterloo rhetorically asked, "When does an event become history?" stating that history is written 20 to 30 years after an event has occurred. "History of the 60s was written in the 30s. Where are the Monica Lewinsky web pages now? We are getting ready to write the history of the 1990s.", Jimmy said. "Users can't do much with current web archives. It's hard to develop tools for non-existent users. We need deep collaborations between users (archivists, journalists, historians, digital humanists, etc.) and tool builders. What would a modern archiving platform built on big data infrastructure look like?" He compared his recent work in creating warcbase with the monolithic OpenWayback Tomcat application. "Existing tools are not adequate."

Warcbase: Building a scalable platform on HBase and Hadoop slides (part 1)

Ian then talked about warcbase as an open source platform for managing web archives with Hadoop and HBase. WARC data is ingested into HBase and Spark is used for text analysis and services.

Warcbase: Building a scalable platform on HBase and Hadoop slides (part 2)

Zhiwu Xie (@zxie) of Virginia Tech then presented his group's work on maintaining web site persistence when the original site is no longer available. By using an approach akin to a proxy server, the content served when the site was last available is continued to be served in lieu of the live site. "If we have an archive that archives every change of that web site and the website goes down, we can use the archive to fill the downtimes.", he said.

Archiving transactions towards an uninterruptible web service slides

Mat Kelly (@machawk1, your author) presented next with "Visualizing digital collections of web archives" where I described the SimHash archival summarization strategy to efficiently generate a visual representation of how a web page changed over time. In the work, I created a stand-alone interface, Wayback add-on, and embeddable service for a summarization to be generated for a live web page. At the close of the presentation, I attempted a live demo.

WS-DL's own Michele Weigle (@weiglemc) next presented Yasmin's (@yasmina_anwar) work on Detecting Off-Topic Pages. The recently accepted TPDL 2015 paper had her looking at how pages in Archive-It collections have changed over time and being able to detect when a page is no longer relevant to what the archivist intended to capture. She used six similarity metrics to find that cosine similarity performed the best.

In the final presentation of the day, Andrea Goethals (@andreagoethals) of Harvard Library and Stephen Abrams of California Digital Library discussed difficulties in keeping up with web archiving locally, citing the outdated tools and systems. A hierarchical diagram of a potential they showed piqued the audiences' interest as being overcomplicated for smaller archives.

Exploring a national collaborative model for web archiving slides

To close out the day, Robert Wolven gave a synopsis of the challenges to come and expressed his hope that there was something for everyone.

Day 2

The second day of the conference contained multiple concurrent topical sessions that were somewhat open-ended to facilitate more group discussion. I initially attended David Rosenthal's talk where he discussed the need for tools and APIs for integration into various system for standardization of access. "A random URL on the web has less than 50% chance of getting preserved anywhere.", he said, "We need to use resources as efficiently as possible to up that percentage". Further emphasizing this point:

DSHR then discussed repairing archives for bit-level integrity and LOCKSS' approach at accomplishing it. How would we go about establish a standard archival vocabulary?", he asked, "'Crawl scope' means something different in Archive-It vs. other systems."

I then changed rooms to catch the last half hour of Dragan Espenschied's tools where he discussed pywb (the software behind webrecorder.io) more in-depth. The software allows people to create their own public and private archives as well as offers a pass-through model where it does not record login information. Further, it can capture embedded YouTube and Google Maps.

Following the first set of concurrent sessions, I attended Ian Milligan's demo of utilizing warcbase for analysis of Canadian Political Parties (a private repo as of this writing but will be public once cleaned up). He also demonstrated using Web Archives for Historical Research. In the subsequent and final presentation of day 2, Jefferson Bailey demonstrated Vinay Goel's (@vinaygo) Archive Research Services Workshop, which was created to serve as an introduction to data mining and computational tools and methods for work with web archives for researchers, developers, and general users. The system utilizes the WAT, LGA, and WANE derived data formats that Jefferson spoke of in his Day 1 Lightning talk.

After Jefferson's talk, Robert Wolven again collected everyone into a single session to go over what was discussed in each session on the second day and gave a final closing.

Overall, the conference was very interesting and very relevant to my research in web archiving. I hope to dig into some of the projects and resources I learned about and follow up with contacts I made at the Columbia Web Archiving Collaboration conference.

— Mat (@machawk1)

1 comment:

  1. Great Summary, Matt. And thanks for sharing this!