2014-11-20: Archive-It Partners Meeting 2014

I attended the 2014 Archive-It Partners Meeting in Montgomery, AL on November 18.  The meeting attendees are representatives from Archive-It partners with interests ranging from archiving webpages about art and music to archiving government webpages.  (Presentation slides are now available on the Archive-It wiki.)  This is ODU's third consecutive Partners Meeting (see trip reports from 2012 and 2013).

The morning program was focused on presentations from partners who are building collections.  Here's a brief overview of each of those.

Penny Baker and Susan Roeper from the Clark Art Institute talked about their experience in archiving the 2013 Venice Biennale international art exhibition (Archive-It collection) and plans for the upcoming exhibition.  Their collection includes exhibition catalogs, monographs, and press releases about the event.  The material also includes a number of videos (mainly from vimeo), which Archive-It can now capture.

Beth Downs from the Montana State Library (Archive-It collection) spoke about working with partners around the state to fulfill the state mandate to make all government documents publicly available and working to make the materials available to state employees, librarians, teachers, and the general public.  One of the nice things they've added to their site footer is a Page History link that goes directly to the Archive-It Wayback calendar page for the current page.

Beth has also provided instructions for their state agencies on how to include the Page History link and how to embed a Search box into the archive on their pages.  This could be easily adapted to point to other state government archives or to the general Internet Archive Wayback Machine.

Dory Bower from the US Government Printing Office talked about the FDLP (Federal Depository Library Program) Web Archive (Archive-It collections).  They have several archiving strategies and use Archive-It mainly for the more content rich websites along with born-digital materials.

Heather Slania, Director of the Betty Boyd Dettre Library and Research Center at the National Museum of Women in the Arts (Archive-It collections) spoke about the challenges of capturing dynamic content from artists websites.  This includes animation, video (mainly vimeo), and other types of Internet art. She has initially focused on capturing websites of a selection of Internet artists.  These sites include over 6000 videos (from just 30 artists).  The next step is to archive the work of video artists and web comics.  As part of this project, she has been considering what types of materials are currently capture-able and categorizing the amount of loss in the archived sites.  This is related to our group's recent work on measuring memento damage (pdfslides) and investigating the archivability of websites over time (pdf at arXivslides).

Nicholas Taylor from Stanford University Libraries gave an overview of the 2013 NDSA (National Digital Stewardship Alliance) Survey Report (pdf).  The latest survey was conducted in 2013 and the first was done in 2011.  NDSA's goal is to conduct this every 2 years.  Nicholas had lots of great stats in his slides, but here are a few that I noted:
  • 50% of respondents were university programs
  • 7% affiliated with IIPC, 33% with NDSA, 45% Web Archiving Roundtable, 71% with Archive-It
  • many are concerned with capturing social media, databases, and video
  • about 80% respondents are using external services for archiving, like Archive-It
  • 80% haven't transferred data to their local repository
  • many are using tools that don't support WARC (but the percentage using WARC has increased since 2011)
Abbie Nordenhaug and Sara Grimm from the Wisconsin Historical Society (Archive-It collections) presented next.  They're just getting started archiving in a systematic manner.  They have a range of state agency partners with websites that are dynamic to those that are fairly static.  So far, they've set up monthly, quarterly, semi-annual, and annual crawls for those sites.

After these presentations, it was time for lunch.  Since we were in Alabama, I found my way to Dreamland BBQ.

After lunch, the presentations focused on collaborations, an update on 2014-2015 Archive-It plans, BOF breakout sessions, and strategies and services.

Anna Perricci from Columbia University Libraries spoke about their experiences with collaborative web archiving projects (Archive-It collections), including the Collaborative Architecture, Urbanism, and Sustainability Web Archive (CAUSEWAY) collection and the Contemporary Composers Web Archive (CCWA) collection.

Kent Underwood, Head of the Avery Fisher Center for Music and Media at the NYU Libraries, spoke about web archiving for music history (Archive-It collection).  Kent gave an eloquent argument for web archiving:  "Today’s websites will become tomorrow’s historical documents, and archival websites must certainly be an integral part of tomorrow’s libraries. But websites are fragile and impermanent, and they cannot endure as historical documents without active curatorial attention and intervention. We must act quickly to curate and preserve the memory of the Internet now, while we have the chance, so that researchers of tomorrow will have the opportunity to discover their own past. The decisions and actions that we take today in web archiving will be crucial in determining what our descendants know and understand about their musical history and culture."

Patricia Carlson from Mount Dora High School in Florida spoke about Archive-It's K-12 Archiving Program and its impact on her students (Mount Dora's Archive-It collection).  She talked about its role in introducing her students to primary sources and metadata.  She's also been able to use things that they already do (like tag people on Facebook) as examples of adding metadata. The students have even made a video chronicling their archiving experiences.

After the updates on ongoing collaborations, Lori Donovan and Maria LaCalle from Archive-It gave an overview of Archive-It's 2014 activities and upcoming plans for 2015.  Archive-It currently has 330 partners in 48 US states (only missing Arkansas and North Dakota!) and 16 countries.  In 2014, with version 4.9, Archive-It crawls certain pages with Heritrix and Umbra, which allows Heritrix to access sites in the same way a browser would.  This allows for capture of client-side scripting (such as JavaScript) and improves the capture of social media sites.  There were several new features in the 5.0 release, among them integration with Google Analytics. There will be both a winter 2014 release and a spring/summer 2015 release.  In the spring/summer release several new features are planned, including visual/UI redesign of the web app, the ability to move and share seeds between collections, ability to manually rank metadata facets on public site, enhanced integration with archive.org, updated Wayback look and feel, and linking related pages on the Wayback calendar (in case URI changed over time).

After a short break, we divided up into BOF groups:
  • Archive.org v2
  • Researcher Services
  • Cross-archive collaboration
  • QA (quality assurance)
  • Archiving video, audio, animations, social media
  • State Libraries
I attended the Research Services BOF, led by Jefferson Bailey and Vinay Goel from Internet Archive and Archive-It.  Jefferson and Vinay described their intentions with launching research services and asked for feedback and requests.  The idea is to use the Internet Archive's big data infrastructure to process data and provide smaller datasets of derived data to partners from their collections.  This would allow researchers to work on smaller datasets that would be manageable without necessarily needing big data tools.  This could also be used to provide a teaser as to what's in the collection, highlight link structure in the collection, etc.  One of the initial goals is to seed example use cases of these derivative datasets to show others what might be possible.  The ultimate goal is to help people get more value from the archive.  Jefferson and Vinay talked in more detail about what's upcoming in the last talk of the meeting (see below). Most of the other participants in the BOF were interested in ways that their users could make research use out of their archived collections.

After the BOF breakout, the final session featured talks on strategies and services.

First up was yours truly (Michele Weigle from the WS-DL research group at Old Dominion University).  My talk was a quick update on several of our ongoing projects, funded by NEH Office of Digital Humanities and the Columbia University Libraries Web Archiving Incentives program.

The tools I mentioned (WARCreate, WAIL, and Mink) are all available from our Software page.  If you try them out, please let us know what you think (contact info is on the last slide).

Mohamed Farag from Virginia Tech's CTRnet research group presented their work on an event focused crawler (EFC).  Their previous work on automatic seed generation from URIs shared on Twitter produced lots of seeds, but not all of them were relevant.  The new work allows a curator to select high quality seed URIs and then uses the event focused crawler (EFC) to retrieve webpages that are highly similar to the seeds.  The EFC can also read WARCs and perform text analysis (entities, topics, etc.) from them.  This enables event modeling, describing what happened, where, and when.

In the final presentation of the meeting, Jefferson Bailey and Vinay Goel from Internet Archive spoke about building Archive-It Research Services, planned to launch in January 2015. The goals are to expand access models to web archives, enable new insights into collections, and facilitate computational analysis.  The plan is to leverage the Internet Archive's infrastructure for large-scale processing.  This could result in increasing the use, visibility, and value of Archive-It collections.  Initially, three main types of datasets are planned:
  • WAT - consists of key metadata from a WARC file, includes text data (title, meta-keywords, description) and link data (including anchor text) for HTML
  • LGA - longitudinal graph analysis - what links to what over time
  • WANE - web archive named entities
All of these datasets are significantly smaller than the original WARC files.  Jefferson and Vinay have built several visualizations based on some of this data for demonstration and will be putting some of these online.  Their future work includes developing programmatic APIs, custom datasets, and custom processing.

All in all, it was a great meeting with lots of interesting presentations. It good to see some familiar faces and to actually meet others I'd only previously emailed with.  It was also nice to be in an audience where I didn't have to motivate the need for web archiving.

There were several people live-tweeting the meeting (#ait14).  I'll conclude with some of the tweets.