Wednesday, November 13, 2013

2013-11-13: 2013 Archive-It Partner Meeting Trip Report


On November 12, I attended the 2013 Archive-It Partner Meeting in Salt Lake City, Utah, our research group's second year of attendance (see 2012 Trip Report). The meeting started off casually at 9am with breakfast and registration. Once everyone was settled, Kristine Hanna, the Director of Archiving Services at Internet Archive introduced her team that was present of the meeting. Kristine acknowledged the fire at Internet Archive last week and the extent of the damage. "It did burn to the ground but thankfully, nobody was injured." She reminded the crowd of partners to review Archive-It's storage and preservation policy and mentioned the redundancies in-place, including a soon-to-be mirror at our very own ODU. Kristine then mentioned news of a new partnership with Reed Technologies to jointly market and sell Archive-It (@archiveitorg). She reassured the audience that nothing would change beyond having more resources for them to accomplish their goals.

Kristine then briefly mentioned the upcoming release of Archive-It 5.0, which would be spoken about in-depth in a later presentation. She asked everyone in the room (of probably 50 or so attendees) to introduce themselves and to state their affiliated. With the intros out of the way, the presentations began.

Kate Legg of National Center for Atmospheric Research (NCAR) presented "First steps toward digital preservation at NCAR". She started by saying that NCAR is a federally funded research and development center (FFRDC) whose mission is to "preserve, maintain and make accessible records and materials that document the history and scientific contributions of NCAR". With over 70 collections and 1500 employees, digital preservation is on the organization's radar. Their plan, while they have a small library and staff, is to accomplish this along with other competing priorities.

"Few people were thinking about the archives for collecting current information", Kate described of some of the organization not understanding that preserving now will create archives for later. "The archive is not just where old where old stuff goes, but new stuff as well." One of the big obstacles for the archiving initiatives of the organizations has been funding. Even with this limitation, however, NCAR was still able to subscribe to Archive-It through a low level subscription. With this subscription, they started to preserve their Facebook group but increasingly found huge amounts of data, including videos, that they felt was too resource heavy to archive. The next step for the initiative is to add a place on the organization's webpage where archive content will be accessible to the public.

Jaime McCurry (@jaime_ann) of Folger Shakespeare Library followed Kate with "The Short and the Long of It: Web Archiving at the Folger Shakespeare Library". Jaime is currently participating in the National Digital Stewardship Residency where her goal is to establish local routines and best practices for archiving and preserving the library's born-digital content. They have two collection with over 6 millions documents (over 400 gigabytes of data) currently where the topic being collected is to preserve content on the web relating to the works of Shakespeare (particularly in social media and from festivals). In trying to describe the large extent of the available content, Jaime said, "In trying to archive Shakespeare's presence on the web, you really have to define what you're looking for. Shakespeare is everywhere!". She noted that one of the first things she realized when she first started on the project at Folger was that nobody knew that the organization was performing web archiving, so she wished to establish an organization-wide web archiving policy. One of the recent potential targets of her archiving project was the NYTimes' Hamlet contest wherein the newspaper suggested Instagram users create 15-second clips of their interpretation of a passage from the play. Because this related to Shakespeare, it would be an appropriate target for the Folger Shakespeare Library.

EDIT: Jamie also created a trip report of the meeting on her blog.

After Jaime finished, Sharon Farnel of University of Alberta began her presentation "Metadata workflows for web archiving – thinking and experimenting with ‘metadata in bulk’". In her presentation she referenced a project called Blacklight, an open source project that provides a discovery interface for any Solr index via a customizable, template-based user interface. In her collection, from the context of metadata, she wished to think about where and why discovery of content tasks place in web archiving. She utilized a mixed model wherein entries might have MARC records, Dublin Core data or both. Sharon emphasized that metadata was an important functionality of Archive-It. To better parse the data, her group created XSLT stylesheets to be able to export the data into a more interoperable format like Excel, which it could then be imported back into Blacklight after manipulation. She referenced some of the difficulties in working the the different technologies but said, "None of these tools were a perfect solution on their own but by combining the tools in-house, we can get good results with the metadata."

After a short break (to caffeinate), Abbie Grotke (@agrtoke) of Library of Congress remotely presented "NDSA Web Archiving Survey Update". In her voice-over presentation from DC, she gave preliminary results of the NDSA Web Archiving Survey, stating that the initiative of the NDIIP program had yielded about 50 respondents so far. For the most part, the biggest concern about web archiving reported by the survey participants was database preservation followed by social media and video archiving. She stated that the survey is still open and encouraged attendees to take it (Take it here).

Trevor Alvord of Brigham Young University was next with "A Muddy Leak: Using the ARL Code of Best Practices in Fair Use with Web Archiving". His efforts with the L. Tom Perry Special Collections at BYU was to build a thematic based collection based on Mormonism. He illustrated that many battles had been fought and won over digital preservation content rights (e.g., Perfect 10 vs. Google and Students vs. iParadigms), so his collection should be justified based on the premises in those cases. "Web archiving is routinely done by two wealthiest corporations (Google and Microsoft)", he quoted Jonathan Band, a recognized figure in the lawsuits versus Google. "In the last few months, libraries have prevailed.", Trevor said, "Even with our efforts, we have not received any complaints about their website being archived by libraries."

Trevor then went on to describe the problem with his data set, alluding to the Teton Dam flooding wherein millions of documents are being produced about Mormonism and now he is having to capture whatever he can. This is partially due to the lowering of the age allowed for missionaries and the Mormon church's encouragement for young Mormons to post online. He showed two examples of Mormon "mommy" bloggers Peace Love Lauren, a very small impact bloggers and NieNie Dialogs, a very popular blog. He asked the audience, "How do you prioritize what content to archive given popular content is more important but also more likely to be preserved?"

Following Trevor, Vinay Goel of Internet Archive presented "Web Archive Analysis". He started by saying that "Most Partners access Archive-It via the Wayback Machine." where other methods would be by using the Archive-It search service or downloading the archival contents. He spoke of de-duplication and how it is represented in WARCs via a revisit record. The core of his presentation spoke of the various WARC meta formats, Web Archive Transformation (WAT) files and CDX files, the format used for WARC indexing. "WAT files are WARC metadata records.", he said, "CDX files are space delimited text files that record where a file resides in a WARC and its offset." Vinay has come up with an analysis toolkit that would allow researchers to express question they want to ask about the archives in a high level language that would then be translated to a low level language understandable by an analysis system. "We can capture YouTube content", he said, giving an example use case, "but the content is difficult to replay." Some of the analysis information he displayed was identifying this non-replayable content in the archives and showing the in-degree and out-degree information of each resource. Further, his toolkit is useful in studying how this linking behavior changes over time.

The crowd then broke for lunch only to return to Scott Reed (@vector_ctrl) of Internet Archive presenting the new features that would be present in the next iteration of Archive-It, 5.0. The new system, among other things, allows users to create test crawls and is better at social media archiving. Some of the goals to be implemented in the system before the end of the year is to get the best capture and display the capture in currently existing tools. Scott mentioned an effort by Archive-It to utilize phantomjs (with which we're familiar at WS-DL through our experiments) through a feature they're calling "Ghost". Further, the new version promises to have an API. Along with Scott, Maria LaCalle spoke of a survey completed about the current version of Archive-It and Solomon Kidd spoke of work done on user interface refinements of the upcoming system.

Following Scott, the presentations continued with your author, Mat Kelly (@machawk1) presenting "Archive What I See Now".

After I finished my presentation, the final presentation of the day was by Debbie Kempe of The Frick Collection and Frick Art Reference Library with "Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art Resources". In her presentation, she stated that there was a broad overlap of art between the Brooklyn Museum, Museum of Modern Art, and the Frick Art Reference Library. Citing Abbie Grotke's survey from earlier, she reminded the audience that no museums responded to the survey, which is problematic for evaluating their archiving needs. "Not all information is digital in the art community", Debbie said. In initiating archiving effort, it wasn't so much clear to the museums' organizers as to why or how web archiving of their content should be done but rather, "Who will do it?" and "How will we pay for it?" She ran a small experiment in accomplishing the preservation tasks of the museum and is now subsequently running a longer "experiment", given more content is being create that is digital and less in print in their collections. In the longer trial, she hopes to test and formulate a sustainable workflow, including re-skilling and organizational changes.

After Debbie, the crowd was freed into a Birds of a Feather session to discuss issues about web archiving that interested each individual, to which I collected with a group about "Capture", given my various software projects relating to the topic. After the BoF session, Lori Donovan and Kristine Hanna adjourned the room to a following reception.

Overall, I felt the trip to Utah to meet with a group with a common interest was a unique experience that I don't get at other conferences where some of the audiences' focuses are disjoint from one another. The feedback I received on my research and the discussion I had with various attendees was extremely valuable in learning how the Archive-It community works and I hope to attend again next year.

EDIT: Since publishing this post, the Archive-It team have posted the slides from the Partner Meeting.

— Mat (@machawk1)

No comments:

Post a Comment