The time of year has again arrived for conferences related to our research area of web sciences and digital libraries. While much our group will be representing the university at the Joint Conference on Digital Libraries (JCDL) conference in Indianapolis (trip report), I was given the opportunity to attend Digital Preservation 2013 in Alexandria, Virginia.
Being much closer to home in Hampton Roads, this is the third year running that I have attended this conference (2012 Trip Report, 2011 Trip Report), having presented digital preservation tools at each: Archive Facebook in 2011 and WARCreate in 2012. Following up from the recent public release of WARCreate (see the announcement), I gave a presentation on another package I had created, Web Archiving Integration Layer (WAIL), originally unveiled at Personal Digital Archiving 2013 in February (Trip Report), WARCreate, and how all of the pieces fit together titled: WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy.
Long before it was my turn to present, however, the lineup included a fantastic cast of other presenters. To start off the conference, Bill LeFurgy (@blefurgy) gave the welcoming remarks.
Bill started by noting that this was the 9th year of the annual NDIIPP meeting and that a lot had changed in that time. He reminisced of when the conference first started in 2004 about how much progress has been made in preservation efforts. "One of principles goals was to build a community around the process of digital stewardship.", he said. Bill then introduced the first speaker of the conference, Hilary Mason.
Hilary Mason (@hmason) of bit.ly is the chief scientist at bit.ly. Hilary started her presentation titled "Humans and Data" by noting that she was there to learn, being an engineer. She offered her expertise on "How engineers and startup people think about preservation when they think about it at all.", she described, "...Which is not that often. That's that punchline."
Commenting on social behavior, she referenced a reddit thread that posed the question "If someone from the 1950s suddenly appeared today, what would be the most difficult thing to explain to them about life today?" to which the top answer was, "I possess a device, in my pocket, that is capable of accessing the entirety of information known to man. I use it to look at pictures of cats and get in arguments with strangers."
"That's the Internet!", said exclaimed, "While technology and our technical capability has changed very rapidly, human nature has not changed at all."
She continued on to speak of the origins of bitly and how it was a complete accident. bit.ly was part of a feature of another product spun out of a company called Betaworks. "They [Betaworks] had this brilliant idea that when you're reading a news article on the Internet, other people are reading that article at the same time and yet our experience of that is very lonely. It is not in any way social." she continued, "So they thought, what if we added a social layer to news consumption? They built a system where you could see the mouse cursors of everybody else on the news article with you. So you can guess what happened then." She described the behavior of the users, who would do the exact opposite by swearing at each other and would chase each other around the screen.
"It was horrible!", she said, "It had the opposite social effect that the product was intended to have. But two different things that were useful came out of it. One of them was bit.ly, which was just a little way to share content in that tool."
Along with most of the presentations at Digital Preservation 2013, I captured this one on digital video and made it viewable here.
Following Hilary, Sarah Werner (@wykenhimself) of the Folger Shakespeare Library presented "Disembodying the Past to Preserve It". She spoke of collections of indulgences and how, because the physical items were not considered valuable, the ones that did survive were reused as waste paper and thus found in bindings of other saved items. "Being treated as disposable is how they survived." she said.
She continued to describe a works within The Great Parchment Book, a collection of 165 leaves describing a survey compiled in 1639 of all of the estates managed by the city of London, that were badly damaged by a fire in 1786.
"Through careful preservation about 50% of the text was recovered but the brittle wrinkled parchment remained an intractable obstacle for further work.", she said, "After extensive physical preservation work, the UCL [University College London] team was able to virtually un-wrinkle the pages. "
She continued, "About 90% of the text of The Great Parchment Book is now readable and available for examination online as images of the leaves, enhanced images, or transcription of the text. In both of these cases, digitization makes available objects for study that would otherwise be restricted either because they're too fragile to handle or they're too dispersed to work with."
After a short break, Micah Altman (drmaltman), the Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, formally announced the 2014 National Digital Stewardship Agenda and gave a brief on the document.
Describing why such a document is drawn up, he said "Effective digital stewardship is vital for maintaining authenticity of public records...and information on how to do it, what to do, and what's going is distributed across practice, research, sectors, disciplines, communities of practice. There's a diversity of perspectives in organizations that are involved. So that sort of sounds like us. More on the reasoning for the document can be seen in the full video."
Following Micah, Leslie Johnston (@lljohnston) of Library of Congress introduced the next panel titled "Creative Approaches to Content Preservation", which "is only a panel in name.", she stated, commenting on the greater similarity of the format to a series of presentations with subsequent questions to the group rather than the traditional format.
Anne Wootton (@annewootton), one-half of Pop-Up Archive (the other being Bailey Smith (@baileyspace) started the panel first describing her organization then beginning by "starting with the tail of an archive" by referencing the Kitchen Sisters and their organization's worth with them when approached with an "archival crisis". The Sisters had been working in public radio for decades, recording thousands hours of sound, and had these recordings stored on a variety of mediums stored in a variety of places.
For their Master's Thesis at Berkeley, they surveyed the digital archiving and public media ecosystems to see if they could identify a solution that would meet the Kitchen Sisters needs while keeping in-mind the restrictions of resources, workflow and lack of technical proficiency. "We saw the need for an inexpensive tool", she said, "that could be used oral history, archives, and media creators alike to store and/or create access to their materials safely and make it discoverable in a way that would be standardized with their industries." Their initial efforts were in creating plugins for Omeka
Travis May of the Federal Reserve Bank of Saint Louis followed Anne, describing his work on FRED, an economic database with over 83,000 economic series from 57 different sources with a majority of the data coming from the United States.
Cal Lee of University of North Carolina at Chapel Hill was up next with "Taking Bitstream Seriously". "The category of dealing with everything we get is pretty large...relates to trying to be a little bit more systematic to try to deal with messy situations where we get this kind of media.", he said after referencing existing systematic tools for transfer like bag.it. Cal went on to describe his project, BitCurator (funded by the Andrew W. Mellon Foundation) that is soon to be headed into phase II. "The main goals are to develop and disseminate a package and support open source tools that can help people apply digital forensics methods.", he said. "There are two main things that aren't traditionally addressed by the digital forensics field itself: building these things into library/archival workflows and supporting provisioning of public access to data.
The famed Jason Scott (@textfiles) of Archive Team took the stage next in patriot attire (and his token black hat) and begun, "I am the harbinger of death. I am the angel of death. I am the sad grim reaper that sits at the crossroads of your lost and dying dreams. I am the boatman on the River Styx who takes your hard drive from you and rides with you across the river to your utter destiny." he continued, humorously, "When the handshakes no longer happen and when the smiles fade - that's where I am living. I am living in this world because I help found something called Archive Team, archiveteam.org.
He went on to describe a few of Archive Team's recent projects including savings all of Xanga (project page), for which he described the progress indicating that the preservation won't end well, and Snapjoy (now down, project page), for which he showed more hope due to the small number of users.
Jason emphasized that there are many online communities that are "real shifting sand" that have no guarantees or laws preventing them from going away. "The fundamental question", he said, "is 'Is an online presence a valid humanitarian concern?'".
"Unfortunately, we are now the victims of the 'brogrammer/journalist' complex, which has worked together to really convince us that the place to put all of our stuff is with people we don't know for reasons we don't know until they decide that they're done with us...or have they've sold it to Google.", he said. "We have three virtues within Archive Team: Rage, Paranoia, and Kleptomania. So basically, we're very angry about these things going away, we have an enormous paranoia about things that might go away at any given time, and we take everything as fast as we can."
Jason went on with further allusions and anecdotes about Archive Team and their projects but the video (below) does his presentation better justice.
With the panel complete, a series of Lightning Talk followed.
William Ying of ARTStor presented "ARTstor Shared Shelf Preservation Plan Based on the NDSA Levels of Digital Preservation".
Abbie Grotke (@grotke) of Library of Congress presented "Content Working Group Case Studies"
David Brunton of Library of Congress presented "The Importance of Being Developers"
Cathy N. Hartman of University of North Texas presented "International Internet Preservation Consortium: Update"
Patrick Loughney of Library of Congress presented "The Library of Congress National Recording Preservation Plan"
Christina Drummond of University of North Texas presented "Your VO 'Lab' Results Are in: What NDSA Members Think of the NDSA"
With Yvonne closing up the lightning talks, Barrie Howard of NDIIPP excused the audience from the first day and encouraged everyone to view the poster session just outside of the main presentation room.
Lisa Green (@boudicca), Director of @commoncrawl started off Day 2 (with an introduction by Bill LeFurgy) with her presentation "Digital Preservation for Machine-Scale Access and Analysis". Citing Hilary's work from Day 1, she said "By machine scale I believe we have do be doing digital preservation in such a way that enables us to do data science on information we're preserving."
Lisa continued by giving a history of the progress of how we have moved from the concept of archiving hard bound information to machine readable information. "By the end of the 20th century, we had significantly increased our storage capacity. At this point, we were able to store and move around megabytes of data very easily. This was about the time that some of the really forward thinking people at Library of Congress started thinking about Digital preservation. We can store so much information now that we needed a new unit to even wrap our heads around the amount of storage we have: a "Library of Congress" worth of information."She continued, describing rare books are setup for display without being accessible (e.g., behind glass) and juxtaposed them with Google Books and Google NGram Viewer in how the latter does not necessarily give direct access to information in the former. "We're not building a time capsule here. We're not putting things away so that they're safe for future generations and maybe we take a peek at them now and then. Citing the Library of Congress' mission statement:
The Library's mission is to make its resources available and useful to the Congress and the American people and to sustain and preserve a universal collection of knowledge and creativity for future generations.
"To me, the first part is the most important part - To make its resources available and useful. What good is collecting all of the information if we're not pushing forward the boundaries of human knowledge? I would propose that some efforts in digital preservation are focused a little too much on the second part, to preserve and sustain, and not enough on the available and useful.", she said.
Emily Gore (@ncschistory) of Digital Public Library of America followed Lisa. She spoke of the various partners that have contributed data to their organizations and that "we free our data. Our data is your data. Our partners' data is you data. You can download the complete repository of data you've give us. Do with it what you will." stating that, by default, the partner's data is under a CC0 license.
As with the first day of the conference, a panel followed titled "Green Bytes: Sustainable Approaches to Digital Stewardship" with an introduction by Erin Engle (@erinengle).
Green Bytes Panel
With the completion of the panel, the crowd was given a half hour to preview the workshops to follow. The five workshops/sessions occurred simultaneously with five different topics:
While very relevant to our interests at WS-DL, I presented at the Web Archiving session, so cannot give an account of the others.
The Tools of the Trade: The Library of Congress Perspective session contained presentations titled "World Digital Library" by Sandy Bostian of Library of Congress; Jukebox by Sam Brylawski of University of California, Santa Barbara; and "Congress.gov" by Andrew Weber (@atweber) of the Law Library of Congress.
The Digital Curation Education and Curriculum session started with the first presentation titled National Digital Stewardship Residency Program" by Kris Nelson of Library of Congress, Bob Horton of IMLS, Andrea Goethals (@andreagoethals of Harvard University, Jefferson Bailey (@jefferson_bail) of Metropolitan New York Library Council, and Prue Adler of Association of Research Libraries. The second presentation of the session was titled "Closing the Digital Curation Gaps: Getting Started Guide" by Helen Tibbo of UNC at Chapel Hill.
The Digital Preservation Tools session contained presentations titled "WGBH Media Library and Archives" by Karen Cariani of WGBH and "DSpace and Fedora Commons: A Comparison of Projects" by Wayne State University Students .
The Managing Software Projects session was more panel-like with David Brunton of Library of Congress, Lisa LaPlant of GPO, Daniel Chudnov of George Washington University Libraries and moderated by Kate Zwaard (@kzwa)of Library of Congress.
Following lunch was a short break then another set of simultaneous workshops and sessions.
The Web Archiving session contained presentations titled "WARCreate and WAIL" by Mat Kelly (@machawk1) of Old Dominion University and "DuraCloud and Archive-It Integration: Preserving Web Collections" by Carissa Smith of Duracloud.
The Digital Preservation Services session contained presentations titled "Digital Preservation Network" by David Minor of UCSD and "Integrating Repositories for Research Data Sharing" by Stephen Abrams of UCC.
The Graduate Curriculum in Digital Preservation session was panel-like involving Jane Zhang of Catholic University, Anthony Cocciolo of Pratt Institute, Kara Van Malssen of AudioVisual Preservation Solutions and Jefferson Bailey of Metropolitan New York Library Council.
The Digital Stewardship Tools from the Library of Congress session contained sessions titled "EDeposit and DMS" by Anupama Rai and Laura Graham of Library of Congress, "NDNP/ChronAm" by David Brunton of Library of Congress and "Viewshare" by Camille Salas of Library of Congress.
The Project Pitching session required prior sign-up and involved three funding agencies: Institute of Museum and Library Services, National Historical Publications and Records Commission, and National Endowment for the Humanities.
The final session of the day was another panel titled "Innovative Approaches to Digital.
Innovative Approaches to Digital Stewardship
Amy Robinson of EyeWire.
Rodrigo Davies of MIT Center for Civic Medias
Aaron Straup Cope of Cooper-Hewitt Museum Labs
After Aaron's presentation, the three presenters fielded questions.
With the completion of the panel, the conference wrapped-up and the crowd was adjourned from the conference.— Mat