2016-06-27: Symposium on Saving The Web at the Library of Congress

On June 16, 2016, the Library of Congress hosted a one day Symposium entitled Saving the Web: The Ethics and Challenges of Preserving What's on the Internet. The Symposium came at the end of the Archives Unleashed 2.0 Web Archive Datathon. The Datathon itself is covered in an earlier report. In addition to presenting the results of datathon projects, the wide variety of speakers at the symposium defended the need for preservation, discussed the special issues associated with preservation of data, and finally highlighted the importance and concepts surrounding the preservation of multimedia.

Keynote by Vint Cerf

The symposium opened with Dame Wendy Hall, Professor of Computer Science at the University of Southampton and Kluge Chair in Science and Technology at the Library of Congress. She mentioned that the Internet has only been available for a few decades and we need to preserve its openness and freedom, because that openness and freedom are always in peril.  She used those points to introduce one of the inventors of TCP/IP, Vint Cerf.

Vint Cerf talked about the instability of the current Internet. He introduced the idea of "Digital Vellum", referring to vellum, a form of parchment created from animal skin that was once used to create fine quality documents. The goal is to not only capture the documents and data that make up the Internet, but also be able to recreate them in the distant future.

He highlighted a number of problems with the current Internet. URLs associated with domain names are not stable; a change in ownership or solvency of a company or organization can cause a domain name to stop responding.

To understand the context surrounding them, digital objects also require a lot of metadata to be captured in addition to their original content. Users need enough information to correctly interpret the content that has been preserved because its context may be lost to time.

Copyright law needs to be amended to give rights and protections to archival organizations, like the Library of Congress. Serious questions about protections for archival and replay of archived content still exist.
He explained the importance of multiple web archives, noting historical issues with libraries and archives being lost to natural disasters or war. He thinks there is something "delicious" about the Library of Alexandria being a backup site for the Internet Archive. He said that we still have many of the artifacts of the ancient world purely by luck or accident, and that we can do better.

Talking about other digital objects, Vint Cerf then discussed efforts to preserve software. He mentioned the OLIVE preservation project for archiving and replaying executable content.  They are currently looking into streaming virtual machines in order to replay old executables on modern systems. He did confess that we have a long way to go before we're able to reproduce old results in some cases.

Vint Cerf mentioned that idea of the "self-archiving Web", indicating that we need collaboration, open design, and new business models. He said that, due to its success, the Internet could serve as a good source of lessons for how one would go about designing the self-archiving Web. Participating in the current Internet is done by just following the agreed upon protocols. It works largely because of its modularity and its capacity for layered evolution. The self-archiving Web should also try to embrace these strengths.
He listed outstanding questions with the approach of archiving the Internet. He said that he had some issues with contemplating the idea of the Internet containing itself. Is there were a better replacement for hyperlinks due to their deterioration? Should be be using something like DOIs instead? When should a snapshot be taken? How do we know when a change has occurred in a resource? How do we ensure that old formats, like old versions of HTML, will render well for future users? Do we store malware or encryption keys? How to handle access control for resources?

The other inventor of TCP/IP, Bob Kahn of the Corporation for National Research Initiatives (CNRI), was unable to attend. He was scheduled to present Digital Object Architecture (DOA), so Vint Cerf continued by presenting that work as well. He talked about the existing handle systems, such as DOI, where identifiers, rather than locations, are used for digital objects. These identifiers are then submitted to a resolution system that locates the object and then delivers it to the user. Of course, this resolution system is an additional layer of infrastructure that must be managed. There are quite a few handle system implementations, including those supported by the Library of Congress, CrossRef, and mEDRA.

He finished up by accepting a few questions from the audience. From these exchanges came additional insight - and funny moments - shown in the tweets below.

Archives Unleashed Presentation

Ian Milligan is an Assistant Professor in the Department of History at the University of Waterloo. He discussed the use of web archives in studying history, highlighting how he used warcbase in a study of 186 million archived pages from geocities.com. He spoke about the importance in studying online communities in order to understand a period of history.

Then Ian and some team representatives presented our hard work from the Archives Unleashed Hackathon. We had worked on a variety of projects with different datasets. A lot of natural language processing combined with temporal metadata and modeling allowed our groups to study sentiment in elections, uncover differences in media reporting based on country, discovering documents related to terrorism, and more.

The Need for Preservation

Next was a series of presentations and panel discussion moderated by David Lazer (Northeastern University) with Abbie Grotke (Library of Congress), Jefferson Bailey (Internet Archive), Richard Marciano (University of Maryland), and Richard Price (British Library). Their topic was "The Need for Preservation". David Lazer started by discussing the curation of archives and posed the open question of how to determine which archived pages are valuable.

David Lazer

David Lazer is a Professor in Political Science and Computer and Information Science at Northeastern University. He started the session by discussing the quality of what has been archived. He mentioned that digital media allows us to think of documents and data in a different way. He discussed the issues with finding useful information in Twitter data, due to the presence of bots and other sources of noise.

Jefferson Bailey

Jefferson Bailey is the Director of Web Archiving Programs for the Archive-It Team at the Internet Archive. He highlighted some statistics about its current holdings as well as talking about its multimodal crawling strategy including work with libraries. At the moment, researchers must develop problem-specific tools to work with the Internet Archive. They are currently gathering information on research interests in an attempt to create a set of general purpose tools for research.

Richard Marciano

Richard Marciano leads the Digital Curation Innovation Center (DCIC) at the University of Maryland. He spoke about DCIC's work with big data and how it was related to digital archives. He finished up with some thought-provoking questions shown below.

Richard Price

Richard Price is the Head of Contemporary British Collections at the British Library. He discussed the mission of saving the web, and stressed that advocacy has always been important for libraries and archives. He mentioned that users are often the best advocates and that the right language is best when trying to advocate for web archiving, preferring the term "time travel" because it seems to engender more interest from the public.

Abbie Grotke

Abbie Grotke is the web archiving team lead for the Library of Congress. She discussed the curated web archive maintained by the Library of Congress. They perform regular crawls of specific websites and use RSS feeds to inform their crawling. Currently the team is focused on acquiring web content, but they do not yet have the resources to make it all accessible. She said that there are challenges to archiving the web in the United States, because most sites do not sit under a country-specific top level domain. The question and answer session afterwards brought up a number of good thoughts. What are the ethics of archiving? Many archives have a national focus, but many topics are international; how do we curate topics so that they are available across archives? Do people have a right to be forgotten?

Putting Data to Work

The next session was moderated by Dame Wendy Hall. The speakers for this session were Lee Rainie (Pew Research Center), Katy Borner (Indiana University Bloomington), James Hendler (Rensselaer Polytechnic Institute), and Phillip E. Scheur (Stanford University).

Lee Rainie

Lee Rainie is the director of Internet, Science and Technology for the Pew Research Center. He stated that he was happy that so many large scale projects involving Internet data, and especially archived data, have a civic focus. He bemoaned the decline of civic news provided by newspapers, but said that librarians and archivists can play an important role in ensuring that civic information gets archived in web archives. He did warn that, though so many research projects acquire data from Twitter, only 20% of Americans use twitter, meaning that many perspectives are lost.

Katy Börner

Katy Börner is a Distinguished Professor of Information Science at the School of Informatics and Computing at Indiana University Bloomington. She discussed the exciting world of visualizing (web) science. She featured some of the work at scimaps.org, a site dedicated to visualizations of scientific data. I was surprised to see her highlight the "Clickstream Map of Science" that was "near and dear" to her, with which I was very familiar because it was created by Johan Bollen, Herbert Van de Sompel, and others as part of "Clickstream Data Yields High-Resolution Maps of Science". She mentioned the need to not only create tools for visualizing web data, but also the importance of pursuing information literacy so that many can use these tools as well.

James Hendler

James Hendler is Director of the Institute for Data Exploration and Applications and the Tetherless World Professor of Computer, Web and Cognitive Sciences at Rensselaer Polytechnic Institute (RPI). He is one of the originators of the Semantic Web. He discussed data and how important it is to ensure that the data we use for research is suitable for others to consume as well. He mentioned the importance of metadata for making sense of data in context, echoing earlier points made by Vint Cerf. He talked about the temporal nature of data and how accessing datasets at different points in time is in itself useful. I spoke to him during one of the breaks about work the LANL Prototyping Team has been doing in regards to temporal access to semantic web data.

Philip Schreur

Philip Schreur is the Assistant University Librarian for Technical and Access Services at Stanford University Library. He discussed the issues of metadata and how libraries are engaged in a migration to linked data. He mentioned the importance of metadata in understanding historical context. He said that shifting from MARC and other metadata formats will be difficult, but necessary for the future of libraries. He sees a future where libraries will be creating metadata for the purposes of sharing it with the web. He also agrees that libraries will continue to curate data, but acquisition of content will be automated.

Dame Wendy Hall

Dame Wendy Hall then began talking about where she would take libraries, emphasizing that it is data that patrons are looking for. That data may take the form of documents, datasets, etc., but is more than just articles. She mentioned that librarians need to become more data-savvy and that discovery will become more and more important.

Saving Media

The last session was moderated by Matthew Weber (Rutgers University). This session included Philip Napoli (Rutgers University), Ramesh Jain (University of California, Irvine), and Katrin Weller (GESIS Leibniz Institute for the Social Sciences and former Kluge Fellow in Digital Studies).

Matthew Weber

Matthew Weber is an Assistant Professor in the School of Communication and Information at Rutgers University. He began the session by talking about how web content changes and how it is possible to view the perceptions of a group in a specific point in time because of these changes.

Philip Napoli

Philip Napoli is a Professor of Journalism and Media Studies in Rutgers School of Communication and Information. He began by echoing one of Vint Cerf's points: there is so much diverse content that it is more difficult to do a study in the early 2000s than it is to study media from the past. He mentioned that there needs to be focus on archiving local news because it is getting lost. It is also an area that local libraries can participate in.

Ramesh Jain

Ramesh Jain is a Professor at the Bren School of Information and Computer Sciences at the University of California, Irvine. His is area of research includes multimedia information systems. He spoke about multimedia and how the growth of cameras have created an unprecedented capability for capturing events. He mentioned how a change is occurring, in part thanks to social media, whereby now we are producing "visual documents" that contain text rather than textual documents that begrudgingly contain photos. He emphasized that we have begun not just creating a web of documents, but a "web of events".

Katrin Weller

Katrin Weller is an information scientist working at the GESIS Leibniz Institute for the Social Sciences. She discussed the issue of context in social media. Will present hashtags have any meaning in the future? She mentioned that future historians may use past instructional texts, like "Twitter for Dummies", to understand how our current tools are used. In some cases, it is important to understand that people change social media accounts over time.

Conclusion by Dame Wendy Hall

Dame Wendy Hall concluded the symposium by discussing the growth of the Internet and how it has changed the world. Her group at the University of Southampton hosts the Web Science Trust, with the goal of facilitating the development of Web Science. She explained that while libraries will be maintaining physical collections, data has also become important to researchers, requiring librarians to learn new data science skills. This led her to introduce Web Observatory, a place to share and link datasets so that researchers can answer questions about the web. The goal is to have metadata in a standard format that will support discovery, but also allow libraries to share each others' data rather than having to collect all of it themselves.

Thoughts and Thanks

All in all, this was an excellent experience and I am glad I attended. I was able to make contact with some of the best minds from a variety of fields while learning about their really fascinating work.

Thanks to Vint Cerf, Ian Milligan, David Lazer, Abbie Grotke, Jefferson Bailey, Richard Marciano, Richard Price, Lee Rainie, Katy Borner, James Hendler, Philip E. Scheur, Dame Wendy Hall, Matthew Weber, Philip Napoli, Ramesh Jain, and Katrin Weller for the excellent thought-provoking presentations.

Thanks to Matthew Weber, Ian Milligan, Jimmy Lin, Noshir Contractor, David Lazer, Wendy Hall, Nicholas Taylor, and Jefferson Bailey for making Archives Unleashed a reality and connecting it to the Save the Web Symposium. Also, thanks to all of the Archives Unleashed attendees who made the experience quite rewarding.

And final thanks go to Dame Wendy Hall and the John W. Kluge Center at the Library of Congress for hosting the event.

Thanks much for tweets from @DameWendyDBE, @EpistolaryBrown, @joecar25, @kwelle, @nullhandle, @websitemgmt, @lljohnston, @tsuomela, @jesseajohnston, @kzwa, @jillreillyjames, @lrainie, @ianmilligan1, @KlugeCtr, @NEH_PresAccess, @KingsleySteph, @justin_littman, @jahendler, @MikeNelson, @docmattweber, @raiminetinati, @earlymodernpost

Many others have also written articles about this event, including: -- Shawn M. Jones