Keynote by Vint Cerf
The symposium opened with Dame Wendy Hall, Professor of Computer Science at the University of Southampton and Kluge Chair in Science and Technology at the Library of Congress. She mentioned that the Internet has only been available for a few decades and we need to preserve its openness and freedom, because that openness and freedom are always in peril. She used those points to introduce one of the inventors of TCP/IP, Vint Cerf.
vellum, a form of parchment created from animal skin that was once used to create fine quality documents. The goal is to not only capture the documents and data that make up the Internet, but also be able to recreate them in the distant future.
He highlighted a number of problems with the current Internet. URLs associated with domain names are not stable; a change in ownership or solvency of a company or organization can cause a domain name to stop responding.“Unfortunately I think preserving all of the digital information on vellum would require a lot of goats.” -Vint Cerf #SaveTheWeb— Kate Zwaard (@kzwa) June 16, 2016
Copyright law needs to be amended to give rights and protections to archival organizations, like the Library of Congress. Serious questions about protections for archival and replay of archived content still exist.At @KlugeCtr #SaveTheWeb conference, @VGCerf: In 22nd Century we may know more about Lincoln Admin. than Obama Admin pic.twitter.com/a4buzzivJl— Michael Nelson (@MikeNelson) June 16, 2016
He explained the importance of multiple web archives, noting historical issues with libraries and archives being lost to natural disasters or war. He thinks there is something "delicious" about the Library of Alexandria being a backup site for the Internet Archive. He said that we still have many of the artifacts of the ancient world purely by luck or accident, and that we can do better.Copyright laws should be amended to give rights to preservation agencies like Library of Congress - per @vgcerf #SaveTheWeb— Lee Rainie (@lrainie) June 16, 2016
Vint Cerf: "We don't want preservation by accident" #SavetheWeb #hackarchives pic.twitter.com/HWThUavbQE— Shawn M. Jones (@shawnmjones) June 16, 2016
Talking about other digital objects, Vint Cerf then discussed efforts to preserve software. He mentioned the OLIVE preservation project for archiving and replaying executable content. They are currently looking into streaming virtual machines in order to replay old executables on modern systems. He did confess that we have a long way to go before we're able to reproduce old results in some cases.
OLIVE project at Carnegie Mellon a good model for archiving digital content - https://t.co/1cMSRwCohe - Touted by @vgcerf #savetheweb— Lee Rainie (@lrainie) June 16, 2016
Now Cerf on the OLIVE project, streaming VMs rather than downloading them – https://t.co/V87LTwvEOh. Discussing tech issues. #SaveTheWeb— Ian Milligan (@ianmilligan1) June 16, 2016
Vint Cerf mentioned that idea of the "self-archiving Web", indicating that we need collaboration, open design, and new business models. He said that, due to its success, the Internet could serve as a good source of lessons for how one would go about designing the self-archiving Web. Participating in the current Internet is done by just following the agreed upon protocols. It works largely because of its modularity and its capacity for layered evolution. The self-archiving Web should also try to embrace these strengths.
He listed outstanding questions with the approach of archiving the Internet. He said that he had some issues with contemplating the idea of the Internet containing itself. Is there were a better replacement for hyperlinks due to their deterioration? Should be be using something like DOIs instead? When should a snapshot be taken? How do we know when a change has occurred in a resource? How do we ensure that old formats, like old versions of HTML, will render well for future users? Do we store malware or encryption keys? How to handle access control for resources?We need a "self-archiving Web" -- @vgcerf backs idea from @timberners_lee. Need collaboration, open design, new biz models #savetheweb— Lee Rainie (@lrainie) June 16, 2016
#SaveTheWeb it's not an easy task - vint cerf --> web is temporal, so how do you make it permanent? pic.twitter.com/jQdos1ggma— Matthew Weber (@docmattweber) June 16, 2016
Vint Cerf: At what rate should we snapshot the WWW? Do we archive malware for historical purposes? #SaveTheWeb pic.twitter.com/dlwp5HRi9g— Shawn M. Jones (@shawnmjones) June 16, 2016
@vgcerf: "I'd like to have a clue that something is worth snapshotting rather than taking a million pictures" #SaveTheWeb— Meaghan Brown (@EpistolaryBrown) June 16, 2016
The other inventor of TCP/IP, Bob Kahn of the Corporation for National Research Initiatives (CNRI), was unable to attend. He was scheduled to present Digital Object Architecture (DOA), so Vint Cerf continued by presenting that work as well. He talked about the existing handle systems, such as DOI, where identifiers, rather than locations, are used for digital objects. These identifiers are then submitted to a resolution system that locates the object and then delivers it to the user. Of course, this resolution system is an additional layer of infrastructure that must be managed. There are quite a few handle system implementations, including those supported by the Library of Congress, CrossRef, and mEDRA.
He finished up by accepting a few questions from the audience. From these exchanges came additional insight - and funny moments - shown in the tweets below.At @KlugeCtr #SaveTheWeb conf., @VGCerf pitch hits for Bob Kahn @cnri and explains Digital Object Architecture, DOA pic.twitter.com/mhFWqKHdJK— Michael Nelson (@MikeNelson) June 16, 2016
.@vgcerf at #SaveTheWeb: Once you’ve published on the world wide web, you’ve committed yourself to history.— Justin Littman (@justin_littman) June 16, 2016
Vint Cerf: build preservation into the norm rather than as the action of a few parties #SaveTheWeb pic.twitter.com/ux5BW4qMb4— Shawn M. Jones (@shawnmjones) June 16, 2016
Archives Unleashed Presentation
Ian Milligan is an Assistant Professor in the Department of History at the University of Waterloo. He discussed the use of web archives in studying history, highlighting how he used warcbase in a study of 186 million archived pages from geocities.com. He spoke about the importance in studying online communities in order to understand a period of history.
. @ianmilligan1 is discussing his study of 186 million URIs on https://t.co/bafMuJDF9u #SaveTheWeb #hackarchives pic.twitter.com/3qvOh4BRC1— Shawn M. Jones (@shawnmjones) June 16, 2016
@ianmilligan1 presenting Geocities as a use case for the usefulness of web archiving for historians. Thanks @textfiles #SavetheWeb— Leslie Johnston (@lljohnston) June 16, 2016
@ianmilligan1 Telling about the value of geocities web archiving. Thanks to the efforts of Archiveteam #SaveTheWeb— Todd Suomela (@tsuomela) June 16, 2016
Team The Mojitos worked on understanding how Obama's visit to Cuba was reported in Cuban media #SaveTheWeb— Meaghan Brown (@EpistolaryBrown) June 16, 2016
Hid: studied the UK Elections 2015 on twitter, limited to MPs tweeting #SaveTheWeb— Meaghan Brown (@EpistolaryBrown) June 16, 2016
Thanks for sharing the insights from #hackarchives projects at #savetheweb - here are some summaries pic.twitter.com/0b0kArH4eH— Katrin Weller (@kwelle) June 16, 2016
CounterTerrorism: used an ideology classifyer to classify text from web radio transcripts #SaveTheWeb— Meaghan Brown (@EpistolaryBrown) June 16, 2016
So glad these will be online! Three minutes is really just a teaser! Really interesting projects! #SaveTheWeb https://t.co/YVF8KV6lkm— Stephanie A Kingsley (@KingsleySteph) June 16, 2016
VintCerf is impressed with what we have done at #hackarchives: Is this a completely new way of understanding our own knowledge? #SaveTheWeb— Shawn M. Jones (@shawnmjones) June 16, 2016
The Need for Preservation
Next was a series of presentations and panel discussion moderated by David Lazer (Northeastern University) with Abbie Grotke (Library of Congress), Jefferson Bailey (Internet Archive), Richard Marciano (University of Maryland), and Richard Price (British Library). Their topic was "The Need for Preservation". David Lazer started by discussing the curation of archives and posed the open question of how to determine which archived pages are valuable.
"The need for preservation" #SaveTheWeb now starting @librarycongress @KlugeCtr - webcast at https://t.co/65RliQfuMU pic.twitter.com/VOT9DLJHC3— NEH Pres Access (@NEH_PresAccess) June 16, 2016
David Lazer is a Professor in Political Science and Computer and Information Science at Northeastern University. He started the session by discussing the quality of what has been archived. He mentioned that digital media allows us to think of documents and data in a different way. He discussed the issues with finding useful information in Twitter data, due to the presence of bots and other sources of noise.
Jefferson Bailey is the Director of Web Archiving Programs for the Archive-It Team at the Internet Archive. He highlighted some statistics about its current holdings as well as talking about its multimodal crawling strategy including work with libraries. At the moment, researchers must develop problem-specific tools to work with the Internet Archive. They are currently gathering information on research interests in an attempt to create a set of general purpose tools for research.
Jefferson Bailey of the Internet Archive says their Web Archive is about to hit 50 billion Web captures #SaveTheWeb— John W. Kluge Center (@KlugeCtr) June 16, 2016
20yr anniversary of Internet Archive this yr, Jefferson Bailey #SaveTheWeb— Lizzy Williamson (@earlymodernpost) June 16, 2016
Richard Marciano leads the Digital Curation Innovation Center (DCIC) at the University of Maryland. He spoke about DCIC's work with big data and how it was related to digital archives. He finished up with some thought-provoking questions shown below.
Richard Price is the Head of Contemporary British Collections at the British Library. He discussed the mission of saving the web, and stressed that advocacy has always been important for libraries and archives. He mentioned that users are often the best advocates and that the right language is best when trying to advocate for web archiving, preferring the term "time travel" because it seems to engender more interest from the public.
Price: One of the fantastic myths about the digital is that it is somehow free, and we internalize this #SaveTheWeb— Meaghan Brown (@EpistolaryBrown) June 16, 2016
"Really really important that we behave non-virtually" -- meet in REAL places, says UK Librarian Richard Price #SaveTheWeb— Lee Rainie (@lrainie) June 16, 2016
Abbie Grotke is the web archiving team lead for the Library of Congress. She discussed the curated web archive maintained by the Library of Congress. They perform regular crawls of specific websites and use RSS feeds to inform their crawling. Currently the team is focused on acquiring web content, but they do not yet have the resources to make it all accessible. She said that there are challenges to archiving the web in the United States, because most sites do not sit under a country-specific top level domain.
LOC web archivists are very envious of Iceland's well-defined (and smaller) preservation mandate #SaveTheWeb— Meaghan Brown (@EpistolaryBrown) June 16, 2016
Putting Data to Work
The next session was moderated by Dame Wendy Hall. The speakers for this session were Lee Rainie (Pew Research Center), Katy Borner (Indiana University Bloomington), James Hendler (Rensselaer Polytechnic Institute), and Phillip E. Scheur (Stanford University).
Lee Rainie is the director of Internet, Science and Technology for the Pew Research Center. He stated that he was happy that so many large scale projects involving Internet data, and especially archived data, have a civic focus. He bemoaned the decline of civic news provided by newspapers, but said that librarians and archivists can play an important role in ensuring that civic information gets archived in web archives. He did warn that, though so many research projects acquire data from Twitter, only 20% of Americans use twitter, meaning that many perspectives are lost.
Katy Börner is a Distinguished Professor of Information Science at the School of Informatics and Computing at Indiana University Bloomington. She discussed the exciting world of visualizing (web) science. She featured some of the work at scimaps.org, a site dedicated to visualizations of scientific data. I was surprised to see her highlight the "Clickstream Map of Science" that was "near and dear" to her, with which I was very familiar because it was created by Johan Bollen, Herbert Van de Sompel, and others as part of "Clickstream Data Yields High-Resolution Maps of Science". She mentioned the need to not only create tools for visualizing web data, but also the importance of pursuing information literacy so that many can use these tools as well.
James Hendler is Director of the Institute for Data Exploration and Applications and the Tetherless World Professor of Computer, Web and Cognitive Sciences at Rensselaer Polytechnic Institute (RPI). He is one of the originators of the Semantic Web. He discussed data and how important it is to ensure that the data we use for research is suitable for others to consume as well. He mentioned the importance of metadata for making sense of data in context, echoing earlier points made by Vint Cerf. He talked about the temporal nature of data and how accessing datasets at different points in time is in itself useful. I spoke to him during one of the breaks about work the LANL Prototyping Team has been doing in regards to temporal access to semantic web data.
Hendler: For the web itself, the URL system provides some kind of organizational structure #SaveTheWeb— Meaghan Brown (@EpistolaryBrown) June 16, 2016
Hendler: Anything we're going to do about web data can't be in a taxonomy, it can't be a tree #SaveTheWeb— Meaghan Brown (@EpistolaryBrown) June 16, 2016
Philip Schreur is the Assistant University Librarian for Technical and Access Services at Stanford University Library. He discussed the issues of metadata and how libraries are engaged in a migration to linked data. He mentioned the importance of metadata in understanding historical context. He said that shifting from MARC and other metadata formats will be difficult, but necessary for the future of libraries. He sees a future where libraries will be creating metadata for the purposes of sharing it with the web. He also agrees that libraries will continue to curate data, but acquisition of content will be automated.
Philip E. Schreur mentioned the importance of transitioning from MARC to linked data #SaveTheWeb— Shawn M. Jones (@shawnmjones) June 16, 2016
"Metadata is data with an ulterior motive." - Philip Schruer #SaveTheWeb— Jaimie Murdock (@JaimieMurdock) June 16, 2016
Dame Wendy Hall
Dame Wendy Hall then began talking about where she would take libraries, emphasizing that it is data that patrons are looking for. That data may take the form of documents, datasets, etc., but is more than just articles. She mentioned that librarians need to become more data-savvy and that discovery will become more and more important.
The last session was moderated by Matthew Weber (Rutgers University). This session included Philip Napoli (Rutgers University), Ramesh Jain (University of California, Irvine), and Katrin Weller (GESIS Leibniz Institute for the Social Sciences and former Kluge Fellow in Digital Studies).
Matthew Weber is an Assistant Professor in the School of Communication and Information at Rutgers University. He began the session by talking about how web content changes and how it is possible to view the perceptions of a group in a specific point in time because of these changes.
Philip Napoli is a Professor of Journalism and Media Studies in Rutgers School of Communication and Information. He began by echoing one of Vint Cerf's points: there is so much diverse content that it is more difficult to do a study in the early 2000s than it is to study media from the past. He mentioned that there needs to be focus on archiving local news because it is getting lost. It is also an area that local libraries can participate in.
Philip Napoli: “It’s still easier to do a study of news coverage from literally 1940 than it is from a year ago.” 😫 Seriously. #SaveTheWeb— Ian Milligan (@ianmilligan1) June 16, 2016
Ramesh Jain is a Professor at the Bren School of Information and Computer Sciences at the University of California, Irvine. His is area of research includes multimedia information systems. He spoke about multimedia and how the growth of cameras have created an unprecedented capability for capturing events. He mentioned how a change is occurring, in part thanks to social media, whereby now we are producing "visual documents" that contain text rather than textual documents that begrudgingly contain photos. He emphasized that we have begun not just creating a web of documents, but a "web of events".
A digital photo is a micro-report - a picture, datestamp, geolocation - almost enough to fully contextualize. #SavetheWeb— Leslie Johnston (@lljohnston) June 16, 2016
Katrin Weller is an information scientist working at the GESIS Leibniz Institute for the Social Sciences. She discussed the issue of context in social media. Will present hashtags have any meaning in the future? She mentioned that future historians may use past instructional texts, like "Twitter for Dummies", to understand how our current tools are used. In some cases, it is important to understand that people change social media accounts over time.
Conclusion by Dame Wendy Hall
Dame Wendy Hall concluded the symposium by discussing the growth of the Internet and how it has changed the world. Her group at the University of Southampton hosts the Web Science Trust, with the goal of facilitating the development of Web Science. She explained that while libraries will be maintaining physical collections, data has also become important to researchers, requiring librarians to learn new data science skills. This led her to introduce Web Observatory, a place to share and link datasets so that researchers can answer questions about the web. The goal is to have metadata in a standard format that will support discovery, but also allow libraries to share each others' data rather than having to collect all of it themselves.
Thoughts and Thanks
All in all, this was an excellent experience and I am glad I attended. I was able to make contact with some of the best minds from a variety of fields while learning about their really fascinating work.
Thanks to Vint Cerf, Ian Milligan, David Lazer, Abbie Grotke, Jefferson Bailey, Richard Marciano, Richard Price, Lee Rainie, Katy Borner, James Hendler, Philip E. Scheur, Dame Wendy Hall, Matthew Weber, Philip Napoli, Ramesh Jain, and Katrin Weller for the excellent thought-provoking presentations.
Thanks to Matthew Weber, Ian Milligan, Jimmy Lin, Noshir Contractor, David Lazer, Wendy Hall, Nicholas Taylor, and Jefferson Bailey for making Archives Unleashed a reality and connecting it to the Save the Web Symposium. Also, thanks to all of the Archives Unleashed attendees who made the experience quite rewarding.
And final thanks go to Dame Wendy Hall and the John W. Kluge Center at the Library of Congress for hosting the event.
Thanks much for tweets from @DameWendyDBE, @EpistolaryBrown, @joecar25, @kwelle, @nullhandle, @websitemgmt, @lljohnston, @tsuomela, @jesseajohnston, @kzwa, @jillreillyjames, @lrainie, @ianmilligan1, @KlugeCtr, @NEH_PresAccess, @KingsleySteph, @justin_littman, @jahendler, @MikeNelson, @docmattweber, @raiminetinati, @earlymodernpost
Many others have also written articles about this event, including:
- Stephanie Kingsley for the American Historical Association
- Jason Steinhauer for Library of Congress
- Blog Post by Web Science Trust
- Storify by Web Science Trust