Friday, July 25, 2014

2014-07-25: Digital Preservation 2014 Trip Report

On July 22 and 23, 2014, Dr. Michael Nelson (@phonedude_mln) and I (@machawk1) attended Digital Preservation 2014 in Washington, DC. This was my fourth consecutive NDIIPP (@ndiipp) / NDSA (@ndsa2) meeting (see trip reports from Digital Preservation 2011, 2012, 2013). With the largest attendance yet (300+) and compressed into two days, the schedule was jam-packed with interesting talks. Per usual, videos for most of the presentations are included inline below.

Day One

Micah Altman (@drmaltman) led the presentations with information about the NDSA and asked, regarding Amazon claiming reliability of 99.99999999999% for uptime, "What do the eleven nines mean?". "There are a number of risk that we know about [as archivists] that Amazon doesn't", he said, continuing, "No single institution can account for all the risks." Micah spoke about the updated National Agenda for Digital Stewardship, which is to have the theme of "developing evidence for collective action".

Matt Kirschenbaum (@mkirschenbaum) followed Micah with his presentation Software, It’s a Thing with reference to the recently excavated infamous Atari ET game. "Though the data on the tapes had been available for years, the uncovering was more about aura and allure". Matt spoke about George R. R. Martin's (of Game of Thrones fame) recent interview where he stated that he still uses a word processing program called WordStar on an un-networked DOS machine and that it is his "secret weapon from distraction and viruses." In the 1980s, Wordstar dominated the market until Word Perfect took rein, followed by Microsoft Word. "A power user that has memorized all of the Wordstar commands could utilize the software with the ease of picking up a pencil and starting to write."

Matt went on to talk (also see his Medium post) about software as different concepts include as an assets, as an object, as a kind of notation or score (qua music), as shrinkwrap, etc. For a full explanation of each, see his presentation:

Following Matt, Shannon Mattern (@shannonmattern) shared her presentation on Preservation Aesthetics. "One of preservationists's primary concerns is whether an item has intrinsic value.", she said. Shannon then spoke about the various sorts of auto-destructive software and art including those that are light sensitive (the latter) and those that delete themselves on execution (the former). In addition to recording her talk (see below), she graciously included the full text of her talk online.

The conference briefly had a break and a quick summary of the poster session to come later in day 1. After the break, there was a panel titled "Stewarding Space Data". Hampapuram Ramapriyan of NASA began the panel stating that "NASA data is available as soon as the satellites are launched.", he continued, "This data is preserved...We shouldn't lost bits that we worked hard to collect and process for results. He also stated that he (and NASA) is part of the Earth Science Information Partners (EISP) Data Stewardship Committee.

Deirdre Byrne of NOAA then presented speaking of their dynamics and further need on documenting data, preserving it with provenance, providing IT support to maintain the data's integrity, and being able to work with the data in the future (a digital preservation plus). Deirdre then referenced Pathfinder, a technology that allows the visualization of sea surface temperatures among other features like indicating coral bleaching from fish stocks, the water quality on coasts, etc. Citing its now use as a de facto standard means for this purpose, she described the physical dynamics as having 8 different satellites for its functionality along with 31 mirror satellites on standby for redundancy.

Emily Frieda Shaw (@emilyfshaw) of University of Iowa Libraries followed in the panel after Deirdre, and spoke about the Iowan role in preserving the original development data for the Explorer I launch. Upon converting and analyzing the data, her fellow researchers realize that at certain altitudes, the radiation detection dropped to zero, which indicated that there were large belts of particles surrounding the Earth (later, they were recognized as the Van Allen belts). After discovering more data in a basement about the launch, the group received a grant to recover and restore the badly damaged and moldy documentation.

Karl Nilsen (@KarlNilsen)and Robin Dasler (@rdasler) of University of Maryland Libraries were next with Robin first talking about her concern with issues in the library realm related to data. She reference that one project's data still resided at the University of Hawaii's Institute for Astronomy due to it being the home school to one of the original researchers on a project. The project consisted of data measuring the distances between galaxies that came about by combining and compiling data from various data sources originating from both observational data and standard corpora. To display the data (500 gigabytes total), the group developed a UI to utilize web technologies like MySQL to make the data more accessible. "Researchers were concerned about their data disappearing before they retired.", she stated about the original motive for increasing the data's accessibility.

Karl changed topics somewhat with stating that two different perspectives can be taken about data from a preservation standpoint (format or system centric). "The intellectual value of a database comes from ad hoc combination from multiple tables in the form of joins and selections.", he said. "Thinking about how to provide access", he continued, "is itself preservation." He followed this with approaches including utilizing virtual machines (VMs), migrating from one application to a more modern web application, and collapsing the digital preservation horizon to ten years at a time.

Following a brief Q&A for the panel was a series of "Lightning talks". James (Jim) A. Bradley of Ball State University started with his talk, "Beyond the Russian Doll Effect: Reflexivity and the Digital Repository Paradigm" where he spoke about promoting and sharing digital assets for reuse. Jim then talked about Digital Media Repository (DMR), which allowed information to be shared and made available at the page level. His group had the unique opportunity to tell what material are in the library, who was using them and when. From these patterns, grad students made 3-D models, which were them subsequently added and attributed to the original objects.

Dave Gibson (@davegilbson) of Library of Congress followed Jim by presenting Video Game Source Disc Preservation. He stated that his group has been the "keepers of The Game" since 1996 and have various source code excerpts from a variety of unreleased games including Duke Nukem Critical Mass, which was released for Nintendo DS but not Playstation Portable (PSP), despite being developed for both platforms. In their exploration of the game, they uncovered 28 different file formats on the discs, many of which were proprietary, and wondered how they could get access to the files' contents. After using Mediacoder to convert many of the audio files and hex editors to read the ASCII, his group found source code fragments hidden within the files. From this work, they now have the ability to preserve unreleased games

Rebecca (Becky) Katz of Council of the District of Columbia was next with UELMA-Compliant Preservation: Questions and Answers?. She first described the UELMA, an act that declares that if a U.S. state passed the act and its digital legal publications are official, the state has to make sure that the publications are preserved in some way. Because of this requirement, many states are reluctant to rely solely on digital documents and instead keeping paper copies in addition to the digital copies. Many of the barriers in the preservation process for the states lie in how they ought to preserve the digital documents. "For long term access," she said, " we want to be preserving our digital content." Becky also spoke about appropriate formats for preservation and that common features of formats like authentication for PDF are not open source. "All the metadata in the world doesn't mean anything if the data is not usable", she said. "We want to have a user interface that integrates with the [preservation] repository." She concluded with recommending state that develop the EULMA to have agreements with universities or long standing institutions to allow users to download updates of the documents to ensure that many copies are available.

Kate Murray of Library of Congress followed Becky with "We Want You Just the Way You Are: The What, Why and When of fixity in Digital Preservation". She referenced the "Migrant Mother" photo and how, through moving the digital photo from one storage component to another, there have been subtle changes to the photo. "I never realized that there are three children in the photo!", she said, referencing the comparison between the altered and original photo. To detect these changes, she uses fixity (related article on The Signal) on a whole collection of data, which ensures bit level integrity.

Following Kate, Krystyna Ohnesorge of Swiss Federal Archives (SFA) presented "Save Your Databases Using SIARD!". "About 90 percent of of specialized applications are based on relational databases.", she said. SIARD is used to preserve database content for the long term so that the data can be understood in 20 or 50 years. The system is already used in 54 countries with 341 licenses currently existing for different international organizations. "If an institution needs to archive relational databases, don't hesitate to use the SIARD suite and SIARD format!"

The conference then broke for lunch where the 2014 Innovation Awards were presented.

Following lunch, Cole Crawford (@coleinthecloud) of the Open Compute Foundation presented "Community Driven Innovation" where he spoke about Open Computer being an international based open source project. "Microsoft is moving all Azure data to Open Compute", he said. "Our technology requirements are different. To have an open innovation system that's reusable is important." He then emphasized that his talk was to be specifically open the storage aspects of Open Compute. He started with "FACT: The 19 inch server rack originated in the railroad industry then propagated to the music industry, then it was adopted by IT." He continued, "One of the most interesting things Facebook has done recently is move from tape storage to ten thousand Blueray discs for cold storage". He stated that in 2010, the Internet consisted of 0.8 Zetabytes. In 2012, this number was 3.0 Zetabytes, and by 2015, he claimed, it will be 40 Zetabytes in size. "As stewards of digital data, you guys can be working with our project to fit your requirements. We can be a great engine for procurement. As you need more access to content, we can get that.

After Cole was another panel titled, "Community Approaches to Digital Stewardship". Fran Berman (@resdatall) of Rensselaer Polytechnic Institute started off with reference to the Research Data Alliance. "All communities are working to develop the infrastructure that's appropriate to them.", she said, "If I want to worry about asthma (referencing an earlier comment about whether asthma is more likely to be obtained in Mexico City versus Los Angeles), I don't want to wait years until the infrastructure is in place. If you have to worry about data, that data needs to live somewhere."

Bradley Daigle (@BradleyDaigle) of University of Virginia followed Fran and spoke about the Academic Preservation Trust, a group consisting of 17 members that takes a community based approach at are attempting to not just have more solutions but better solutions. The group would like to create a business model based on preservation services. "If you have X amount of data, I can tell you it will take Y amount of cost to preserve that data.", he said, describing an ideal model. "The AP Trust can serve as a scratch space with preservation being applied to the data."

Following Bradley on the panel, Aaron Rubinstein from University of Massachusetts Amherst described his organization's scheme as being similar to Scooby Doo, iterating through each character displayed on-screen and stating the name of a member organization. "One of the things that makes our institution privileges is that we have administrators that understand the need for preservation.

The last presenter in the panel, Jamie Schumacher of Northern Illinois University started with "Smaller institutions have challenges when starting digital preservation. Instead of obtaining an implementation grant when applying to the NEH, we got a 'Figure it Out' grant. ... Our mission was to investigate a handful of digital preservation solutions that were affordable for organizations with restricted resources like small staff sizes and those with lone rangers. We discovered that practitioners are overwhelmed. To them, digital objects are a foreign thing." Some of the roadblocks her team eliminated were the questions of which tools and services to use for preservation tasks, to which Google frequently gave poor of too many results.

Following a short break, the conference split into five different concurrent workshops and breakout sessions. I attended the session titled Next Generation: The First Digital Repository for Museum Collections where Ben Fino-Radin (@benfinoradin) of Museum of Modern Art, Dan Gillean of Artefactual Systems and Kara Van Malssen (@kvanmalssen) of AVPreserve gave a demo of their work.

As I was presenting a poster at Digital Preservation 2014, I was unable to stay for the second presentation in the session Revisiting Digital Forensics Workflows in Collecting Institutions by Marty Gengenbach of Gates Archive, as a was required to setup my poster. Starting at 5 o'clock, the breakout sessions ended and a reception was held with the poster session in the main area of the hotel. My poster, "Efficient Thumbnail Summarization for Web Archives" is an implementation of Ahmed AlSum's initial work published at ECIR 2014 (see his ECIR Trip Report).

Day Two

The second day started off with breakfast and an immediate set of workshops and breakout sessions. Among these, I attended the Digital Preservation Questions and Answers from the NDSA Innovation Working Group where Trevor Owens (@tjowens), among other group members introduced the history and migration of an online digital preservation question and answer system. The current site, currently residing at http://qanda.digipres.org is in the process of migration from previous attempts including a failed try at a Digital Preservation Stack Exchange. This work, completed in-part by Andy Jackson (@anjacks0n) at the UK Web Archive, began its migration with his Zombse project, which extracted all of the initial questions and data from the failed Stack Exchange into a format that would eventually be readable by another Q&A system.

Following a short break after the first long set of sessions, the conference then re-split into the second set of breakout sessions for the day, where I attended the session titled Preserving News and Journalism. Aurelia Moser (@auremoser) administrated this panel-style presentation and initially showed a URI where the presentation's resource could be found (I typed bit.ly/1klZ4f2 but that seems to be incorrect).

The panel, consisting of Anne Wootton (@annewooton, Leslie Johnston (@lljohnston), and Edward McCain Reynolds (@e_mccain), initially asked, "What is born digital and what are news apps?". The group had put forth a survey toward 476 news organization, consisting of 406 hybrid organizations (those that put content in print and online), and 70 "online only" publications.

From the results, the surveyors asked what the issue was with responses, as they kept the answers open ended for the journalists to obtain an accurate account of their practices. "Newspapers that are in chains are more likely to have written policies for preservation.

The smaller organizations are where we're seeing data loss." At one point, Anne Wooton's group organized a "Design-a-Thon" where they gathered journalists, archivists, and developers. Regarding the surveyors' practice, the group stated that Content Management System (CMS) vendors for news outlets are the holders of t he key of the kingdom for newspapers in regard to preservation.

After the third breakout session of the conference, lunch was served (Mexican!) with Trevor Owens of Library of Congress, Amanda Brennan (@continuants) of Tumblr, and Trevor Blank (@trevorjblank of The State University of New York at Potsdam giving a preview of CURATECamp, to occur the day after the completion of the conference. While lunch proceeded, a series of lightning talks was presented.

The first lightning talk was by Kate Holterhoff (@KateHolterhoff) of Carnegie-Mellon University and titled Visual Haggard and Digitizing Illustration. In her talk, she introduced Visual Haggard, a catalog of many images from public domain books and documents that attempts to have better quality representations of the images in these documents compared to other online systems like Google Books. "Digital Archivists should contextual and mediate access to digital illustrations", she said.

Michele Kimpton of DuraSpace followed Kate with DuraSpace and Chronopolis Partner to Build a Long-term Access and Preservation Platform. In her presentation she introduced a few tools like Chronopolis (used for dark archiving), DuraCloud and a few other tools and her group's approach toward getting various open source tools to work together to provide a more comprehensive solution for preserving digital content.

Following Michelle, Ted Westervelt of Library of Congress presented Library of Congress Recommended Formats where he reference the Best Edition Statement, a largely obsolete but useful document that needed supplementation to account for modern best practice and newer mediums. His group has developed the "Recommended Format Specification", which provide this without superseding the original document and is a work-in-progress. The goal of the document is to set parameters for the target objects for the document so that most content that is current un-handled by the in-place specification will have directives to ensure that digital preservation of the objects is guided and easy.

After Ted, Jane Zhang of Catholic University of America presented Electronic Records and Digital Archivists: A Landscape Review where she did a cursory study of employment positions for digital archivists, both formally trained and trained on-the-job. She attempt to answer the question "Are libraries hiring digital archivists?" and tried to see a pattern from one hundred job descriptions.

After the lightning talks, another panel was held, titled "Research Data and Curation". Inna Kouper (@inkouper) and Dharma Akmon (@DharmaAkmon) of Indiana University and University of Michigan, Ann Arbor, respectively, discussed Sustainable Environment Actionable Data (SEAD, @SEADdatanet), and a Research Object Framework for the data that will be very familiar for practitioners working with video producers. "Research projects are bundles", they said, "The ROF captures multiple aspects of working with data include unique ID, agents, states, relationships, and content and how they cyclicly relate. Research objects change states."

They continued, "Curation activities are happening from the beginning to the end of an object's lifecycle. An object goes through three main states", they listed, "Live objects are in a state of flux, handled by members of project teams, and their transition is initiated by the intent to publish. Curation objects consist of content packaged using the BagIt protocol with metadata, and relationships via OAI/ORE maps, which are mutable but allow selective changes to metadata. Finally, publication objects are immutable and citable via a DOI and have revisions and derivations tracked."

Ixchel Faniel from OCLC Research then presented, stating that there are three perspectives for archeological practice. "The data has a lifecycle from the data collection, data sharing, and data reuse perspective and cycle." Her investigation consisted of detecting how actions in one part of the lifecycle facilitated work in other parts of the lifecycle. She collected data over one and one-half year (2012-2014) from 9 data producers, 2 repository staff, and 7 data re-users and concluded that actions in one part of the cycle have influence on things that occur in other stages. "Repository actions are overwhelmingly positive but cannot always reverse or explain documentation procedures."

Following Ixchel and the panel, George Oates (@goodformand), Director of Good, Form & Spectacle presented Contending with the Network. "Design can increase access to the material", she said, stating her experience with Flickr and Internet Archive. Relating to her work with Flickr, she referenced The Commons, a program that is attempting to catalog the world's public photo archive. In her work with IA, she was most proud of the interface she designed for the Understanding 9/11 interface. She then worked for a group named MapStack and created a project called Surging Seas, an interactive tool for visualizing sea level rise. She recently started a new business "Good, Form, & Spectacle" and proceeded on a formal mission of archiving all documents related to the mission through metadata. "It's always useful to see someone use what you've designed. They'll do stuff you anticipate and that you think is not so clear."

Following a short break, the last session of the day started with the Web Archiving Panel. Stephen Abrams of California Digital Library (CDL) started the presentation asking "Why web archiving? Before, the web was a giant document retrieval system. This is no longer the case. Now, the web browser is a virtual machine where the language of choice is JavaScript and not HTML." He stated that the web is a primary research data object and that we need to provide programmatic and business ways to support web archiving.

After Stephen, Martin Klein (@mart1nkle1n) of Los Alamos National Laboratory (LANL) and formally of our research group gave an update on the state of the work done with Memento. "There's extensive memento infrastructure in-place now!", he said. New web services that are to be released soon to be offered by LANL include a Memento redirect service (for example, going to http://example.com/memento/20040723150000/http://matkelly.com will automatically be resolved in the archives to the closest available archived copy); a memento list/search service to allow memento lookup using a user interface with specifying dates, times, and a URI; and finally, a Memento TimeMap service.

After Martin, Jimmy Lin (@lintool) of University of Maryland and formally of Twitter presented on how to leverage his big data expertise for use in digital preservation. "The problem", he said, "is that web archives are an important part of heritage but are vastly underused. Users can't do that much with web archives currently." His goal is to build tools to support exploration and discovery in web archives. A tool his group built, Warcbase uses Hadoop and HBase for topic modeling.

After Jimmy, ODU WS-DL's very own Michael Nelson (@phonedude_mln) presented starting off with "The problem is that right now, we're still in the phase of 'Hooray! It's in the web archive!" whenever something show up. What we should be asking is, "How well did we archive it?" Referencing the recent publicity of Internet Archiving capturing evidence toward the plane being shot down in Ukraine, Michael says, "We were just happy that we had it archived. When you click on one of the video, however, and it just sits here and hangs. We have the page archived but maybe not all the stuff archived that we like." He then went on to describe the ways that his group is assessing web archive is to determine the importance of what's missing, detect temporal violations, and benchmarking how well the tools handle the content they're made to capture.


After Dr. Nelson presented, the audience had an extensive amount of questions.

After the final panel, Dragan Espenschied (@despens) of Rhizome presented Big Data, Little Narration (see his interview). In his unconventional presentation, he reiterated that some artifacts don't make sense in the archives. "Each data point needs additional data about it somewhere else to give it meaning.", he said, giving a demonstration of an authentic replay of Geocities sites in Netscape 2.0 via a browser-accessible emulator. "Every instance of digital culture is too large for an institution because it's digital and we cannot completely represent it."

As the conference adjourned, I was glad I was able to experience it, see the progress other attendees have made in the last three (or more) years, and present the status of my research.

— Mat Kelly (@machawk1)

Tuesday, July 22, 2014

2014-07-22: "Archive What I See Now" Project Funded by NEH Office of Digital Humanities

We are grateful for the continued support of the National Endowment for the Humanities and their Office of Digital Humanities for our "Archive What I See Now" project.
In 2013, we received support for 1 year through a Digital Humanities Start-Up Grant.  This week, along with our collaborator Dr. Liza Potts from Michigan State, we were awarded a 3-year Digital Humanities Implementation Grant. We are excited to be one of the seven projects selected this year.

Our project goals are two-fold:
  1. to enable users to generate files suitable for use by large-scale archives (i.e., WARC files) with tools as simple as the "bookmarking" or "save page as" approaches that they already know
  2. to enable users to access the archived resources in their browser through one of the available add-ons or through a local version of the Wayback Machine (wayback).
Our innovation is in allowing individuals to "archive what I see now". The user can create a standard web archive file ("archive") of the content displayed in the browser ("what I see") at a particular time ("now").
Our work focuses on bringing the power of institutional web archiving tools like Heritrix and wayback to humanities scholars through open-source tools for personal-scale web archiving. We are building the following tools:
  • WARCreate - A browser extension (for both Google Chrome and Firefox) that can create an archive of a single webpage in the standard WARC format and save it to local disk. It can allow a user to archive pages behind authentication or that have been modified after user interaction.
  • WAIL (Web Archiving Integration Layer) - A stand-alone application that provides one-click installation and GUI-based configuration of both Heritrix and wayback on the user’s personal computer.
  • Mink - A browser extension (for both Google Chrome and Firefox) that provides access to archived versions of live webpages. This is an additional Memento client that can be configured to access locally stored WARC files created by WARCreate.
With these three tools, a researcher could, in her normal workflow, discover a web resource (using her browser), archive the resource as she saw it (using WARCreate in her browser), and then later index and replay the archived resource (using WAIL). Once the archived resource is indexed, it would be available for view in the researcher’s browser (using Mink).

We are looking forward to working with our project partners and advisory board members: Kristine Hanna (Archive-It), Lily Pregill (NY Art Resources Consortium), Dr. Louisa Wood Ruby (Frick Art Reference Library), Dr. Steven Schneider (SUNY Institute of Technology), and Dr. Avi Santo (ODU).

A previous post described the work we did for the start-up grant:
http://ws-dl.blogspot.com/2013/10/2013-10-11-archive-what-i-see-now.html

We've also posted previously about some of the tools (WARCreate and WAIL) that we've developed as part of this project:
http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html

See also "ODU gets largest of 11 humanities grants in Virginia" from The Virginian-Pilot.

-Michele

Monday, July 14, 2014

2014-07-14: "Refresh" For Zombies, Time Jumps

We've blogged before about "zombies", or archived pages that reach out to the live web for images, ads, movies, etc.  You can also describe it as the live web "leaking" into the archive, but we prefer the more colorful metaphor of a mixture of undead and living pages.  Most of the time Javascript is to blame (for example, see our TPDL 2013 paper "On the Change in Archivability of Websites Over Time"), but in this example the blame rests with the HTML <meta http-equiv="refresh" content="..."> tag, whose behavior in the archives I discovered quite by accident.

First, the meta refresh tag is a nasty bit of business that allows HTML to specify the HTTP headers you should have received.  This is occasionally useful (like loading a file from local disk), but more often that not seems to create situations in which the HTML and the HTTP disagree about header values, leading to surprisingly complicated things like MIME type sniffing.  In general, having data formats specify protocol behavior is a bad idea (see the discussion about orthogonality in the W3C Web Architecture), but few can resist the temptation.  Specifically, http-equiv="refresh" makes things even worse, since the HTTP header "Refresh" never officially existed, and it was eventually dropped from the HTML specification as well.

However, it is a nice illustration of a common but non-standard HTML/fake-HTTP extension that nearly everyone supports.  Here's how it works, using www.cnn.com as an example:



This line:

<meta http-equiv="refresh" content="1800;url=http://www.cnn.com/?refresh=1"/>

tells the client to wait 30 minutes (1800 seconds) and reload the current page with the value specified in the optional url= argument (if no URL is provided, the client uses the current page's URL).  CNN has used this "wait 30 minutes and reload" functionality for many years, and it is certainly desirable for a news site to cause the client to periodically reload its front page.   The problem comes when a page is archived, but the refresh capability is 1) not removed or 2) the URL argument is not (correctly) rewritten.

Last week I had loaded a memento of cnn.com from WebCitation, specifically: http://webcitation.org/5lRYaE8eZ, that shows the page as it existed at 2009-11-21:


I hid that page, did some work, and then when I came back I noticed that it had reloaded to the page as of 2014-07-11, even though the URL and the archival banner at the top remained unchanged:


The problem is that WebCitation leaves the meta refresh tag as is, causing the page to reload from the live web after 30 minutes.  I had never noticed this behavior before, so I decided to check how some other archives handle it.

The Internet Archive rewrites the URL, so although the client still refreshes the page, it gets an archived page.  Checking:

http://web.archive.org/web/20091121211700/http://www.cnn.com/


we find:

<meta http-equiv="refresh" content="1800;url=/web/20091121211700/http://www.cnn.com/?refresh=1">


But since the IA doesn't know to canonicalize www.cnn.com/?refresh=1 to www.cnn.com, you actually get a different archived page:



Instead of ending up on 2009-11-21, we end up two days in the past at 2009-11-19:


To be fair, ignoring "?refresh=1" is not a standard canonicalization rule but could be added (standard caveats apply).  And although this is not quite a zombie, it is potentially unsettling since the original memento (2009-11-21) is silently exchanged for another memento (2009-11-19; future refreshes will stay on the 2009-11-19 version).  Presumably other Wayback-based archives behave similarly.  Checking the British Library I saw:

http://www.webarchive.org.uk/wayback/archive/20090914012158/http://www.cnn.com/

redirect to:

http://www.webarchive.org.uk/wayback/archive/20090402030800/http://www.cnn.com/?refresh=1

In this case the jump is more noticable (five months: 2009-09-14 vs. 2009-04-02) since the BL's archives of cnn.com are more sparse. 

Perma.cc behaves similarly to the Internet Archive (i.e., rewriting but not canonicalizing), but presumably because it is a newer archive, it does not yet have a "?refresh=1" version of cnn.com archived.  It is possible that Perma.cc has a Wayback backend, but I'm not sure.  I had to push a 2014-07-11 version into Perma.cc (i.e., it did not already have cnn.com archived).  Checking:

http://perma.cc/89QJ-Y632?type=source


we see:

<meta http-equiv="refresh" content="1800;url=/warc/89QJ-Y632/http://www.cnn.com/?refresh=1"/>

And after 30 minutes it will refresh to a framed 404 because cnn.com/?refresh=1 is not archived:


As Perma.cc becomes more populated, the 404 behavior will likely disappear and be replaced with something like the Internet Archive and British Library examples.

Archive.today is the only archive that correctly handled this situation.  Loading:

https://archive.today/Zn6HS

produces:


A check of the HTML source reveals that they simply strip out the meta refresh tag altogether, so this memento will stay parked on 2013-06-27 no matter how long it stays in the client.

In summary:

  • WebCitation did not rewrite the URI and thus created a zombie
  • Internet Archive (and other Wayback archives) rewrites the URI, but because of site-specific canonicalization, it violates the user's expectations with a single time jump (the distance of which is dependent on the sparsity of the archive)
  • Perma.cc rewrites the URI, but in this case, because it is a new archive, produces a 404 instead of a time jump
  • Archive.today strips the meta refresh tag and avoids the behavior altogether

--Michael

2014-07-14: The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript

One very large part of digital preservation is the act of crawling and saving pages on the live Web into a format for future generations to view. To accomplish this, web archivists use various crawlers, tools, and bits of software, often built to purpose. Because of these tools' ad hoc functionality, users expect them to function much better than a general purpose tool.

As anyone that has looked up a complex web page in The Archive can tell you, the more complex the page, the less likely that all resources will be captured to replay the page. Even when these pages are preserved, the replay experience is frequently inconsistent from the page on the live web.

We have started building a preliminary corpus of tests to evaluate a handful of tools and web sites that were created specifically to save web pages from being lost in time.

In homage to the web browser evaluation websites by the Web Standards Project, we have created The Archival Acid Test as a first step in ensuring that these tools to which we supply URLs for preservation are doing their job to the extent we expect.

The Archival Acid Test evaluates features that modern browsers execute well but preservation tools have trouble handling. We have grouped these tests into three categories with various tests under each category:

The Basics

  • 1a - Local image, relative to the test
  • 1b - Local image, absolute URI
  • 1c - Remote image, absolute
  • 1d - Inline content, encoded image
  • 1e - Scheme-less resource
  • 1f - Recursively included CSS

JavaScript

  • 2a - Script, local
  • 2b - Script, remote
  • 2c - Script inline, DOM manipulation
  • 2d - Ajax image replacement of content that should be in archive
  • 2e - Ajax requests with content that should be included in the archve, test for false positive (e.g., same origin policy)
  • 2f - Code that manipulates DOM after a certain delay (test the synchronicity of the tools)
  • 2g - Code that loads content only after user interaction (tests for interaction-reliant loading of a resource)
  • 2h - Code that dynamically adds stylesheets

HTML5 Features

  • 3a - HTML5 Canvas Drawing
  • 3b - LocalStorage
  • 3c - External Webpage
  • 3d - Embedded Objects (HTML5 video)

For the first run of the Archival Acid Tests, we evaluated Internet Archive's Heritrix, GNU Wget (via its recent addition of WARC support), and our own WARCreate Google Chrome browser extension. Further, we ran the test on Archive.org's Save Page Now feature, Archive.today, Mummify.it (now defunct), Perma.cc, and WebCite. For each of these tools, we first attempted to preserve the Web Standards Project's Acid 3 Test (see Figure 1).

The results for this initial study (Figure 2) were accepted for publication (see the paper) to the Digital Libraries 2014 conference (joint JCDL and TPDL this year) and will be presented September 8th-14th in London, England.

The actual test we used is available at http://acid.matkelly.com for you to exercise with your tools/websites and the code that runs the site is available on GitHub.

— Mat Kelly (@machawk1)

Thursday, July 10, 2014

2014-07-10: Federal Cloud Computing Summit



As mention in my previous post, I attended the Federal Cloud Computing Summit on July 8th and 9th at the Ronald Reagan Building in Washington, D.C. I helped the host organization, the Advanced Technology And Research Center (ATARC) organize and run the MITRE-ATARC Collaboration Sessions that kick off the event on July 8th. The summit is designed to allow Government representatives to meeting and collaborate with industry, academic, and other Government cloud computing practitioners on the current challenges in cloud computing.

A FedRAMP primer was held at 10:00 AM on July 8th in a Government-only session. At its conclusion, we began the MITRE-ATARC Collaboration Sessions that focused on Cloud Computing in Austere Environments, Cloud Computing for the Mobile Worker, Security as a Service, and the Impact of Cloud Computing on the Enterprise. Because participants are protected by Chathan House Rule, I cannot elaborate on the Government representation or discussions in the collaboration sessions. MITRE will be constructing a summary document from the discussions that outlines the main points of the discussions, identifies orthogonal ideas between the challenge areas, and makes recommendations for the Government and academia based on the discussions. For reference, please see the 2013 Federal Cloud Computing Summit Whitepaper.

On July 9th, I attended the Summit which is a conference-style series of panels and speakers with an industry trade-show held before the event and during lunch. At 3:30-4:15, I moderated a panel of Government representatives from each of the collaboration sessions in a question-and-answer session about the outcomes from the previous day's collaboration sessions.

To follow along on Twitter, you can refer to the Federal Cloud Computing Summit Handle (@cloudfeds), the ATARC Handle (@atarclabs), and the #cloudfeds hashtag.

This was the third Federal Summit event in which I have participated. They are great events that the Government participants have consistently identified as high-value and I am excited to start planning the Winter installment of the Federal Cloud Computing Summit.

--Justin F. Brunelle

Tuesday, July 8, 2014

2014-07-08: Presenting WS-DL Research to PES University Undergrads

On July 7th and 8th, 2014, Hany SalahEldeen and I (Mat Kelly) were given the opportunity to present our PhD research to visiting undergraduate seniors from a leading university in Bangalore, India (PES University). About thirty students were in attendance at each session and indicated their interest in the topics through a large quantity of relevant questions.


Dr. Weigle (@weiglemc)

Prior to ODU CS students' presentations, Dr. Michele C. Weigle (@weiglemc) gave the students an overview presentation of some of WS-DL's research topics with her presentation Bits of Research.

In her presentation she covered both our lab's foundational work, recent work, some outstanding research questions, as well as some potential projects to entice interested students to work with our research group.


Mat (@machawk1), your author

Between Hany and me, I (Mat Kelly) presented a fairly high level yet technical overview titled Browser-Based Digital Preservation, which highlighted my recent work in creating WARCreate, a Google Chrome extension that allows web pages to be preserved from the browser.

Though not merely a demo of the tool (as was given at Digital Preservation and JCDL 2012), I initially gave a primer on the dynamics of the web, HTTP, the state of web archiving, some issues relating to personal web archiving versus institutional web archiving, then finally, the problems that WARCreate addresses. I also covered some other related topics and browser-based preservation dynamics, which can be seen in the slides included in this post.


Hany (@hanysalaheldeen) presented the next day after my presentation, giving a general overview of his academic career and research topics. His presentation Zen and the Art of Data Mining covered the wide range of topics including (but not limited to) temporal user intention, the Egyptian Revolution, and his experience as an ODU WS-DL PhD student (to, again, entice the students).


The opportunity for Hany and me to present what we work on day-to-day to bright-eyed undergraduate students was unique, as their interest is both within our research area (computer science) yet still have doors open on what research path to take as potential graduate students.

We hope that the presentations and questions we were able to answer were of some help in facilitating their decisions to pursue a graduate career at the Web Science and Digital Libraries Research Lab at Old Dominion University.



— Mat Kelly