Thursday, August 28, 2014

2014-08-28: InfoVis 2013 Class Projects

(Note: This is continuing a series of posts about visualizations created either by students in our research group or in our classes.)

I've been teaching the graduate Information Visualization course since Fall 2011.  In this series of posts, I'm highlighting a few of the projects from each course offering.  (Previous posts: Fall 2011, Fall 2012)

In Spring 2013, I taught an Applied Visual Analytics course that asked students to create visualizations based on the "Life in Hampton Roads" annual survey performed by the Social Science Research Center at ODU.  In Fall 2013, I taught the traditional InfoVis course that allowed students to choose their own topics.  (All class projects are listed in my InfoVis Gallery.)

Life in Hampton Roads
Created by Ben Pitts, Adarsh Sriram Sathyanarayana, Rahul Ganta


This project (currently available at https://ws-dl.cs.odu.edu/vis/LIHR/) provides a visualization of the "Life in Hampton Roads" survey for 2010-2012.  Data cleansing was done with the help of tools like Excel and Open Refine.  The results of the survey are shown using maps and bar charts, making it easier to understand public opinions on a particular question. Visualizations are created by using various tools like D3, Javascript, JQuery and HTML. The visualizations implemented in the project are reader-driven and exploratory.  The user can interactively change the filters in the top half of the page and see the data only for the selected groups below.

The video below provides a demo of the tool.


MeSH Viewer
Created by Gil Moskowitz

MeSH is a controlled vocabulary developed and maintained by the U.S. National Library of Medicine (NLM) for indexing biomedical databases, including the PubMed citation database. PubMed queries can be made more precise, returning fewer citations with higher relevance, by issuing them with specific reference to MeSH terms. The database of MeSH terms is large, with many interrelationships between terms. The MeSH viewer (currently available at https://ws-dl.cs.odu.edu/vis/MeSH-vis/) visually presents a subset of MeSH terms, specifically MeSH descriptors. The MeSH descriptors are organized into trees based on a hierarchy of MeSH numbers. This work includes a tree view to show the relationships as structured by the NLM. However, individual descriptors may have multiple MeSH numbers, hence multiple locations in the forest of trees, so this view is augmented with a force-directed node-link view of terms found in response to a user search. The terms can be selected and used to build PubMed search strings, and an estimate of the specificity of this combination of terms is displayed.

The video below provides a demo of the tool.


Visualizing Currency Volatility
Created by Jason Long



This project (currently available at https://ws-dl.cs.odu.edu/vis/currency/) was focused on showing changes in currency values over time. The initial display shows the values of 39 currencies (including Bitcoin) as compared to the US Dollar (USD).  It's easy to see when the Euro was introduced and individual currencies in EU countries dropped off.  Clicking on a particular currency expands its chart and allows for closer inspection of its change over time.  There is also a tab for a heatmap view that allows the user to view a moving average of the difference from USD.  The color indicates whether the USD is appreciating (green) or depreciating (red).

The video below provides a demo of the tool.

-Michele

Tuesday, August 26, 2014

2014-08-26: Memento 101 -- An Overview of Memento in 101 Slides

In preparation for the upcoming "Tools and Techniques for Revisiting Online Scholarly Content" tutorial at JCDL 2014, Herbert and I have revamped the canonical slide deck for Memento, and have called it "Memento 101" for the 101 slides it contains.  The previous slide deck was from May 2011 and was no longer current with the RFC (December 2013).  The slides cover Memento basic and intermediate concepts, with pointers for some of the more detailed and esoteric bits (like patterns 2, 3, and 4, as well as the special cases) of interest to only the most hard-core archive wonks. 

The JCDL 2014 tutorial will choose a subset of these slides, combined with updates from the Hiberlink project and various demos.  If you find yourself in need of explaining Memento please feel free to use these slides in part or in whole (PPT is available for download from slideshare). 




--Michael & Herbert

Thursday, August 21, 2014

2014-08-22: One WS-DL Class Offered for Fall 2014

This fall, only one WS-DL class will be offered:
This class approaches the Web as a phenomena to be studied in its own right.  In this class we will explore a variety of tools (e.g., Python, R, D3) as well as applications (e.g., social networks, recommendations, clustering, classification) that are commonly used in analyzing the socio-technical structures of the Web. 

The class will be similar to the fall 2013 offering. 

Right now we're planning on offering in spring 2015:
  • CS 418 Web Programming 
  • CS 725/825 Information Visualization
  • CS 751/851 Introduction to Digital Libraries
--Michael

Friday, July 25, 2014

2014-07-25: Digital Preservation 2014 Trip Report

Mat Kelly and Dr. Michael L. Nelson travel to Washington, DC and both report on their current research as well as be made aware of others' work in the field.                           
On July 22 and 23, 2014, Dr. Michael Nelson (@phonedude_mln) and I (@machawk1) attended Digital Preservation 2014 in Washington, DC. This was my fourth consecutive NDIIPP (@ndiipp) / NDSA (@ndsa2) meeting (see trip reports from Digital Preservation 2011, 2012, 2013). With the largest attendance yet (300+) and compressed into two days, the schedule was jam-packed with interesting talks. Per usual, videos for most of the presentations are included inline below.

Day One

Micah Altman (@drmaltman) led the presentations with information about the NDSA and asked, regarding Amazon claiming reliability of 99.99999999999% for uptime, "What do the eleven nines mean?". "There are a number of risk that we know about [as archivists] that Amazon doesn't", he said, continuing, "No single institution can account for all the risks." Micah spoke about the updated National Agenda for Digital Stewardship, which is to have the theme of "developing evidence for collective action".

Matt Kirschenbaum (@mkirschenbaum) followed Micah with his presentation Software, It’s a Thing with reference to the recently excavated infamous Atari ET game. "Though the data on the tapes had been available for years, the uncovering was more about aura and allure". Matt spoke about George R. R. Martin's (of Game of Thrones fame) recent interview where he stated that he still uses a word processing program called WordStar on an un-networked DOS machine and that it is his "secret weapon from distraction and viruses." In the 1980s, Wordstar dominated the market until Word Perfect took rein, followed by Microsoft Word. "A power user that has memorized all of the Wordstar commands could utilize the software with the ease of picking up a pencil and starting to write."
Matt went on to talk (also see his Medium post) about software as different concepts include as an assets, as an object, as a kind of notation or score (qua music), as shrinkwrap, etc. For a full explanation of each, see his presentation:

Following Matt, Shannon Mattern (@shannonmattern) shared her presentation on Preservation Aesthetics. "One of preservationists's primary concerns is whether an item has intrinsic value.", she said. Shannon then spoke about the various sorts of auto-destructive software and art including those that are light sensitive (the latter) and those that delete themselves on execution (the former). In addition to recording her talk (see below), she graciously included the full text of her talk online.

The conference briefly had a break and a quick summary of the poster session to come later in day 1. After the break, there was a panel titled "Stewarding Space Data". Hampapuram Ramapriyan of NASA began the panel stating that "NASA data is available as soon as the satellites are launched.", he continued, "This data is preserved...We shouldn't lost bits that we worked hard to collect and process for results. He also stated that he (and NASA) is part of the Earth Science Information Partners (EISP) Data Stewardship Committee.

Deirdre Byrne of NOAA then presented speaking of their dynamics and further need on documenting data, preserving it with provenance, providing IT support to maintain the data's integrity, and being able to work with the data in the future (a digital preservation plus). Deirdre then referenced Pathfinder, a technology that allows the visualization of sea surface temperatures among other features like indicating coral bleaching from fish stocks, the water quality on coasts, etc. Citing its now use as a de facto standard means for this purpose, she described the physical dynamics as having 8 different satellites for its functionality along with 31 mirror satellites on standby for redundancy.

Emily Frieda Shaw (@emilyfshaw) of University of Iowa Libraries followed in the panel after Deirdre, and spoke about the Iowan role in preserving the original development data for the Explorer I launch. Upon converting and analyzing the data, her fellow researchers realize that at certain altitudes, the radiation detection dropped to zero, which indicated that there were large belts of particles surrounding the Earth (later, they were recognized as the Van Allen belts). After discovering more data in a basement about the launch, the group received a grant to recover and restore the badly damaged and moldy documentation.

Karl Nilsen (@KarlNilsen)and Robin Dasler (@rdasler) of University of Maryland Libraries were next with Robin first talking about her concern with issues in the library realm related to data. She reference that one project's data still resided at the University of Hawaii's Institute for Astronomy due to it being the home school to one of the original researchers on a project. The project consisted of data measuring the distances between galaxies that came about by combining and compiling data from various data sources originating from both observational data and standard corpora. To display the data (500 gigabytes total), the group developed a UI to utilize web technologies like MySQL to make the data more accessible. "Researchers were concerned about their data disappearing before they retired.", she stated about the original motive for increasing the data's accessibility.
Karl changed topics somewhat with stating that two different perspectives can be taken about data from a preservation standpoint (format or system centric). "The intellectual value of a database comes from ad hoc combination from multiple tables in the form of joins and selections.", he said. "Thinking about how to provide access", he continued, "is itself preservation." He followed this with approaches including utilizing virtual machines (VMs), migrating from one application to a more modern web application, and collapsing the digital preservation horizon to ten years at a time.
Following a brief Q&A for the panel was a series of "Lightning talks". James (Jim) A. Bradley of Ball State University started with his talk, "Beyond the Russian Doll Effect: Reflexivity and the Digital Repository Paradigm" where he spoke about promoting and sharing digital assets for reuse. Jim then talked about Digital Media Repository (DMR), which allowed information to be shared and made available at the page level. His group had the unique opportunity to tell what material are in the library, who was using them and when. From these patterns, grad students made 3-D models, which were them subsequently added and attributed to the original objects.
Dave Gibson (@davegilbson) of Library of Congress followed Jim by presenting Video Game Source Disc Preservation. He stated that his group has been the "keepers of The Game" since 1996 and have various source code excerpts from a variety of unreleased games including Duke Nukem Critical Mass, which was released for Nintendo DS but not Playstation Portable (PSP), despite being developed for both platforms. In their exploration of the game, they uncovered 28 different file formats on the discs, many of which were proprietary, and wondered how they could get access to the files' contents. After using Mediacoder to convert many of the audio files and hex editors to read the ASCII, his group found source code fragments hidden within the files. From this work, they now have the ability to preserve unreleased games
Rebecca (Becky) Katz of Council of the District of Columbia was next with UELMA-Compliant Preservation: Questions and Answers?. She first described the UELMA, an act that declares that if a U.S. state passed the act and its digital legal publications are official, the state has to make sure that the publications are preserved in some way. Because of this requirement, many states are reluctant to rely solely on digital documents and instead keeping paper copies in addition to the digital copies. Many of the barriers in the preservation process for the states lie in how they ought to preserve the digital documents. "For long term access," she said, " we want to be preserving our digital content." Becky also spoke about appropriate formats for preservation and that common features of formats like authentication for PDF are not open source. "All the metadata in the world doesn't mean anything if the data is not usable", she said. "We want to have a user interface that integrates with the [preservation] repository." She concluded with recommending state that develop the EULMA to have agreements with universities or long standing institutions to allow users to download updates of the documents to ensure that many copies are available.
Kate Murray of Library of Congress followed Becky with "We Want You Just the Way You Are: The What, Why and When of fixity in Digital Preservation". She referenced the "Migrant Mother" photo and how, through moving the digital photo from one storage component to another, there have been subtle changes to the photo. "I never realized that there are three children in the photo!", she said, referencing the comparison between the altered and original photo. To detect these changes, she uses fixity (related article on The Signal) on a whole collection of data, which ensures bit level integrity.
Following Kate, Krystyna Ohnesorge of Swiss Federal Archives (SFA) presented "Save Your Databases Using SIARD!". "About 90 percent of of specialized applications are based on relational databases.", she said. SIARD is used to preserve database content for the long term so that the data can be understood in 20 or 50 years. The system is already used in 54 countries with 341 licenses currently existing for different international organizations. "If an institution needs to archive relational databases, don't hesitate to use the SIARD suite and SIARD format!"
The conference then broke for lunch where the 2014 Innovation Awards were presented.
Following lunch, Cole Crawford (@coleinthecloud) of the Open Compute Foundation presented "Community Driven Innovation" where he spoke about Open Computer being an international based open source project. "Microsoft is moving all Azure data to Open Compute", he said. "Our technology requirements are different. To have an open innovation system that's reusable is important." He then emphasized that his talk was to be specifically open the storage aspects of Open Compute. He started with "FACT: The 19 inch server rack originated in the railroad industry then propagated to the music industry, then it was adopted by IT." He continued, "One of the most interesting things Facebook has done recently is move from tape storage to ten thousand Blueray discs for cold storage". He stated that in 2010, the Internet consisted of 0.8 Zetabytes. In 2012, this number was 3.0 Zetabytes, and by 2015, he claimed, it will be 40 Zetabytes in size. "As stewards of digital data, you guys can be working with our project to fit your requirements. We can be a great engine for procurement. As you need more access to content, we can get that.

After Cole was another panel titled, "Community Approaches to Digital Stewardship". Fran Berman (@resdatall) of Rensselaer Polytechnic Institute started off with reference to the Research Data Alliance. "All communities are working to develop the infrastructure that's appropriate to them.", she said, "If I want to worry about asthma (referencing an earlier comment about whether asthma is more likely to be obtained in Mexico City versus Los Angeles), I don't want to wait years until the infrastructure is in place. If you have to worry about data, that data needs to live somewhere."

Bradley Daigle (@BradleyDaigle) of University of Virginia followed Fran and spoke about the Academic Preservation Trust, a group consisting of 17 members that takes a community based approach at are attempting to not just have more solutions but better solutions. The group would like to create a business model based on preservation services. "If you have X amount of data, I can tell you it will take Y amount of cost to preserve that data.", he said, describing an ideal model. "The AP Trust can serve as a scratch space with preservation being applied to the data."

Following Bradley on the panel, Aaron Rubinstein from University of Massachusetts Amherst described his organization's scheme as being similar to Scooby Doo, iterating through each character displayed on-screen and stating the name of a member organization. "One of the things that makes our institution privileges is that we have administrators that understand the need for preservation.

The last presenter in the panel, Jamie Schumacher of Northern Illinois University started with "Smaller institutions have challenges when starting digital preservation. Instead of obtaining an implementation grant when applying to the NEH, we got a 'Figure it Out' grant. ... Our mission was to investigate a handful of digital preservation solutions that were affordable for organizations with restricted resources like small staff sizes and those with lone rangers. We discovered that practitioners are overwhelmed. To them, digital objects are a foreign thing." Some of the roadblocks her team eliminated were the questions of which tools and services to use for preservation tasks, to which Google frequently gave poor of too many results.

Following a short break, the conference split into five different concurrent workshops and breakout sessions. I attended the session titled Next Generation: The First Digital Repository for Museum Collections where Ben Fino-Radin (@benfinoradin) of Museum of Modern Art, Dan Gillean of Artefactual Systems and Kara Van Malssen (@kvanmalssen) of AVPreserve gave a demo of their work.

As I was presenting a poster at Digital Preservation 2014, I was unable to stay for the second presentation in the session Revisiting Digital Forensics Workflows in Collecting Institutions by Marty Gengenbach of Gates Archive, as a was required to setup my poster. Starting at 5 o'clock, the breakout sessions ended and a reception was held with the poster session in the main area of the hotel. My poster, "Efficient Thumbnail Summarization for Web Archives" is an implementation of Ahmed AlSum's initial work published at ECIR 2014 (see his ECIR Trip Report).

Day Two

The second day started off with breakfast and an immediate set of workshops and breakout sessions. Among these, I attended the Digital Preservation Questions and Answers from the NDSA Innovation Working Group where Trevor Owens (@tjowens), among other group members introduced the history and migration of an online digital preservation question and answer system. The current site, currently residing at http://qanda.digipres.org is in the process of migration from previous attempts including a failed try at a Digital Preservation Stack Exchange. This work, completed in-part by Andy Jackson (@anjacks0n) at the UK Web Archive, began its migration with his Zombse project, which extracted all of the initial questions and data from the failed Stack Exchange into a format that would eventually be readable by another Q&A system.
Following a short break after the first long set of sessions, the conference then re-split into the second set of breakout sessions for the day, where I attended the session titled Preserving News and Journalism. Aurelia Moser (@auremoser) administrated this panel-style presentation and initially showed a URI where the presentation's resource could be found (I typed bit.ly/1klZ4f2 but that seems to be incorrect).
The panel, consisting of Anne Wootton (@annewooton, Leslie Johnston (@lljohnston), and Edward McCain Reynolds (@e_mccain), initially asked, "What is born digital and what are news apps?". The group had put forth a survey toward 476 news organization, consisting of 406 hybrid organizations (those that put content in print and online), and 70 "online only" publications.
From the results, the surveyors asked what the issue was with responses, as they kept the answers open ended for the journalists to obtain an accurate account of their practices. "Newspapers that are in chains are more likely to have written policies for preservation.
The smaller organizations are where we're seeing data loss." At one point, Anne Wooton's group organized a "Design-a-Thon" where they gathered journalists, archivists, and developers. Regarding the surveyors' practice, the group stated that Content Management System (CMS) vendors for news outlets are the holders of t he key of the kingdom for newspapers in regard to preservation.

After the third breakout session of the conference, lunch was served (Mexican!) with Trevor Owens of Library of Congress, Amanda Brennan (@continuants) of Tumblr, and Trevor Blank (@trevorjblank of The State University of New York at Potsdam giving a preview of CURATECamp, to occur the day after the completion of the conference. While lunch proceeded, a series of lightning talks was presented.
The first lightning talk was by Kate Holterhoff (@KateHolterhoff) of Carnegie-Mellon University and titled Visual Haggard and Digitizing Illustration. In her talk, she introduced Visual Haggard, a catalog of many images from public domain books and documents that attempts to have better quality representations of the images in these documents compared to other online systems like Google Books. "Digital Archivists should contextual and mediate access to digital illustrations", she said.

Michele Kimpton of DuraSpace followed Kate with DuraSpace and Chronopolis Partner to Build a Long-term Access and Preservation Platform. In her presentation she introduced a few tools like Chronopolis (used for dark archiving), DuraCloud and a few other tools and her group's approach toward getting various open source tools to work together to provide a more comprehensive solution for preserving digital content.

Following Michelle, Ted Westervelt of Library of Congress presented Library of Congress Recommended Formats where he reference the Best Edition Statement, a largely obsolete but useful document that needed supplementation to account for modern best practice and newer mediums. His group has developed the "Recommended Format Specification", which provide this without superseding the original document and is a work-in-progress. The goal of the document is to set parameters for the target objects for the document so that most content that is current un-handled by the in-place specification will have directives to ensure that digital preservation of the objects is guided and easy.

After Ted, Jane Zhang of Catholic University of America presented Electronic Records and Digital Archivists: A Landscape Review where she did a cursory study of employment positions for digital archivists, both formally trained and trained on-the-job. She attempt to answer the question "Are libraries hiring digital archivists?" and tried to see a pattern from one hundred job descriptions.

After the lightning talks, another panel was held, titled "Research Data and Curation". Inna Kouper (@inkouper) and Dharma Akmon (@DharmaAkmon) of Indiana University and University of Michigan, Ann Arbor, respectively, discussed Sustainable Environment Actionable Data (SEAD, @SEADdatanet), and a Research Object Framework for the data that will be very familiar for practitioners working with video producers. "Research projects are bundles", they said, "The ROF captures multiple aspects of working with data include unique ID, agents, states, relationships, and content and how they cyclicly relate. Research objects change states."
They continued, "Curation activities are happening from the beginning to the end of an object's lifecycle. An object goes through three main states", they listed, "Live objects are in a state of flux, handled by members of project teams, and their transition is initiated by the intent to publish. Curation objects consist of content packaged using the BagIt protocol with metadata, and relationships via OAI/ORE maps, which are mutable but allow selective changes to metadata. Finally, publication objects are immutable and citable via a DOI and have revisions and derivations tracked."
Ixchel Faniel from OCLC Research then presented, stating that there are three perspectives for archeological practice. "The data has a lifecycle from the data collection, data sharing, and data reuse perspective and cycle." Her investigation consisted of detecting how actions in one part of the lifecycle facilitated work in other parts of the lifecycle. She collected data over one and one-half year (2012-2014) from 9 data producers, 2 repository staff, and 7 data re-users and concluded that actions in one part of the cycle have influence on things that occur in other stages. "Repository actions are overwhelmingly positive but cannot always reverse or explain documentation procedures."
Following Ixchel and the panel, George Oates (@goodformand), Director of Good, Form & Spectacle presented Contending with the Network. "Design can increase access to the material", she said, stating her experience with Flickr and Internet Archive. Relating to her work with Flickr, she referenced The Commons, a program that is attempting to catalog the world's public photo archive. In her work with IA, she was most proud of the interface she designed for the Understanding 9/11 interface. She then worked for a group named MapStack and created a project called Surging Seas, an interactive tool for visualizing sea level rise. She recently started a new business "Good, Form, & Spectacle" and proceeded on a formal mission of archiving all documents related to the mission through metadata. "It's always useful to see someone use what you've designed. They'll do stuff you anticipate and that you think is not so clear."
Following a short break, the last session of the day started with the Web Archiving Panel. Stephen Abrams of California Digital Library (CDL) started the presentation asking "Why web archiving? Before, the web was a giant document retrieval system. This is no longer the case. Now, the web browser is a virtual machine where the language of choice is JavaScript and not HTML." He stated that the web is a primary research data object and that we need to provide programmatic and business ways to support web archiving.
After Stephen, Martin Klein (@mart1nkle1n) of Los Alamos National Laboratory (LANL) and formally of our research group gave an update on the state of the work done with Memento. "There's extensive memento infrastructure in-place now!", he said. New web services that are to be released soon to be offered by LANL include a Memento redirect service (for example, going to http://example.com/memento/20040723150000/http://matkelly.com will automatically be resolved in the archives to the closest available archived copy); a memento list/search service to allow memento lookup using a user interface with specifying dates, times, and a URI; and finally, a Memento TimeMap service.
After Martin, Jimmy Lin (@lintool) of University of Maryland and formally of Twitter presented on how to leverage his big data expertise for use in digital preservation. "The problem", he said, "is that web archives are an important part of heritage but are vastly underused. Users can't do that much with web archives currently." His goal is to build tools to support exploration and discovery in web archives. A tool his group built, Warcbase uses Hadoop and HBase for topic modeling.
After Jimmy, ODU WS-DL's very own Michael Nelson (@phonedude_mln) presented starting off with "The problem is that right now, we're still in the phase of 'Hooray! It's in the web archive!" whenever something show up. What we should be asking is, "How well did we archive it?" Referencing the recent publicity of Internet Archiving capturing evidence toward the plane being shot down in Ukraine, Michael says, "We were just happy that we had it archived. When you click on one of the video, however, and it just sits here and hangs. We have the page archived but maybe not all the stuff archived that we like." He then went on to describe the ways that his group is assessing web archive is to determine the importance of what's missing, detect temporal violations, and benchmarking how well the tools handle the content they're made to capture.

After Dr. Nelson presented, the audience had an extensive amount of questions.
After the final panel, Dragan Espenschied (@despens) of Rhizome presented Big Data, Little Narration (see his interview & links page). In his unconventional presentation, he reiterated that some artifacts don't make sense in the archives. "Each data point needs additional data about it somewhere else to give it meaning.", he said, giving a demonstration of an authentic replay of Geocities sites in Netscape 2.0 via a browser-accessible emulator. "Every instance of digital culture is too large for an institution because it's digital and we cannot completely represent it."
As the conference adjourned, I was glad I was able to experience it, see the progress other attendees have made in the last three (or more) years, and present the status of my research.
— Mat Kelly (@machawk1)

Tuesday, July 22, 2014

2014-07-22: "Archive What I See Now" Project Funded by NEH Office of Digital Humanities

We are grateful for the continued support of the National Endowment for the Humanities and their Office of Digital Humanities for our "Archive What I See Now" project.
In 2013, we received support for 1 year through a Digital Humanities Start-Up Grant.  This week, along with our collaborator Dr. Liza Potts from Michigan State, we were awarded a 3-year Digital Humanities Implementation Grant. We are excited to be one of the seven projects selected this year.

Our project goals are two-fold:
  1. to enable users to generate files suitable for use by large-scale archives (i.e., WARC files) with tools as simple as the "bookmarking" or "save page as" approaches that they already know
  2. to enable users to access the archived resources in their browser through one of the available add-ons or through a local version of the Wayback Machine (wayback).
Our innovation is in allowing individuals to "archive what I see now". The user can create a standard web archive file ("archive") of the content displayed in the browser ("what I see") at a particular time ("now").
Our work focuses on bringing the power of institutional web archiving tools like Heritrix and wayback to humanities scholars through open-source tools for personal-scale web archiving. We are building the following tools:
  • WARCreate - A browser extension (for both Google Chrome and Firefox) that can create an archive of a single webpage in the standard WARC format and save it to local disk. It can allow a user to archive pages behind authentication or that have been modified after user interaction.
  • WAIL (Web Archiving Integration Layer) - A stand-alone application that provides one-click installation and GUI-based configuration of both Heritrix and wayback on the user’s personal computer.
  • Mink - A browser extension (for both Google Chrome and Firefox) that provides access to archived versions of live webpages. This is an additional Memento client that can be configured to access locally stored WARC files created by WARCreate.
With these three tools, a researcher could, in her normal workflow, discover a web resource (using her browser), archive the resource as she saw it (using WARCreate in her browser), and then later index and replay the archived resource (using WAIL). Once the archived resource is indexed, it would be available for view in the researcher’s browser (using Mink).

We are looking forward to working with our project partners and advisory board members: Kristine Hanna (Archive-It), Lily Pregill (NY Art Resources Consortium), Dr. Louisa Wood Ruby (Frick Art Reference Library), Dr. Steven Schneider (SUNY Institute of Technology), and Dr. Avi Santo (ODU).

A previous post described the work we did for the start-up grant:
http://ws-dl.blogspot.com/2013/10/2013-10-11-archive-what-i-see-now.html

We've also posted previously about some of the tools (WARCreate and WAIL) that we've developed as part of this project:
http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html

See also "ODU gets largest of 11 humanities grants in Virginia" from The Virginian-Pilot.

-Michele

Monday, July 14, 2014

2014-07-14: "Refresh" For Zombies, Time Jumps

We've blogged before about "zombies", or archived pages that reach out to the live web for images, ads, movies, etc.  You can also describe it as the live web "leaking" into the archive, but we prefer the more colorful metaphor of a mixture of undead and living pages.  Most of the time Javascript is to blame (for example, see our TPDL 2013 paper "On the Change in Archivability of Websites Over Time"), but in this example the blame rests with the HTML <meta http-equiv="refresh" content="..."> tag, whose behavior in the archives I discovered quite by accident.

First, the meta refresh tag is a nasty bit of business that allows HTML to specify the HTTP headers you should have received.  This is occasionally useful (like loading a file from local disk), but more often that not seems to create situations in which the HTML and the HTTP disagree about header values, leading to surprisingly complicated things like MIME type sniffing.  In general, having data formats specify protocol behavior is a bad idea (see the discussion about orthogonality in the W3C Web Architecture), but few can resist the temptation.  Specifically, http-equiv="refresh" makes things even worse, since the HTTP header "Refresh" never officially existed, and it was eventually dropped from the HTML specification as well.

However, it is a nice illustration of a common but non-standard HTML/fake-HTTP extension that nearly everyone supports.  Here's how it works, using www.cnn.com as an example:



This line:

<meta http-equiv="refresh" content="1800;url=http://www.cnn.com/?refresh=1"/>

tells the client to wait 30 minutes (1800 seconds) and reload the current page with the value specified in the optional url= argument (if no URL is provided, the client uses the current page's URL).  CNN has used this "wait 30 minutes and reload" functionality for many years, and it is certainly desirable for a news site to cause the client to periodically reload its front page.   The problem comes when a page is archived, but the refresh capability is 1) not removed or 2) the URL argument is not (correctly) rewritten.

Last week I had loaded a memento of cnn.com from WebCitation, specifically: http://webcitation.org/5lRYaE8eZ, that shows the page as it existed at 2009-11-21:


I hid that page, did some work, and then when I came back I noticed that it had reloaded to the page as of 2014-07-11, even though the URL and the archival banner at the top remained unchanged:


The problem is that WebCitation leaves the meta refresh tag as is, causing the page to reload from the live web after 30 minutes.  I had never noticed this behavior before, so I decided to check how some other archives handle it.

The Internet Archive rewrites the URL, so although the client still refreshes the page, it gets an archived page.  Checking:

http://web.archive.org/web/20091121211700/http://www.cnn.com/


we find:

<meta http-equiv="refresh" content="1800;url=/web/20091121211700/http://www.cnn.com/?refresh=1">


But since the IA doesn't know to canonicalize www.cnn.com/?refresh=1 to www.cnn.com, you actually get a different archived page:



Instead of ending up on 2009-11-21, we end up two days in the past at 2009-11-19:


To be fair, ignoring "?refresh=1" is not a standard canonicalization rule but could be added (standard caveats apply).  And although this is not quite a zombie, it is potentially unsettling since the original memento (2009-11-21) is silently exchanged for another memento (2009-11-19; future refreshes will stay on the 2009-11-19 version).  Presumably other Wayback-based archives behave similarly.  Checking the British Library I saw:

http://www.webarchive.org.uk/wayback/archive/20090914012158/http://www.cnn.com/

redirect to:

http://www.webarchive.org.uk/wayback/archive/20090402030800/http://www.cnn.com/?refresh=1

In this case the jump is more noticable (five months: 2009-09-14 vs. 2009-04-02) since the BL's archives of cnn.com are more sparse. 

Perma.cc behaves similarly to the Internet Archive (i.e., rewriting but not canonicalizing), but presumably because it is a newer archive, it does not yet have a "?refresh=1" version of cnn.com archived.  It is possible that Perma.cc has a Wayback backend, but I'm not sure.  I had to push a 2014-07-11 version into Perma.cc (i.e., it did not already have cnn.com archived).  Checking:

http://perma.cc/89QJ-Y632?type=source


we see:

<meta http-equiv="refresh" content="1800;url=/warc/89QJ-Y632/http://www.cnn.com/?refresh=1"/>

And after 30 minutes it will refresh to a framed 404 because cnn.com/?refresh=1 is not archived:


As Perma.cc becomes more populated, the 404 behavior will likely disappear and be replaced with something like the Internet Archive and British Library examples.

Archive.today is the only archive that correctly handled this situation.  Loading:

https://archive.today/Zn6HS

produces:


A check of the HTML source reveals that they simply strip out the meta refresh tag altogether, so this memento will stay parked on 2013-06-27 no matter how long it stays in the client.

In summary:

  • WebCitation did not rewrite the URI and thus created a zombie
  • Internet Archive (and other Wayback archives) rewrites the URI, but because of site-specific canonicalization, it violates the user's expectations with a single time jump (the distance of which is dependent on the sparsity of the archive)
  • Perma.cc rewrites the URI, but in this case, because it is a new archive, produces a 404 instead of a time jump
  • Archive.today strips the meta refresh tag and avoids the behavior altogether

--Michael

2014-07-14: The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript

One very large part of digital preservation is the act of crawling and saving pages on the live Web into a format for future generations to view. To accomplish this, web archivists use various crawlers, tools, and bits of software, often built to purpose. Because of these tools' ad hoc functionality, users expect them to function much better than a general purpose tool.

As anyone that has looked up a complex web page in The Archive can tell you, the more complex the page, the less likely that all resources will be captured to replay the page. Even when these pages are preserved, the replay experience is frequently inconsistent from the page on the live web.

We have started building a preliminary corpus of tests to evaluate a handful of tools and web sites that were created specifically to save web pages from being lost in time.

In homage to the web browser evaluation websites by the Web Standards Project, we have created The Archival Acid Test as a first step in ensuring that these tools to which we supply URLs for preservation are doing their job to the extent we expect.

The Archival Acid Test evaluates features that modern browsers execute well but preservation tools have trouble handling. We have grouped these tests into three categories with various tests under each category:

The Basics

  • 1a - Local image, relative to the test
  • 1b - Local image, absolute URI
  • 1c - Remote image, absolute
  • 1d - Inline content, encoded image
  • 1e - Scheme-less resource
  • 1f - Recursively included CSS

JavaScript

  • 2a - Script, local
  • 2b - Script, remote
  • 2c - Script inline, DOM manipulation
  • 2d - Ajax image replacement of content that should be in archive
  • 2e - Ajax requests with content that should be included in the archve, test for false positive (e.g., same origin policy)
  • 2f - Code that manipulates DOM after a certain delay (test the synchronicity of the tools)
  • 2g - Code that loads content only after user interaction (tests for interaction-reliant loading of a resource)
  • 2h - Code that dynamically adds stylesheets

HTML5 Features

  • 3a - HTML5 Canvas Drawing
  • 3b - LocalStorage
  • 3c - External Webpage
  • 3d - Embedded Objects (HTML5 video)

For the first run of the Archival Acid Tests, we evaluated Internet Archive's Heritrix, GNU Wget (via its recent addition of WARC support), and our own WARCreate Google Chrome browser extension. Further, we ran the test on Archive.org's Save Page Now feature, Archive.today, Mummify.it (now defunct), Perma.cc, and WebCite. For each of these tools, we first attempted to preserve the Web Standards Project's Acid 3 Test (see Figure 1).

The results for this initial study (Figure 2) were accepted for publication (see the paper) to the Digital Libraries 2014 conference (joint JCDL and TPDL this year) and will be presented September 8th-14th in London, England.

The actual test we used is available at http://acid.matkelly.com for you to exercise with your tools/websites and the code that runs the site is available on GitHub.

— Mat Kelly (@machawk1)