Tuesday, July 22, 2014

2014-07-22: "Archive What I See Now" Project Funded by NEH Office of Digital Humanities

We are grateful for the continued support of the National Endowment for the Humanities and their Office of Digital Humanities for our "Archive What I See Now" project.
In 2013, we received support for 1 year through a Digital Humanities Start-Up Grant.  This week, along with our collaborator Dr. Liza Potts from Michigan State, we were awarded a 3-year Digital Humanities Implementation Grant. We are excited to be one of the seven projects selected this year.

Our project goals are two-fold:
  1. to enable users to generate files suitable for use by large-scale archives (i.e., WARC files) with tools as simple as the "bookmarking" or "save page as" approaches that they already know
  2. to enable users to access the archived resources in their browser through one of the available add-ons or through a local version of the Wayback Machine (wayback).
Our innovation is in allowing individuals to "archive what I see now". The user can create a standard web archive file ("archive") of the content displayed in the browser ("what I see") at a particular time ("now").
Our work focuses on bringing the power of institutional web archiving tools like Heritrix and wayback to humanities scholars through open-source tools for personal-scale web archiving. We are building the following tools:
  • WARCreate - A browser extension (for both Google Chrome and Firefox) that can create an archive of a single webpage in the standard WARC format and save it to local disk. It can allow a user to archive pages behind authentication or that have been modified after user interaction.
  • WAIL (Web Archiving Integration Layer) - A stand-alone application that provides one-click installation and GUI-based configuration of both Heritrix and wayback on the user’s personal computer.
  • Mink - A browser extension (for both Google Chrome and Firefox) that provides access to archived versions of live webpages. This is an additional Memento client that can be configured to access locally stored WARC files created by WARCreate.
With these three tools, a researcher could, in her normal workflow, discover a web resource (using her browser), archive the resource as she saw it (using WARCreate in her browser), and then later index and replay the archived resource (using WAIL). Once the archived resource is indexed, it would be available for view in the researcher’s browser (using Mink).

We are looking forward to working with our project partners and advisory board members: Kristine Hanna (Archive-It), Lily Pregill (NY Art Resources Consortium), Dr. Louisa Wood Ruby (Frick Art Reference Library), Dr. Steven Schneider (SUNY Institute of Technology), and Dr. Avi Santo (ODU).

A previous post described the work we did for the start-up grant:
http://ws-dl.blogspot.com/2013/10/2013-10-11-archive-what-i-see-now.html

We've also posted previously about some of the tools (WARCreate and WAIL) that we've developed as part of this project:
http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html

See also "ODU gets largest of 11 humanities grants in Virginia" from The Virginian-Pilot.

-Michele

Monday, July 14, 2014

2014-07-14: "Refresh" For Zombies, Time Jumps

We've blogged before about "zombies", or archived pages that reach out to the live web for images, ads, movies, etc.  You can also describe it as the live web "leaking" into the archive, but we prefer the more colorful metaphor of a mixture of undead and living pages.  Most of the time Javascript is to blame (for example, see our TPDL 2013 paper "On the Change in Archivability of Websites Over Time"), but in this example the blame rests with the HTML <meta http-equiv="refresh" content="..."> tag, whose behavior in the archives I discovered quite by accident.

First, the meta refresh tag is a nasty bit of business that allows HTML to specify the HTTP headers you should have received.  This is occasionally useful (like loading a file from local disk), but more often that not seems to create situations in which the HTML and the HTTP disagree about header values, leading to surprisingly complicated things like MIME type sniffing.  In general, having data formats specify protocol behavior is a bad idea (see the discussion about orthogonality in the W3C Web Architecture), but few can resist the temptation.  Specifically, http-equiv="refresh" makes things even worse, since the HTTP header "Refresh" never officially existed, and it was eventually dropped from the HTML specification as well.

However, it is a nice illustration of a common but non-standard HTML/fake-HTTP extension that nearly everyone supports.  Here's how it works, using www.cnn.com as an example:



This line:

<meta http-equiv="refresh" content="1800;url=http://www.cnn.com/?refresh=1"/>

tells the client to wait 30 minutes (1800 seconds) and reload the current page with the value specified in the optional url= argument (if no URL is provided, the client uses the current page's URL).  CNN has used this "wait 30 minutes and reload" functionality for many years, and it is certainly desirable for a news site to cause the client to periodically reload its front page.   The problem comes when a page is archived, but the refresh capability is 1) not removed or 2) the URL argument is not (correctly) rewritten.

Last week I had loaded a memento of cnn.com from WebCitation, specifically: http://webcitation.org/5lRYaE8eZ, that shows the page as it existed at 2009-11-21:


I hid that page, did some work, and then when I came back I noticed that it had reloaded to the page as of 2014-07-11, even though the URL and the archival banner at the top remained unchanged:


The problem is that WebCitation leaves the meta refresh tag as is, causing the page to reload from the live web after 30 minutes.  I had never noticed this behavior before, so I decided to check how some other archives handle it.

The Internet Archive rewrites the URL, so although the client still refreshes the page, it gets an archived page.  Checking:

http://web.archive.org/web/20091121211700/http://www.cnn.com/


we find:

<meta http-equiv="refresh" content="1800;url=/web/20091121211700/http://www.cnn.com/?refresh=1">


But since the IA doesn't know to canonicalize www.cnn.com/?refresh=1 to www.cnn.com, you actually get a different archived page:



Instead of ending up on 2009-11-21, we end up two days in the past at 2009-11-19:


To be fair, ignoring "?refresh=1" is not a standard canonicalization rule but could be added (standard caveats apply).  And although this is not quite a zombie, it is potentially unsettling since the original memento (2009-11-21) is silently exchanged for another memento (2009-11-19; future refreshes will stay on the 2009-11-19 version).  Presumably other Wayback-based archives behave similarly.  Checking the British Library I saw:

http://www.webarchive.org.uk/wayback/archive/20090914012158/http://www.cnn.com/

redirect to:

http://www.webarchive.org.uk/wayback/archive/20090402030800/http://www.cnn.com/?refresh=1

In this case the jump is more noticable (five months: 2009-09-14 vs. 2009-04-02) since the BL's archives of cnn.com are more sparse. 

Perma.cc behaves similarly to the Internet Archive (i.e., rewriting but not canonicalizing), but presumably because it is a newer archive, it does not yet have a "?refresh=1" version of cnn.com archived.  It is possible that Perma.cc has a Wayback backend, but I'm not sure.  I had to push a 2014-07-11 version into Perma.cc (i.e., it did not already have cnn.com archived).  Checking:

http://perma.cc/89QJ-Y632?type=source


we see:

<meta http-equiv="refresh" content="1800;url=/warc/89QJ-Y632/http://www.cnn.com/?refresh=1"/>

And after 30 minutes it will refresh to a framed 404 because cnn.com/?refresh=1 is not archived:


As Perma.cc becomes more populated, the 404 behavior will likely disappear and be replaced with something like the Internet Archive and British Library examples.

Archive.today is the only archive that correctly handled this situation.  Loading:

https://archive.today/Zn6HS

produces:


A check of the HTML source reveals that they simply strip out the meta refresh tag altogether, so this memento will stay parked on 2013-06-27 no matter how long it stays in the client.

In summary:

  • WebCitation did not rewrite the URI and thus created a zombie
  • Internet Archive (and other Wayback archives) rewrites the URI, but because of site-specific canonicalization, it violates the user's expectations with a single time jump (the distance of which is dependent on the sparsity of the archive)
  • Perma.cc rewrites the URI, but in this case, because it is a new archive, produces a 404 instead of a time jump
  • Archive.today strips the meta refresh tag and avoids the behavior altogether

--Michael

2014-07-14: The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript

One very large part of digital preservation is the act of crawling and saving pages on the live Web into a format for future generations to view. To accomplish this, web archivists use various crawlers, tools, and bits of software, often built to purpose. Because of these tools' ad hoc functionality, users expect them to function much better than a general purpose tool.

As anyone that has looked up a complex web page in The Archive can tell you, the more complex the page, the less likely that all resources will be captured to replay the page. Even when these pages are preserved, the replay experience is frequently inconsistent from the page on the live web.

We have started building a preliminary corpus of tests to evaluate a handful of tools and web sites that were created specifically to save web pages from being lost in time.

In homage to the web browser evaluation websites by the Web Standards Project, we have created The Archival Acid Test as a first step in ensuring that these tools to which we supply URLs for preservation are doing their job to the extent we expect.

The Archival Acid Test evaluates features that modern browsers execute well but preservation tools have trouble handling. We have grouped these tests into three categories with various tests under each category:

The Basics

  • 1a - Local image, relative to the test
  • 1b - Local image, absolute URI
  • 1c - Remote image, absolute
  • 1d - Inline content, encoded image
  • 1e - Scheme-less resource
  • 1f - Recursively included CSS

JavaScript

  • 2a - Script, local
  • 2b - Script, remote
  • 2c - Script inline, DOM manipulation
  • 2d - Ajax image replacement of content that should be in archive
  • 2e - Ajax requests with content that should be included in the archve, test for false positive (e.g., same origin policy)
  • 2f - Code that manipulates DOM after a certain delay (test the synchronicity of the tools)
  • 2g - Code that loads content only after user interaction (tests for interaction-reliant loading of a resource)
  • 2h - Code that dynamically adds stylesheets

HTML5 Features

  • 3a - HTML5 Canvas Drawing
  • 3b - LocalStorage
  • 3c - External Webpage
  • 3d - Embedded Objects (HTML5 video)

For the first run of the Archival Acid Tests, we evaluated Internet Archive's Heritrix, GNU Wget (via its recent addition of WARC support), and our own WARCreate Google Chrome browser extension. Further, we ran the test on Archive.org's Save Page Now feature, Archive.today, Mummify.it (now defunct), Perma.cc, and WebCite. For each of these tools, we first attempted to preserve the Web Standards Project's Acid 3 Test (see Figure 1).

The results for this initial study (Figure 2) were accepted for publication (see the paper) to the Digital Libraries 2014 conference (joint JCDL and TPDL this year) and will be presented September 8th-14th in London, England.

The actual test we used is available at http://acid.matkelly.com for you to exercise with your tools/websites and the code that runs the site is available on GitHub.

— Mat Kelly (@machawk1)

Thursday, July 10, 2014

2014-07-10: Federal Cloud Computing Summit



As mention in my previous post, I attended the Federal Cloud Computing Summit on July 8th and 9th at the Ronald Reagan Building in Washington, D.C. I helped the host organization, the Advanced Technology And Research Center (ATARC) organize and run the MITRE-ATARC Collaboration Sessions that kick off the event on July 8th. The summit is designed to allow Government representatives to meeting and collaborate with industry, academic, and other Government cloud computing practitioners on the current challenges in cloud computing.

A FedRAMP primer was held at 10:00 AM on July 8th in a Government-only session. At its conclusion, we began the MITRE-ATARC Collaboration Sessions that focused on Cloud Computing in Austere Environments, Cloud Computing for the Mobile Worker, Security as a Service, and the Impact of Cloud Computing on the Enterprise. Because participants are protected by Chathan House Rule, I cannot elaborate on the Government representation or discussions in the collaboration sessions. MITRE will be constructing a summary document from the discussions that outlines the main points of the discussions, identifies orthogonal ideas between the challenge areas, and makes recommendations for the Government and academia based on the discussions. For reference, please see the 2013 Federal Cloud Computing Summit Whitepaper.

On July 9th, I attended the Summit which is a conference-style series of panels and speakers with an industry trade-show held before the event and during lunch. At 3:30-4:15, I moderated a panel of Government representatives from each of the collaboration sessions in a question-and-answer session about the outcomes from the previous day's collaboration sessions.

To follow along on Twitter, you can refer to the Federal Cloud Computing Summit Handle (@cloudfeds), the ATARC Handle (@atarclabs), and the #cloudfeds hashtag.

This was the third Federal Summit event in which I have participated. They are great events that the Government participants have consistently identified as high-value and I am excited to start planning the Winter installment of the Federal Cloud Computing Summit.

--Justin F. Brunelle

Tuesday, July 8, 2014

2014-07-08: Presenting WS-DL Research to PES University Undergrads

On July 7th and 8th, 2014, Hany SalahEldeen and I (Mat Kelly) were given the opportunity to present our PhD research to visiting undergraduate seniors from a leading university in Bangalore, India (PES University). About thirty students were in attendance at each session and indicated their interest in the topics through a large quantity of relevant questions.


Dr. Weigle (@weiglemc)

Prior to ODU CS students' presentations, Dr. Michele C. Weigle (@weiglemc) gave the students an overview presentation of some of WS-DL's research topics with her presentation Bits of Research.

In her presentation she covered both our lab's foundational work, recent work, some outstanding research questions, as well as some potential projects to entice interested students to work with our research group.


Mat (@machawk1), your author

Between Hany and me, I (Mat Kelly) presented a fairly high level yet technical overview titled Browser-Based Digital Preservation, which highlighted my recent work in creating WARCreate, a Google Chrome extension that allows web pages to be preserved from the browser.

Though not merely a demo of the tool (as was given at Digital Preservation and JCDL 2012), I initially gave a primer on the dynamics of the web, HTTP, the state of web archiving, some issues relating to personal web archiving versus institutional web archiving, then finally, the problems that WARCreate addresses. I also covered some other related topics and browser-based preservation dynamics, which can be seen in the slides included in this post.


Hany (@hanysalaheldeen) presented the next day after my presentation, giving a general overview of his academic career and research topics. His presentation Zen and the Art of Data Mining covered the wide range of topics including (but not limited to) temporal user intention, the Egyptian Revolution, and his experience as an ODU WS-DL PhD student (to, again, entice the students).


The opportunity for Hany and me to present what we work on day-to-day to bright-eyed undergraduate students was unique, as their interest is both within our research area (computer science) yet still have doors open on what research path to take as potential graduate students.

We hope that the presentations and questions we were able to answer were of some help in facilitating their decisions to pursue a graduate career at the Web Science and Digital Libraries Research Lab at Old Dominion University.



— Mat Kelly

2014-07-08: Potential MediaWiki Web Time Travel for Wayback Machine Visitors






Over the past year, I've been working on the Memento MediaWiki Extension.  In addition to trying to produce a decent product, we've also been trying to build support for the Memento MediaWiki Extension at WikiConference USA 2014.  Recently, we've reached out via Twitter to raise awareness and find additional supporters.

To that end, we attempt to answer two questions:
  1. The Memento extension provides the ability to access a page revision closest, but not over the datetime specified by the user.  As mentioned in an earlier blog post, the Internet Archive only has access to the revisions of articles that existed at the time it crawled, but a wiki can access every revision.  How effective is the Wayback Machine at ensuring that visitors gain access to pages close to the datetimes they desire?
  2. How many visitors of the Wayback Machine could benefit from the use of the Memento MediaWiki Extension?
Answering the first question shows why the Wayback Machine is not a suitable replacement for a native MediaWiki Extension.

Answering the second question gives us an idea of the potential user base for the Memento MediaWiki Extension.

Thanks to the work by Yasmin AlNoamany's work in "Who and What Links to the Internet Archive", we have access to 766 GB of (compressed) anonymized Internet Archive logs in a common Apache format.  Each log file represents a single day of access to the Wayback Machine.  We can use these logs to answer these questions.

Effectiveness of accessing closest desired datetime in the Wayback Machine

How effective is the Wayback Machine at ensuring that visitors gain access to pages close to the datetimes they desire?
To answer the first question, I used the following shell command to review the logs.



This command was only used on this single log file to find a potential English Wikipedia page as an example to trace in the logs.  It was only used to search for an answer to the first question above.

From that command, I found a Wayback Machine capture of a Wikipedia article about the Gulf War.  The logs were anonymized, so of course I couldn't see the actual IP address of the visitor, but I was able to follow the path of referrers back to see what path the user took as they browsed via the Wayback Machine.



We see that the user engages in a Dive pattern, as defined in Yasmin AlNoamany's "Access Patterns for Robots and Humans in Web Archives".
  1. http://web.archive.org/web/20071218235221/angel.ap.teacup.com/gamenotatsujin/24.html
  2. http://web.archive.org/web/20080112081044/http://angel.ap.teacup.com/gamenotatsujin/259.html
  3. http://web.archive.org/web/20071228223131/http://angel.ap.teacup.com/gamenotatsujin/261.html
  4. http://web.archive.org/web/20071228202222/http://angel.ap.teacup.com/gamenotatsujin/262.html
  5. http://web.archive.org/web/20080105140810/http://angel.ap.teacup.com/gamenotatsujin/263.html
  6. http://web.archive.org/web/20071228202227/http://angel.ap.teacup.com/gamenotatsujin/264.html
  7. http://web.archive.org/web/20071228223136/http://angel.ap.teacup.com/gamenotatsujin/267.html
  8. http://web.archive.org/web/20071228223141/http://angel.ap.teacup.com/gamenotatsujin/268.html
  9. http://web.archive.org/web/20080102052100/http://en.wikipedia.org/wiki/Gulf_War

The point of this exercise was not to read this Japanese blog that the user was initially interested in.  From this series of referrers, we see that the end user chose the original URI with a datetime of 2007/12/18 23:52:21 (from the 20071218235221 part of the archive.org URI).  It is the best we can do to determine which Accept-Datetime they would have chosen if they were using Memento.  What they actually got at the end was an article with a Memento-Datetime of 2008/01/02 05:21:00.


So, we could assume, that perhaps there were no changes in this article between these two dates.  The Wikipedia history for that article shows a different story, listing 51 changes to the article in that time.

The Internet Archive produced a page that maps to revision 181419148 (1 January 2008), rather than revision 178800602 (19 December 2007), which is the closest revision to what the visitor actually desired.

What did the user miss out on by getting the more recent version of the article?  The old revision discusses how the Gulf War was the last time the United States used battleships in war, but an editor in between decided to strike this information from the article.  The old revision listed different countries in the Gulf War coalition than the new revision.

So, seeing as the Internet Archive's Wayback Machine slides the user from date to date, they end up getting a different revision than they originally desired.  This algorithm makes sense in an archival environment like the Wayback Machine, where the mementos are sparse.

The Memento MediaWiki Extension has access to all revisions, meaning that the user can get the revision closest to the date they want.

Potential Memento MediaWiki Extension Users at the Internet Archive

How many visitors of the Wayback Machine could benefit from the use of the Memento MediaWiki Extension?
The second question involves discovering how many visitors are using the Wayback Machine for browsing Wikipedia when they could be using the Memento MediaWiki Extension.

We processed these logs in several stages to find the answer, using different scripts and commands than the one used earlier.

First, a simple grep command, depicted below, was run on each logfile.  The variable $inputfile was the compressed log file, and the $outputfile was stored in a separate location.



Considering we are looping through 766 GB of data, this took quite some time to complete on our  dual-core 2.4 GHz virtual machine with 2 GB of RAM.

As Yasmin AlNoamany showed in "Who and What Links to the Internet Archive", wikipedia.org is the biggest referrer to the Internet Archive, but we wanted direct users.  So, we were concerned with any entries that were merely referrers from Wikipedia.  Because Wikipedia uses links to the Internet Archive to avoid dead links to Wikipedia article references, there are many referrers in these logs from Wikipedia.

We used the simple Python script below on each of the 288 output files returned from the first pass, stripping out all of referrers containing the string 'wikipedia.org'.



Python was used because it offered better performance than merely using a combination of sed, grep, and awk to achieve the same goal.

Once we had stripped the referrers from the processed log data, then we could find the counts of access to Wikipedia with another script. The script below was run with the argument of wikipedia.org as the search term. Seeing as we had removed referrers, only actual requests for wikipedia.org should remain.



Because each log file represents one day of activity, this script gives us a CSV containing a date and a count of how many wikipedia.org requests occur for that day.

Now that we have a list of counts from each day, it is easy to take the numbers from the count column in this CSV and find the mean.  Again, enter Python, because it was simple and easy.



It turns out that the Wayback Machine, on average, receives 131,438 requests for Wikipedia articles each day.

If we perform the same exercise for the other popular wiki sties in the web we get the results shown in the table below.

Wiki SiteMean number of
daily requests to the
Wayback Machine
*.wikipedia.org (All Wikipedia sites)131,438
*.wikimedia.org (Wikimedia Commons)26,721
*.wikia.com (All Wikia sites)9,574

So, there are a potential 168,000 Memento requests per day who could benefit if these wikis used the Memento MediaWiki Extension.

On top of it, these logs represent a snapshot in time for the Wayback Machine only.  The Internet Archive has other methods of access that were not included in this study, so the number of potential Memento requests per day is actually much higher.

Summary


We have established support for two things:
  1. the Memento MediaWiki Extension will produce results closer to the date requested than the Wayback Machine
  2. there are a potential 168,000 Memento requests per day that could benefit from the Memento MediaWiki Extension
Information like this is useful when newcomers ask: who could benefit from Memento?

--Shawn M. Jones

Monday, July 7, 2014

2014-07-07: InfoVis Fall 2012 Class Projects


(Note: This is continuing a series of posts about visualizations created either by students in our research group or in our classes.)

I've been teaching the graduate Information Visualization course since Fall 2011.  In this series of posts, I'm highlighting a few of the projects from each course offering.  (Previous post: Fall 2011)

The Fall 2012 projects were based on the 2012 ODU Infographics Contest. Participants were tasked with visualizing the history and trajectory of work done in the area of quantum sensing.

Top Quantum Sensing Trends
Created by Wayne Stilwell


This project (currently available at https://ws-dl.cs.odu.edu/vis/quantum-stilwell/) is a visualization for displaying the history and trajectory of quantum sensing. History is shown as a year-by-year slideshow. The most publicized quantum sensing areas for the selected year are displayed. Clicking on a topic shows the number of publications on that subject over time compared to the most popular topic (gray line). This allows users to see when a subject started to rise in popularity and at what point in time (if any) it started to decline. The visualization also shows which research groups have the most publications for the selected subject. When a new year is chosen, animation is used to show which topics increased in popularity and which decreased. The final slide in the visualization is a projection for the year 2025 to show where quantum sensing is headed in the future.  This project won the overall competition.

The video below provides a demo of the tool.



BibTeX Corpus Visualizer
Created by Mat Kelly


One method to find trends in any industry is to examine the publications related to that industry. Given a set of publications, one should be able to extrapolate trends based on solely on the publications' metadata, e.g., title, keywords, abstract. For one to analyze text data to determine trends is daunting, so another method should be used that analyzes this data and presents it in a way that can be easily consumed by a casual user. This casual user should be be able to achieve the goal of identifying trends in the respective industry. This project (currently available at https://ws-dl.cs.odu.edu/vis/quantum-kelly/index.php) is a visualization that examines a small corpus consisting of metadata (in BibTeX format) about a collection of articles related to Quantum Sensing. The interface allows a user to explore this data and conclude many attributes of the data set and industry, including finding trends.  This project was built using the jQuery and D3.js libraries.

The video below provides a demo of the tool. 


Mat is one of our PhD students and has done other visualization work (described in his IEEE VIS 2013 trip report).

-Michele