Monday, April 20, 2015

2015-04-20: Virginia Space Grant Consortium Student Research Conference Report

Mat Kelly and various other graduate students in the state of Virginia present their graduate research at the Virginia Space Grant Consortium.                           

On Friday, April 17, 2015 I attended the Virginia Space Grant Consortium (VSGC) Student Research Conference at NASA Langley Research Center (LaRC) in Hampton, Virginia. This conference is slightly beyond the scope of what we at ODU WS-DL (@webscidl) usually investigate, as the research requirement was that it was relevant to NASA's objectives as a space agency.

My previous work with LaRC's satellite imagery allowed me to approach the imagery files with the perspective a computational scientist. More on my presentation, "Facilitation of the A Posteriori Replication of Web Published Satellite Imagery" below.

The conference started off with registration and a provided continental breakfast. Mary Sandy, the VSGC Director and Chris Carter, the VSGC Deputy Director began by describing the history of the Virginia Space Grant Consortium program including the amount contributed since its inception and the number of recipients that have benefitted from being funded.

The conference was organized in a model consisting of concurrent sessions of two to three themed presentations by undergraduate and graduate students at various Virginia universities.

First Concurrent Sessions

I attended the "Aerospace" session in the first morning session. In this session Maria Rye (Virginia Tech) started with her explorative research in suppressing distortions in tensegrity systems, a flexible structure held together by interconnected bars and tendons.

Marie Ivanco (Old Dominion University) followed Maria with her research in applying Analytic Hierarchy Processes (AHPs) for analytical sensitivity analysis and local inconsistency checks for engineering applications.

Peter Marquis (Virginia Tech) spoke third in the session with his research on characterizing the design variables to trim the LAICE CubeSat to obtain a statically stable flight configuration.

Second Concurrent Sessions

The second sessions seamlessly continued with Stephen Noel (Virginia Tech) presenting a similar work relating to LAICE. His work consisted of the development of software to read, parse, and interpret calibration data for the system.

Cameron Orr (Virginia Tech) presented the final work in the second Aerospace session with the exploration of the development of adapted capacitance manometers for thermospheric applications. Introducing this additional component as well as some detection circuitry allowed more accurate measurement of pressure changes.

Third Concurrent Sessions

After a short break where posters from graduate students around Virginia were presented, I opted to move to another room to view the Applied Science presentations.

Atticus Stovall (University of Virginia) described his system for modeling forest carbon relating height-to-biomass relationships as well as voxel based volume modeling as a means of evaluating the amount of carbon stored.

Matthew Giarra (Virginia Tech) wrapped up the short session with a visual investigation of the flow of hemolymph (blood) in insects' bodies as a potential model for non-directional fluid pumping.

Fourth Concurrent Sessions

The third session immediately segued into the fourth session of the day, where I changed rooms to attend the Astrophysics presentations.

Charles Fancher (William & Mary) presented work on a theoretical prototype for an ultracold atom-based magnetometer for accurate timekeeping in space.


John Blalock (Hampton University) presented next in the Astrophysics session with his work on using various techniques to measure wind speeds on Saturn from the results returned by the Cassini orbiter's Imaging Science Subsystem.


Kimberly Sokal (University of Virginia) wrapped up the fourth session with her enthusiastic presentation on emerging super star clusters with Wolf-Rayet stars. Her group's discovery of the star cluster S26 in NGC 4449 is undergoing an evolutionary transition that is not well understood. The ongoing work may provide feedback as to the tipping point of the emerging process that affects the super star cluster's ability to remain bound.

The conference then broke for an invitation-only lunch with a keynote address by Dr. David Bowles, Acting Directory of NASA Langley Research Center.

Fifth Concurrent Sessions

For the final session of the day, I attended and presented at the Astrophysics session. Emily Mitchell (University of Virginia) presented first with her study on the irradiation effects of H2-laden porous water ice films in the interstellar medium (ISM). She exposed ice to hydrogen gas at different pressures after deposition and during radiation. She reports that H2 concentration increases with decreasing ion flux, suggesting that as much as 7 percent solid H2 is trapped in interstellar ice by radiation impacts

Following Emily, Mat Kelly (your author) of Old Dominion University presented my work on the Facilitation of the A Posteriori Replication of Web Published Satellite Imagery. By creating software to mine the metadata and a system that allows peer-to-peer sharing of the public domain satellite imagery currently solely located on the NASA Langley servers, I was able to mitigate the reliance on a single source of the data. The system I created utilizes concepts from ResourceSync, BitTorrent and WebRTC.

Wrap Up

The Virginia Space Grant Consortium Student Research Conference was extremely interesting despite being somewhat different in topic compared to our usual conferences. I am very glad that I got the opportunity to do the research for the fellowship and hope to progress the work for further applications beyond satellite imagery.

Mat (@machawk1)

Sunday, April 5, 2015

2015-04-05: From Student To Researcher...

In 2010, I decided to again study at the Old Dominion University Computer Science Department for better employment opportunities. After taking some classes, I realized that I did not merely want to take classes and earn a Master's Degree, but also wanted to contribute knowledge, like those who wrote the many research papers I had read during my courses.

My Master's Thesis is titled "Avoiding Spoilers On MediaWiki Fan Sites Using Memento".   I came to the topic via a strange route.

During Dr. Nelson's Introduction to Digital Libraries course, we built a digital library based on a single fictional universe.  I chose the television show Lost, and specifically archived Lostpedia, a site that my wife and I used while watching and discussing the show.  We realized that fans were updating Lostpedia while episodes aired.  This highlighted the idea that wiki revisions created prior to the episode obviously did not contain information about that episode, and emphasized that episodes led to wiki revisions.

A few years later, a discussion at work occurred while watching Game of Thrones.  I realized that some of us had seen the episode of the night before while others had not.  We wanted to use the Game of Thrones Wiki to continue our conversation, but realized that those who had not seen the episode easily encountered spoilers.  By this point, I was quite familiar with Memento, had used Memento for Chrome, and was working on the Memento MediaWiki Extension.  The idea of using Memento to avoid spoilers was born.

The figure above exhibits the Naïve Spoiler Concept.  The concept is that wiki revisions in the past of a given episode should not contain spoilers, because information has not yet been revealed by the episode, hence fans could not write about it.  Inversely, wiki revisions in the future of a given episode will likely contain spoilers, seeing as episodes cause fans to write wiki revisions.

It turned out that there was more to avoiding spoilers in fan wiki sites than merely using Memento and the Naïve Spoiler Concept.  Most TimeGates use a heuristic that is not reliable for avoiding spoilers, so I proposed a new one and demonstrated why the existing heuristic was insufficient by calculating the probability of encountering a spoiler using the current heuristic.  I also used the Memento MediaWiki Extension to demonstrate this new heuristic in action.  In this way I was able to develop a Computer Science Master's Thesis on the topic.

Mindist (minimum distance) is the heuristic used by most TimeGates. This works well for an sparse archive, because often the closest memento to the datetime you have requested is best.  Wikis have access to every revision, allowing us to use a new heuristic minpast (minimum distance in the past, minimum distance without going over the given date).  Using records from fan wikis, I showed that, if one is trying to avoid spoilers, there can be as much as a 66% chance of encountering a spoiler if we use the Wayback Machine or a Memento TimeGate using mindist.  I also analyzed Wayback Machine logs for wikia.com requests and found that 19% of those requests ended up in the future.  From these studies, it was clear that using minpast directly on wikis was the best way to avoid spoilers.

While I was examining fan wikis for spoilers, I also had the opportunity to compare wiki revisions with mementos recorded by the Internet Archive.  Using this information I was actually able to reveal how the Internet Archive's sparsity is changing over time.  Because wikis keep track of every revision, so we can see missed updates by the Internet Archive.

In the figure above, we see a timeline for each wiki page I conducted in the study.  The X-axis shows time and the Y-axis consists of an identifier for each wiki page.  Darker colors indicate more missed updates by the Internet Archive.  We see that the colors are getting lighter, meaning that the Internet Archive has becoming more aggressive in recording pages.

Below are the slides for the presentation, available on my SlideShare account, followed by the video of my defense posted to YouTube.  The full document of my Master's Thesis is available here.







Thanks to Dr. Irwin Levinstein and Dr. Michele Weigle for serving on my committee.  Their support has been invaluable during this process. Were it not for Dr. Levinstein, I would not have been able to become a graduate student.  Were it not Dr. Weigle's wonderful Networking class, I would not have been able to draw some of the conclusions necessary to complete this thesis.

Much of the thanks goes to my advisor, Dr. Michael L. Nelson, who spent hours discussing these concepts with me, helping correct my assumptions and assessments when I erred, while praising the experience when I came up with something original and new.  His patience and devotion not only to the area of study, but also the art of mentoring, led me down the path of success.

In the process of creating this thesis, I also created a technical report which can be referenced using the BibTeX code below.



So, what is next?  Do I use wikis to study the problem of missed updates in more detail? Do I study the use of the naïve spoiler concept in another setting?  Or do I do something completely different?

I realize that I have merely begun my journey from student to researcher, but know even more now that I will enjoy the path I have chosen.

--Shawn M. Jones, Researcher

Monday, March 23, 2015

2015-03-23: 2015 Capital Region Celebration of Women in Computing (CAPWIC)

On February 27-28, I attended the 2015 Capital Region Celebration of Women in Computing (CAPWIC) in Harrisonburg, VA on the campus of James Madison University.  Two of our graduating Masters students, Apeksha Barhanpur (ACM president) and Kayla Henneman (ACM-W president) attended with me.

With the snow that had blanketed the Hampton Roads region, we were lucky to get out of town on Friday morning.  We were also lucky that Harrisonburg had their foot of snow over the previous weekend so that there was plenty of time for all of the roads to be cleared.  We had some lovely scenery to view along the way.

We arrived a little late on Friday afternoon, but Apeksha and Kayla were able to attend "How to Get a Tech Job" by Ann Lewis, Director of Engineering at Pedago.  This talk focused on how each student has to pick the right field of technology for their career. The speaker presented some basic information on the different fields of technology and different levels of job positions and companies. The speaker also mentioned the "Because Software is Awesome" Google Group, which is a private group for students seeking information on programming languages and career development.

While they attended the talk, I caught up with ODU alum and JMU assistant professor, Samy El-Tawab.

After a break, I put on my Graduate Program Director hat and gave a talk titled "What's Grad School All About?"


I got to reminisce about my grad school days, share experiences of encountering the imposter syndrome, and discuss the differences between the MS and PhD degrees in computer science.


After my talk, we set up for the College and Career Fair.  ODU served as an academic sponsor, meaning that we got a table where were able to talk with several women interested in graduate school.  Apeksha and Kayla also got to pass out their resumes to the companies that were represented.

I also got to show off my deck of Notable Women in Computing playing cards.  (You can get your own deck at notabletechnicalwomen.org.)


Our dinner keynote, "Technology and Why Diversity Matters," was given by Sydney Klein, VP for Information Security and Risk Management at Capital One. (Capital One had a huge presence at the conference.) One thing she emphasized is that Capital One now sees itself as more of a technology company than a bank. Klein spoke about the importance of women in technology and the percentages of women that are represented in the field at various levels. She also mentioned various opportunities present within the market for women.

After dinner, we had a ice breaker/contest where everyone was divided into groups with the task of creating a flag representing the group and their relation with the field of computer science. Apeksha was on the winning team!  Their flag represented the theme of the conference and how they were connected to the field of technology, “Women make the world work”. Apeksha noted that this was a great experience to work with a group of women from different regions around the world.

On Saturday morning, Apekska and Kayla attended the "Byte of Pi" talk given by Tejaswini Nerayanan and Courtney Christensen from FireEye. They demonstrated programming using the Raspberry Pi device, a single board computer.  The students were given a small demonstration on writing code and building projects.

Later Saturday, my grad school buddy, Mave Houston arrived for her talk.  Mave is the Founder and Head of USERLabs and User Research Strategy at Capital One. Mave gave a great talk, titled "Freedom to Fail". She also talked about using "stepping stones on the way to success." She let us play with Play-Doh, figuring out how to make a better toothbrush. My partner, a graduate student at Virginia State University, heard me talk about trying to get my kids interested in brushing their teeth and came up with a great idea for a toothbrush with buttons that would let them play games and give instructions while they brushed. Another group wanted to add a sensor that would tell people where they needed to focus their brushing.

We ended Saturday with a panel on graduate school that both Mave and I helped with and hopefully encouraged some of the students attending to continue their studies.

-Michele

Tuesday, March 10, 2015

2015-03-10: Where in the Archive Is Michele Weigle?

(Title is an homage to a popular 1980s computer game "Where in the World Is Carmen Sandiego?")

I was recently working on a talk to present to the Southeast Women in Computing Conference about telling stories with web archives (slideshare). In addition to our Hurricane Katrina story, I wanted to include my academic story, as told through the archive.

I was a grad student at UNC from 1996-2003, and I found that my personal webpage there had been very well preserved.  It's been captured 162 times between June 1997 and October 2013 (https://web.archive.org/web/*/http://www.cs.unc.edu/~clark/), so I was able to come up with several great snapshots of my time in grad school.

https://web.archive.org/web/20070912025322/
http://www.cs.unc.edu/~clark/
Aside: My UNC page was archived 20 times in 2013, but the archived pages don't have the standard Wayback Machine banner, nor are their outgoing links re-written to point to the archive. For example, see https://web.archive.org/web/20130203101303/http://www.cs.unc.edu/~clark/
Before I joined ODU, I was an Assistant Professor at Clemson University (2004-2006). The Wayback Machine shows that my Clemson home page was only crawled 2 times, both in 2011 (https://web.archive.org/web/*/www.cs.clemson.edu/~mweigle/). Unfortunately, I no longer worked at Clemson in 2011, so those both return 404s:


Sadly, there is no record of my Clemson home page. But, I can use the archive to prove that I worked there. The CS department's faculty page was captured in April 2006 and lists my name.

https://web.archive.org/web/20060427162818/
http://www.cs.clemson.edu/People/faculty.shtml
Why does the 404 show up in the Wayback Machine's calendar view? Heritrix archives every response, no matter the status code. Everything that isn't 500-level (server error) is listed in the Wayback Machine. Redirects (300-level responses) and Not Founds (404s) do record the fact that the target webserver was up and running at the time of the crawl.

Wouldn't it be cool if when I request a page that 404s, like http://www.cs.clemson.edu/~mweigle/, the archive could figure out that there is a similar page (http://www.cs.unc.edu/~clark/) that links to the requested page?
https://web.archive.org/web/20060718131722/
http://www.cs.unc.edu/~clark/
It'd be even cooler if the archive could then figure out that the latest memento of that UNC page now links to my ODU page (http://www.cs.odu.edu/~mweigle/) instead of the Clemson page. Then, the archive could suggest http://www.cs.odu.edu/~mweigle/ to the user.

https://web.archive.org/web/20120501221108/
http://www.cs.unc.edu/~clark/
I joined ODU in August 2006.  Since then, my ODU home page has been saved 53 times (https://web.archive.org/web/*/http://www.cs.odu.edu/~mweigle/).

The only memento from 2014 is on Aug 9, 2014, but it returns a 302 redirecting to an earlier memento from 2013.



It appears that Heritrix crawled http://www.cs.odu.edu/~mweigle (note the lack of a trailing /), which resulted in a 302, but http://www.cs.odu.edu/~mweigle/ was never crawled. The Wayback Machine's canonicalization is likely the reason that the redirect points to the most recent capture of http://www.cs.odu.edu/~mweigle/. (That is, the Wayback Machine knows that http://www.cs.odu.edu/~mweigle and http://www.cs.odu.edu/~mweigle/ are really the same page.)

My home page is managed by wiki software and the web server does some URL re-writing. Another way to get to my home page is through http://www.cs.odu.edu/~mweigle/Main/Home/, which has been saved 3 times between 2008 and 2010. (I switched to the wiki software sometime in May 2008.) See https://web.archive.org/web/*/http://www.cs.odu.edu/~mweigle/Main/Home/

Since these two pages point to the same thing, should these two timemaps be merged? What happens if at some point in the future I decide to stop using this particular wiki software and end up with http://www.cs.odu.edu/~mweigle/ and http://www.cs.odu.edu/~mweigle/Main/Home/ being two totally separate pages?

Finally, although my main ODU webpage itself is fairly well-archived, several of the links are not.  For example, http://www.cs.odu.edu/~mweigle/Resources/WorkingWithMe is not archived.


Also, several of the links that are archived have not been recently captured.  For instance, the page with my list of students was last archived in 2010 (https://web.archive.org/web/20100621205039/http://www.cs.odu.edu/~mweigle/Main/Students), but none of these students are still at ODU.

Now, I'm off to submit my pages to the Internet Archive's "Save Page Now" service!

--Michele

Monday, March 2, 2015

2015-03-02 Reproducible Research: Lessons Learned from Massive Open Online Courses

Source: Dr. Roger Peng (2011). Reproducible Research in Computational Science. Science 334: 122

Have you ever needed to look back at a program and research data from lab work performed last year, last month or maybe last week and had a difficult time recalling how the pieces fit together? Or, perhaps the reasoning behind the decisions you made while conducting your experiments is now obscure due to incomplete or poorly written documentation.  I never gave this idea much thought until I enrolled in a series of Massive Open Online Courses (MOOCs) offered on the Coursera platform. The courses, which I took during the period from August to December of 2014, were part of a nine course specialization in the area of data science. The various topics included R Programming, Statistical Inference and Machine Learning. Because these courses are entirely free, you might think they would lack academic rigor. That's not the case. In fact, these particular courses and others on Coursera are facilitated by many of the top research universities in the country. The courses I took were taught by professors in the biostatistics department of the Johns Hopkins Bloomberg School of Public Health. I found the work to be quite challenging and was impressed by the amount of material we covered in each four-week session. Thank goodness for the Q&A forums and the community teaching assistants as the weekly pre-recorded lectures, quizzes, programming assignments, and peer reviews required a considerable amount of effort each week.

While the data science courses are primarily focused on data collection, analysis and methods for producing statistical evidence, there was a persistent theme throughout -- this notion of reproducible research. In the figure above, Dr. Roger Peng, a professor at Johns Hopkins University and one of the primary instructors for several of the courses in the data science specialization, illustrates the gap between no replication and the possibilities for full replication when both the data and the computer code are made available. This was a recurring theme that was reinforced with the programming assignments. Each course concluded with a peer-reviewed major project where we were required to document our methodology, present findings and provide the code to a group of anonymous reviewers; other students in the course. This task, in itself, was an excellent way to either confirm the validity of your approach or learn new techniques from someone else's submission.

If you're interested in more details, the following short lecture from one of the courses (16:05), also presented by Dr. Peng, gives a concise introduction to the overall concepts and ideas related to reproducible research.





I received an introduction to reproducible research as a component of the MOOCs, but you might be wondering why this concept is important to the data scientist, analyst or anyone interested in preserving research material. Consider the media accounts in the latter part of 2014 of admonishments for scientists who could not adequately reproduce the results of groundbreaking stem cell research (Japanese Institute Fails to Reproduce Results of Controversial Stem-Cell Research) or the Duke University medical research scandal which was documented in a 2012 segment of 60 Minutes. On the surface these may seem like isolated incidents, but they’re not.  With some additional investigation, I discovered some studies, as noted in a November 2013 edition of The Economist, which have shown reproducibility rates as low as 10% for landmark publications posted in scientific journals (Unreliable Research: Trouble at the Lab). In addition to a loss of credibility for the researcher and the associated institution, scientific discoveries which cannot be reproduced can also lead to retracted publications which affect not only the original researcher but anyone else whose work was informed by possibly erroneous results or faulty reasoning. The challenge of reproducibility is further compounded by technology advances that empower researchers to rapidly and economically collect very large data sets related to their discipline; data which is both volatile and complex. You need only think about how quickly a small data set can grow when it's aggregated with other data sources.


Cartoon by Sidney Harris (The New Yorker)


So, what steps should the researcher take to ensure reproducibility? I found an article published in 2013, which lists Ten Simple Rules for Reproducible Computational Research. These rules are a good summary of the ideas that were presented in the data science courses.
  • Rule 1: For Every Result, Keep Track of How It Was Produced. This should include the workflow for the analysis, shell scripts, along with the exact parameters and input that was used.
  • Rule 2: Avoid Manual Data Manipulation Steps. Any tweaking of data files or copying and pasting between documents should be performed by a custom script.
  • Rule 3: Archive the Exact Versions of All External Programs Used. This is needed to preserve dependencies between program packages and operating systems that may not be readily available at a later date.
  • Rule 4: Version Control All Custom Scripts. Exact reproduction of results may depend upon a particular script. Archiving tools such as Subversion or Git can be used to track the evolution of code as its being developed.
  • Rule 5: Record All Intermediate Results, When Possible in Standardized Formats. Intermediate results can reveal faulty assumptions and uncover bugs that may not be apparent in the final results.
  • Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds. Using the same random seed ensures exact reproduction of results rather than approximations.
  • Rule 7: Always Store Raw Data behind Plots. You may need to modify plots to improve readability. If raw data are stored in a systematic manner, you can modify the plotting procedure instead of redoing the entire analysis.
  • Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected. In order to validate and fully understand the main result, it is often useful to inspect the detailed values underlying any summaries.
  • Rule 9: Connect Textual Statements to Underlying Results. Statements that are connected to underlying results can include a simple file path to detailed results or the ID of a result in the analysis framework.
  • Rule 10: Provide Public Access to Scripts, Runs, and Results. Most journals allow articles to be supplemented with online material. As a minimum, you should submit the main data and source code as supplementary material and be prepared to respond to any requests for further data or methodology details by peers.
In addition to the processing rules, we were also encouraged to adopt suitable technology packages as part of our toolkit. The following list represents just a few of the many products we used to assemble a reproducible framework and also introduce literate programming and analytical techniques into the assignments.
  • R and RStudio: Integrated development environment for R.
  • Sweave: An R package that allows you to embed R code in LaTeX documents.
  • Knitr: New enhancements to the Sweave package for dynamic report generation. It supports publishing to the web using R Markdown and R HTML.
  • R Markdown: Integrates with knitr and RStudio. Allows you to execute R code in chunks and create reproducible documents for display on the web.
  • RPubs: Web publication tool for sharing R markdown files. The gallery of example documents illustrates some useful techniques.
  • Git and GitHub: Open source version control repository.
  • Apache Subversion (SVN): Open source version control repository.
  • iPython Notebook: Creates literate webpages and documents interactively in Python. You can combine code execution, text, mathematics, plots and rich media into a single document. This gallery of videos and screencasts includes tutorials and hands-on demonstrations.
  • Notebook Viewer: Web publication tool for sharing iPython notebook files.

As a result of my experience with the MOOCs, I now have a greater appreciation for the importance of reproducible research and all that it encompasses. For more information on the latest developments, you can refer to any of these additional resources or follow Dr. Peng (@rdpeng) on Twitter.

-- Corren McCoy

Tuesday, February 17, 2015

2015-02-17: Reactions To Vint Cerf's "Digital Vellum"

Don't you just love reading BuzzFeed-like articles, constructed solely of content embedded from external sources?  Yeah, me neither.  But I'm going to pull one together anyway.

Vint Cerf generated a lot of buzz last week when at an AAAS meeting he gave talk titled "Digital Vellum".  The AAAS version, to the best of my knowledge, is not online but this version of "Digital Vellum" at CMU-SV from earlier the same week is probably the same.



The media (e.g., The Guardian, The Atlantic, BBC) picked up on it, because when Vint Cerf speaks people rightly pay attention.  However, the reaction from archiving practitioners and researchers was akin to having your favorite uncle forget your birthday, mostly because Cerf's talk seemed to ignore the last 20 or so years of work in preservation.  For a thoughtful discussion of Cerf's talk, I recommend David Rosenthal's blog post.  But let's get to the BuzzFeed part...

In the wake of the media coverage, I found myself retweeting many of my favorite wry responses starting with Ian Milligan's observation:


Andy Jackson went a lot further, using his web archive (!) to find out how long we've been talking about "digital dark ages":



And another one showing how long The Guardian has been talking about it:


And then Andy went on a tear with pointers to projects (mostly defunct) with similar aims as "Digital Vellum":









Andy's dead right, of course.  But perhaps Jason Scott has the best take on the whole thing:



So maybe Vint didn't forget our birthday, but we didn't get a pony either.  Instead we got a dime kitty

--Michael

2015-02-17: Fixing Links on the Live Web, Breaking Them in the Archive


On February 2nd, 2015, Rene Voorburg announced the JavaScript utility robustify.js. The robustify.js code, when embedded in the HTML of a web page, helps address the challenge with link rot by detecting when a clicked link will return an HTTP 404 and uses the Memento Time Travel Service to discover mementos of the URI-R. Robustify.js assigns an onclick event to each anchor tag in the HTML. The event occurs, robustify.js makes an Ajax call to a service to test the HTTP response code of the target URI.

When an HTTP 404 response code is detected by robustify.js, it uses Ajax to make a call to a remote server, uses the Memento Time Travel Service to find mementos of the URI-R, and uses a JavaScript alert to let the user know that JavaScript will redirect the user to the memento.

Our recent studies have shown that JavaScript -- particularly Ajax -- normally makes preservation more difficult, but robustify.js is a useful utility that is easily implemented to solve an important challenge. Along this thought process, we wanted to see how a tool like robustify.js would behave when archived.

We constructed two very simple test pages, both of which include links to Voorburg's missing page http://www.dds.nl/~krantb/stellingen/.
  1. http://www.cs.odu.edu/~jbrunelle/wsdl/unrobustifyTest.html which does not use robustify.js
  2. http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html which does use robustify.js
In robustifyTest.html, when the user clicks on the link to http://www.dds.nl/~krantb/stellingen/, an HTTP GET request is issued by robustify.js to an API that returns an existing memento of the page:

GET /services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F HTTP/1.1
Host: digitopia.nl
Connection: keep-alive
Origin: http://www.cs.odu.edu
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4
Accept: */*
Referer: http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Fri, 06 Feb 2015 21:47:51 GMT
Content-Type: application/json; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/5.3.10-1ubuntu3.15
Access-Control-Allow-Origin: *

The resulting JSON is used by robustify.js to then redirect the user to the memento http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/ as expected.

Given this success, we wanted to understand how our test pages would behave in the archives. We also included a link to the stellingen memento in our test page before archiving to understand how a URI-M would behave in the archives. We used the Internet Archive's Save Page Now feature to create the mementos at URI-Ms http://web.archive.org/web/20150206214019/http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html and http://web.archive.org/web/20150206215522/http://www.cs.odu.edu/~jbrunelle/wsdl/unrobustifyTest.html.

The Internet Archive re-wrote the embedded links to be relative to the archive in the memento, converting http://www.dds.nl/~krantb/stellingen/ to http://web.archive.org/web/20150206214019/http://www.dds.nl/~krantb/stellingen/. Upon further investigation, we noticed that robustify.js does not issue onclick events to anchor tags linking to pages within the same domain as the host page. An onclick even is not assigned to this any embedded anchor tags because all of the links point to within the Internet Archive, the host domain. Due to this design decision, robustify.js is never invoked when within the archive.

When clicking on the URI-M, the 2015-02-06 memento does not exist, so the Internet Archive redirects the user to the closest memento http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/. The user, when clicking the link, ends up at the 1999 memento because the Internet Archive understands how to redirect the user from the 2015 URI-M for a memento that does not exist to the 1999 URI-M for a memento that does exist. If the Internet Archive had no memento for http://www.dds.nl/~krantb/stellingen/ the user would simply receive a 404 and not have the benefit of robustify.js using the Memento Time Travel service to search additional archives.

The robustify.js file is archived (http://web.archive.org/web/20150206214020js_/http://digitopia.nl/js/robustify-min.js) but its embedded URI-Rs are re-written by the Internet Archive.  The original, live web JavaScript has URI templates embedded in the code that are completed at run time by inserting the "yyymmddhhmmss" and "url" variable strings into the URI-R:

archive:"http://timetravel.mementoweb.org/memento/{yyyymmddhhmmss}/{url}",statusservice:"http://digitopia.nl/services/statuscode.php?url={url}"

These templates are rewritten during playback to be relative to the Internet Archive:

archive:"/web/20150206214020/http://timetravel.mementoweb.org/memento/{yyyymmddhhmmss}/{url}",statusservice:"/web/20150206214020/http://digitopia.nl/services/statuscode.php?url={url}"

Because the robustify.js is modified during archiving, we wanted to understand the impact of including the URI-M of robustify.js (http://web.archive.org/web/20150206214020js_/http://digitopia.nl/js/robustify-min.js) in our test page (http://www.cs.odu.edu/~jbrunelle/wsdl/test-r.html). In this scenario, the JavaScript attempts to execute when the user clicks on the page's links, but the re-written URIs point to /web/20150206214020/http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2 (since test-r.html exists on www.cs.odu.edu, the links are relative to www.cs.odu.edu instead of archive.org).

Instead of issuing an HTTP GET for http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F, robustify.js issues an HTTP GET for
http://www.cs.odu.edu/web/20150206214020/http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F which returns an HTTP 404 when dereferenced.
The robustify.js script does not handle the HTTP 404 response when looking for its service, and throws an exception in this scenario. Note that the memento that references the URI-M of robustify.js does not throw an exception because the robustify.js script does not make a call to digitopia.nl/services/.

In our test mementos, the Internet Archive also re-writes the URI-M http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/ to http://web.archive.org/web/20150206214019/http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/.

This memento of a memento (in a near Yo Dawg situation) does not exist. Clicking on the apparent memento of a memento link leads to the user being told by the Internet Archive that the page is available to be archived.

We also created an Archive.today memento of our robustifyTest.html page: https://archive.today/l9j3O. In this memento, the functionality of the robustify script is removed, redirecting the user to http://www.dds.nl/~krantb/stellingen/ which results in a HTTP 404 response from the live web. The link to the Internet Archive memento is re-written to https://archive.today/o/l9j3O/http://www.dds.nl/~krantb/stellingen/, which results in a redirect (via a refresh) to http://www.dds.nl/~krantb/stellingen/ which results in a HTTP 404 response from the live web, just as before. Archive.today uses this redirect approach as standard operating procedure. However, Archive.today re-writes all links to URI-Ms back to their respective URI-Rs.

This is a different path to a broken URI-M than the Internet Archive takes, but results in a broken URI-M, nonetheless.  Note that Archive.today simply removes the robustify.js file from the memento, not only removing the functionality, but also removing any trace that it was present in the original page.

In an odd turn of events, our investigation into whether a JavaScript tool would behave properly in the archives has also identified a problem with URI-Ms in the archives. If web content authors continue to utilize URI-Ms to mitigate link rot or utilize tools to help discover mementos of defunct links, there is a potential that the archives may see additional challenges of this nature arising.


--Justin Brunelle