Posts

Showing posts from June, 2015

2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

Image
My research and niche within the WS-DL research group focuses on understanding how the adoption of JavaScript and Ajax is impacting our archives. I leave the details as an exercise to the reader (D-Lib Magazine 2013, TPDL2013, JCDL2014, IJDL2015), but the proverbial bumper sticker is that JavaScript makes archiving more difficult because the traditional archival tools are not equipped to execute JavaScript.

For example, Heritrix (the Internet Archive's automatic archival crawler) executes HTTP GET requests for archival target URIs on its frontier and archives the HTTP response headers and the content returned from the server when the URI is dereferenced. Heritrix "peeks" into embedded JavaScript and extracts any URIs it can discover, but does not execute any client-side scripts. As such, Heritrix will miss any URIs constructed in the JavaScript or any embedded resources loaded via Ajax.

For example, the Kelly Blue Book Car Values website (Figure 1) uses Ajax to retriev…

2015-06-26: JCDL 2015 Doctoral Consortium

Image
Mat Kelly attended and presented at the JCDL 2015 Doctoral Consortium. This is his report.                           Evaluating progress between milestones in a PhD program is difficult due to the inherent open-endedness of research. A means of evaluating whether a student's topic is sound and has merit while still early on in his career is to attend a doctoral consortium. Such an event, as the one held at the annual Joint Conference on Digital Libraries (JCDL), has previously provided a platform for WS-DL students (see 2014, 2013, 2012, and others) to network with faculty and researchers from other institutions as well as observe the approach that other PhD students at the same point in their career use to explain their respective topics.As the wheels have turned, I have showed enough progress in my research for it to be suitable for preliminary presentation at the 2015 JCDL Doctoral Consortium -- so did so this past Sunday in Knoxville, Tennessee. Along with seven other graduate…

2015-06-09: Mobile Mink merges the mobile and desktop webs

Image
As part of my 9-to-5 job at The MITRE Corporation, I lead several STEM outreach efforts in the local academic community. One of our partnerships with the New Horizon's Governor's School for Science and Technology pairs high school seniors with professionals in STEM careers. Wes Jordan has been working with me since October 2014 as part of this program and for his senior mentorship project as a requirement for graduation from the Governor's School.

Wes has developed Mobile Mink (soon to be available in the Google Play store). Inspired by Mat Kelly's Mink add-on for Chrome, Wes adapted the functionality to an Android application. This blog post discusses the motivation for and operation of Mobile Mink.

Motivation The growth of the mobile web has encouraged web archivists to focus on ensuring its thorough archiving. However, the mobile URIs are not as prevalent in the archives as their non-mobile (or as we will refer to them: desktop) URIs. This is apparent when we compar…

2015-06-09: Web Archiving Collaboration: New Tools and Models Trip Report

Image
Mat Kelly and Michele Weigle travel to and present at the Web Archiving Collaboration Conference in NYC.                           On June 4 and 5, 2015, Dr. Weigle (@weiglemc) and I (@machawk1) traveled to New York City to attend the Web Archiving Collaboration conference held at the Columbia School of International and Public Affairs. The conference gave us an opportunity to present our work from the incentive award provided to us by Columbia University Libraries and the Andrew W. Mellon Foundation in 2014.Robert Wolven of Columbia University Libraries started off the conference with welcoming the audience and emphasizing the variety of presentations that were to occur on that day. He then introduced Jim Neal, the keynote speaker.Jim Neal starting by noting the challenges of "repository chaos", namely, which version of a document should be cited for online resources if multiple versions exist. "Born-digital content must deal with integrity", he said, "and re…