Posts

Showing posts with the label deferred representations

2020-11-03: 19 Years of Wayback – Inspiring the collection and replay of the web

Image
The Internet Archive’s Wayback Machine is almost 20 years old. As the Wayback Machine nears its second decade full of operation, I reflected on how my research has been inspired by the work that goes into enabling the historical replay of the web.   The @waybackmachine is officially old enough to vote but not drink this year. 468 Billion web pages and more than 1,000,000,000,000 captures later, the Wayback Machine is still public, free, and committed to access for all. https://t.co/EdYMeVz2q5 — Internet Archive (@internetarchive) October 26, 2020 I’ve been away from the WS-DL blog for a little while, so a reintroduction is probably worthwhile. I am a PhD alumnus from the WS-DL group and am currently a Principal Researcher at The MITRE Corporation . As you may guess, my work in the WS-DL group focused on web archiving and specifically the crawler and information collection trade-offs of using crawlers that exercise JavaScript on web pages (e.g., Brozzler ) vs those that...

2016-04-15: How I learned not to work full-time and get a PhD

Image
ODU's commencement on May 7th marks the last day of my academic career as a student. I began my career at ODU in the Fall of 2004, graduated with my BS in CS in the Spring of 2008 at which point I immediately began my Master's work under Dr. Levinstein . I completed my MS in Spring 2010, spent the summer with June Wright (now June Brunelle), and started my Ph.D. under Dr. Nelson in the Fall of 2010 (which is referred to as the Great Bait-and-Switch in our family). I will finish in the Spring of 2016 only to return as an adjunct instruction teaching CS418/518 at ODU in the Fall of 2016. On February 5th, I defended my dissertation " Scripts in a Frame: A Framework for Archiving Deferred Representations " (above picture courtesy Dr. Danette Allen , video courtesy of Mat Kelly ). My research in the WS-DL group focused on understanding, measuring, and mitigating the impacts of client-side technologies like JavaScript on the archives. In short, we showed that JavaS...

2015-11-06: iPRES2015 Trip Report

Image
From November 2nd through November 5th, Dr. Nelson , Dr. Weigle , and I attended the iPRES2015 conference at the University of North Carolina Chapel Hill . This served as a return visit for Drs. Nelson and Weigle; Dr. Nelson worked at UNC through a NASA fellowship and Dr. Weigle received her PhD from UNC. We also met with Martin Klein , a WS-DL alumnus now at the UCLA Library. While the last ODU contingent to visit UNC was not so lucky, we returned to Norfolk relatively unscathed. Cal Lee and Helen Tibbo opened the conference with a welcome on November 3rd, followed by Nancy McGovern 's keynote address delivered with Leo Konstantelos and Maureen Pennock . This was not a traditional keynote, but instead an interactive dialogue in which several challenge areas were presented to the audience, and the audience responded -- live and on twitter -- significant achievements or advances in those challenge areas from #lastyear. For example, Dr. Nelson identified the #iCanHazMemento...

2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

Image
My research and niche within the WS-DL research group focuses on understanding how the adoption of JavaScript and Ajax is impacting our archives. I leave the details as an exercise to the reader ( D-Lib Magazine 2013 , TPDL2013 , JCDL2014 , IJDL2015 ), but the proverbial bumper sticker is that JavaScript makes archiving more difficult because the traditional archival tools are not equipped to execute JavaScript. For example,  Heritrix  (the  Internet Archive 's automatic archival crawler) executes HTTP GET requests for archival target URIs on its frontier and archives the HTTP response headers and the content returned from the server when the URI is dereferenced. Heritrix "peeks" into embedded JavaScript and extracts any URIs it can discover, but does not execute any client-side scripts. As such, Heritrix will miss any URIs constructed in the JavaScript or any embedded resources loaded via Ajax. For example, the Kelly Blue Book Car Values website (Figure 1) uses...