2020-11-03: 19 Years of Wayback – Inspiring the collection and replay of the web

The Internet Archive’s Wayback Machine is almost 20 years old. As the Wayback Machine nears its second decade full of operation, I reflected on how my research has been inspired by the work that goes into enabling the historical replay of the web.

The @waybackmachine is officially old enough to vote but not drink this year.

468 Billion web pages and more than 1,000,000,000,000 captures later, the Wayback Machine is still public, free, and committed to access for all. https://t.co/EdYMeVz2q5
— Internet Archive (@internetarchive) October 26, 2020

I’ve been away from the WS-DL blog for a little while, so a reintroduction is probably worthwhile. I am a PhD alumnus from the WS-DL group and am currently a Principal Researcher at The MITRE Corporation. As you may guess, my work in the WS-DL group focused on web archiving and specifically the crawler and information collection trade-offs of using crawlers that exercise JavaScript on web pages (e.g., Brozzler) vs those that do not (e.g., Heritrix). My MITRE research includes a heavy focus on web science and continues to be inspired by web archiving, including the work being done at the Internet Archive.

Those of us working in and around archiving theory have benefited from the research and work that enables the Wayback Machine’s operation. That is, capturing and replaying the web has led to considerable theoretic and applied research that has influenced other disciplines within computer science, including the various disciplines involved in my work at ODU and MITRE. That gets us to my reflection: my work has taken advantage of the threads of research associated with the Wayback Machine. While the Wayback Machine is focused on the replay of web archival captures (i.e., mementos), it drives much of the work around web crawling, capture, and storage to make the Wayback Machine replays work. Similarly, my research includes not only the replay of mementos but also has focused on the ways that web scientists can improve web data collection, storage, and replay.

In 2013, Mat Kelly and I took a very early look at the challenges presented to archival replay created by personalized representations such as local news delivered based on the geographic location of the user-agent’s IP address. If we assume that mementos are contributed by human users to a web archive for replay (and it’s worth noting that this is a core principle in Mat’s research – how to share mementos from a personal archive), personalized representations could cause confusion during replay. We presented some replay options that could make replay of personalized mementos in the Wayback Machine feasible and less confusing for users.

In 2015-2016, I worked with a team of MITRE and WS-DL researchers to evaluate the feasibility of using the Internet Archive’s archiving infrastructure on the MITRE intranet. We implemented tools to crawl (i.e., Heritrix) and replay (i.e., the Wayback Machine) on the MITRE intranet and evaluated the viability of using this approach to provide temporal navigation within our corporate environment. We determined that the risk of indefinitely capturing and managing improperly stored sensitive material posed too great a risk (e.g., if a project’s sensitive findings were accidentally released on the MITRE intranet and crawled and made available through the MITRE archival infrastructure, the difficulty of controlling the spill would increase significantly). This investigation tested the limits of re-implementing the Internet Archive's archiving infrastructure against the peculiarities of a corporate intranet.

My work at MITRE has recently pivoted to applying web capture and replay theory to a new web science discipline. I have been a senior advisor on a MITRE internal research project exploring the intersection of web science and web accessibility. This is not my first encounter with web accessibility; Mat and I briefly touched on web accessibility (e.g., Section 508) in our work with studying the archivability of websites (including government websites) over time. However, the MITRE research project – named Demodocus – takes my web accessibility experience to a new depth with a slightly different focus. Demodocus began with an investigation into how web science theory – much of which has been driven by the Internet Archive's web archiving practices – can help web accessibility evaluations by taking some of the facets of web accessibility testing and using a web crawler to navigate through a web application while performing accessibility evaluations. This matched web accessibility tools with web archiving theory. The Demodocus research team designed an automatic web application accessibility evaluation framework by implementing the crawling framework from my dissertation – web crawling for deferred representations (which studied the models for crawling and storing JavaScript-driven web applications for replay in the Wayback Machine).

An excerpt from our presentation at iHSED2019 on the intersection of web science, web archiving, and web accessibility. This presentation is Approved for Public Release; Distribution Unlimited. Public Release Case Number 19-2335

The Demodocus framework was inspired by the Internet Archive’s web archiving theory and created a solution at the intersection of web accessibility and web archiving. It also extends my prior research in improving the fidelity and completeness of web application capture. The Demodocus team created a prototype implementation of the framework that performs automatic evaluations of web page accessibility by navigating through the client-side states of a web application (Trevor Bostic, Daniel Chudnov, Brittany Tracy, Jeff Stanley, John Higgins, Justin F. Brunelle. “Demodocus: Automated Web Accessibility Evaluations”, Proceedings of the 2020 ICT Accessibility Testing Symposium, October 2020, pgs 81-90, 2020). This project represents the adoption of web archiving research – inspired by the efforts surrounding the Wayback Machine to enable its operation – to the domain of web accessibility.

Upon a reflection on a my recent research threads, I recognize the impact that the Internet Archive’s Wayback Machine has made on domains beyond web archiving. Along with the obvious contributions to web archiving, these research efforts have benefited the exploration of corporate knowledge management and have lead to new methods of evaluating web application accessibility. I found it personally gratifying to take inspiration from the celebration of the Wayback Machine’s nearly two decades of work to perform a bit of a retrospective on my admittedly young career as a web science researcher.

Here's to the prep for the Wayback’s 20th birthday party!

-- Justin F. Brunelle

Approved for public release. Distribution unlimited 20-02582-1.

Search This Blog

Web Science and Digital Libraries Research Group

2020-11-03: 19 Years of Wayback – Inspiring the collection and replay of the web

Comments

Post a Comment