2022-08-04: Web Archiving in Popular Media II: User Tasks of Journalists


Figure 1: The two most common goals for journalists who use web archives as evidence in their articles is to view unavailable pages and to view page content change over time.

Different groups of users collectively have different levels of understanding about web archives. In a previous post from 2016, Web Archiving in Popular Media, Scott Ainsworth demonstrated the emergence of web archives as evidence in journalism. The list of news articles that he presented as examples was a novel contribution at that time. Users with a strong mental model for the past web could benefit from advanced web archives features such as full-text search. What is the current mental model for web archives held by journalists, and how do web archives help journalists?

Collecting Articles that Reference Web Archives

In "Where'd it Go?" (2007), Teevan analyzed a set of web pages that were collected from a phrase search with the goal of understanding user behavior about re-finding [1]. By searching for a salient phrase with a news aggregator, I used this same methodology to analyze how journalists currently perceive and use web archives.

I chose the search phrase "Wayback Machine" because of the Internet Archive's prominence. I conducted one manual search for this phrase using Google News on May 11, 2022. Since this search yielded appropriate results, I automated the search with the GNews Python API. In addition to specifying keywords, GNews queries can also restrict results to a customizable time span. Based on the article dates and distribution seen in the manual search, I chose a time span of 14 days. Each GNews query returns 100 results in JSON format. I executed this GNews query every two weeks starting May 25, 2022 and ending July 6, 2022.

When Teevan searched for "Where'd it go," about 25% of the results contained usable information about re-finding behavior. Similarly, not all of the articles collected by searching for the phrase "Wayback Machine" included information about how journalists use web archives. Some of the articles included "wayback machine" as a colloquial phrase, while others were articles about web archives. Ultimately, 106 articles contained evidence showing how journalists use web archives.

I analyzed the articles after each search iteration and categorized them by user task, based on how the journalist was using the memento from the web archive. There are a variety of ways journalists use web archives when investigating stories, but not all user tasks were present in each result set. The distribution of articles coded with each task was also different between result sets; considering the count of articles coded with each task cumulatively resulted in a more accurate task distribution. The last set of articles collected on July 6 showed that the list of tasks and their distribution was stable. The article data set can be found on GitHub.

User Tasks of Journalists

One way that journalists use web archives is to view web pages that are no longer available on the live web. When trying to view unavailable content, the most common task was viewing a single page that has been deleted. For example, these pages could have a 404 HTTP status code, while the parent site is still available on the live web. A more advanced use of web archives by journalists is to investigate how content changes on web pages over time. The top two tasks matching this goal are viewing how content has evolved on a page over time and viewing content (such as a sentence) that has been deleted from a page that is still available on the live web. Other tasks in this category include calculating the lifespan of certain content, determining when content was added to a page, comparing a previous version to the current version, and examining terminology evolution on a page.

Below are a few examples of articles where journalists used web archives to examine the change in web pages over time.

In "Did Herschel Walker Lie About Lying? GOP Senate Candidate Denies Making False Claim About Graduating College" (2022-05-27), People.com used web archives to show that the political candidate Herschel Walker removed the statement about having earned a college degree from his biography on his website.

Figure 2: Comparison of mementos 20211215212100 and 20211217135409 for https://www.teamherschel.com/about/


In "QS Ranks Russian Universities, Contrary to Original Plan" (2022-06-13), InsideHigherEd.com used web archives to show that the sentence about Quacquarelli Symonds excluding Russian universities from being ranked had been removed from the company's Ukraine Crisis statement.

Figure 3: Comparison of mementos 20220307192336 and 20220324094041 for https://www.qs.com/ukraine-crisis/ 


In "Army locked public report after story on overdue suicide regulation" (2022-05-27), ArmyTimes.com used the Internet Archive's wayback-diff tool to show that a note about needing to log in to view a resource had been added.

Figure 4: Comparison of mementos 20220406212423 and 20220407223123 for https://armypubs.army.mil/ using wayback-diff 


In "Tesla increases prices across lineup, with Model X up as much as $6,000" (2022-06-15), TechCrunch.com used web archives to show how the price of a Tesla changed over time.


In "Communication safety in Messages expands to four new countries" (2022-05-16), iMore.com used web archives to show when Apple made new child safety features available to additional countries.

Figure 6: Comparison of mementos 20220513100426 and 20220526224531 for https://www.apple.com/child-safety/ 


The Need for Robust Links

While journalists have a strong mental model of web archives and use them to support a variety of tasks, they were less successful at linking to the mementos they found. Only two-thirds of the journalists successfully linked to a memento. The most common problem that journalists encountered when trying to link to mementos was that they included no links at all, either to the memento or to the resource on the live web. This problem was present in 20% of the articles. A third of these articles appeared on media sites that only allow internal links or don't have any links in articles, but the other articles contained links to external sites, so the majority of these articles do demonstrate users' difficulty in linking to mementos. Journalists also linked to pages on the live web rather than the archived versions that they used when writing their articles. Some journalists linked to time maps instead of individual mementos, and another journalist had a typo in their memento link.

However, linking to one memento alone is not sufficient. In order to show how content has changed on a page over time, there needs to be a link to the previous version of the page, the current version of the page on the live web, and a snapshot of the current page. It is best practice to link to the live web version of the page, so that the original address is not obscured. Linking to the live web version of the page also makes sense in this case because the articles show that journalists have a strong disposition for referencing the original resource. By linking to the live web version, the journalists wanted to link to the version of the page at that moment in time. But because web pages change over time, some of the live web versions have content that doesn't match what's referred to in the articles. Content drift is the major reason why each journalist should have created a snapshot of the current page version at the time of their article. They could have used services such as the Internet Archive's Save Page Now tool or Archive.today.

Robust Links provide a way to link all of these versions of the page in a clear way. Dr. Martin Klein recently gave a talk about Robust Links as a part of the invited talk series of the ODU CS REU in Disinformation Detection and Analytics. Robust Links contain a link to the original resource, a link to a snapshot of the current version, and the datetime of that snapshot. By including the datetime, alternate comparable snapshots can be located if necessary. Using Robust Links helps to ingrain a process for preserving functional hyperlinks: creating a snapshot of the live version of the page, as well as linking to resources that can be used to recover the information if the link becomes broken. The Robust Links site contains directions for robustifying links.

Below are some examples that demonstrate the need for robust links in the articles, as well as how the links could have been robustified.

In "A Victory for Women?" (2022-07-04), Tol.org used web archives to access deleted content from Hungarian President Katalin Novák's blog. The memento linked in the article is http://web.archive.org/web/20100424030605/http:/schmitten.blog.hu/. While original addresses are not obfuscated by the Wayback Machine, the original address included in this memento address is missing a slash after the scheme. The Wayback Machine uses the handy_url module in the Python surt library to parse addresses, so this memento does replay. However, this example shows why it is not possible to obtain original addresses directly from memento addresses without the surt library, making the case for including the original address on its own. In fact, this article is one of fourteen that includes a memento with an original address missing a slash or the scheme entirely.

To create a Robust Link that contains the correct original address and the specific memento, the HTML is:

<a href="http://schmitten.blog.hu/" data-versionurl="http://web.archive.org/web/20100424030605/http:/schmitten.blog.hu/" data-versiondate="2010-04-24">Robust Link to Katalin Novák's blog</a>

The link displays like this: Robust Link to Katalin Novák's blog

In "China blasts US over wording change on State Department's Taiwan website" (2022-05-10), NYPost.com used web archives to compare the current version of the US-Taiwan fact sheet with the previous version from August 2018. The journalist linked to both the live version of the page as well as the memento with the previous version. However, the live version of the page referenced in the article refers to the fact sheet updated on May 5, while the fact sheet has been updated again as of May 28, so the live version of the page no longer contains the content referenced in the article. The May 5 version of the page has been archived, so it is possible to view the page as it was at the time of the article being written. However, this version is not linked in the article, so viewing the current live version of the page would be confusing to readers. This example shows how content drift negatively affects linking to live resources, and why there should be links to both the live page as well as a snapshot of that version.

To create a Robust Link that contains the link to the version of the page at the time the article was written, the HTML is:

<a href="https://www.state.gov/u-s-relations-with-taiwan/" data-versionurl="https://web.archive.org/web/20220508062239/https://www.state.gov/u-s-relations-with-taiwan/" data-versiondate="2010-05-08">Robust Link to the US-Taiwan fact sheet</a>

The link displays like this: Robust Link to the US-Taiwan fact sheet

There is still more work that needs to be done for robust links to be successful with all web users. Researchers have repeatedly found that users can become disoriented when viewing archived web pages, similar to the "lost in hyperspace" phenomenon that affected internet users in the 1990s. Even as recently as 2019, participants in Archive-It's usability study had trouble perceiving whether they were navigating the live web or the past web [2]. Web archives, and sites employing robust links, need to strongly distinguish between navigating to pages on the live web and replaying pages from the past web.

Outlook

This set of 106 articles was chosen because the articles contain information about how journalists use web archives. Circumstantially, each article also contains a reference to an archived web page. A time span of change for each web page can be defined by taking the date of the memento linked in the article as the start date and the article publication date as the end date. These pages with their time spans constitute a versioned document collection with an interesting set of changes.

Many of the journalists wanted to examine change in pages over time, such as viewing deleted content. Currently, there isn't a way to search for which version of a page contains that kind of information, so each memento must be inspected one by one. Journalists would be able to accomplish their goals more easily if the changes in a web page over time were searchable. A tool that can both support queries for changes in a webpage over time as well as present the changes in context is identified as a possible application of Jatowt et al.'s "Journey to the past: proposal of a framework for past web browser." [3]

Journalists who are interested in viewing deleted content on a page represent one specific case of re-finding. Perhaps the key to more effectively using web archives for re-finding lies in asking "When" something went rather than "Where."

Acknowledgements

I thank Dr. Michele C. Weigle and Dr. Michael L. Nelson for their guidance and collaboration.

Sources

  1. Jaime Teevan. "Where'd it go?": How people ask after lost Web information. ASIST 2007: 1-19. https://doi.org/10.1002/meet.1450440269
  2. Samantha Abrams; Alexis Antracoli; Rachel Appel; Celia Caust-Ellenbogen; Sarah Denison; Sumitra Duncan; Stefanie Ramsay. "Sowing the Seeds for More Usable Web Archives: A Usability Study of Archive-It." The American Archivist (2019) 82 (2): 440–469. https://doi.org/10.17723/aarc-82-02-19
  3. Adam Jatowt, Yukiko Kawai, Satoshi Nakamura, Yutaka Kidawara, Katsumi Tanaka. "Journey to the past: proposal of a framework for past web browser." In Proceedings of the seventeenth conference on Hypertext and hypermedia, 135-144, HYPERTEXT '06. https://doi.org/10.1145/1149941.1149969


--Lesley Frew

Comments