2023-02-26: Animating Changes in Webpages, Featuring George Santos's Biography

Figure 1: The Wayback Machine "Changes" tool showing the difference between George Santos's biography on December 19, 2022 and February 3, 2023. Deletions are highlighted in yellow and additions are highlighted in blue. Source: https://web.archive.org/web/diff/20221219173515/20230203162225/https://georgeforny.com/about/


The Washington Post recently published the article, "See the evolution of lies in George Santos's campaign biography." George Santos is a member of the U.S. House of Representatives in the 118th U.S. Congress. He has steadily removed claims from his website that have been proven to be false, such as holding a bachelor's degree from Baruch College. In the Washington Post article, the journalists used the Internet Archive's Wayback Machine to find and view previous versions of his webpage that included the false claims. To show change over time, they interspersed colored text boxes with the change text throughout the article. Previously, we investigated common use tasks for journalists who use web archives, and we found that viewing deleted content on webpages is a common task.


Figure 2: The Washington Post article about George Santos used colored boxes, representing the topic on the website, to show change over time. The boxes included headings with the approximate date of the content. Source: https://www.washingtonpost.com/politics/interactive/2023/george-santos-resume-lies/


Besides manual inspection, there is currently one way to compute and show changes on historical versions of webpages. The "Changes" tool on the Wayback Machine computes and shows the side-by-side HTML diff between two mementos. 

We have been developing an alternative way to find and view changes on webpages in order to emphasize the fact that a person made these changes. Our approach is to use storytelling and show the changes as an animation. The technique is similar to when a text message is shown letter by letter to imitate the message being typed in real time, such as in the Netflix docuseries The Most Hated Man on the Internet

Three of the changes featured in the Washington Post article are about George Santos's work history (Figure 3), the religious heritage of his grandparents (Figure 4), and his philanthropy (Figure 5). These three changes are shown below in the new animation format that we developed. Figure 3 and Figure 4 show examples where statements were deleted and then replaced with new content. Figure 3 specifically shows that George Santos originally claimed he worked for Goldman Sachs, but later removed the company's name from his work history, as there is no record he ever worked there. The animation shows the incremental deletion highlighted in red, followed by the incremental addition highlighted in green. Figure 5 shows an animation for deleted content that was not replaced with any new wording.

Figure 3: This animation shows the deletion of the statement about George Santos's work history. The animation was created using the mementos from April 2, 2022 and October 14, 2022.




Figure 4: This animation shows the deletion of the statement about George Santos's grandparents being Jewish. The animation was created using the mementos from December 23, 2022 and December 30, 2022.



Figure 5: This animation shows the deletion of the statement about George Santos's philanthropy. The animation was created using the mementos from December 23, 2022 and December 30, 2022.

How the animation works


Hypercane is a tool for summarizing web archive collections, and its "synthesize warc" action can save a memento with its resources, such as images and stylesheets, into a WARC file. I downloaded 29 versions of George Santos's biography webpage saved by the Wayback Machine between June 2020 and February 2023 using Hypercane. In order for users to easily determine when the changes occurred, we have developed a change-text search interface. First, the WARCs are indexed into the search backend Lucene using the UKWA WARC indexer. Next, a change text script calculates the validity ranges of the mementos as defined by Berberich in his work on Temporal Search for Web Archives. This script also calculates all of the added and deleted terms for each webpage version. These calculations are posted into the Lucene index directly. The search interface uses Solarium, a PHP library for interacting with the Solr search platform built on top of Lucene. The user can type a term or phrase, and this is translated into a formal query for Lucene over the deleted terms field behind-the-scenes. The snippet in the search engine results page is generated by the PHP-diff library.

For example, searching for the debunked claim that George's grandparents were Jewish yields that the claim was added to his biography no later than October 14, 2022, and was removed sometime between December 23 and December 30, 2022.

Figure 6: The change-text search interface allows for users to search for deleted terms and phrases. The interface provides information about the changes, the content lifespan, links to replay the mementos, and links to examine the changes in more detail via an animation or a sliding diff over multiple page versions.

The animated deletion uses two mementos and a deleted term or phrase as its input. The Wayback Machine's "Changes" tool uses the HTML-diff library developed by the Environmental Data and Governance Initiative, along with a web archive replay engine. The animated deletion also uses the EDGI's HTML-diff library, but uses the combined diff rather than the side-by-side diff. The replay engine used by the animated deletion is PyWB. After computing the HTML diff between the two mementos, only deletions corresponding to the queried term or phrase are kept. These differences are labeled in the HTML so that they can be animated sequentially. JavaScript controls the page jumps between differences as well as animating the text. 

Outlook

George Santos's webpage also contains some more expected edits, which can be identified by examining the full change text calculations that were posted to the Lucene index. Between August and September of 2020, there were some edits to fix misspellings. Between August and September of 2021, he truthfully updated his work history. These kinds of edits are typical, especially for a page with a life span of over two years. While not all changes to webpages are newsworthy, examining the uninteresting edits is still worthwhile because it would lead towards automatically detecting semantically meaningful edits on webpages in general.

The public has a right to see the edits made to congressional and government websites. Tools such as this change text search interface will facilitate journalists use of web archives to hold the government accountable when the changes are not made in a transparent way.


--Lesley Frew

Comments