2025-01-27: Can LLMs Detect and Analyze Changes in Webpages?

Image Generated by DALL-E

The utility of Large Language Models (LLMs) across diverse tasks raises important questions about the role of AI in web archive contexts, particularly in facilitating change analysis tasks. For instance, can LLMs effectively detect and analyze changes in webpages? How can this be achieved? What advantages and limitations do AI approaches have compared to traditional methods?. 

In this blog post, I summarize our paper “Exploring Large Language Models for Analyzing Changes in Web Archive Content: A Retrieval-Augmented Generation Approach” (Jhon G. BotelloLesley FrewJose J. PadillaMichele C. Weigle) published at the 9th Computational Archival Science (CAS) Workshop 2024 (check Lesley Frew's trip report for more details about the workshop). The paper represents our first step in determining the ability of LLMs and the extent to which they can assist with change analysis tasks.

Motivation

As we interact with the web, we may notice that it is constantly changing. Fortunately, web archives preserve mementos, meaning historical versions of how original resources appeared in the past. Analyzing webpage changes is of interest in different areas. For instance, the removal of speeches from the UK Conservative Party website or the use of web archives as evidence of content manipulation in traditional media demonstrates the importance of studying changes in areas such as journalism, politics, and government.

Change analysis has traditionally been a manual and tedious process. Current tools may be helpful, but challenges still remain. For instance, the Wayback Machine's changes feature highlights additions or deletions of terms and phrases but does not capture complex changes like content structure modifications or semantic context shifts. While TMVis analyzes a set of mementos and selects those that have large differences—often detecting webpage design updates—it does not highlight specific changes or analyze them at a semantic level. In this regard, Large Language Models (LLMs) can offer a novel approach by considering contextual and semantic nuances.

Methodology

We used WARC-GPT, an open-source tool based on a Retrieval-Augmented Generation (RAG) approach designed to create customizable chatbots for interacting with web archives. Our knowledge base included a small sample of 30 webpages from 4 U.S. federal agency websites related to climate change (EPA, DOE, NASA, and NOAA), which we identified as having undergone modifications between 2016 and 2020. We evaluated the system’s ability to identify relevant passages where changes occurred and utilized GPT-4o to analyze those modifications.

To enhance the system's performance, we used prompt engineering to adjust the task prompt so that the system focused on detecting changes in terms, phrases, paragraph, and changes in semantic context. The analysis involved comparing two or more snapshots of the same page to evaluate the magnitude and significance of the changes. Results were evaluated through human assessment to ensure the accuracy of the answers based on the content of each website.

WARC-GPT Architecture Considering a Set of Mementos as a Knowledge Source.

Results

We designed two queries to evaluate the performance of the method used. Query 1 focused on assessing the ability to detect changes, while Query 2 examined the ability to analyze shifts in semantic context.

Our results showed several interesting observations:
  • The system successfully detected accurate changes in 24 out of the 30 analyzed web pages.
  • In six cases, responses regarding changes were lacking due to either insufficient relevant passages retrieved from mementos or issues with generating embeddings caused by issues with the WARC file.
  • Pages from the same domain occasionally caused confusion in the model, leading the system to extract passages from a different page within the same domain (e.g., the landing page versus a services page).
  • The approach demonstrated scalability in handling multiple mementos per page for detailed comparisons, meaning the system can analyze changes between pairs or among more than two mementos.
Below is an example showcasing two versions (2016 and 2020) of a webpage from the United States Environmental Protection Agency (EPA) and the answers from the LLM. 

The webpage offers training related to Enforcement and Compliance History Online (ECHO), a publicly accessible platform that provides detailed information about the compliance and enforcement history of environmental regulations for facilities across the United States. In 2016, the page primarily focused on webinars that provided foundational knowledge about ECHO's functionalities. By 2020, the webinars evolved to include more training content, better descriptions, and a variety of formats for user convenience. The LLM was able to identify that, while the site's core purpose of providing ECHO training remained consistent, there was a significant evolution in the semantic context. This evolution involved an expansion of topics and a more structured presentation of information to address user training more effectively.

Version: 2016-04-03T08:53:45Z

Version: 2020-06-30T10:52:36Z

Query 1: What are the changes or inconsistencies in terms, phrases, or paragraphs among the mementos of webpage X?

GPT 4o: 
The archived versions of the webpage "echo.epa.gov/help/training" show several changes and updates over time:

1. Webinar Information and Content:
- In the 2016 version, the webpage mentions a series of webinars, including "Introduction to ECHO" and "Intro to the Acts in ECHO,"
with specific dates and times for these events [1]. By 2020, the content has expanded to include more detailed descriptions of various
webinars, such as "Intro to ECHO," "Advanced ECHO Webinar: Behind the Scenes," and "Advanced ECHO Webinar: Water" [3]. These later versions provide more comprehensive information about the webinars, including the topics covered, the format (e.g., PDF presentations, MP4 recordings), and the requirement for registration before accessing the recordings.

2. Focus and Structure of Webinars:
- The 2016 version focuses on introducing the ECHO website and its basic functionalities, with webinars aimed at providing a foundational understanding of the acts and tools available on the site [1]. By 2020, the webinars have evolved to include more advanced
topics and detailed demonstrations, such as accessing and using water tools and reports, and exploring air facility search capabilities
[3].

3. Additional Resources and Tools:
- The 2016 version lists various resources and tools available on the ECHO website, such as facility search help, enforcement case search
help, and dashboards help [1]. In the 2020 version, there is a more structured presentation of resources, including video tutorials, quick start guides, and detailed help sections for different types of reports and dashboards [3].

4. Registration and Access Requirements:
- In the 2016 version, there is a mention of a registration step required for downloading webinar recordings, with a note that the EPA
plans to replace this with a non-registration option in the future [1]. By 2020, the registration requirement is still present for accessing webinar recordings, indicating that this aspect has not changed [3].

Overall, the changes reflect an expansion and enhancement of the training resources and webinars offered by the ECHO website, with a
shift towards more detailed and advanced content over time.

Query 2: Compare any change regarding the semantic context among the mementos for the webpage X. 

GPT 4o: 

The semantic context of the webpage "echo.epa.gov/help/training" has evolved over time, reflecting changes in focus and the depth of content offered:

1. Expansion of Webinar Topics: Initially, the focus was on introducing the ECHO website and its basic functionalities, with webinars aimed at providing a foundational understanding of the acts and tools available on the site [1]. By 2020, the webinars had expanded to include more advanced topics and detailed demonstrations, such as accessing and using water tools and reports, and exploring air facility search capabilities [3]. This shift indicates a move towards providing more comprehensive and specialized training resources.

2. Increased Detail and Structure: Over time, the content became more structured and detailed. The 2016 version provided basic information about webinars and resources [1], while the 2020 version offered more comprehensive descriptions, including the topics covered, formats available (e.g., PDF presentations, MP4 recordings), and registration requirements [3]. This change suggests an effort to enhance user understanding and accessibility of the training materials.

3. Focus on Advanced Analysis: The later versions of the webpage emphasize advanced analysis capabilities, such as using ECHO tools
for detailed environmental data interpretation and compliance tracking [3]. This reflects a shift in semantic context from basic
introductions to enabling users to perform more complex analyses and gain deeper insights into environmental data.

Overall, the semantic context has shifted from providing introductory content to offering more advanced and detailed training resources,
reflecting an evolution in the scope and depth of the educational materials provided by the ECHO website.

Our initial change detection screening was not aimed at evaluating whether the LLM could identify all the changes that may have occurred. Instead, we focused on assessing its ability to detect any type of change at a semantical level and determine the accuracy of these detections in relation to the content of the webpage.

The Road Ahead: AI and Web Archiving

WARC-GPT is one of the first tools to contribute to the intersection of artificial intelligence (AI) and web archiving. While its primary purpose was to facilitate interaction with collections of archived web pages, we have observed significant potential in its application to change analysis through the implementation of a Retrieval-Augmented Generation (RAG) methodology. This opens new possibilities for future research. 

Our exploration, while a first step, highlights the importance of continuing contributions to the intersection of artificial intelligence and web archiving. This study was limited to the use of a single LLM and a small dataset. It remains essential to investigate the differences that might arise when employing various LLMs and embedding models, as well as to evaluate the system's performance when analyzing large volumes of archived data across diverse topics, languages, and structures. Like humans, LLMs appear to adopt different perspectives, as demonstrated in tasks such as entity extraction. Such experiment was conducted by Brian Llinás from Storymodelers and WS-DL Labs. More information can be found in the blog post "Do Large Language Models Agree on Entity Extraction?." Therefore, it would be worthwhile to analyze their behavior in change detection tasks.

With the rapid advancement of AI models, it is also important to consider the integration of multimodality. By combining text and image processing, these models could analyze not only textual changes but also visual changes, such as graphics, layouts, and multimedia elements. This expansion would enable a more comprehensive understanding of how web content evolves over time.

Ethical Considerations

As with any AI-driven tool, using LLMs to analyze changes may raise important ethical concerns, particularly regarding the reliability of the information they provide. While detecting changes relies on a comparison of vectors, generating answers about shifts in semantic context can lead to incorrect or misleading interpretations of data due to bias in the information the model was trained with. This poses a significant challenge, especially in contexts where the tool might be used by individuals who lack the expertise to critically evaluate its outputs. In this sense, there is always a risk of increasing misinformation, making advancing research in this area essential.

We do not claim that such methodologies and tools will replace traditional methods nor the critical thinking and reasoning required by those analyzing changes, at least not in the short term. However, AI can effectively complement traditional tools by providing an extra layer of insight and saving time for detecting changes.

Lesley presented this work at the workshop. The presentation slides are below:

Jhon G. Botello

Comments