2026-03-18: A Glimpse into How AI Tools Can Enhance the Way We Study Web Archive Content: Challenges and Opportunities

Artificial intelligence (AI) has transformed nearly every field. Today, we can access and train models that generate text, images, sound, video, and code. This transformation is reshaping how we think, analyze, and preserve information. Yet, despite the rapid growth of AI, its use for analyzing web archive content seems to advance at a slower pace.

Web archiving is the process of collecting, preserving, and providing access to web content over time, where a memento represents a previous version of a web resource as it existed at a specific moment in the past. Much of the recent work within the web archiving community (e.g., [1], [2], [3]) has focused on making the archiving process itself more intelligent, integrating AI into tasks such as web crawling, storage optimization, and metadata generation. In contrast, the application of AI to the analysis of already archived web content has received comparatively less attention. This gap represents a great opportunity for innovation and contribution, particularly as web archives continue to grow in size, diversity, and historical importance.

In this blog, I aim to outline (based on my perspective, analysis, preliminary work, and insights gained during my PhD candidacy exam) opportunities for where AI could play a role, as well as key challenges involved in integrating AI into web archiving.

My Preliminary Work

Since I joined the PhD program at ODU in 2023 (Blog post introducing myself) under the supervision of Dr. Michele C. Weigle, my work has focused on the intersection of web archiving and AI, with a particular emphasis on leveraging Large Language Models (LLMs) through Retrieval-Augmented Generation (RAG) to detect and interpret text changes across mementos. Identifying the exact moment when content was modified often requires carefully comparing multiple archived versions, a process that can be both tedious and time-consuming. Moreover, detecting and analyzing where important changes occur is not a straightforward process. Users often need to select a subset of captures from thousands available, and even then, there is no guarantee that the differences they find will be meaningful or important. Traditional approaches to memento change analysis, such as lexical comparisons and indexing (e.g., [4], [5]), focus on showing the deletion or addition of terms or phrases but ignore semantic context. As a result, they miss subtle shifts in meaning and rely heavily on human interpretation.

My early work resulted in a paper titled “Exploring Large Language Models for Analyzing Changes in Web Archive Content: A Retrieval-Augmented Generation Approach,” coauthored with Lesley Frew, Dr. Jose J. Padilla, and Dr. Michele C. Weigle. The results of this initial exploration demonstrated that an LLM, when combined with tools such as RAG over a set of mementos, can effectively retrieve and analyze changes in archived web content. However, it remains necessary to constrain the analysis to distinguish between important and non-important changes. Building on this, I have been developing a pipeline to automatically determine whether a change alters meaning or context and should be considered significant. This aims to reduce manual effort, cognitive load, and support integration into web archive systems while advancing methods for analyzing archived web content at scale.

My PhD Candidacy Exam

During the summer of 2025, I passed my PhD candidacy exam (pdf, slides). This milestone marked an important transition in my doctoral studies and provided an opportunity to reflect on my preliminary work, learn, and identify new ways to contribute to the intersection of AI and web archiving. In my candidacy exam, I reviewed a set of ten papers related to analyzing changes and temporal coherence in archived web pages and websites. Changes refer to any modifications observed in web content over time, including the addition, deletion, or alteration of text, images, structure, or other embedded resources. Temporal coherence, on the other hand, refers to the degree to which all components of an archived web page (such as HTML, text, images, and stylesheets) or website (such as interconnected pages and resources) were captured close enough in time to accurately represent how it appeared and functioned at a specific moment. A lack of temporal coherence can result in inconsistencies in how the archived page or site looks or behaves, which may affect the accuracy of change analysis.

Figure 2. A moment from my PhD candidacy exam, where I presented a ten-paper review on analyzing changes and temporal coherence in archived web pages and websites.

AI in Web Archiving: Opportunities

Over time, several researchers have addressed the analysis of changes and temporal coherence in web archives; however, the use of AI in this context has been limited. Below, I outline some research opportunities and challenges based on insights gained from my preliminary work and candidacy exam on how AI could play a role in these activities.

Topic Drift

AlNoamany et al. [6] studied web archive collections to identify off-topic pages within TimeMaps, which occur when a webpage that was originally relevant to a collection later changes into unrelated content. For example, in a collection about the 2003 California Recall Election (Figure 3), the site johnbeard4gov.com initially supported candidate John Beard (September 24, 2003) but later transformed into an unrelated adult-oriented page (December 12, 2003), making it irrelevant to the collection. To detect such changes, AlNoamany et al. proposed automated methods including text-based similarity metrics (cosine similarity, Jaccard similarity, and term overlap), a kernel-based method using web search context, and structural features such as changes in page length and word count. Using manually labeled TimeMap versions as ground truth, they found that the best performance was achieved by combining TF-IDF cosine similarity with word-count change.

Figure 3. Example of johnbeard4gov.com going off-topic. The first capture (September 24, 2003) shows the site supporting a California gubernatorial candidate, while the later capture (December 12, 2003) shows the domain transformed into unrelated adult-oriented content. Source: AlNoamany et al. [6]

Recent advances in AI and representation learning offer opportunities to enhance off-topic detection in web archives beyond traditional term frequency measures. Instead of relying on TF-IDF, future approaches could use dense semantic embeddings from transformer models to better capture meaning and context, enabling the detection of more subtle topic drift. Comparing embedding-based similarity with the methods proposed by AlNoamany et al. could help determine which approach is more effective, particularly when topic shifts are not immediately apparent.

Temporal Coherence

Weigle et al. [7] highlight a key challenge in modern web archiving: many sites, such as CNN.com, rely on client-side rendering, where the server delivers basic HTML and JavaScript that later fetch dynamic content (often JSON) through API calls. Traditional crawlers like Heritrix do not execute JavaScript or consistently capture these dynamic resources, leading to temporal violations in which archived HTML and embedded JSON files have different capture times, potentially misrepresenting events or news stories. The issue is illustrated in Figure 5, which shows archived CNN.com pages captured between September 2015 and July 2016. The top row displays pages replayed in the Wayback Machine that show the same top-level headline despite being captured months apart. The bottom row shows mementos from the same dates with the correct top-level headlines; however, the second-level stories remain temporally inconsistent.

By measuring time differences between base HTML captures and embedded JSON resources using CNN.com pages (September 2015–July 2016), Weigle et al. identified nearly 15,000 mementos with mismatches exceeding two days. They conclude that browser-based crawlers best reduce such inconsistencies, though due to their higher cost and slower performance, they recommend deploying them selectively for pages that depend on client-side rendering.

Figure 4. Example of temporal coherence violation in archived CNN.com pages using client-side rendering. Source: Weigle et al. [7].

AI can enhance existing approaches to temporal coherence in web archives, such as those proposed by Weigle et al., by helping identify pages that depend on client-side rendering. For example, a machine learning model could be fine-tuned to analyze the initial HTML and related resources to detect signals such as empty or minimally populated DOM structures and classify whether a webpage relies on client-side rendering. AI-based analysis could also estimate the proportion of JavaScript relative to textual content and detect patterns associated with common client-side frameworks. Combined with indicators such as API endpoints referenced in scripts, these features can be used to flag pages that are unlikely to render correctly with traditional crawlers and may require browser-based crawling.

AI for Enhancing Web Archive Interfaces

While platforms such as Google and others have begun integrating AI into their user interfaces, web archives have largely remained unchanged in this respect. This is notable given the potential of AI to make web archive interfaces more intuitive and more informative for a wide range of users. For example, as my preliminary work suggests, when analyzing content changes, users currently must manually browse long lists of captures or compare multiple archived versions of a webpage. AI could instead automatically identify moments when important changes occur and direct users’ attention to those points in time.

Along the same line, the Internet Archive’s Wayback Machine provides a “Changes” feature that highlights deletions and additions between two snapshots and a calendar view where color intensity reflects the amount of variation. However, this variation is based on the quantity of changes rather than their significance. As a result, many small edits may appear more important than fewer but meaningful modifications. An AI-enhanced interface could address this limitation by incorporating semantic change detection. For instance, a calendar view that highlights when the meaning or message of a page changes can make large-scale temporal analysis more efficient and accessible. Moreover, users could ask natural-language questions such as “When did this page change its message?” or “What were the major updates during a specific period?” and receive concise, understandable answers.

AI could also guide users through large collections by recommending related pages, explaining why certain versions are relevant, or warning when an archived page may contain temporally inconsistent content. For non-experts, visual aids generated by AI, such as timelines, change highlights, or short explanations, could make complex web archive data easier to interpret.

AI in Web Archiving: Challenges

While there are opportunities for AI integration into web archiving, there are also challenges that must be considered.

Technical Challenges

From a technical standpoint, I identified three primary challenges regarding using AI for analyzing archived web content. The first concerns the nature of archived web data. Web archiving systems typically store collected content using the Web ARChive (WARC) format. Each WARC file stores complete HTTP response headers, HTML content, and additional embedded resources such as images and JavaScript files. Although this format provides a structure and allows long-term preservation, it is verbose and was not designed to support AI-based analysis. Consequently, researchers must perform extensive parsing and preprocessing before AI models can effectively use archived web content.

Second, many web archives, such as the Internet Archive’s Wayback Machine, prioritize long-term storage and preservation over indexing and large-scale content retrieval. As a result, a single web page may have hundreds or even thousands of archived versions over time. Building and maintaining large-scale vector indexes over such temporally dense collections quickly becomes computationally expensive and, in many cases, impractical.

Third, even when working with controlled data scenarios, such as curated web archive collections, AI-driven analysis still depends on the availability of ground truth for evaluation and validation. For instance, training models to detect significant changes across mementos would require large-scale, high-quality annotations that capture not only what changed, but whether those changes meaningfully affect content interpretation. At present, no large-scale annotated datasets exist that support systematic analysis of change significance across archived web versions, creating a major barrier to training and evaluating AI models in this domain.

Ethical Challenges

Beyond technical limitations, the integration of AI into web archive analysis raises important ethical challenges. For instance, web archives preserve content as it existed at specific points in time, often without the consent or awareness of content creators or the individuals represented in that content. When AI models analyze archived web data, they may surface, reinterpret, or amplify sensitive information that was never intended to be reused in new analytical contexts. For this reason, it is important to carefully consider how AI is applied within web archiving. I contend that AI should be viewed as a complementary tool, one that supports, rather than replaces, human judgment. For example, AI can assist in identifying potential moments of relevant changes, flagging or summarizing them, while humans interpret the results and make decisions.

It is also important to note that recent debates highlight growing tensions between web archives and content owners regarding the use of archived data for AI training and analysis. For example, major news publishers have begun restricting access to resources like the Internet Archive due to concerns that archived content is being used for large-scale AI scraping without compensation or consent [8]. In response to such restrictions, researchers and practitioners—including Mark Graham, Director of the Wayback Machine—have argued that limiting access to web archives poses a significant risk to the preservation of digital history [9]. From this perspective, the primary concern is not excessive access, but rather the potential loss of the web as a historical record if archiving efforts are weakened.

Conceptual Challenges

AI models, particularly LLMs, typically operate on individual snapshots of data. As a result, they are not inherently designed to reason about evolution, temporal coherence, or change over time in archived web content. Consequently, answers to temporally grounded questions should not be expected by default when these models are applied without additional structure or context.

In static analysis scenarios, AI models can perform effectively. For example, given a single archived web page, an LLM can generate a summary, identify main topics, extract named entities, or analyze embedded resources such as images, videos, or scripts. Temporal analysis in web archiving, however, requires a different mode of reasoning. The central questions are not “What does this page say?” or “What is this page about?” but rather “What changed?”, “When did it change?”, “Why did it happen?”, and “What impact does the change have over time?” Answering these questions requires comparing multiple archived versions, reasoning based on context, and perhaps correlating changes across web pages.

Integrating AI into web archiving is therefore not only about efficiency, but about enabling new forms of discovery. This requires clearly defining desired outcomes and using AI to support or accelerate processes that have traditionally been manual.

Final Reflections

To conclude, I would like to leave the reader with a set of open questions as we continue moving toward the integration of AI in web archiving. One of the most visible changes introduced by AI is the ability to go beyond syntactic analysis and begin exploring semantic analysis, where meaning, context, and interpretation matter. This shift is not about replacing existing techniques, but about expanding the types of questions we can ask when working with web archive data.

I contend that traditional algorithms remain essential for many web archiving tasks. They are precise, transparent, and well understood. AI, by contrast, offers strengths in areas where rules struggle: interpreting context, assessing relevance, and reasoning across multiple versions of content. Rather than framing this as a competition between algorithms and AI, a more productive question is how these approaches can complement one another, and in which parts of the analysis pipeline each is most appropriate.

In the short term, I consider that AI tools are unlikely to replace algorithmic methods. However, they already show promise as assistive tools that can guide analysis, prioritize attention, and help humans reason about large and complex temporal collections. This naturally raises a forward-looking question: if AI continues to improve in its ability to reason about time, meaning, and change, how should the web archiving community adapt its tools, workflows, and standards?

The WARC format has proven effective for long-term preservation, but it was not designed with AI-driven analysis in mind. Should we aim to augment existing archival formats with AI-aware representations, or should we focus on developing AI methods that better adapt to current standards such as WARC? How we answer this will shape not only how we analyze web archives, but also how future generations access and understand the web past.

References

[1] AK, Ashfauk Ahamed. “AI driven web crawling for semantic extraction of news content from newspapers.” Scientific Reports, 2025. [Online]. https://doi.org/10.1038/s41598-025-25616-x.

[2] Abrar, M. F., Saqib, M., Alferaidi, A., Almuraziq, T. S., Uddin, R., Khan, W., & Khan, Z. H. “Intelligent web archiving and ranking of fake news using metadata-driven credibility assessment and machine learning.” Scientific Reports, 2025. [Online]. https://doi.org/10.1038/s41598-025-31583-0.

[3] Nair, A., Goh, Z. R., Liu, T., and Huang, A. Y. “Web archives metadata generation with gpt-4o: Challenges and insights,” arXiv, Tech. Rep. arXiv:2411.0540, Nov. 2024. [Online]. https://arxiv.org/abs/2411.05409.

[4] L. Frew, M. L. Nelson, and M. C. Weigle, “Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives,” in Proceedings of the 23rd ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), 2023, pp. 71–81. https://doi.org/10.1109/JCDL57899.2023.00021.

[5] T. Sherratt and A. Jackson, GLAM-Workbench/web-archives, https://zenodo.org/records/6450762, version v1.1.0, Apr. 2022. DOI: 10.5281/zenodo.6450762.

[6] Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within timemaps in web archives,” International Journal on Digital Libraries, vol. 17, no. 3, pp. 203–221, 2016. https://doi.org/10.1007/s00799-016-0183-5.

[7] M. C. Weigle, M. L. Nelson, S. Alam, and M. Graham, “Right HTML, wrong JSON: Challenges in replaying archived webpages built with client-side rendering,” in Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), Jun. 2023, pp. 82–92. https://doi.org/10.1109/JCDL57899.2023.0002.

[8] Robertson, K. “News publishers limit Internet Archive access due to AI scraping concerns.” Nieman Lab, Jan. 2026. [Online]. https://www.niemanlab.org/2026/01/news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/

[9] Graham, M. “Preserving the web is not the problem — losing it is.” Techdirt, Feb. 17, 2026. [Online]. https://www.techdirt.com/2026/02/17/preserving-the-web-is-not-the-problem-losing-it-is/

Search This Blog

Web Science and Digital Libraries Research Group