2022-07-22: Summary of "Web Archiving and Search Personalized"

The Web Archiving and Search Personalized system automatically captures, archives, and indexes pages for both full-text search and replay. (Source: Kiesel et al., Figure 1a)

According to a study conducted by Teevan et al. in 2007, 39% of search queries represent users trying to re-find previously viewed pages [1]. One approach to supporting users in this task is automatic personal web archiving. Each page that the user visits is saved, so that it can be found later, similar to an automated version of the "bookmark as archive" feature in Mabe et al.’s Memento-aware browser prototype [2]. However, creating a system that can save web pages as they are viewed, index them for full-text search, and replay them later is an ambitious goal. Johannes Kiesel (@KieselJohannes), Arjen P. de Vries (@arjenpdevries), Matthias Hagen (@matthias_hagen), Benno Stein (@bennostein), and Martin Potthast (@martinpotthast) created a prototype system for this purpose in their paper “Web Archiving and Search Personalized” from DESIRES 2018 [3]. 


DESIRES (@DESIRES_IR), Design of Experimental Search & Information REtrieval Systems, is a biennial conference that promotes innovation in information retrieval systems. The conference proceedings are published through SunSITE Central Europe (CEUR). Currently, the four types of papers at the conference are full papers, prototypes, open problems, and abstracts. Papers are encouraged to deviate from traditional approaches, highlight both successes and failures, and favor boldness over convention.


Architecture


The WASP system automatically archives then indexes visited pages, allows for discovery of these pages through a search interface, and enables access to the pages through a replay engine.


In order to automatically archive pages as they are visited, WASP uses warcprox. warcprox saves the WARCs that it writes to a single folder, which in this case is the same as the replay engine PyWB's default WARC folder. This choice was made because PyWB already includes functionality to automatically index all WARCs in its default folder. PyWB monitors for WARCs in its directory that have been modified after its index file.


The WARCs also need to be automatically indexed for full-text search. WASP uses the ElasticSearch search engine platform, which is built upon a Lucene index. The Java non-blocking file I/O package can monitor directories for any changes to files with the Watch Service API. As warcprox adds information to a WARC, the information is captured, analyzed, and sent to be indexed using the ElasticSearch Java API. The WARC is parsed using the Lemur Project WARC reader, then the content is examined to see if it is HTML using Apache's HTTPClient library. HTML content is then parsed with the Jericho HTML Parser library before being indexed.



In this figure, the Web Archiving and Search Personalized system supports (b) a search interface via interaction with an ElasticSearch/Lucene index and the ElasticSearch Java API, as well as (c) a replay engine with PyWB. (Source: Kiesel et al., Figure 1b-c)


WASP includes both a search interface and a replay engine, to find and view archived pages. To allow all of the services to function independently, each service (search, replay, etc.) is assigned its own port. WASP’s search interface is a custom interface designed using the ElasticSearch Java API. Users can search by keyword, and they can also specify a time interval of interest. The search results include a link to replay the archived page with PyWB, as well as a link to view the page on the live web. With this setup, WASP’s local search ability allows for users to have more privacy than when they use live web search engines.


In this figure, the WASP search interface allows users to search their personal archive by keyword and also specify a time range of interest. Each search result includes two links: one to replay the archived version, and another to view the page on the live web. (Source: Kiesel et al., Figure 2)

The source and docker instructions for WASP are available on github.


Possible Evaluation Paths and Lessons Learned

“In case several non-identical versions of the same page are found in the requested interval, the prototype displays all of them as separate results. However, we expect that more mature personal web archiving and search systems will rather condense the different versions of a web page, especially when the context of the query terms is similar in the versions.” – Kiesel et al.

Does personal web archiving make pages easier to find compared to live web search engines? The authors propose evaluating their system by presenting the user with a screenshot of a page they are to re-find, and measuring the success rate, the number of clicks, and the time duration. A hindrance to the success of the WASP system is the presence of multiple versions of the same page in the search results. Since the multiple versions were near duplicates, they were all matches for the search query, but including all of the versions separately muddled the results overall.


The authors finish with a discussion of lessons learned. First, there is a need for both privacy and control over the pages archived. Control exists via a proxy-switching browser plugin, but there is not yet a way for users to clear parts of their personal web archive index the way that they can clear their browsing history. Next, the authors present the idea of using time spent on a web page to trigger indexing only the important pages. Finally, the authors discuss modern web page designs that challenge the idea of a document unit, such as unnatural pagination in support of ad revenue and timelines with infinite scrolling.


Outlook


How should multiple versions of the same page be represented in search results?


In the WASP system, multiple versions of the same page are displayed separately. What are the advantages and disadvantages of condensing the versions? Kiesel observed that when the page versions are displayed as separate results, it “cluttered” the results list. However, in “Desiderata for Exploratory Search Interfaces to Web Archives in Support of Scholarly Activities,” Jackson et al. noted that grouping can hide the ways that pages change over time [4]. Users are interested in seeing how pages change over time. A search engine that presents multiple versions of a page as one result could indicate how the query terms on the page have changed over time, such as by identifying term additions and deletions. Showing the search results in this way would make it possible to group multiple versions of the same page without hiding the pages’ changes over time.


Why is automatic indexing significant for full text search in web archives?


At the 2022 IIPC Web Archiving Conference (trip summary), Andy Jackson (@anjacks0n), Anders Klindt Myrvoll (@AndersKlindt), and Ben O'Brien (@ob1_ben_ob) led a session on Full-Text Search for Web Archives. One of the open problems identified by the panelists is aligning the Solr index (Lucene) that is necessary for full text search with the PyWB index (CDXJ) that is necessary for state-of-the-art replay. Anders Klindt Myrvoll commented (36:30), “It could be great if we could just reuse the Solr index!” While the UKWA WARC Indexer populates Lucene with most of the fields necessary for the CDXJ, the Solr/Lucene field “content_length” is not compatible with the CDXJ field “length” because the latter represents the entire length of each record, including both its content and headers. Another option is to query the Solr/Lucene index for a list of every WARC it contains, and then send that list to PyWB to be indexed for replay. This has proven to be a viable way to keep the PyWB index aligned with the Lucene index, but it’s a bit clunky. Automatic indexing will be a better solution to this problem.


Sources

  1. Jaime Teevan, Eytan Adar, Rosie Jones, and Michael A. S. Potts. “Information re-retrieval: repeat queries in Yahoo's logs.” In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 151–158. SIGIR '07. https://doi.org/10.1145/1277741.1277770
  2. Abby Mabe, Michael Nelson and Michele Weigle, “A Chromium-based Memento-aware Web Browser”, In Proceedings of TPDL 2022, https://arxiv.org/abs/2104.13361
  3. Johannes Kiesel, Arjen P. de Vries, Matthias Hagen, Benno Stein, Martin Potthast. “WASP: Web Archiving and Search Personalized.” DESIRES 2018: 16-21. https://desires.dei.unipd.it/2018/papers/paper4.pdf
  4. Jackson, Andrew, Jimmy Lin, Ian Milligan, and Nick Ruest. “Desiderata for Exploratory Search Interfaces to Web Archives in Support of Scholarly Activities.” In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, 103–6. JCDL ’16. https://doi.org/10.1145/2910896.2910912


--Lesley Frew

Comments