2022-05-03: Summary of "Temporal Search in Web Archives"

In this figure, nine document versions (grey) are shown ordered with respect to time. The versions have different term frequencies, represented by the score axis. The nine original versions are coalesced into three versions (black) in the inverted index. (Source: Berberich, Figure 3.3)

According to a user survey conducted by the National Library of the Netherlands in 2007, full-text search is the top requested feature for web archives. Some web archives do allow for the public to perform full-text searches, while other web archives only allow the public to search by website address. For web archives that have implemented full-text search, every version of a document may be indexed, regardless of the similarity between consecutive document versions. In “Temporal Search in Web Archives” (2010), Berberich develops time-travel text search to improve searching in web archives.

What extensions to existing architecture will support full-text search with respect to time across versioned documents in sizable web archives?

In this figure, a sample time travel inverted index shows the terms with validity time ranges in grey, and the documents with validity time ranges in white. The term cat has been partitioned into two posting lists. Note that document nine is duplicated in the index for the term cat since the document's validity range matches the validity range of both posting lists. (Source: Berberich, Figure 3.1)

In a traditional inverted index, each term has one listing of matching documents. However, this setup doesn't efficiently or effectively take into account how documents change over time. The Time Travel Inverted indeX (TTIX) extends the traditional inverted index by storing each term with multiple posting lists that each have an associated valid time range. Each versioned document also has a valid time range, which is the timestamp of the identified version of the document, inclusive, to the timestamp of the next version of the document, exclusive. By setting up the index in this way, a query for a term over time can be retrieved by selecting a posting list for that term that matches the time specified in the query. This framework allows for more efficient retrieval from the index, since documents that match the term but don’t match the time range are pre-filtered from the results via the validity time ranges of the term posting lists. However, increasing the efficiency of query processing comes at an index space cost. The best-case scenario for efficient retrieval involves partitioning the index into one posting list per valid time interval. In this case, documents with time ranges that match multiple posting list time intervals are duplicated in the index, which increases the size of the index. The best case for efficient index space involves one posting list per term, which negates the benefits of the validity time ranges. Given the space bound requirements, the retrieval efficiency can be maximized within this threshold. Thus, the TTIX can create a viable balance for efficiency between the index’s size and retrieval from the index.

Temporal coalescing reduces the size of the index by combining the entries for documents with similar consecutive versions. Versions of a document are only coalesced if they are similar for the term within a specified threshold and if they are temporally adjacent. If a term disappears and then appears, this would be reflected by a gap in the coalescences. Coalescing is also possible when considering the frequency of the term rather than just its presence or absence. Finally, coalescing is also possible when considering the position of the word in the document. This positional coalescing does reduce the size of the index but comes at an index retrieval efficiency cost when compared to the skip lists used in high-performance search systems (one widely used example of this kind of search system is Lucene). In order to account for this, it’s possible to specify a bound on the efficiency loss that still maximizes index space savings.

When a coalesced document is identified as a match for a query, the list of versions of the document in sorted time order is consulted in order to retrieve all versions of the document that are a match to the query.

In order to evaluate his approach, Berberich implemented the TTIX in Java and ran his algorithms with different coalescing thresholds on three datasets: 15 million English Wikipedia revisions from 2001 - 2005, 17 million United Kingdom government web pages captures crawled from 2004 - 2005, and 1.8 million articles in the New York Times Annotated Corpus from 1987 - 2007. The Wikipedia and UK datasets include versioning, while the New York Times dataset does not. Berberich first ran the algorithms with different coalescing thresholds using the Okapi BM25 term frequency scores as a similarity measure. The first result that he noticed was that approximately one third of the UK web page crawls were identical duplicates to a previously crawled version of that page, and approximately one half of the UK web page crawls and Wikipedia revisions were near duplicates. The result of this property of the datasets is that temporal coalescing is highly effective in reducing the index size at all coalescing thresholds. Similar results were obtained even when considering the positions of each term in the document rather than the term frequency.

Reducing the index size is only valuable if the query results are comparable to a non-coalesced index. The queries used to evaluate the coalescing technique were drawn from AOL query logs from 2006 whose query results matched pages in the data sets, along with random time intervals with different granularities in the validity range of each data set. Berberich found that the recall for queries was inversely proportional to the temporal coalescing threshold. For example, a higher coalescing factor, representing the ability to merge documents with less similarity, results in lower recall. He also found that time granularities perform differently with respect to temporal coalescing. When comparing the first 25 search results for the queries from a coalesced index versus a non-coalesced index, searching with a temporal range of one week kept the results pretty similar even when using a coalescing factor of 0.05. This coalescing factor allows for less similarity than the coalescing factor of 0.01 representing near duplicates, which makes the index smaller. However, using a temporal search range of one year had poor performance for all coalescing factors above 0.01, which means that only near duplicates can be merged for this time range.

Finally, Berberich measured the processing time for the queries. Because of the number of near duplicates present in the data sets, temporal coalescing lowered the time required for queries by nearly 50%. For phrase queries, which require the index to maintain the position of each term, there was a slight increase in processing time when using identical duplicate temporal coalescing, which is what was expected because of the advantage that skip-list based search systems have in this scenario. However, higher coalescing thresholds did reduce the processing time below the time used in those systems.

How can search queries using contemporary terminology be mapped to their historical verbiage in order to retrieve relevant documents?

Berberich developed an across-time semantic similarity measure to create a mapping between contemporary terminology and the historical terminology used in relevant archived documents. Thinking of the past as its own language, this mapping creates a dictionary to translate between current terms and past terms. For each term in the corpus, the co-occurrence of other terms with respect to a given time unit is precomputed. If two terms from different times are used in similar contexts, their co-terms will be similar. If the query contains multiple terms, each term will be reformulated into its historical counterpart. In order to make sure the combination of the historical terms is a good match for the contemporary query, another aspect of the model is making sure that when multiple terms are translated, the combination of historical terms appears together in documents. Reformulating terms is only effective if it results in finding matching documents, so the popularity of the historical query is also calculated. Therefore, the probability-based model that facilitates effective query reformulation is based on the across-time semantic similarity, the co-occurrence of multiple terms, and the popularity of the suggested query.

In this figure, Walkman and iPod are determined to be across-time similar based on the terms they frequently are used with in their respective times, such as portable, earphones, and music. (Source: Berberich, Figure 4.1)

The goal of query reformulation is to provide the user with search suggestions, so it’s important for the model to run quickly. Two design decisions facilitate the speed needed. First, a subset of the full-term list is pruned to eliminate combinations that are unlikely to occur. Next, reformulations are computed using a shortest path heuristic.

In order to evaluate this algorithm, terms from 2005 were mapped to 1990 terms using the New York Times data set. The historical terms suggested by the algorithm were all manually examined and determined to be feasible reformulations.

How can temporal expressions that are extracted from text queries containing both a textual part and a temporal part be better integrated into the retrieval model to improve effectiveness?

Text queries that contain both a textual part and a temporal part are the current way that users attempt to make temporal queries. The temporal expressions that are extracted from the query contain an inherent amount of uncertainty. To account for this uncertainty, Berberich developed a validity range model for the start and end time associated with the query. In order to have a high ranking in the search results, a document must have a high score matching the textual part of the query as well as having a large overlap between its validity range and the validity range of the query. This overlap can be determined mathematically as an intersection.

The evaluation of the retrieval model centered on a user study. First, Mechanical Turk workers specified entities that matched given temporal expressions and vice versa. Next, workers evaluated search results from the New York Times and Wikipedia data sets as relevant or not relevant. Similar to the evaluation of temporal coalescing, different time granularities had different results. The retrieval model was effective for the New York Times data set across all time granularities but was not effective for the Wikipedia data set for time granularities smaller than a decade.

Outlook

In conclusion, Berberich’s temporal coalescing technique is important for search in web archives, when many near duplicate versions of a page are present. Berberich applied temporal coalescing at the index level but noted it would be worthwhile to also apply temporal coalescing to the query results for future work. Berberich also demonstrated the importance of temporal validity ranges for documents and queries. There are now existing search systems, like Lucene, that support document date ranges including calculating the intersection of two date ranges and variable temporal granularity levels. Finally, Berberich identified the inherent uncertainty in text queries that have both a textual part and a temporal part. For searching a corpus with a significant temporal dimension, allowing the user to control the temporal part of their query instead of typing it as a text-based expression could improve both the relevancy of the search results and the user experience.

Sources

Klaus Berberich, “Temporal Search in Web Archives,” PhD thesis, Saarland University, July 2010.
Ras and van Bussel. “Web Archiving User Study.” National Library of the Netherlands, July 2007.

-Lesley Frew

Search This Blog

Web Science and Digital Libraries Research Group