2020-05-21: Visualizing Webpage Changes Over Time With TMVis

Embed code for the Image Grid.

This work has been supported by a NEH/IMLS Digital Humanities Advancement Grant (HAA-256368-17).

The web is dynamic, meaning webpages that exist today may not exist tomorrow. Even if a webpage continues to exist, it could display completely different content than it used to. Web archives, such as the Internet Archive (IA), Archive-It (AIT), and many others, preserve past versions of webpages for use by scholars, researchers, and the general public. Using Memento terminology, an archived version of a webpage at a particular time is called a memento, or URI-M, and the list of all mementos for a particular webpage is called a TimeMap. Different web pages have different sized TimeMaps. For example, the TimeMap for odu.edu contains over 2000 mementos, while the TimeMap for cnn.com contains around 300,000. Analyzing such large TimeMaps is nearly impossible to do manually.

Based on previous work (Alsum and Nelson, ECIR 2014), TimeMap Visualization (TMVis) determines which mementos show significant changes in the webpage, thus finding the "unique" mementos that are worth rendering. After determining the unique mementos, TMVis creates a thumbnail screenshot of each memento. After generating thumbnails, they are placed into four different visualizations to help the user view the mementos. These visualizations are an Image Grid, Image Slider, Timeline, and Animated GIF. This process makes it easier for researchers and historians to gain an overview of how a webpage changed over time. (We currently only support the Internet Archive and Archive-It, but the list of archives could be expanded to any Memento-compatible public archive.) The source code is available from https://github.com/oduwsdl/tmvis.

System Walk-through

To use TMVis, visit http://tmvis.cs.odu.edu/. Enter a URI and click "View TimeMap". A request is sent to the selected server (Internet Archive or Archive-It) and the TimeMap of the URI is fetched. The user is then presented with a histogram showing the distribution of mementos in the TimeMap over time. This histogram, shown below, is binned by month and year.

Histogram showing the distribution of mementos in the TimeMap. The user can zoom in on a portion of the histogram by clicking and dragging over it, or inputting a date range in the boxes below.

By clicking and dragging over the histogram, the user can select a date range in the TimeMap. Alternatively, a date range can be selected by using the input boxes below the histogram. When a date range is selected, a zoomed in version of the histogram appears below the original histogram, allowing you to see the differences in the number of mementos per month more easily. The number of mementos in the selected date range is displayed. This number affects how long it will take to calculate unique mementos. Since this process can take several minutes, a maximum of 1000 mementos will be analyzed from the selected date range of mementos. When the user clicks Calculate Unique, the process of selecting the most unique mementos begins. If the selected date range contains more than 1000 mementos, we first run the following algorithm to extract a sample of 1000 mementos.

Algorithm to determine which mementos to analyze.

The TimeMap is first filtered based on the selected date range. Then, the TimeMap is split into 250 equal partitions. For example, if the TimeMap contains 2000 mementos, each partition will contain 8 mementos. From each partition, up to 4 mementos will be chosen. The algorithm always chooses the earliest memento in the TimeMap. Each time a memento is chosen, a counter denoting the number of mementos left that can be selected from the partition is decremented. The datetime of the first memento is then compared to the datetime of the second memento. If the time between the mementos is less than 3 days, the second memento will be skipped. The first memento is compared to each consecutive memento until the distance between datetimes is at least 3 days. As shown in the diagram above, this occurs when the first memento is compared to the sixth memento. If the end of the partition is reached before the counter is 0, then from the next partition, 4 mementos plus the leftover mementos from the counter can be chosen. Only two mementos were selected from the first partition in the diagram, so 6 mementos can be selected from the second partition. This process continues until the end of the TimeMap is reached. The figure below shows the histogram of the TimeMap of moma.org compared to the mementos chosen to be analyzed by the algorithm. We chose 3 days between mementos. This value can be adjusted or a different sampling method can be employed in the future.

Histograms for the full TimeMap of moma.org and the TimeMap after selecting up to 1000 mementos to analyze with the algorithm.

Next, the unique mementos are calculated from this sample of up to 1000. The first step in the unique memento calculating process is to calculate the SimHash for each memento. These will be compared to filter the mementos using different Hamming distance thresholds. The filtering process begins by selecting the first memento in the TimeMap to be included in the summarization. This memento will act as a baseline to compare with subsequent mementos.

Unique memento calculation process.

The diagram above shows the unique memento calculation process. In the diagram, the Hamming distance threshold is denoted as HDT. Each memento is denoted by M1, M2, ... Mn. The first memento, M1, is compared to each consecutive memento until the Hamming distance is greater than or equal to the threshold. In the diagram, this occurs when M1 is compared to M3.

When the unique memento calculation is completed, the user is presented with different numbers of unique mementos to generate thumbnails for, as shown below. Each option represents a different Hamming distance threshold being used in the calculation. A higher Hamming distance results in fewer unique mementos.

Stats page showing the different unique thumbnail options. Each option gives the number of thumbnails to generate, based on different Hamming distance thresholds.

When selecting the number of thumbnails to generate, the estimated amount of time to generate the thumbnails is displayed. As shown in the image above, it takes about 34 minutes (depending upon the responsiveness of the archive) to generate 62 thumbnails for odu.edu. The time to generate thumbnails includes the time to render each of the selected mementos in a headless browser and create screenshots. Selecting fewer thumbnails reduces the time required. Clicking Generate Thumbnails takes to user to the final page, which shows the four visualizations.

Image Grid

Image Grid showing thumbnails for odu.edu.

The first visualization is the Image Grid shown above. In this visualization, the thumbnails are arranged in a left to right, top to bottom manner. The user can remove any thumbnail from all four visualizations by pressing the X in the top right corner of the thumbnail to be removed. Once all thumbnails to be removed have been selected, pressing Update will redraw the image grid. The user can revert this and include all thumbnails in all visualizations again by clicking Revert.

Embed code can be generated for the Image Grid by clicking Embed Image Grid on the top left. This embed code takes into account which thumbnails are currently display in the visualizations, so if a user removes a thumbnail, it will not be included in the embed code. This code can be used to embed the visualization in any webpage (example page with embedded visualizations). The Download URI-M List button allows the user to download a text file including the URI-Ms for the mementos. This text file can be used as input to Raintale, a tool that allows users to create and share stories using archived webpages.

Embed code for the Image Grid.

Portion of downloaded URI-M list.

Image Slider

Image Slider with embed code.

The image above shows the Image Slider. This visualization imitates the photo roller functionality used in iPhoto. By moving the cursor across the thumbnail, the next thumbnail is shown to the user. The user is taken to the actual archived page of the currently display thumbnail by clicking on it. The user can also cycle through the thumbnails by clicking the arrow buttons to the left and right of the Slider. Like the Image Grid, the Image Slider has the option to generate embed code.

Timeline

Timeline visualization for odu.edu

The Timeline view shown above arranges the thumbnails according to datetime, similar to the histogram. The user can navigate the Timeline view using the zoom, next, previous, next unique, and previous unique buttons. Each yellow bar represents a unique memento, while each gray bar represents a regular memento. The timeline visualization is based on the Timeline Setter library.

Animated GIF

Animated GIF displaying a thumbnail for odu.edu.

The figure above shows the Animated GIF visualization. It takes the thumbnails and uses the GifShot library to create the GIF. The GIF can be downloaded by clicking Download GIF. The animation interval can be adjusted by changing the number of seconds in the box above and then clicking Update GIF. Checking the Add Timestamp box and then clicking update adds a watermark of the appropriate datetime to each thumbnail.

TMVis is a great tool for summarizing how webpages have changed over time. Web archives capture and store mementos, while TMVis takes these mementos and presents them in organized visualizations. Sharing the visualizations is made easy with the embeddable Image Grid and Image Slider, and the downloadable Animated GIF. It has been an enjoyable and valuable experience to contribute to this project. Archiving the web and being able to study how it has evolved is important for preserving history.

-- Abigail Mabe (@abigail_mabe) and Dhruv Patel (@dhruv_282)

Weigle and Nelson: We are grateful for the support of the National Endowment for the Humanities (NEH) and the Institute of Museum and Library Services (IMLS), and for the input from our partners, Deborah Kempe from the Frick Art Reference Library and New York Art Resources Consortium and Pamela Graham and Alex Thurman from Columbia University Libraries. This project is an extension of AlSum and Nelson's Thumbnail Summarization Techniques for Web Archives, published in ECIR 2014 (presentation slides), and our previous work, funded by an incentive grant from Columbia University Libraries and the Andrew W. Mellon Foundation. This project has benefited from major contributions from Mat Kelly (built the initial web implementation of AlSum's summarization algorithm), Sawood Alam (provided advice and technical support), Surbhi Shankar (mocked up the initial visualizations), Maheedhar Gunnam (connected Mat's summarization code with the visualizations in a web service), and Miranda Smith (made UI enhancements).

Update (6/11/2020): We published a tech report on this project, available on arXiv.

Abigail Mabe, Dhruv Patel, Maheedhar Gunnam, Surbhi Shankar, Mat Kelly, Sawood Alam, Michael L. Nelson, and Michele C. Weigle. Visualizing Webpage Changes Over Time. Technical report arXiv:2006.02487, June 2020, https://arxiv.org/abs/2006.02487

Search This Blog

Web Science and Digital Libraries Research Group

2020-05-21: Visualizing Webpage Changes Over Time With TMVis

Comments

Post a Comment