2017-12-11: Difficulties in timestamping archived web pages

Figure 1: A web page from nasa.gov is archived
 by Michael's Evil Wayback in July 2017.

Figure 2: When visiting the same archived page in October 2017,
we found that the content of the page has been tampered with.
 
The 2016 Survey of Web Archiving in the United States shows an increasing trend of using public and private web archives in addition to the Internet Archive (IA). Because of this tendency we should consider the question of validity of archived web pages deleivered by these archives. 
Let us look at an example where the important web page https://climate.nasa.gov/vital-signs/carbon-dioxide/, that keeps a record of the carbon dioxide (CO2) level in the Earth’s atmosphere, is captured by a private web archive “Michael’s Evil Wayback” on July 17, 2017 at 18:51 GMT. At this time, as Figure 1 shows, the CO2 was 406.31 ppm.
When revisiting the same archived page in October 2017, we should be presented with the same content. Surprisingly, CO2 changed and became 270.31 ppm as Figure 2 shows. So which one is the “real” archived archived page?
We can simply detect that the content of an archived web page has been modified by generating a cryptographic hash value on the returned HTML code. For example, the following command will download the web page https://climate.nasa.gov/vital-signs/carbon-dioxide/ and generate a SHA-256 hash value on its HTML content
$ curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/ | shasum -a 256
b87320c612905c17d1f05ffb2f9401ef45a6727ed6c80703b00240a209c3e828  -
The next figure illustrates how the simple approach of generating hashes can detect any tampering with content of archived pages. In this example, the "black hat" in the figure (i.e., Michael’s Evil Wayback) has changed the CO2 to a lower value (i.e., in favor of individuals or organizations who deny that CO2 is one of the main causes of global warming).  


Another possible solution to validate archived web pages is to use timestamping. If a trusted timestamp is issued on an archived web page, anyone should verify that a particular representation of the web page has existed in a specific time in the past.
As of today, many systems, such as OriginStamp and OpenTimestamps offer a free-of-charge service to generate blockchain-based trusted timestamps of digital documents, such as Bitcoin. These tools perform multiple steps to successfully create timestamps. One of these steps requires computing a hash value which represents the content of the resource (i.e, by the cURL command above). Next, this hash value is converted to a Bitcoin's address, then a Bitcoin's transaction is made where one of the two sides of the transaction (i.e., the source and destination) should be the new generated address. Once approved by the blockchain, the transaction creation datetime is considered to be a trusted timestamp. Shawn Jones describes in "Trusted Timestamping of Mementos" how to create trusted timestamp of archived web pages using blockchain networks.
In our technical report "Difficulties of Timestamping Archived Web Pages", we show that trusted timestamping archived web pages is not an easy task for several reasons. The main reason is that a hash value calculated on the content of  an archived web page (i.e., memento) should be repeatable. That is we should always obtain the same hash value each time we retrieve the memento. In addition to those difficulties, we introduced some requirements to be fulfilled in order to generate repeatable hash values of mementos.

--Mohamed Aturban


Mohamed Aturban, Michael L. Nelson, Michele C. Weigle, "Difficulties of Timestamping Archived Web Pages." 2017. Technical Report. arXiv:1712.03140.

Comments