2024-03-27: If the First Archived Copy by the Wayback Machine is 404, When did the URL First Exist?

Figure 1: TimeMap for the URL https://twitter.com/elonmusk/status/1675187969420828672

On July 1, 2023, Elon Musk posted a tweet that was archived multiple times by the Wayback Machine. Figure 1 shows a representation of the TimeMap for the URI-R (URI of the original resource) of the tweet. Each timestamp shown indicates the time, or Memento-Datetime, when the URI-R was archived. In this interface, the color of the timestamp indicates the type of HTTP response that was received at the time of archiving. A blue timestamp indicates an archived HTTP 200 response (the URI-R returned the expected result), a green timestamp indicates an archived HTTP 302 response (the URI-R redirected), and an orange timestamp indicates an archived HTTP 404 response (the URI-R was not found). Each timestamp links to a URI-M, which is the URI of the archived webpage, or memento. Typically, we can take the first appearance of a URI-M in a TimeMap as evidence that the URI-R existed at that Memento-Datetime (the time of archiving). But what if that first memento is of an HTTP 404 response, as shown in Figure 1?

We can obtain the machine-readable version of what we see in Figure 1 using the Wayback Machine’s CDX API. This is shown below:

% curl -s "http://web.archive.org/cdx/timemap/cdx?url=https://twitter.com/elonmusk/status/167518796942

0828672" | head -20

com,twitter)/elonmusk/status/1675187969420828672 20230701170320

https://twitter.com/elonmusk/status/1675187969420828672 text/html 404

7WTMU2LRRXABBBVHB4MWPFINUOKZZV2G 3112

com,twitter)/elonmusk/status/1675187969420828672 20230701170812

https://twitter.com/elonmusk/status/1675187969420828672 text/plain 302

OHPCCN5YK6NNMWGU4KWSLZHO3GKXRARC 428

com,twitter)/elonmusk/status/1675187969420828672 20230701182920

http://twitter.com/elonmusk/status/1675187969420828672 text/plain 302

OHPCCN5YK6NNMWGU4KWSLZHO3GKXRARC 419

com,twitter)/elonmusk/status/1675187969420828672 20230701183708

https://twitter.com/elonmusk/status/1675187969420828672 text/html 302

UGK2GNV3LUKVKQ5QKAHBBCDKQTBXSI7Q 995

com,twitter)/elonmusk/status/1675187969420828672 20230701183709

https://twitter.com/elonmusk/status/1675187969420828672 text/html 200

D7KQDUUH7LPO4RLK3PSMNZXOC4BFBXCA 46643

com,twitter)/elonmusk/status/1675187969420828672 20230701194954

https://twitter.com/elonmusk/status/1675187969420828672 text/html 404

7WTMU2LRRXABBBVHB4MWPFINUOKZZV2G 2985

com,twitter)/elonmusk/status/1675187969420828672 20230702021719

https://twitter.com/elonmusk/status/1675187969420828672 text/html 404

7WTMU2LRRXABBBVHB4MWPFINUOKZZV2G 3113

com,twitter)/elonmusk/status/1675187969420828672 20230702025114

https://twitter.com/elonmusk/status/1675187969420828672 warc/revisit -

7WTMU2LRRXABBBVHB4MWPFINUOKZZV2G 1349

com,twitter)/elonmusk/status/1675187969420828672 20230702033829

https://twitter.com/elonmusk/status/1675187969420828672 text/html 200

GNHMEO26OG5LDCLUVADEZNAVS5LIHBXR 46677

com,twitter)/elonmusk/status/1675187969420828672 20230702072102

https://twitter.com/elonmusk/status/1675187969420828672 text/html 200

C2QH4WWYF6H6YUCGSQRZRSYWDOX4GANL 45702

com,twitter)/elonmusk/status/1675187969420828672 20230702101254

https://twitter.com/elonmusk/status/1675187969420828672 text/html 200

LPHYSFLYDXP4NUZEGDYO3PGRYSVA3LM5 45707

com,twitter)/elonmusk/status/1675187969420828672 20230702103309

https://twitter.com/elonmusk/status/1675187969420828672 text/html 429

F7K2MYJF6CBXBOPA3SUBCSHV7554U7OH 10684

com,twitter)/elonmusk/status/1675187969420828672 20230702175608

https://twitter.com/elonmusk/status/1675187969420828672 text/html 429

BDDEYBIWPVUCK3TKJQQV66OOK3H5ES6N 9895

com,twitter)/elonmusk/status/1675187969420828672 20230703072239

https://twitter.com/elonmusk/status/1675187969420828672 text/html 200

3K7C5USBPB4IQLZODMKLKFQMJKMAEM6R 45583

com,twitter)/elonmusk/status/1675187969420828672 20230703183110

https://twitter.com/elonmusk/status/1675187969420828672 text/html 404

7WTMU2LRRXABBBVHB4MWPFINUOKZZV2G 3108

com,twitter)/elonmusk/status/1675187969420828672 20230703202104

https://twitter.com/elonmusk/status/1675187969420828672 text/html 200

ASXJZ3FBEGXAJ5KUCJIXVAZMYZSYIOB7 46743

com,twitter)/elonmusk/status/1675187969420828672 20230703202104

https://twitter.com/elonmusk/status/1675187969420828672 text/html 302

UGK2GNV3LUKVKQ5QKAHBBCDKQTBXSI7Q 993

com,twitter)/elonmusk/status/1675187969420828672 20230704053725

https://twitter.com/elonmusk/status/1675187969420828672 text/html 403

2D3D2EJRQDM6OD3G2VY46MVXY6YR2X34 954

com,twitter)/elonmusk/status/1675187969420828672 20230705030327

https://twitter.com/elonmusk/status/1675187969420828672 text/html 200

PJSS5VSAZXT6TKYJXHJ4CN6HJXGNJSO6 14756

com,twitter)/elonmusk/status/1675187969420828672 20230705082346 http://twitter.com/elonmusk/status/1675187969420828672 unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 604

Fake mementos can be created for standard URI-Rs, but not for tweet URI-Rs

In most cases, a memento of a URI-R means that the URI-R existed at that time, in part because an HTTP response code of 200 is assumed. If someone could predict a URI-R that would be used in the future, then this could be used to suggest the existence of a resource at a time earlier than it actually existed. It is possible that a standard URI-R could be predicted in advance for cases like personal websites or news websites, for example:

https://www.cnn.com/2024/01/07/health/exercise-smoking-alcohol-resolutions-fitness-health-wellness/index.html

The above URL does not use any unique identifier that would create any complication when constructing the URI-R path. Instead, it uses a combination of "datetime/subject/slug" and thus, could be predicted in advance. For this reason, it is also possible that fake links can be created for standard URI-Rs. For example, a "fake" URL for the (fake) upcoming story on CNN about the ongoing pizza preference controversy in our research group would look like:

https://www.cnn.com/2024/04/01/politics/michael-nelson-admits-like-pineapple-pizza/index.html

A TimeMap representation for a URI-R in the web archive might have 404s, but that does not mean the URI-R actually existed. Darryl Mead analyzed a case study in his article “Creating disinformation: Archiving fake links on the Wayback Machine viewed through the lens of routine activity theory” about how fake links were created within the Internet Archive for the anti-porn website yourbrainonporn.com with the intent to defame the website owner. The links never existed, but screenshots of these archived 404 responses were circulated on social media to spread disinformation through the appearance of them having been 200 responses. Michael L. Nelson provided a summary of this article, including a discussion of web archives and disinformation, in the following tweet thread:

"Creating disinformation: Archiving fake links on the Wayback Machine viewed through the lens of routine activity theory"

an interesting article in First Monday by:

Darryl Mead of The Reward Foundation (@brain_love_sex)https://t.co/ELJo3dzgbV

🧵 for #WebArchiveWednesday
— Michael L. Nelson (@phonedude_mln) October 25, 2023

We typically assume that a TimeMap for a URL begins with archived 200 responses, and possibly later shifts to redirects or 404 responses as the URL and its associated content evolves. But the yourbrainonporn.com example challenges our assumptions by having TimeMaps that consist solely of archived 404 responses. This is because anyone can submit an URI-R (even one that does not exist) to be archived through IA's Save Page Now service. This would then create a record in the Internet Archive for that URI-R, even if it's a 404 response.

However, URL construction for tweets is different, and it is highly unlikely that the URI-R of a particular tweet could be predicted before it exists. The URI-R of each tweet has a Snowflake ID, a unique identifier that can be generated by distributed clients. Since the Snowflake ID is based on the timestamp and machine ID, it is implausible to predict the URL of a tweet before it exists. In the case of Elon Musk’s tweet, we can show using the TweetedAt tool that the tweet ID has a creation date of 2023-07-01T17:01:50Z (Figure 2). This implies that the tweet actually existed at that datetime.

Figure 2: Extracting the creation date of the tweet from the tweet ID using TweetedAt.

Analyzing some mementos of the tweet crawled by the Internet Archive

Since we know that the first memento of Elon Musk's tweet was likely not faked, why was it archived as a 404 response? Figure 3 shows what the Internet Archive’s crawler saw when trying to archive Elon Musk's tweet for the first time on 2023-07-01T17:03:20Z. We know the tweet existed then because the tweet was created at 2023-07-01T17:01:50Z. But, as we see in the curl response for the memento, shown below, we get an HTTP 404 response indicating that the content does not exist. As a result, we have an archived 404 for the first Memento-Datetime. This is indicating that the Wayback Machine got a 404 at this datetime, which was 1 minute and 30 seconds after the tweet was created. One reason that an unauthenticated crawler, like the Wayback Machine, might receive an HTTP 404 response is that the Twitter account has gone private. However, it is unlikely that Elon Musk's account was ever private. Our previous curl interaction with the CDX API also showed HTTP 403 response, meaning that the Twitter server was refusing to provide access to the tweet, and HTTP 429 responses, signaling that Twitter's rate limiting was triggered.

Figure 3: The first memento on July 1, 2023 (17:03:20 GMT) shows an HTTP 404 response.

Curl response of the first memento on July 1, 2023 (17:03:20 GMT)

The second memento on 2023-07-01T17:08:12Z redirects to the same URI-M as the first memento (shown in Figure 3), which results in an HTTP 404 response. From the curl response for this memento, we get an HTTP 302 response indicating there is a redirection. As a result, we have an archived 302 for the second Memento-Datetime. The Wayback Machine redirected the second memento to the capture found at the first memento.

Curl response of the second memento on July 1, 2023 (17:08:12 GMT)

The third memento on 2023-07-01T18:29:20Z and fourth memento on 2023-07-01T18:37:08Z both redirect to the fifth memento. The fifth memento, on 2023-07-01T18:37:09Z, returns an HTTP 200 response (Figure 4). From the curl response, we get an HTTP 200 response indicating that the request was successful. Even though we have a ‘200 OK’ response, the full rendering of the URI-M does not yield to a successfully captured memento. The reasoning for such an occurrence is detailed in the next section.

Figure 4: The fifth memento on 2023-07-01T18:37:09Z shows an HTTP 200 response for the main page, but the subsequent API calls are 429 or 404 resulting in a default error page HTTP 200 response but the tweet content is unavailable.

Curl response of the fifth memento on July 1, 2023 (18:37:09 GMT)

Finally, the first memento on 2023-07-05T03:03:27Z shows the expected result (Figure 5), showing the actual tweet content. This was four days after the tweet was created and the URI-R was archived for the first time. From the curl response, we get an HTTP 200 response indicating that the request was successful, and we can see a successfully archived and replayed memento as well.

Figure 5: Content of the first memento on July 5, 2023 (03:03:27 GMT)

Curl response of the first memento on July 5, 2023 (03:03:27 GMT)

Archiving Twitter’s new UI is hard

In July 2019, the developers of Twitter decided to focus on a new website architecture to serve a more responsive experience for both mobile and desktop users. JavaScript has an impact on archiving web pages as web crawlers often miss JavaScript-dependent representations. John Berlin’s blog post “CNN.com has been unarchivable since November 1st, 2016” from 2017 and Scott G. Ainsworth’s blog post “Web Archive Study Informs Website Design” from 2016 show how JavaScript greatly complicates web archiving and reduces archive quality significantly when conventional tools are used.

Among the mementos discussed above, the fifth memento of July 1, 2023 shows an HTTP 200 response for the main page, but the subsequent API calls are 429 or 404. This results in incomplete captures of the URI-R. This is because the new Twitter UI is not friendly to most web archives. Twitter’s new UI talks to api.twitter.com (Figure 6), which imposes aggressive rate limiting, and this rate limiting makes the new UI difficult to archive. Kritika Garg and Himarsha Jayanetti discussed in detail how the new Twitter UI is not easily archivable by most web archives in their blog post “Twitter Was Already Difficult To Archive, Now It's Worse!” Their follow-on blog post "New Twitter UI: Replaying Archived Twitter Pages That Never Existed" reveals that the new UI first provides a placeholder HTML template and then the page is built with various API requests and corresponding JSON responses. If the JSON responses necessary to build the page are not archived, then this might result in incomplete captures of the mementos. This occurs because the errors in the template are displayed if the JSON responses cannot be obtained. In summary, the resulting web page starts with a skeleton page, which is a default error page, and then issues a number of API calls. If any number of these API calls fail or are not made, then we are left with this default error page. As a result, the incomplete captures are because subsequent API calls failed or were not made or were rate limit blocked by the server. In our case, the subsequent API calls were not made and no JSON responses were parsed to build the page. Thus, we get a default error page. Figure 7 shows that the API request for the tweet content (https://twitter.com/i/api/graphql/...) receives a 302 redirect to an archived 404 response on 2023-07-01T18:37:11Z (the accompanying curl request and response makes this more clear).

Figure 6: Twitter’s new UI talks to api.twitter.com

Figure 7: The request for the archived tweet content JSON at 2023-07-01T18:37:09Z receives a 302 redirect to a different memento (2023-07-01T18:37:11Z)

Curl response of the memento on July 1, 2023 (18:37:11 GMT)

Another reason for such varying HTTP status codes in the TimeMap could be that the tweet was deleted from the live web. However, searching for the text of the tweet on the live web using Google search confirms that the tweet still exists on the live web (Figure 8).

Figure 8: https://twitter.com/elonmusk/status/1675187969420828672

This Google search reveals that while JavaScript continues to be challenging for crawlers run by web archives and search engines, Google was able to obtain the text of the tweet with an unauthenticated crawler. So, how does Google crawl tweets? One way is using a headless browser and executing JavaScript. However, we suspect Google is not doing headless crawling because that takes a lot of time. Justin F. Brunelle performed an investigation almost 10 years ago on the Google crawler’s ability to index JavaScript-dependent representations in his blog post “Google and JavaScript.” He showed that Google effectively renders such representations at Web scale, yet further work could be done to properly crawl these types of representations. It is evident that Twitter’s new UI has some adverse effects on archiving services. Garg and Jayanetti used ‘Googlebot’ as a user agent to show how this approach can be a viable workaround to deal with the adverse effects. However, since late November 2023, it is not possible anymore to obtain the server-side UI rendered content for Googlebot. As a result, using ‘Googlebot’ as a user agent does not work as a viable workaround that we mentioned previously. Though we do not receive any server-side rendering while using ‘Googlebot’ as a user agent, we observe that the rendering still occurs for search engines. This means that the Google search engine is still indexing tweets (Figure 9). It is plausible that Twitter is allowing access to a certain range of IP addresses which allows the search engine to index tweets by enabling the server-side UI rendering. We speculate that Twitter is now deploying some anti-bot measures and this might have an impact on the archiving process.

Figure 9: Elon Musk’s tweet on the live web using Google search.

https://archive.ph/sENKo

Conclusion

When trying to replay archived tweets, there are three main outcomes - 404 (content does not exist), incomplete capture (content not fully rendered), and complete capture (content fully rendered). The complete and incomplete captures result in occurrences that indicate the URL really existed. But, if the archived content resolves to 404s, then this occurrence does not necessarily indicate that the URL really existed. It is possible to create fake links easily for standard URLs; those that when archived would show 404s. However, for particular tweet URLs this is not the case. The example we discuss here is of a particular tweet authored by Elon Musk and we found that:

- The tweet was created at 2023-07-01T17:01:50Z.

- The tweet is still on the live web, and Elon Musk’s account was presumably never private nor restricted.

- The first 4 mementos archived after that time are 404.

- It is not until 2023-07-05T03:03:27Z that we have a successfully replayable archived copy of that tweet.

- It is possible that Twitter was employing anti-bot measures, and this impacted the Wayback Machine.

This gives us the answer to our question: can we consider the existence of a URI-M in a TimeMap as proof of the existence of that tweet even though the memento is of an HTTP 404 response? While not true for general URLs, in this case we can consider the first memento in the TimeMap as proof that the tweet existed at that time, even though the HTTP status code is 404.

—-Tarannum Zaki (@tarannum_zaki)

Search This Blog

Web Science and Digital Libraries Research Group

2024-03-27: If the First Archived Copy by the Wayback Machine is 404, When did the URL First Exist?

Comments

Post a Comment