2024-03-27: If the First Archived Copy by the Wayback Machine is 404, When did the URL First Exist?
On July 1, 2023, Elon Musk posted a tweet that was archived multiple times by the Wayback Machine. Figure 1 shows a representation of the TimeMap for the URI-R (URI of the original resource) of the tweet. Each timestamp shown indicates the time, or Memento-Datetime, when the URI-R was archived. In this interface, the color of the timestamp indicates the type of HTTP response that was received at the time of archiving. A blue timestamp indicates an archived HTTP 200 response (the URI-R returned the expected result), a green timestamp indicates an archived HTTP 302 response (the URI-R redirected), and an orange timestamp indicates an archived HTTP 404 response (the URI-R was not found). Each timestamp links to a URI-M, which is the URI of the archived webpage, or memento. Typically, we can take the first appearance of a URI-M in a TimeMap as evidence that the URI-R existed at that Memento-Datetime (the time of archiving). But what if that first memento is of an HTTP 404 response, as shown in Figure 1?
We can obtain the machine-readable version of what we see in Figure 1 using the Wayback Machine’s CDX API. This is shown below:
% curl -s "http://web.archive.org/cdx/timemap/cdx?url=https://twitter.com/elonmusk/status/167518796942 0828672" | head -20 com,twitter)/elonmusk/status/1675187969420828672 20230701170320 https://twitter.com/elonmusk/status/1675187969420828672 text/html 404 7WTMU2LRRXABBBVHB4MWPFINUOKZZV2G 3112 com,twitter)/elonmusk/status/1675187969420828672 20230701170812 https://twitter.com/elonmusk/status/1675187969420828672 text/plain 302 OHPCCN5YK6NNMWGU4KWSLZHO3GKXRARC 428 com,twitter)/elonmusk/status/1675187969420828672 20230701182920 http://twitter.com/elonmusk/status/1675187969420828672 text/plain 302 OHPCCN5YK6NNMWGU4KWSLZHO3GKXRARC 419 com,twitter)/elonmusk/status/1675187969420828672 20230701183708 https://twitter.com/elonmusk/status/1675187969420828672 text/html 302 UGK2GNV3LUKVKQ5QKAHBBCDKQTBXSI7Q 995 com,twitter)/elonmusk/status/1675187969420828672 20230701183709 https://twitter.com/elonmusk/status/1675187969420828672 text/html 200 D7KQDUUH7LPO4RLK3PSMNZXOC4BFBXCA 46643 com,twitter)/elonmusk/status/1675187969420828672 20230701194954 https://twitter.com/elonmusk/status/1675187969420828672 text/html 404 7WTMU2LRRXABBBVHB4MWPFINUOKZZV2G 2985 com,twitter)/elonmusk/status/1675187969420828672 20230702021719 https://twitter.com/elonmusk/status/1675187969420828672 text/html 404 7WTMU2LRRXABBBVHB4MWPFINUOKZZV2G 3113 com,twitter)/elonmusk/status/1675187969420828672 20230702025114 https://twitter.com/elonmusk/status/1675187969420828672 warc/revisit - 7WTMU2LRRXABBBVHB4MWPFINUOKZZV2G 1349 com,twitter)/elonmusk/status/1675187969420828672 20230702033829 https://twitter.com/elonmusk/status/1675187969420828672 text/html 200 GNHMEO26OG5LDCLUVADEZNAVS5LIHBXR 46677 com,twitter)/elonmusk/status/1675187969420828672 20230702072102 https://twitter.com/elonmusk/status/1675187969420828672 text/html 200 C2QH4WWYF6H6YUCGSQRZRSYWDOX4GANL 45702 com,twitter)/elonmusk/status/1675187969420828672 20230702101254 https://twitter.com/elonmusk/status/1675187969420828672 text/html 200 LPHYSFLYDXP4NUZEGDYO3PGRYSVA3LM5 45707 com,twitter)/elonmusk/status/1675187969420828672 20230702103309 https://twitter.com/elonmusk/status/1675187969420828672 text/html 429 F7K2MYJF6CBXBOPA3SUBCSHV7554U7OH 10684 com,twitter)/elonmusk/status/1675187969420828672 20230702175608 https://twitter.com/elonmusk/status/1675187969420828672 text/html 429 BDDEYBIWPVUCK3TKJQQV66OOK3H5ES6N 9895 com,twitter)/elonmusk/status/1675187969420828672 20230703072239 https://twitter.com/elonmusk/status/1675187969420828672 text/html 200 3K7C5USBPB4IQLZODMKLKFQMJKMAEM6R 45583 com,twitter)/elonmusk/status/1675187969420828672 20230703183110 https://twitter.com/elonmusk/status/1675187969420828672 text/html 404 7WTMU2LRRXABBBVHB4MWPFINUOKZZV2G 3108 com,twitter)/elonmusk/status/1675187969420828672 20230703202104 https://twitter.com/elonmusk/status/1675187969420828672 text/html 200 ASXJZ3FBEGXAJ5KUCJIXVAZMYZSYIOB7 46743 com,twitter)/elonmusk/status/1675187969420828672 20230703202104 https://twitter.com/elonmusk/status/1675187969420828672 text/html 302 UGK2GNV3LUKVKQ5QKAHBBCDKQTBXSI7Q 993 com,twitter)/elonmusk/status/1675187969420828672 20230704053725 https://twitter.com/elonmusk/status/1675187969420828672 text/html 403 2D3D2EJRQDM6OD3G2VY46MVXY6YR2X34 954 com,twitter)/elonmusk/status/1675187969420828672 20230705030327 https://twitter.com/elonmusk/status/1675187969420828672 text/html 200 PJSS5VSAZXT6TKYJXHJ4CN6HJXGNJSO6 14756 com,twitter)/elonmusk/status/1675187969420828672 20230705082346 http://twitter.com/elonmusk/status/1675187969420828672 unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 604 |
Fake mementos can be created for standard URI-Rs, but not for tweet URI-Rs
In most cases, a memento of a URI-R means that the URI-R existed at that time, in part because an HTTP response code of 200 is assumed. If someone could predict a URI-R that would be used in the future, then this could be used to suggest the existence of a resource at a time earlier than it actually existed. It is possible that a standard URI-R could be predicted in advance for cases like personal websites or news websites, for example:
The above URL does not use any unique identifier that would create any complication when constructing the URI-R path. Instead, it uses a combination of "datetime/subject/slug" and thus, could be predicted in advance. For this reason, it is also possible that fake links can be created for standard URI-Rs. For example, a "fake" URL for the (fake) upcoming story on CNN about the ongoing pizza preference controversy in our research group would look like:
https://www.cnn.com/2024/04/01/politics/michael-nelson-admits-like-pineapple-pizza/index.html
A TimeMap representation for a URI-R in the web archive might have 404s, but that does not mean the URI-R actually existed. Darryl Mead analyzed a case study in his article “Creating disinformation: Archiving fake links on the Wayback Machine viewed through the lens of routine activity theory” about how fake links were created within the Internet Archive for the anti-porn website yourbrainonporn.com with the intent to defame the website owner. The links never existed, but screenshots of these archived 404 responses were circulated on social media to spread disinformation through the appearance of them having been 200 responses. Michael L. Nelson provided a summary of this article, including a discussion of web archives and disinformation, in the following tweet thread:
"Creating disinformation: Archiving fake links on the Wayback Machine viewed through the lens of routine activity theory"
— Michael L. Nelson (@phonedude_mln) October 25, 2023
an interesting article in First Monday by:
Darryl Mead of The Reward Foundation (@brain_love_sex)https://t.co/ELJo3dzgbV
🧵 for #WebArchiveWednesday
We typically assume that a TimeMap for a URL begins with archived 200 responses, and possibly later shifts to redirects or 404 responses as the URL and its associated content evolves. But the yourbrainonporn.com example challenges our assumptions by having TimeMaps that consist solely of archived 404 responses. This is because anyone can submit an URI-R (even one that does not exist) to be archived through IA's Save Page Now service. This would then create a record in the Internet Archive for that URI-R, even if it's a 404 response.
However, URL construction for tweets is different, and it is highly unlikely that the URI-R of a particular tweet could be predicted before it exists. The URI-R of each tweet has a Snowflake ID, a unique identifier that can be generated by distributed clients. Since the Snowflake ID is based on the timestamp and machine ID, it is implausible to predict the URL of a tweet before it exists. In the case of Elon Musk’s tweet, we can show using the TweetedAt tool that the tweet ID has a creation date of 2023-07-01T17:01:50Z (Figure 2). This implies that the tweet actually existed at that datetime.
Figure 2: Extracting the creation date of the tweet from the tweet ID using TweetedAt.
Analyzing some mementos of the tweet crawled by the Internet Archive
Since we know that the first memento of Elon Musk's tweet was likely not faked, why was it archived as a 404 response? Figure 3 shows what the Internet Archive’s crawler saw when trying to archive Elon Musk's tweet for the first time on 2023-07-01T17:03:20Z. We know the tweet existed then because the tweet was created at 2023-07-01T17:01:50Z. But, as we see in the curl response for the memento, shown below, we get an HTTP 404 response indicating that the content does not exist. As a result, we have an archived 404 for the first Memento-Datetime. This is indicating that the Wayback Machine got a 404 at this datetime, which was 1 minute and 30 seconds after the tweet was created. One reason that an unauthenticated crawler, like the Wayback Machine, might receive an HTTP 404 response is that the Twitter account has gone private. However, it is unlikely that Elon Musk's account was ever private. Our previous curl interaction with the CDX API also showed HTTP 403 response, meaning that the Twitter server was refusing to provide access to the tweet, and HTTP 429 responses, signaling that Twitter's rate limiting was triggered.
Figure 3: The first memento on July 1, 2023 (17:03:20 GMT) shows an HTTP 404 response.
Curl response of the first memento on July 1, 2023 (17:03:20 GMT)
The second memento on 2023-07-01T17:08:12Z redirects to the same URI-M as the first memento (shown in Figure 3), which results in an HTTP 404 response. From the curl response for this memento, we get an HTTP 302 response indicating there is a redirection. As a result, we have an archived 302 for the second Memento-Datetime. The Wayback Machine redirected the second memento to the capture found at the first memento.
Curl response of the second memento on July 1, 2023 (17:08:12 GMT)
The third memento on 2023-07-01T18:29:20Z and fourth memento on 2023-07-01T18:37:08Z both redirect to the fifth memento. The fifth memento, on 2023-07-01T18:37:09Z, returns an HTTP 200 response (Figure 4). From the curl response, we get an HTTP 200 response indicating that the request was successful. Even though we have a ‘200 OK’ response, the full rendering of the URI-M does not yield to a successfully captured memento. The reasoning for such an occurrence is detailed in the next section.
Figure 4: The fifth memento on 2023-07-01T18:37:09Z shows an HTTP 200 response for the main page, but the subsequent API calls are 429 or 404 resulting in a default error page HTTP 200 response but the tweet content is unavailable.
Curl response of the fifth memento on July 1, 2023 (18:37:09 GMT)
Finally, the first memento on 2023-07-05T03:03:27Z shows the expected result (Figure 5), showing the actual tweet content. This was four days after the tweet was created and the URI-R was archived for the first time. From the curl response, we get an HTTP 200 response indicating that the request was successful, and we can see a successfully archived and replayed memento as well.
Figure 5: Content of the first memento on July 5, 2023 (03:03:27 GMT)
Curl response of the first memento on July 5, 2023 (03:03:27 GMT)
Archiving Twitter’s new UI is hard
In July 2019, the developers of Twitter decided to focus on a new website architecture to serve a more responsive experience for both mobile and desktop users. JavaScript has an impact on archiving web pages as web crawlers often miss JavaScript-dependent representations. John Berlin’s blog post “CNN.com has been unarchivable since November 1st, 2016” from 2017 and Scott G. Ainsworth’s blog post “Web Archive Study Informs Website Design” from 2016 show how JavaScript greatly complicates web archiving and reduces archive quality significantly when conventional tools are used.
Among the mementos discussed above, the fifth memento of July 1, 2023 shows an HTTP 200 response for the main page, but the subsequent API calls are 429 or 404. This results in incomplete captures of the URI-R. This is because the new Twitter UI is not friendly to most web archives. Twitter’s new UI talks to api.twitter.com (Figure 6), which imposes aggressive rate limiting, and this rate limiting makes the new UI difficult to archive. Kritika Garg and Himarsha Jayanetti discussed in detail how the new Twitter UI is not easily archivable by most web archives in their blog post “Twitter Was Already Difficult To Archive, Now It's Worse!” Their follow-on blog post "New Twitter UI: Replaying Archived Twitter Pages That Never Existed" reveals that the new UI first provides a placeholder HTML template and then the page is built with various API requests and corresponding JSON responses. If the JSON responses necessary to build the page are not archived, then this might result in incomplete captures of the mementos. This occurs because the errors in the template are displayed if the JSON responses cannot be obtained. In summary, the resulting web page starts with a skeleton page, which is a default error page, and then issues a number of API calls. If any number of these API calls fail or are not made, then we are left with this default error page. As a result, the incomplete captures are because subsequent API calls failed or were not made or were rate limit blocked by the server. In our case, the subsequent API calls were not made and no JSON responses were parsed to build the page. Thus, we get a default error page. Figure 7 shows that the API request for the tweet content (https://twitter.com/i/api/graphql/...) receives a 302 redirect to an archived 404 response on 2023-07-01T18:37:11Z (the accompanying curl request and response makes this more clear).
Figure 6: Twitter’s new UI talks to api.twitter.com
Figure 7: The request for the archived tweet content JSON at 2023-07-01T18:37:09Z receives a 302 redirect to a different memento (2023-07-01T18:37:11Z)
Curl response of the memento on July 1, 2023 (18:37:11 GMT)
Figure 9: Elon Musk’s tweet on the live web using Google search.
Conclusion
When trying to replay archived tweets, there are three main outcomes - 404 (content does not exist), incomplete capture (content not fully rendered), and complete capture (content fully rendered). The complete and incomplete captures result in occurrences that indicate the URL really existed. But, if the archived content resolves to 404s, then this occurrence does not necessarily indicate that the URL really existed. It is possible to create fake links easily for standard URLs; those that when archived would show 404s. However, for particular tweet URLs this is not the case. The example we discuss here is of a particular tweet authored by Elon Musk and we found that:
- The tweet was created at 2023-07-01T17:01:50Z.
- The tweet is still on the live web, and Elon Musk’s account was presumably never private nor restricted.
- The first 4 mementos archived after that time are 404.
- It is not until 2023-07-05T03:03:27Z that we have a successfully replayable archived copy of that tweet.
- It is possible that Twitter was employing anti-bot measures, and this impacted the Wayback Machine.
This gives us the answer to our question: can we consider the existence of a URI-M in a TimeMap as proof of the existence of that tweet even though the memento is of an HTTP 404 response? While not true for general URLs, in this case we can consider the first memento in the TimeMap as proof that the tweet existed at that time, even though the HTTP status code is 404.
—-Tarannum Zaki (@tarannum_zaki)
Comments
Post a Comment