2019-10-28: The interaction between search engine caches and web archives
My brother, a lawyer in India, recently sent me two screenshots shown in Figures 1 and 2, of a news article about a corruption case involving a renowned doctor from India. In order to proceed with legal proceedings against the newspapers for publishing the article, my brother needed some evidence about the publication of the articles. Therefore he sought my help in finding the URLs of the articles shown in the screenshots. The news articles were published in an English language newspaper, The Asian Age, and a Hindi language newspaper, Punjab Kesari.
Figure 1: Screenshot of the news article from the English language newspaper, The Asian Age shared with me by my brother |
Figure 2: Screenshot of the news article from the Hindi language newspaper, Punjab Kesari shared with me by my brother |
Finding URLs for the screenshot of the news articles
I searched the websites of The Asian Age and Punjab Kesari for the articles and found links to the articles (shown in the Original URL row of Tables 1 and 2) but they both redirected to a 404 page, as shown in Figures 3 and 4. Fortunately, we found search engine (SE) cached copies of both articles in the Google and Bing caches, as shown in Figures 5 and 6.
Plinio Vargas in his post "Link to Web Archives, not Search Engine Caches" talks about the ephemeral nature of the SE cache URLs and highlights the reason for linking to the web archives over the SE cache URLs. Furthermore, Dr. Michael Nelson in his post "Russell Westbrook, Shane Keisel, Fake Twitter Accounts, and Web Archives" has already shown us the use of SE cache URLs and the web archives to find answers to real world problems.
Plinio Vargas in his post "Link to Web Archives, not Search Engine Caches" talks about the ephemeral nature of the SE cache URLs and highlights the reason for linking to the web archives over the SE cache URLs. Furthermore, Dr. Michael Nelson in his post "Russell Westbrook, Shane Keisel, Fake Twitter Accounts, and Web Archives" has already shown us the use of SE cache URLs and the web archives to find answers to real world problems.
Figure 3: A 404 page appears on accessing the news article from Punjab Kesari |
Figure 4: A 404 page appears on accessing the news article from The Asian Age |
cURL response for the The Asian Age news article which redirects to a 404 page
msiddique@wsdl-3102-03:~/Desktop/Test$ curl -IL "http://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html" HTTP/1.1 301 Moved Permanently Date: Fri, 20 Sep 2019 18:35:07 GMT Server: Apache/2.4.7 (Ubuntu) Location: https://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html Cache-Control: max-age=300 Expires: Fri, 20 Sep 2019 18:40:07 GMT Connection: close Content-Type: text/html; charset=iso-8859-1 HTTP/1.1 301 Moved Permanently Date: Fri, 20 Sep 2019 18:35:08 GMT Server: Apache/2.4.7 (Ubuntu) X-Powered-By: PHP/5.5.9-1ubuntu4.29 Set-Cookie: PHPSESSID=dsp7g2kkn5sfk2eggaftg3un84; path=/ Expires: Thu, 19 Nov 1981 08:52:00 GMT Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache location: /404.html X-Cache: MISS from www.asianage.com Connection: close Content-Type: text/html HTTP/1.1 200 OK Date: Fri, 20 Sep 2019 18:35:10 GMT Server: Apache/2.4.7 (Ubuntu) X-Powered-By: PHP/5.5.9-1ubuntu4.29 Set-Cookie: PHPSESSID=koaujt0tiaqgjvafa5je1djps5; path=/ Expires: Thu, 19 Nov 1981 08:52:00 GMT Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache X-Cache: MISS from www.asianage.com Connection: close Content-Type: text/html
|
Figure 6: Google Cache http://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html |
cURL response for the Punjab Kesari news article which redirects to a 404 page
msiddique@wsdl-3102-03:~/Desktop/Test$ curl -IL "https://haryana.punjabkesari.in/national/news/police-is-not-taking-action-on-dr-purushottam-who-cheated-the-patients-1050341" HTTP/1.1 301 Moved Permanently Content-Length: 0 Connection: keep-alive Cache-Control: private Location: https://haryana.punjabkesari.in/common404.aspx Server: Microsoft-IIS/8.0 Date: Fri, 20 Sep 2019 18:45:12 GMT X-Cache: Miss from cloudfront Via: 1.1 21b0487d8c28cb4577401d2a73a03053.cloudfront.net (CloudFront) X-Amz-Cf-Pop: IAD79-C2 X-Amz-Cf-Id: Ub5SmJxPQWHJQSIg9xEz-GVZLQtNA4KHkXHT2-qp_6ZD8AFKF_fQKQ== HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 76757 Connection: keep-alive Cache-Control: public, no-cache="Set-Cookie", max-age=15000 Expires: Fri, 20 Sep 2019 17:17:08 GMT Last-Modified: Fri, 20 Sep 2019 13:07:08 GMT Server: Microsoft-IIS/8.0 Date: Fri, 20 Sep 2019 13:07:08 GMT Vary: Accept-Encoding,Cookie X-Cache: Hit from cloudfront Via: 1.1 21b0487d8c28cb4577401d2a73a03053.cloudfront.net (CloudFront) X-Amz-Cf-Pop: IAD79-C2 X-Amz-Cf-Id: 5PzkcGPXziNxfNLDffTV3-V6Ks2w3FQiEUWnHMzfZm_aDKfyBKjw7A== Age: 20281
Push the cached URLs to multiple web archives
We pushed the Bing and Google cache URLs (URI-R-SEs) for both news articles to the Internet Archive, perma.cc, and archive.is. The URI-Ms for the URI-R-SEs are shown in Tables 1 and 2. We can use ArchiveNow to automate pushing of URLs to multiple web archives. We also captured the WARC files of the URI-R-SEs for the articles using Webrecorder and stored the WARCs locally.
Accessing the Cache URLs in the Web Archives
Web archives index mementos by their URI-R. A SE cache URI-M can only be accessed by users who know the URI-R-SE, which is mostly opaque as a result of various parameters and encodings. As shown in Figure 7, the URI-R-SE for the same web resource may vary according to different geographic location which means that the same web resource may be indexed under different URI-R-SEs in the web archives.
In the US, the Bing Cache URL for the The Asian Age news article is
http://cc.bingj.com/cache.aspx?
In India, the Bing cache URL for the The Asian Age news article is
http://cc.bingj.com/cache.aspx?q=+http%3a%2f%2fwww.asianage.com%2fmetros%2fdelhi%2f260819%2frs-100-crore-duping-claim-against-top-doctor.html&d=4857393311190329&mkt=en-IN&setlang=en-US&w=dLDQJ43_8q6g4yPEAeK5Q-U3JNpx878y
Figure 7: The Bing Cache URL for the US (left) is 200 and the one for India is 404 (right) |
Pushing the URI-R-SE to multiple web archives not only makes it accessible from web multiple archives, but also some web archives can be leveraged to find mementos in the other web archives. As shown in Figure 8, archive.is extracts the URI-R of the article from the URI-R-SE of the article and indexes the URI-Ms for the URI-R-SE under both the URI-R and URI-R-SE. As shown in Figure 9, we accessed a memento from Internet Archive for the URI-R-SE using the extracted URI-R-SE from archive.is which is what the other web archives consider as URI-R.
|
Figure 9: Using the Bing cache URL from archive.is to retrieve mementos of the search engine cache from the Internet Archive |
Figure 10: Memento of a SE cache which did not capture the intended content |
Figure 11: Google indexed a document from the Internet Archive which lists the memento from perma.cc for the The Asian Age news article |
Sometimes SE caches have pages that are missing (404) from the live web but not yet archived. We should push SE cache URL (URI-R-SE) to multiple web archives. We can automate the process of saving URLs to multiple web archives simultaneously by using ArchiveNow. We can use web archives like archive.is to get the URI-R-SE using the URI-R of the resource which can further to be used to search the other web archives for mementos of the URI-R-SEs.
My studies in web archiving helped me solve a real world problem posed by my brother where he needed the URLs of news articles for which he provided me with the screenshots. I found those URLs in SE caches and pushed them to multiple web archives which will be used by him in his legal proceedings.
Mohammed Nauman Siddique
(@m_nsiddique)
Comments
Post a Comment