Monday, October 28, 2019

2019-10-28: The interaction between search engine caches and web archives

News articles from Indian newspapers about a corruption case involving an Indian doctor. The left images show screenshots of the article from the print newspaper. The right images show URLs for the articles returning with 404 pages.  

My brother, a lawyer in India, recently sent me two screenshots shown in Figures 1 and 2, of a news article about a corruption case involving a renowned doctor from India. In order to proceed with legal proceedings against the newspapers for publishing the article, my brother needed some evidence about the publication of the articles. Therefore he sought my help in finding the URLs of the articles shown in the screenshots. The news articles were published in an English language newspaper, The Asian Age, and a Hindi language newspaper, Punjab Kesari

Figure 1: Screenshot of the news article from the English language newspaper, The Asian Age shared with me by my brother

Figure 2: Screenshot of the news article from the Hindi language newspaper, Punjab Kesari shared with me by my brother

Finding URLs for the screenshot of the news articles


I searched the websites of The Asian Age and Punjab Kesari for the articles and found links to the articles (shown in the Original URL row of Tables 1 and 2) but they both redirected to a 404 page, as shown in Figures 3 and 4. Fortunately, we found search engine (SE) cached copies of both articles in the Google and Bing caches, as shown in Figures 5 and 6.

Plinio Vargas in his post "Link to Web Archives, not Search Engine Caches" talks about the ephemeral nature of the SE cache URLs and highlights the reason for linking to the web archives over the SE cache URLs. Furthermore, Dr. Michael Nelson in his post "Russell Westbrook, Shane Keisel, Fake Twitter Accounts, and Web Archives" has already shown us the use of SE cache URLs and the web archives to find answers to real world problems.

Figure 3: A 404 page appears on accessing the news article from Punjab Kesari
Figure 4: A 404 page appears on accessing the news article from The Asian Age 



cURL response for the The Asian Age news article which redirects to a 404 page
msiddique@wsdl-3102-03:~/Desktop/Test$ curl -IL "http://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html"
HTTP/1.1 301 Moved Permanently
Date: Fri, 20 Sep 2019 18:35:07 GMT
Server: Apache/2.4.7 (Ubuntu)
Location: https://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html
Cache-Control: max-age=300
Expires: Fri, 20 Sep 2019 18:40:07 GMT
Connection: close
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 301 Moved Permanently
Date: Fri, 20 Sep 2019 18:35:08 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Powered-By: PHP/5.5.9-1ubuntu4.29
Set-Cookie: PHPSESSID=dsp7g2kkn5sfk2eggaftg3un84; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: /404.html
X-Cache: MISS from www.asianage.com
Connection: close
Content-Type: text/html

HTTP/1.1 200 OK
Date: Fri, 20 Sep 2019 18:35:10 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Powered-By: PHP/5.5.9-1ubuntu4.29
Set-Cookie: PHPSESSID=koaujt0tiaqgjvafa5je1djps5; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
X-Cache: MISS from www.asianage.com
Connection: close
Content-Type: text/html

Figure 5: Bing Cache http://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html  

Figure 6: Google Cache http://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html


cURL response for the Punjab Kesari news article which redirects to a 404 page
msiddique@wsdl-3102-03:~/Desktop/Test$ curl -IL "https://haryana.punjabkesari.in/national/news/police-is-not-taking-action-on-dr-purushottam-who-cheated-the-patients-1050341"
HTTP/1.1 301 Moved Permanently
Content-Length: 0
Connection: keep-alive
Cache-Control: private
Location: https://haryana.punjabkesari.in/common404.aspx
Server: Microsoft-IIS/8.0
Date: Fri, 20 Sep 2019 18:45:12 GMT
X-Cache: Miss from cloudfront
Via: 1.1 21b0487d8c28cb4577401d2a73a03053.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: IAD79-C2
X-Amz-Cf-Id: Ub5SmJxPQWHJQSIg9xEz-GVZLQtNA4KHkXHT2-qp_6ZD8AFKF_fQKQ==

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 76757
Connection: keep-alive
Cache-Control: public, no-cache="Set-Cookie", max-age=15000
Expires: Fri, 20 Sep 2019 17:17:08 GMT
Last-Modified: Fri, 20 Sep 2019 13:07:08 GMT
Server: Microsoft-IIS/8.0
Date: Fri, 20 Sep 2019 13:07:08 GMT
Vary: Accept-Encoding,Cookie
X-Cache: Hit from cloudfront
Via: 1.1 21b0487d8c28cb4577401d2a73a03053.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: IAD79-C2
X-Amz-Cf-Id: 5PzkcGPXziNxfNLDffTV3-V6Ks2w3FQiEUWnHMzfZm_aDKfyBKjw7A==
Age: 20281


Push the cached URLs to multiple web archives


We pushed the Bing and Google cache URLs (URI-R-SEs) for both news articles to the Internet Archive, perma.cc, and archive.is. The URI-Ms for the URI-R-SEs are shown in Tables 1 and 2. We can use ArchiveNow to automate pushing of URLs to multiple web archives. We also captured the WARC files of the URI-R-SEs for the articles using Webrecorder and stored the WARCs locally.

Table 1: Links to the original URL, SE cache URLs, and the mementos for The Asian Age news article.
The Asian Age News Article
Original URL (URI-R) http://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html
Google Cache URL (URI-R-SE) https://webcache.googleusercontent.com/search?q=cache:NZBrw4FQYRUJ:https://www.asianage.com/amp/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html%3Futm_source%3DlatestPromotion%26utm_medium%3Dend%26utm_campaign%3Dlatest+&cd=1&hl=en&ct=clnk&gl=us
Mementos for Google Cache (URI-M) Internet Archive http://web.archive.org/web/20190913044502/https://webcache.googleusercontent.com/search?q=cache:NZBrw4FQYRUJ:https://www.asianage.com/amp/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html%3Futm_source%3DlatestPromotion%26utm_medium%3Dend%26utm_campaign%3Dlatest+&cd=1&hl=en&ct=clnk&gl=us
archive.is http://archive.is/Pc5Ss
perma.cc https://perma.cc/43A5-ZGUP
Bing Cache URL (URI-R-SE) http://cc.bingj.com/cache.aspx?q=http%3a%2f%2fwww.asianage.com%2fmetros%2fdelhi%2f260819%2frs-100-crore-duping-claim-against-top-doctor.html&d=4900757184710965&mkt=en-US&setlang=en-GB&w=2azCVRmBXeu0mxbmz4qBzg-JwMMcKGUO
Mementos for Bing Cache (URI-M) Internet Archive https://web.archive.org/web/20190913222156/http://cc.bingj.com/cache.aspx?q=http%3a%2f%2fwww.asianage.com%2fmetros%2fdelhi%2f260819%2frs-100-crore-duping-claim-against-top-doctor.html&d=4900757184710965&mkt=en-US&setlang=en-GB&w=2azCVRmBXeu0mxbmz4qBzg-JwMMcKGUO
archive.is http://archive.is/MLEPL
perma.cc https://perma.cc/L87D-YNTD


Table 2: Links to the original URL, SE cache URLs, and the mementos for Punjab Kesari news article.
Punjab Kesari News Article
Original URL (URI-R) https://haryana.punjabkesari.in/national/news/police-is-not-taking-action-on-dr-purushottam-who-cheated-the-patients-1050341
Google Cache URL (URI-R-SE) No results found
Mementos for Google Cache (URI-M) No Mementos
Bing Cache URL (URI-R-SE) http://cc.bingj.com/cache.aspx?q=https%3a%2f%2fharyana.punjabkesari.in%2fnational%2fnews%2fpolice-is-not-taking-action-on-dr-purushottam-who-cheated-the-patients-1050341&d=4562373894013993&mkt=en-US&setlang=en-GB&w=h-neQ61HBpgaugtetBnppSMOpz05iO79
Mementos for Bing Cache (URI-M) Internet Archive http://web.archive.org/web/20190915055809/http://cc.bingj.com/cache.aspx?q=https%3a%2f%2fharyana.punjabkesari.in%2fnational%2fnews%2fpolice-is-not-taking-action-on-dr-purushottam-who-cheated-the-patients-1050341&d=4562373894013993&mkt=en-US&setlang=en-GB&w=h-neQ61HBpgaugtetBnppSMOpz05iO79
archive.is http://archive.is/4yhYv
perma.cc https://perma.cc/C3V3-DYA5


Accessing the Cache URLs in the Web Archives  


Web archives index mementos by their URI-R. A SE cache URI-M can only be accessed by users who know the URI-R-SE, which is mostly opaque as a result of various parameters and encodings. As shown in Figure 7, the URI-R-SE for the same web resource may vary according to different geographic location which means that the same web resource may be indexed under different URI-R-SEs in the web archives. 

In the US, the Bing Cache URL for the The Asian Age news article is
http://cc.bingj.com/cache.aspx?

In India, the Bing cache URL for the The Asian Age news article is

http://cc.bingj.com/cache.aspx?q=+http%3a%2f%2fwww.asianage.com%2fmetros%2fdelhi%2f260819%2frs-100-crore-duping-claim-against-top-doctor.html&d=4857393311190329&mkt=en-IN&setlang=en-US&w=dLDQJ43_8q6g4yPEAeK5Q-U3JNpx878y

Figure 7: The Bing Cache URL for the US (left) is 200 and the one for India 
is 404 (right)

Pushing the URI-R-SE to multiple web archives not only makes it accessible from web multiple archives, but also some web archives can be leveraged to find mementos in the other web archives. As shown in Figure 8, archive.is extracts the URI-R of the article from the URI-R-SE of the article and indexes the URI-Ms for the URI-R-SE under both the URI-R and URI-R-SE. As shown in Figure 9, we accessed a memento from Internet Archive for the URI-R-SE using the extracted URI-R-SE from archive.is which is what the other web archives consider as URI-R. 
Figure 8: archive.is lists the Bing cache URL for the memento upon searching for the URL of the web page which can be used to search in other web archive

Figure 9:  Using the Bing cache URL from archive.is to retrieve 
mementos of the search engine cache from the Internet Archive
Figure 10: Memento of a SE cache which did not capture the intended content 
Figure 11: Google indexed a document from the Internet Archive which lists the memento from perma.cc for the The Asian Age news article 
As shown in Figure 10, the Internet Archive has archived Bing's soft 404 for the URI-R-SE. Fortunately, archive.is, as shown in Figure 8, archived its memento before the URI-R-SE became a soft 404. At times, we can find URI-Ms to a 404 page indexed in Google search result. As shown in Figure 11, the Google search result for the The Asian Age news article listed a document from Internet Archive which contains the URI-M from perma.cc for the news article.

Sometimes SE caches have pages that are missing (404) from the live web but not yet archived. We should push SE cache URL (URI-R-SE) to multiple web archives. We can automate the process of  saving URLs to multiple web archives simultaneously by using ArchiveNow. We can use web archives like archive.is to get the URI-R-SE using the URI-R of the resource which can further to be used to search the other web archives for mementos of the URI-R-SEs.

My studies in web archiving helped me solve a real world problem posed by my brother where he needed the URLs of news articles for which he provided me with the screenshots. I found those URLs in SE caches and pushed them to multiple web archives which will be used by him in his legal proceedings. 
------
Mohammed Nauman Siddique
(@m_nsiddique)

No comments:

Post a Comment