Wednesday, March 21, 2018

2018-03-21: Cookies Are Why Your Archived Twitter Page Is Not in English


Fig. 1 - Barack Obama's Twitter page in Urdu

The ODU WSDL lab has sporadically encountered archived Twitter pages for which the default HTML language setting was expected to be in English, but when retrieving the archived page its template appears in a foreign language. For example, the tweet content of Previous US President Barack Obama’s archived Twitter page, shown in the image above, is in English, but the page template is in Urdu. You may notice that some of the information, such as, "followers", "following", "log in", etc. are not display in English but instead are displayed in Urdu. A similar observation was expressed by Justin Littman in "The vulnerability in the US digital registry, Twitter, and the Internet Archive". According to Justin's post, the Internet Archive is aware of the bug and is in the process of fixing it.  This problem may appear benign to the casual observer, but it has deep implications when looked at from a digital archivist perspective.

The problem became more evident when Miranda Smith (a member of the WSDL lab) was finalizing the implementation of a Twitter Follower-History-Count tool.  The application makes use of mementos extracted from the Internet Archive (IA) in order to find the number of followers that a particular Twitter account had acquired through time. The tool expects the Web page retrieved from the IA to be rendered in English in order to perform Web scraping and extract for the number of followers a Twitter account had at a particular time. Since it was now evident that Twitter pages were not archived in English only, we had to decide to account for all possible language settings or discard non-English mementos. We asked ourselves: Why are some Twitter pages archived in Non-English languages that we generally expected to be in English? Note that we are referring to the interface/template language and not the language of the tweet content.

We later found that this issue is more prevalent than we initially thought it was. We selected the previous US President Barack Obama as our personality to explore how many languages and how often his Twitter page was archived. We downloaded the TimeMap of his page using MemGator and then downloaded all the mementos in it for analysis. We found that his Twitter page was archived in 47 different languages (all the languages that Twitter currently supports, a subset of which is supported in their widgets) across five different web archives, including Internet Archive (IA), Archive-It (AIT), Library of Congress (LoC), UK Web Archive (UKWA), and Portuguese Web Archive (PT). Our dataset shows that overall only 53% of his pages (out of over 9,000 properly archived mementos) were archived in English. Of the remaining 47% mementos 22% were archived in Kannada and 25% in 45 other languages combined. We excluded mementos from our dataset that were not "200 OK" or did not have language information.

Fig. 2 shows that in the UKWA English is only 5% of languages in which Barack Obama's Twitter pages were archived. Conversely, in the IA, about half of the number of Barack Obama's Twitter pages are archived in English as much as all the remaining languages combined. It is worth noting that AIT is a subset of the IA. On the one hand, it is good to have more language diversity in archives (for example, the archival record is more complete for English language web pages than other languages). On the other hand, it is very disconcerting when the page is captured in a language not anticipated. We also noted that Twitter pages in the Kannada language are archived more often than all other non-English languages combined, although Kannada ranks 32 globally by the number of native speakers which are 0.58% of the global population. We tried to find out why some Twitter pages were archived in non-English languages that belong to accounts that generally tweet in English. We also tried to find out why Kannada is so prevalent among many other non-English languages. Our findings follow.

Fig. 2 Barack Obama Twitter Page Language Distribution in Web Archives

We started investigating the reason why web archives sometimes capture pages in non-English languages, and we came up with the following potential reasons:
  • Some JavaScript in the archived page is changing the template text in another language at the replay time
  • A cached page on a shared proxy is serving content in other languages
  • "Save Page Now"-like features are utilizing users' browsers' language preferences to capture pages
  • Geo-location-based language setting
  • Crawler jobs are intentionally or unintentionally configured to send a different "Accept-Language" header
The actual reason turned out to have nothing to do with any of these, instead it was related to cookies, but describing our thought process and how we arrived at the root of the issue has some important lessons worth sharing.

Evil JavaScript


Since JavaScript is known to cause issues in web archiving (a previous blog post by John Berlin expands on this problem), both at capture and replay time, we first thought this has to do with some client-side localization where a wrong translation file is leaking at replay time. However, when we looked at the page source in a browser as well as on the terminal using curl (as illustrated below), it was clear that translated markup is being generated on the server side. Hence, this possibility was struck off.

$ curl --silent https://twitter.com/?lang=ar | grep "<meta name=\"description\""
  <meta name="description" content="من الأخبار العاجله حتى الترفيه إلى الرياضة والسياسة، احصل على القصه كامله مع التعليق المباشر.">

Caching


We thought Twitter might be doing content negotiation using "Accept-Language" request header, so we changed language preference in our web browser and opened Twitter in an incognito window which confirmed our hypothesis. Twitter did indeed consider the language preference sent by the browser and responded a page in that language. However, when we investigated HTTP response headers we found that twitter.com does not return the "Vary" header when it should. This behavior can be dangerous because the content negotiation is happening on "Accept-Language" header, but it is not advertised as a factor of content negotiation. This means, a proxy can cache a response to a URI in some language and serve it back to someone else when the same URI is requested, even with a different language in the "Accept-Language" setting. We considered this as a potential possibility of how an undesired response can get archived. 

On further investigation we found that Twitter tries very hard (sometimes in wrong ways) to make sure their pages are not cached. This can be seen in their response headers illustrated below. The Cache-Control and obsolete Pragma headers explicitly ask proxies and clients not to cache the response itself or anything about the response by setting values to "no-cache" and "no-store". The Date (the date/time at which the response was originated) and Last-Modified headers are set to the same value to ensure that the cache (if stored) becomes invalid immediately. Additionally, the Expires header (the date/time after which the response is considered stale) is set to March 31, 1981, a date far in the past, long before Twitter even existed, to further enforce cache invalidation.


$ curl -I https://twitter.com/
HTTP/1.1 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
pragma: no-cache
date: Sun, 18 Mar 2018 17:43:25 GMT
last-modified: Sun, 18 Mar 2018 17:43:25 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
...

Hence, the possibility of a cache returning pages in different languages due to the missing "Vary" header was also not sufficient to justify the number of mementos in non-English languages.

Geo-location


We thought about the possibility that Twitter identifies a potential language for guest visitors based on their IP address (to guess the geo-location). However, the languages seen in mementos do not align with the places where archival crawlers are located. For example, the Kannada language that is dominating in the UK Web Archive is spoken in the State of Karnataka in India, and it is unlikely that the UK Web Archive is crawling from machines located in Karnataka.


On-demand Archiving


The Internet Archive recently introduced the "Save Page Now" feature, which acts as a proxy and forwards request headers of the user to the upstream web server rather than its own. This behavior can be observed in a memento that we requested for an HTTP echo service, HTTPBin, from our browser. The service echoes back data in the response that it receives from the client in the request. By archiving it, we expect to see headers that identify the client that the service has seen as the requesting client. The headers shown there are of our browser, not of the IA's crawler, especially the "Accept-Language" (that we customized in our browser) and the "User-Agent" headers, which confirms our hypothesis that IA's Save Page Now feature acts as a proxy.

$ curl http://web.archive.org/web/20180227154007/http://httpbin.org/anything
{
  "args": {},
  "data": "",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding": "gzip,deflate",
    "Accept-Language": "es",
    "Connection": "close",
    "Host": "httpbin.org",
    "Referer": "https://web.archive.org/",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/64.0.3282.167 Chrome/64.0.3282.167 Safari/537.36"
  },
  "json": null,
  "method": "GET",
  "origin": "207.241.225.235",
  "url": "https://httpbin.org/anything"
}

This behavior made us consider that people from different regions of the world with different language setting in their browsers, when using "Save Page Now" feature, would end up preserving Twitter pages in the language of their preference (since Twitter does honor "Accept-Language" header in some cases). However, we were unable to replicate this in our browser. Also, not every archive has on-demand archiving and thus could never replay users' request headers.

We also repeated this experiment in Archive.is, another on-demand web archive. Unlike IA, they do not replay users' headers like a proxy, instead they have their custom request headers. Archive.is does not show the original markup, instead it modifies the page heavily before serving, so a curl output will not be very useful. However, the content of our archived HTTP echo service page look like this:


{
  "args": {},
  "data": "",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip",
    "Accept-Language": "tt,en;q=0.5",
    "Connection": "close",
    "Host": "httpbin.org",
    "Referer": "https://www.google.co.uk/",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2704.79 Safari/537.36"
  },
  "json": null,
  "method": "GET",
  "origin": "128.82.7.11, 128.82.7.11",
  "url": "https://httpbin.org/anything"
}

Note that it has its custom "Accept-Language" and "User-Agent" headers (different from our browser from which we requested the capture). It also has a custom "Referer" header included. However, unlike IA, it replayed our IP address as origin. We then captured https://twitter.com/?lang=ar (http://archive.is/cioM5) followed by https://twitter.com/phonedude_mln/ (http://archive.is/IbHgB) to see if the language session sticks across two successive Twitter requests, but that was not the case as the second page was archived in English (not in Arabic). Though, this does not necessarily prove that their crawler does not have this issue. It is possible that two different instances of their crawler handled the two requests or some other links of Twitter pages (with "?lang=en") were archived between the two requests by someone else. However, we do not have sufficient information to be certain about it.

Misconfigured Crawler


Some of the early memento we observed this behavior happening were from the Archive-It. So, we thought that some collection maintainers might have misconfigured their crawling job that sends a non-default "Accept-Language" header, resulting in such mementos. Since we did not have access to their crawling configuration, there was very little we could do to test this hypothesis. Many of the leading web archives are using Heritrix as their crawler, including Archive-It, and we happen to have some WARC files from AI, so we started looking into those. We looked for request records of those WARC files for any Twitter links to see what "Accept-Language" header was sent. We were quite surprised to see that Heritrix never sent any "Accept-Language" headers to any server, so this could not be the reason at all. However, when looking into those WARC files, we saw "Cookie" headers sent to the servers in the request records of Twitter and many others. This lead us to uncover the actual cause of the issue.

Cookies, the Real Culprit


So far, we have been considering Heritrix to be a stateless crawler, but when we looked into the WARC files of AI, we observed Cookies being sent to servers. This means Heritrix does have Cookie management built-in (which is often necessary to meaningfully capture some sites). With this discovery, we started investigating Twitter's behavior from a different perspective. The page source of Twitter has a list of alternate links for each language they provide localization for (currently, 47 languages). This list can get added to the frontier queue of the crawler. Though, these links have a different URI (i.e., having a query parameter "?lang=<lang-code>"), once any of these links are loaded, the session is set for that language until the language is explicitly changed or the session expires/cleared. In the past they had options in the interface to manually select a language, which then gets set for the session. It is understandable that general purpose web sites cannot rely on the "Accept-Language" completely for localization related content negotiation as browsers have made it difficult to customize language preferences, especially if one has to set it on a per-site basis.

We experimented with Twitter's language related behavior in our web browser by navigating to https://twitter.com/?lang=ar, which yields the page in the Arabic language. Then navigating to any Twitter page such as https://twitter.com/ or https://twitter.com/ibnesayeed (without the explicit "lang" query parameter) continues to serve Arabic pages (if a Twitter account is not logged in). Here is how Twitter's server behaves for language negotiation:

  • If a "lang" query parameter (with a supported language) is present in any Twitter link, that page is served in the corresponding language.
  • If the user is a guest, value from the "lang" parameter is set for the session (this gets set each time an explicit language parameter is passed) and remains sticky until changed/cleared.
  • If the user is logged in (using Twitter's credentials), the default language preference is taken from their profile preferences, so the page will only show in a different language if an explicit "lang" parameter is present in the URI. However, it is worth noting that crawlers generally behave like guests.
  • If the user is a guest and no "lang" parameter is passed, Twitter falls back to the language supplied in the "Accept-Language" header.
  • If the user is a guest, no "lang" parameter is passed, and no "Accept-Language" header is provided, then responses are in English (though, this could be affected by Geo-IP, which we did not test).

In the example below we illustrate some of that behavior using curl. First, we fetch Twitter's home page in Arabic using explicit "lang" query parameter and show that the response was indeed in Arabic as it contains lang="ar" attribute in the <html> element tag. We also saved any cookies that the server might want to set in the "/tmp/twitter.cookie" file. We then showed that the cookie file does indeed have a "lang" cookie with the value "ar" (there are some other cookies in it, but those are not relevant here). In the next step, we fetched Twitter's home page without any explicit "lang" query parameter and received a response in the default English language. Then we fetched the home page with the "Accept-Language: ur" header and got the responses in Urdu. Finally, we fetched the home page again, but this time supplied the saved cookies (that includes "lang=ar" cookie) and received the response again in Arabic.

$ curl --silent -c /tmp/twitter.cookie https://twitter.com/?lang=ar | grep "<html"
<html lang="ar"</span><nowiki> data-scribe-reduced-action-queue="true">

$ grep lang /tmp/twitter.cookie
twitter.com FALSE / FALSE 0 lang ar

$ curl --silent https://twitter.com/ | grep "<html"
<html lang="en" data-scribe-reduced-action-queue="true">

$ curl --silent -H "Accept-Language: ur" https://twitter.com/ | grep "<html"
<html lang="ur" data-scribe-reduced-action-queue="true">

$ curl --silent -b /tmp/twitter.cookie https://twitter.com/ | grep "<html"
<html lang="ar" data-scribe-reduced-action-queue="true">


Twitter Cookies and Heritrix


Now that we understood the reason, we wanted to replicate what is happening in a real archival crawler. We used Heritrix to simulate the effect that Twitter cookies have when a Twitter page gets archived in the IA. The order of these links was carefully chosen to see if the first link sets the language to Arabic and then the second one gets captured in Arabic or not. We seeded the following URLs and placed them in the same sequence inside Heritrix's configuration file:
We had already proven that the first URI which included the language identifier for Arabic (lang=ar) will place the language identifier inside the cookie. The question now becomes: What is the effect this cookie will have on subsequent crawls/requests of future Twitter pages? Is the language identifier going to stay the same as the one already set in the cookie? Is is it going to revert to a default language preference? The common expectation for our seeded URIs is that the first Twitter page will be archived in Arabic, and that the second page will be archived in English, since a request to a top level .com domain is usually defaulted to the English language. However, since we have observed that the Twitter cookies contain the language identifier when this parameter is passed in the URI, then if subsequent Twitter pages use the same cookie, it is plausible that the language identifier is going to be maintained.

After running the crawling job in Heritrix for the seeded URIs, we inspected the WARC file generated by Heritrix. The results were as we expected. Heritrix was indeed saving and replaying "Cookie" headers, resulting in the second page being captured in Arabic. Relevant portions of the resulting WARC file are shown below:

WARC/1.0
WARC-Type: request
WARC-Target-URI: https://twitter.com/?lang=ar
WARC-Date: 2018-03-16T21:58:44Z
WARC-Concurrent-To: <urn:uuid:7dbc3a67-5cf8-4375-8343-c0f6b03039f4>
WARC-Record-ID: <urn:uuid:473273f6-48fa-4dd3-a5f0-81caf9786e07>
Content-Type: application/http; msgtype=request
Content-Length: 301

GET /?lang=ar HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://cs.odu.edu/)
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: twitter.com
Cookie: guest_id=v1%3A152123752160566016; personalization_id=v1_uAUfoUV9+DkWI8mETqfuFg==

The portion of the WARC, shown above, is a  request record for the URI https://twitter.com/?lang=ar. Highlighted lines illustrates the GET request made to the host "twitter.com" with the path and query parameter "/?lang=ar". This request yielded a response from Twitter that contains a "set-cookie" header with the language identifier included in the URI "lang=ar" as shown in the portion of the WARC below. The HTML was rendered in Arabic (notice the highlighted <html> element with the lang attribute in the response payload below).

WARC/1.0
WARC-Type: response
WARC-Target-URI: https://twitter.com/?lang=ar
WARC-Date: 2018-03-16T21:58:44Z
WARC-Payload-Digest: sha1:FCOPDBN2U5LXU7FEUUGQ4WXYGR7OP5JI
WARC-IP-Address: 104.244.42.129
WARC-Record-ID: <urn:uuid:7dbc3a67-5cf8-4375-8343-c0f6b03039f4>
Content-Type: application/http; msgtype=response
Content-Length: 151985

HTTP/1.0 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
content-length: 150665
content-type: text/html;charset=utf-8
date: Fri, 16 Mar 2018 21:58:44 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
last-modified: Fri, 16 Mar 2018 21:58:44 GMT
pragma: no-cache
server: tsa_b
set-cookie: fm=0; Expires=Fri, 16 Mar 2018 21:58:34 UTC; Path=/; Domain=.twitter.com; Secure; HTTPOnly
set-cookie: _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCGKB0jBiAToMY3NyZl9p%250AZCIlZmQ1MTY4ZjQ3NmExZWQ1NjUyNDRmMzhhZGNiMmFhZjQ6B2lkIiU0OTQ0%250AZDMxMDY4NjJhYjM4NjBkMzI4MDE0NjYyOGM5ZA%253D%253D--f571656f1526d7ff1b363d527822ebd4495a1fa3; Path=/; Domain=.twitter.com; Secure; HTTPOnly
set-cookie: lang=ar; Path=/
set-cookie: ct0=10558ec97ee83fe0f2bc6de552ed4b0e; Expires=Sat, 17 Mar 2018 03:58:44 UTC; Path=/; Domain=.twitter.com; Secure
status: 200 OK
strict-transport-security: max-age=631138519
x-connection-hash: 2a2fc89f51b930202ab24be79b305312
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-response-time: 100
x-transaction: 001495f800dc517f
x-twitter-response-tags: BouncerCompliant
x-ua-compatible: IE=edge,chrome=1
x-xss-protection: 1; mode=block; report=https://twitter.com/i/xss_report

<!DOCTYPE html>
<html lang="ar" data-scribe-reduced-action-queue="true">
...

The subsequent request in the seeded Heritrix configuration file (https://twitter.com/phonedude_mln/) generated an additional request record which is shown on the WARC file portion below. The highlighted lines illustrates the GET request made to the host "twitter.com" with the path and query parameter "/phonedude_mln/". You may notice that a "Cookie"  with the value lang=ar was included as one of the parameters in the header request which was set initially by the first seeded URI. The results were as expected. Heritrix was indeed saving and replaying "Cookie" headers, resulting in the second page being captured in Arabic.

WARC/1.0
WARC-Type: request
WARC-Target-URI: https://twitter.com/phonedude_mln/
WARC-Date: 2018-03-16T21:58:48Z
WARC-Concurrent-To: <urn:uuid:634dea88-6994-4bd4-af05-5663d24c3727>
WARC-Record-ID: <urn:uuid:eef134ed-f3dc-459b-95e7-624b4d747bc1>
Content-Type: application/http; msgtype=request
Content-Length: 655

GET /phonedude_mln/ HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://cs.odu.edu/)
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: twitter.com
Cookie: lang=ar; _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCGKB0jBiAToMY3NyZl9p%250AZCIlZmQ1MTY4ZjQ3NmExZWQ1NjUyNDRmMzhhZGNiMmFhZjQ6B2lkIiU0OTQ0%250AZDMxMDY4NjJhYjM4NjBkMzI4MDE0NjYyOGM5ZA%253D%253D--f571656f1526d7ff1b363d527822ebd4495a1fa3; ct0=10558ec97ee83fe0f2bc6de552ed4b0e; guest_id=v1%3A152123752160566016; personalization_id=v1_uAUfoUV9+DkWI8mETqfuFg==

The portion of the WARC file, shown below, shows the effect of Heritrix saving and playing the "Cookie" headers. The highlighted <html> element proved that the  HTML language identifier was set to Arabic on the second seeded URI (https://twitter.com/phonedude_mln/), although the URI did not include in the language identifier.

WARC/1.0
WARC-Type: response
WARC-Target-URI: https://twitter.com/phonedude_mln/
WARC-Date: 2018-03-16T21:58:48Z
WARC-Payload-Digest: sha1:5LI3DGWO6NGK4LWSIHFZZHW43H2Z2IWA
WARC-IP-Address: 104.244.42.129
WARC-Record-ID: <urn:uuid:634dea88-6994-4bd4-af05-5663d24c3727>
Content-Type: application/http; msgtype=response
Content-Length: 518086

HTTP/1.0 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
content-length: 516921
content-type: text/html;charset=utf-8
date: Fri, 16 Mar 2018 21:58:48 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
last-modified: Fri, 16 Mar 2018 21:58:48 GMT
pragma: no-cache
server: tsa_b
set-cookie: fm=0; Expires=Fri, 16 Mar 2018 21:58:38 UTC; Path=/; Domain=.twitter.com; Secure; HTTPOnly
set-cookie: _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCGKB0jBiAToMY3NyZl9p%250AZCIlZmQ1MTY4ZjQ3NmExZWQ1NjUyNDRmMzhhZGNiMmFhZjQ6B2lkIiU0OTQ0%250AZDMxMDY4NjJhYjM4NjBkMzI4MDE0NjYyOGM5ZA%253D%253D--f571656f1526d7ff1b363d527822ebd4495a1fa3; Path=/; Domain=.twitter.com; Secure; HTTPOnly
status: 200 OK
strict-transport-security: max-age=631138519
x-connection-hash: ef102c969c74f3abf92966e5ffddb6ba
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-response-time: 335
x-transaction: 0014986c00687fa3
x-twitter-response-tags: BouncerCompliant
x-ua-compatible: IE=edge,chrome=1
x-xss-protection: 1; mode=block; report=https://twitter.com/i/xss_report

<!DOCTYPE html>
<html lang="ar" data-scribe-reduced-action-queue="true">
...

We used PyWb to replay pages from the captured WARC file. Fig. 3 is the page rendered after retrieving the first seeded URI of our collection (https://twitter.com/?lang=ar). For those not familiar with Arabic, this is indeed Twitter's home page in Arabic.

Fig.3  https://twitter.com/?lang=ar

Fig. 4 is the representation given by PyWb after requesting the second seeded URI (https://twitter.com/phonedude_mln). The page was rendered using Arabic as the default language, although we did not include this setting in the URI, nor did our browser language settings include Arabic.

Fig.4  https://twitter.com/phonedude_mln/ in Arabic

Why is Kannada More Prominent?


As we noted before, Twitter's page source now includes a list of alternate links for 47 supported languages. These links look something like this:

<link rel="alternate" hreflang="fr" href="https://twitter.com/?lang=fr">
<link rel="alternate" hreflang="en" href="https://twitter.com/?lang=en">
<link rel="alternate" hreflang="ar" href="https://twitter.com/?lang=ar">
...
<link rel="alternate" hreflang="kn" href="https://twitter.com/?lang=kn">

The fact that Kannada ("kn") is the last language in the list is why it is so prevalent in web archives. While other language specific links overwrite the session set by their predecessor, the last one affects many more Twitter links in the frontier queue. Twitter started supporting Kannada along with three other Indian languages in July 2015 and placed it at the very end of language related alternate links. Since then, it has been captured more often in various archives than any other non-English language. Before these new languages were added, Bengali used to be the last link in the alternate language links for about a year. Our dataset shows dense archival activity for Bengali between July 2014 to July 2015, then Kannada took over. This confirms our hypothesis about the spatial placement of the last language related link sticking the session for a long time with that language. This affects all upcoming links in the crawlers' frontier queue from the same domain until another language specific link overwrites the session.

What Should We Do About It?


Disabling cookies does not seem to be a good option for crawlers as some sites would try hard to set a cookie by repeatedly returning redirect responses until their desired "Cookie" headers is included in the request. However, explicitly reducing the cookie expiration duration in crawlers could potentially mitigate the long-lasting impact of such sticky cookies. Garbage collecting any cookie that was set more than a few seconds ago would make sure that no cookie is being reused for more than a few successive requests. Sandboxing crawl jobs in many isolated sessions is another potential solution to minimize the impact. Alternatively, some filtering policies can be set for URLs that set any session cookies to download them in a separate short-lived session to isolate them from rest of the crawl frontier queue.

Conclusions


The problem of portions of Twitter pages unintentionally being archived in non-English languages is quite significant. We found that 47% of mementos of Barack Obama's Twitter page were in non-English languages, almost half of which were in Kannada alone. While language diversity in web archives is generally a good thing, in this case though, it is disconcerting and counter-intuitive. We found that the root cause has to do with Twitter's sticky language sessions maintained using cookies which Heritrix crawler seems to honor.

The Kannada language being the last one in the list of language-specific alternate links on Twitter's pages makes it overwrite the language cookies resulting from the URLs in other languages listed above it. This causes more Twitter pages in the frontier queue being archived in Kannada than other non-English languages. Crawlers are generally considered to be stateless, but honoring cookies makes them somewhat stateful. This behavior in web archives may not be specific to just Twitter, but many other sites that utilize cookies for content negotiation might have some similar consequences. This issue can potentially be mitigated by reducing the cookie expiration duration explicitly in crawlers or distributing the crawling task for the URLs of the same domain in many small sandboxed instances.

--
Sawood Alam
and
Plinio Vargas

No comments:

Post a Comment