Monday, March 18, 2019

2019-03-18: Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages

Figure 1: Mixed language blocks on a memento of a Twitter timeline. Highlighted with blue colored box for Portuguese, orange for English, and red for Urdu. Dotted border indicates the template present in the original HTML response while blocks with solid borders indicate lazily loaded content.

Would you be surprised if I were to tell you that Twitter is a multi-lingual website, supporting 47 different international languages? How about if I were to tell you that a usual Twitter timeline page can contain tweets in whatever languages the owner of the handle chooses to tweet, but can also show navigation bar and various sidebar blocks in many different languages simultaneously, now surprised? Well, while it makes no sense, it may actually happen in web archives when a memento of a timeline is accessed as shown in Figure 1. Spoiler alert! Cookies are to be blamed, once again.

Last month, I was investigating a real life version of "Ron Burgundy will read anything on the teleprompter (Anchorman)" and "Chatur's speech (3 Idiots)" moments, when I noticed something that caught my eyes. I was looking at a memento (i.e., a historical version of a web page) of Pratik Sinha's Twitter timeline from the Internet Archive. Pratik is the co-founder of Alt News (an Indian fact checking website) and the person who edited an internal document of the IT Cell of BJP (the current ruling party of India), which was then copy-pasted and tweeted by some prominent handles of the party. Tweets on his timeline are generally in English, but the archived page's template language was not English (although, I did not request the page in any specific language). However, this was not surprising to me as we have already investigated the reason behind this template language behavior last year and found that HTTP cookies were causing it. After spending a minute or so on the page, a small notice appeared in the main content area, right above the list of tweets, suggesting that there were 20 more tweets, but the message was in Urdu language, a Right-to-Left (RTL) language, very different from the language used in the page's navigation bar. Urdu, being my first language, immediately alerted me that there is something not quite right. Upon further investigation, I found that the page was composed of three different languages, Portuguese, English, and Urdu as highlighted in Figure 1 (here I am not talking about the language of tweets themselves).

What Can Deface a Composite Memento?


This defaced composite memento is a serious archival replay problem as it is showing a web page that perhaps never existed. While the individual representations all separately existed on the live web, they were never combined in the page as it is replayed by the web archive. In the Web Science and Digital Libraries Research Group, we uncovered a couple of causes in the past that can yield defaced composite mementos. One of them is live-leakage (also known as Zombies) for which Andy Jackson proposed we should use Content-Security-PolicyAda Lerner et al. took a security-centric approach that was deployed by the Internet Archive's Wayback Machine, and we proposed Reconstructive as a potential solution using Service Worker. The other known cause is temporal violations, on which Scott Ainsworth is working as his PhD research. However, this mixed-language Twitter timeline issue cannot be explained by zombies nor temporal violations.

Anatomy of a Twitter Timeline


To uncover the cause, I further investigated the anatomy of a Twitter timeline page and various network requests it makes when accessed live or from a web archive as illustrated in Figure 2. Currently, when a Twitter timeline is loaded anonymously (without logging in), the page is returned with a block of brief description of the user, a navigation bar (containing summary of numbers of tweets and followers etc.), a sidebar block to encourage visitors to create a new account, and an initial set of tweets. The page also contains empty placeholders of some sidebar blocks such as related users to follow, globally trending topics, and recent media posted on that timeline. Apart from loading page requisites, the page also makes some follow up XHR requests to populate these blocks. When the page is active (i.e., the browser tab is focused) it polls for new tweets after every 30 seconds and global trends after every 5 minutes. Successful responses to these asynchronous XHR requests contain data in JSON format, but instead of providing a language-independent bare bone structured data to rendering templates on the client-side, they contain some server-side rendered encoded markup. Which is then decoded on the client-side and directly injected in corresponding empty placeholders (or replaced with any existing content), then the block is set to visible. This server-side partial markup rendering needs to know the language of the parent page in order to utilize phrases translated in the corresponding language to yield a consistent page.

Figure 2: An active Twitter timeline page asynchronously populates related users and recent media blocks then polls for new tweets every 30 seconds and global trends every 5 minutes.

How Does Twitter's Language Internationalization Work?


From our past investigation we know that Twitter handles languages in two primary ways, a query parameter and a cookie header. In order to fetch a page in a specific language (from their 47 currently supported languages) one can either add a "?lang=<language-code>" query parameter in the URI (e.g.,
https://twitter.com/ibnesayeed?lang=ur
for Urdu) or send a Cookie header containing the "lang=<language-code>" name/value pair. A URI query parameter takes precedence in this case and also sets the "lang" Cookie accordingly (overwriting any existing value) for all the subsequent requests until overwritten again explicitly. This works well on the live site, but has some unfortunate consequences when a memento of a Twitter timeline is replayed from a web archive, causing this hodgepodge illustrated in Figure 1 (area highlighted by dotted border indicates the template served in the initial HTML response while areas surrounded with solid border were lazily loaded). This mixed-language rendering does not happen when a memento of a timeline is loaded with an explicit language query parameter in the URI as illustrated in Figures 3, 4, and 5 (the "lang" query parameter is highlighted in the archival banner and also the lazily loaded blocks from each language that corresponds to the blocks in Figure 1). In this case, all the subsequent XHR URIs also contain the explicit "lang" query parameter.

Figure 3: A memento of a Twitter timeline explicitly in Portuguese.

Figure 4: A memento of a Twitter timeline explicitly in English.

Figure 5: A memento of a Twitter timeline explicitly in Urdu. The direction of the page is Right-to-Left (RTL), as a result, sidebar blocks are moved to the left hand side.

To understand the issue, consider the following sequence of events during the crawling of a Twitter timeline page. Suppose, we begin a fresh crawling session and start with fetching the https://twitter.com/ibnesayeed page without any specific language code supplied. Depending on the geo-location of the crawler or any other factors Twitter might return the page in a specific language, for instance, in English. The crawler extracts links of all the page requisites and hyperlinks to add them into the frontier queue. The crawler may also attempt to extract URIs of potential XHR or other JS initiated requests, which might add URIs like:
https://twitter.com/i/trends?k=&pc=true&profileUserId=28631536&show_context=true&src=module
and
https://twitter.com/i/related_users/28631536
(and various other lazily loaded resources) in the frontier queue. The HTML page also contains 47 language-specific alternate links (and one x-default hreflang) in its markup (with "?lang=<language-code>" style parameters). These alternate links will also be added in the frontier queue of the crawler in some order. When these language-specific links are fetched by the crawler, the lang Cookie will be set, overwriting any prior value. Now, suppose the https://twitter.com/ibnesayeed?lang=ur was fetched before the "/i/trends" data, it would set the language for any subsequent requests to be served in Urdu. When the data for global trends block is fetched, Twitter's server will returned a server-side rendered markup in Urdu, which will be injected in the page that was initially served in English. This will cause the header of the block saying "دنیا بھر کے میں رجحانات" instead of "Worldwide trends". Here, I would take a long pause of silence to express my condolence on the brutal murder of a language with more than 100 million speakers worldwide by a platform as big as Twitter. The Urdu translation of this phrase appearing on such a prominent place on the page is a nonsense and grammatically wrong. Twitter, if you are listening, please change it to something like "عالمی رجحانات" and get an audit of other translated phrases. Now, back to the original problem, following is a walk-through of the scenario described above.

$ curl --silent "https://twitter.com/ibnesayeed" | grep "<html"
<html lang="en" data-scribe-reduced-action-queue="true">
$ curl --silent -c /tmp/twitter.cookie "https://twitter.com/ibnesayeed?lang=ur" | grep "<html"
<html lang="ur" data-scribe-reduced-action-queue="true">
$ grep lang /tmp/twitter.cookie
twitter.com FALSE / FALSE 0 lang ur
$ curl --silent -b /tmp/twitter.cookie "https://twitter.com/ibnesayeed" | grep "<html"
<html lang="ur" data-scribe-reduced-action-queue="true">
$ curl --silent -b /tmp/twitter.cookie "https://twitter.com/i/trends?k=&pc=true&profileUserId=28631536&show_context=true&src=module" | jq
{
  "module_html": "<div class=\"flex-module trends-container context-trends-container\">\n  <div class=\"flex-module-header\">\n    \n    <h3><span class=\"trend-location js-trend-location\">دنیا بھر کے میں رجحانات</span></h3>\n  </div>\n  <div class=\"flex-module-inner\">\n    <ul class=\"trend-items js-trends\">\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#PiDay\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/PiDay?src=tren&amp;data_id=tweet%3A1106214111183020034\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#PiDay</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Google employee sets new record for calculating π to 31.4 trillion digits\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#SaveODAAT\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/SaveODAAT?src=tren&amp;data_id=tweet%3A1106252880921747457\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#SaveODAAT</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Netflix cancels One Day at a Time after three seasons\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"Beto\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=Beto&amp;src=tren&amp;data_id=tweet%3A1106142158023786496\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Beto</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Beto O’Rourke announces 2020 presidential bid\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#AvengersEndgame\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/AvengersEndgame?src=tren&amp;data_id=tweet%3A1106169765830295552\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#AvengersEndgame</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Marvel dropped a new Avengers: Endgame trailer\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"12 Republicans\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:tweet_count_1000_10000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=%2212%20Republicans%22&amp;src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">12 Republicans</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          6,157 ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#NationalAgDay\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:tweet_count_1000_10000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/NationalAgDay?src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#NationalAgDay</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          6,651 ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"Kyle Guy\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:tweet_count_1000_10000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=%22Kyle%20Guy%22&amp;src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Kyle Guy</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          1,926 ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#314Day\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:tweet_count_10000_100000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/314Day?src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#314Day</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          12 ہزار ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"Tillis\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=Tillis&amp;src=tren&amp;data_id=tweet%3A1106266707230777344\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Tillis</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Senate votes to block Trump&#39;s border emergency declaration\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"Bikers for Trump\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:tweet_count_10000_100000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=%22Bikers%20for%20Trump%22&amp;src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Bikers for Trump</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          16.8 ہزار ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n    </ul>\n  </div>\n</div>\n",
  "personalized": false,
  "woeid": 1
}

Here, I started by fetching my Twitter time without specifying any language in the URI or via cookies. The response was returned in English. I then fetched the same page with explicit "?lang=ur" query parameter and saved any returned cookies in the "/tmp/twitter.cookie" file. We illustrated that the response was indeed returned in Urdu. We then checked the saved cookie file to see if it contains a "lang" cookie, which it does and has a value of "ur". We then utilized the saved cookie file to fetch the main timeline page again, but without an explicit "?lang=ur" query parameter to illustrate that Twitter's server respects it and returns the response in Urdu. Finally, we fetched global trends data while utilizing saved cookies and illustrated that the response contains a JSON-serialized HTML markup with Urdu header text in it as
"<h3><span class=\"trend-location js-trend-location\">دنیا بھر کے میں رجحانات</span></h3>"
under the "module_html" JSON key. The original response is encoded using Unicode escapes, but we used jq utility here to pretty-print JSON and decode escaped markup for easier illustration.

Understanding Cookie Violations


When fetching a single page (and all its page requisites) at a time, this problem, let's name it a cookie violation, might not happen as often. However, when crawling is done on a large scale, preventing such unfortunate misalignment of frontier queue completely is almost impossible, especially, since the "lang" cookie is set for the root path of the domain and affects every resource from the domain.

The root cause here can more broadly be described as a lossy state information being utilized when replaying a stateful resource representation from archives that originally performed content negotiation based on cookies or other headers. Most of the popular archival replay systems (e.g., OpenWayback, PyWB, and even our own InterPlanetary Wayback) do not perform any content negotiation when serving a memento other than the Accept-Datetime header (which is not part of the original crawl-time interaction, but a means to add the time dimension to the web). Traditional archival crawlers (such as Heritrix) mostly interacted with web servers by using only URIs without any custom request headers that might affect the returned response. This means, generally a canonicalized URI along with the datetime of the capture was sufficient to identify a memento. However, cookies are an exception to this assumption as they are needed for some sites to behave properly, hence cookie management support was added to these crawlers long time ago. Cookies can be used for tracking, client-side configurations, key/value store, and authentication/authorization session management, but in some cases they can also be used for content negotiation (as is the case with Twitter). When cookies are used for content negotiation, the server should adevrtise it in the "Vary" header, but Twitter does not. Accommodating cookies at capture/crawl time, but not utilizing them at replay time has this consequence of cookie violations, resulting in defaced composite mementos. Similarly, in aggregated personal web arching, which is the PhD research topic of Mat Kelly, not utilizing session cookies (or other forms of authorization headers) at replay time can result in a serious security vulnerability of private content leakage. In modern headless browser-based crawlers there might even be some custom headers that a site utilizes in XHR (or fetch API) for content negotiation, which should be considered when indexing the content for replay (or filtering at replay time from a subset). Ideally, a web archive should behave like an HTTP proxy/cache when it comes to content negotiation, but it may not always be feasible.

What Should We Do About It?


So, should we include cookies in the replay index and only return a memento if the cookies in the request headers match? Well, that will be a disaster as it will cause an enormous amount of false-negatives (i.e., mementos that are present in an archive and should be returned, but won't). Perhaps we can canonicalize cookies and only index ones that are authentication/authorization session-related or used for content negotiation. However, identifying such cookies will be difficult and will require some heuristic analysis or machine learning, because, these are opaque strings and their names are decided by the server application (rather than using any standardized names).

Even if we can somehow sort this issue out, there are even bigger problems in making it to work. For example, how to get the client send suitable cookies in the first place? How will the web archive know when to send a "Set-Cookie" header? Should the client follow the exact path of interactions with pages as the crawler did when a set of pages were captured in order to set appropriate cookies

Let's ignore session cookies for now and only focus on the content negotiation related cookies. Also, let's relax the cookie matching condition further by only filtering mementos if a cookies header is present in a request, otherwise ignore cookies from the index. This means, the replay system can send a Set-Cookie header if the memento in question was originally observed with a Set-Cookie header and expect to see it in the subsequent requests. Sounds easy? Welcome to the cookie collision hell. Cookies from various domains will be required to be rewritten to set the domain name of the web archive that is serving the memento. As a result, same cookie names from various domains served over time from the same archive will step over each other (it's worth mentioning that often a single web page has page requisites from many different domains). Even the web archive can have some of its own cookies independent of the memento being served.

We can attempt to solve this collision issue by rewriting the path of cookies and prefixing it with the original domain name to limit the scope (e.g., change
"Set-Cookie: lang=ur; Domain: twitter.com; Path=/"
to
"Set-Cookie: lang=ur; Domain: web.archive.org; Path=/twitter.com/"
). This is not going to work because the client will not send this cookie unless the requested URI-M path has a prefix of "/twitter.com/", but the root path of Twitter is usually rewritten as something like "/web/20190214075028/https://twitter.com/" instead. If the same rewriting rule is used in cookie path then the unique 14-digit datetime path segment will block it from being sent with subsequent requests that have a different datetime (which is almost always the case after an initial redirect). Unfortunately, cookie path does not support wildcard paths like "/web/*/https://twitter.com/".

Another possibility could be prefixing the name of the cookie with the original domain [and path] (with some custom encoding and unique-enough delimiters) then setting path to the root of the replay (e.g., change the above example to
"Set-Cookie: twitter__com___lang=ur; Domain: web.archive.org; Path=/web/"
), which, the replay server understands how to decode and apply properly. I am not aware of any other attributes of cookies that can be exploited to annotate with additional information. The downside of this approach is that if the client is relying on these cookies for certain functionalities then the changed name will affect them.

Additionally, an archival replay system should also rewrite cookie expiration time to a short-lived future value (irrespective of the original value, which could be a value in the past or a very distant value in the future) otherwise the growing pile of cookies from many different pages will increase the request size significantly over time. Moreover, incorporating cookies in replay systems will have some consequences in cross-archive aggregated memento reconstruction.

In our previous post about another cookie related issue, we proposed that explicitly expiring cookies (and garbage collecting cookies older than a few seconds) may reduce the impact. We also proposed that distributing crawl jobs of the URIs from the same domain in smaller sandboxed instances could minimize the impact. I think these two approaches can be helpful in mitigating this mixed-language issue as well. However, it is worth noting that these are crawl-time solutions, which will not solve the replay issues of existing mementos.

Dissecting the Composite Memento


Now, back to the memento of Pratik's timeline from the Internet Archive. The page is archived primarily in Portuguese. When it is loaded in a web browser that can execute JavaScript, the page makes subsequent asynchronous requests to populate various blocks as it does on the live site. Recent media block is not archived, so it does not show up. Related users block is populated in Portuguese (because this block is generally populated immediately after the main page is loaded and does not get a chance to be updated later, hence, unlikely to load a version in a different language). The closest successful memento of the global trends data is loaded from
https://web.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&pc=true&profileUserId=7431372&show_context=true&src=module
(which is in English). As the page starts to poll for new tweets for the account, it first finds the closest memento at
https://web.archive.org/web/20190227220450/https://twitter.com/i/profiles/show/free_thinker/timeline/tweets?composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=30000&latent_count=0&min_position=1095942934640377856
URI-M in Urdu. This adds a notification bar above the main content area that suggests there are 20 new tweets available (clicking on this bar will insert those twenty tweets in the timeline as the necessary markup is already returned in the response, waiting for a user action). I found the behavior of the page to be inconsistent due to intermittent issues, but reloading the page a few times and waiting for a while helps. In the subsequent polling attempts the latent_count parameter changes from "0" to "20" (this suggests how many new tweets are loaded and ready to be inserted) and the min_position parameter changes from "1095942934640377856" to "1100819673937960960" (these are IDs of the most recent tweets loaded so far). Every other parameter generally remains the same in the successive XHR calls after every 30 seconds. If one waits for long enough on this page (while the tab is still active), occasionally another successful response arrives that updates the new tweets notification from 20 to 42 (but in a different language from Urdu). To see if there are any other clues that can explain why the banner was inserted in Urdu, I investigated the HTTP response as shown below (the payload is decoded, pretty-printed, and truncated for ease of inspection):

$ curl --silent -i "https://web.archive.org/web/20190227220450/https://twitter.com/i/profiles/show/free_thinker/timeline/tweets?composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=30000&latent_count=0&min_position=1095942934640377856"
HTTP/2 200 
server: nginx/1.15.8
date: Fri, 15 Mar 2019 04:25:14 GMT
content-type: text/javascript; charset=utf-8
x-archive-orig-status: 200 OK
x-archive-orig-x-response-time: 36
x-archive-orig-content-length: 995
x-archive-orig-strict-transport-security: max-age=631138519
x-archive-orig-x-twitter-response-tags: BouncerCompliant
x-archive-orig-x-transaction: 00becd1200f8d18b
x-archive-orig-x-content-type-options: nosniff
content-encoding: gzip
x-archive-orig-set-cookie: fm=0; Max-Age=0; Expires=Mon, 11 Feb 2019 13:21:45 GMT; Path=/; Domain=.twitter.com; Secure; HTTPOnly, _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCIRWidxoAToMY3NyZl9p%250AZCIlYzlmNGViODk4ZDI0YmI0NzcyMTMyMzA3M2M5ZTRjZDI6B2lkIiU2ODFi%250AZjgzYjMzYjEyYzk1NGNlMDlmYzRkNDIzZTY3Mg%253D%253D--22900f43bec575790847d2e75f88b12296c330bc; Path=/; Domain=.twitter.com; Secure; HTTPOnly
x-archive-orig-expires: Tue, 31 Mar 1981 05:00:00 GMT
x-archive-orig-server: tsa_a
x-archive-orig-last-modified: Mon, 11 Feb 2019 13:21:45 GMT
x-archive-orig-x-xss-protection: 1; mode=block; report=https://twitter.com/i/xss_report
x-archive-orig-x-connection-hash: bca4678d59abc86b8401176fd37858de
x-archive-orig-pragma: no-cache
x-archive-orig-cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
x-archive-orig-date: Mon, 11 Feb 2019 13:21:45 GMT
x-archive-orig-x-frame-options: 
cache-control: max-age=1800
x-archive-guessed-content-type: application/json
x-archive-guessed-encoding: utf-8
memento-datetime: Mon, 11 Feb 2019 13:21:45 GMT
link: <https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="original", <https://web.archive.org/web/timemap/link/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="timegate", <https://web.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="first memento"; datetime="Mon, 11 Feb 2019 13:21:45 GMT", <https://web.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="memento"; datetime="Mon, 11 Feb 2019 13:21:45 GMT", <https://web.archive.org/web/20190217171144/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="next memento"; datetime="Sun, 17 Feb 2019 17:11:44 GMT", <https://web.archive.org/web/20190217171144/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="last memento"; datetime="Sun, 17 Feb 2019 17:11:44 GMT"
content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org
x-archive-src: liveweb-20190211133005/liveweb-20190211132143-wwwb-spn01.us.archive.org.warc.gz
x-app-server: wwwb-app23
x-ts: ----
x-location: All
x-cache-key: httpsweb.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&pc=true&profileUserId=7431372&show_context=true&src=moduleUS
x-page-cache: MISS

{
  "max_position": "1100819673937960960",
  "has_more_items": true,
  "items_html": "\n      <li class=\"js-stream-item stream-item stream-item\n\" data-item-id=\"1100648521127129088\"\nid=\"stream-item-tweet-1100648521127129088\"\ndata-item-type=\"tweet\"\n data-suggestion-json=\"{&quot;suggestion_details&quot;:{},&quot;tweet_ids&quot;:&quot;1100648521127129088&quot;,&quot;scribe_component&quot;:&quot;tweet&quot;}\"> ... [REDACTED] ... </li>",
  "new_latent_count": 20,
  "new_tweets_bar_html": "  <button class=\"new-tweets-bar js-new-tweets-bar\" data-item-count=\"20\" style=\"width:100%\">\n        دیکھیں 20 نئی ٹویٹس\n\n  </button>\n",
  "new_tweets_bar_alternate_html": []
}

While many web archives are good at exposing original response headers via X-Archive-Orig-* headers in mementos, I don't know any web archive (yet) that exposes corresponding original request headers as well (I propose using something like X-Archive-Request-Orig-* headers). By looking the the above response we can understand the structure of how new tweets' notification works on a Twitter timeline, but it does not answer why the response was in Urdu (as highlighted in the value of the "new_tweets_bar_html" JSON key). Based on my assessment and experiment above, I think that the corresponding request should have a header like "Cookie: lang=ur; Domain: twitter.com; Path=/", which can be verified if the corresponding WARC file was available.

Cookie Violations Experiment on the Live Site


Finally, I attempted to recreate this language hodgepodge on the live site on my own Twitter timeline. I followed the the steps below and ended up with a page shown in Figure 6 (which contains phrases from English, Arabic, Hindi, Spanish, Chinese, and Urdu, but could have all 47 supported languages).

  1. Open your Twitter timeline in English by explicitly supplying "?lang=en" query parameter in a browser tab (it can be an incognito window) without logging in, let's call it Tab A
  2. Open another tab in the same window and load your timeline without any "lang" query parameter (it should show your timeline in English), let's call it Tab B
  3. Switch to Tab A and change the value of the "lang" parameter to one of the 47 supported language codes and load the page to update the "lang" cookie (which will be reflected in all the tabs of the same window)
  4. From a different browser (that does not share cookies with the above tabs) or device login to your Twitter account (if not logged in already) and retweet something
  5. Switch to Tab B and wait for a notification to appear suggesting one new tweet in the language selected in the Tab A (it may take a little over 30 seconds)
  6. If you want to add more languages then click on the notification bar (which will insert the new tweet in the current language) and repeat from step 3 otherwise continue
  7. To see the global trends block of Tab B in a different language perform step 3 with the desired language code, switch back to Tab B, and wait until it changes (it may take a little over 5 minutes)

Figure 6: Mixed language illustration on Twitter's live website. It contains phrases from English, Arabic, Hindi, Spanish, Chinese, and Urdu, but could have all 47 supported languages.


Conclusions


With the above experiment on the live site I am confident about my assessment that a cookie violation could be one reason why a composite memento would be defaced. How common this issue is in Twitter's mementos and on other sites is still an open question. While I do not know a silver-bullet solution to this issue yet, I think it can potentially be mitigated to some extent for the future mementos by explicitly reducing the cookie expiration duration in crawlers or distributing the crawling task for the URLs of the same domain in many small sandboxed instances. Investigating options about filtering responses by matching cookies needs a more rigorous research.

--
Sawood Alam

2 comments:

  1. Google Chrome is exploring means to classify cookies for better transparency and control so that, for example, session cookies can be isolated from tracking cookies. Improving privacy and security on the web

    ReplyDelete