Monday, March 18, 2019

2019-03-18: Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay In Multiple Languages

Figure 1: Mixed language blocks on a memento of a Twitter timeline. Highlighted with blue colored box for Portuguese, orange for English, and red for Urdu. Dotted border indicates the template present in the original HTML response while blocks with solid borders indicate lazily loaded content.

Would you be surprised if I were to tell you that Twitter is a multi-lingual website, supporting 47 different international languages? How about if I were to tell you that a usual Twitter timeline page can contain tweets in whatever languages the owner of the handle chooses to tweet, but can also show navigation bar and various sidebar blocks in many different languages simultaneously, now surprised? Well, while it makes no sense, it may actually happen in web archives when a memento of a timeline is accessed as shown in Figure 1. Spoiler alert! Cookies are to be blamed, once again.

Last month, I was investigating a real life version of "Ron Burgundy will read anything on the teleprompter (Anchorman)" and "Chatur's speech (3 Idiots)" moments, when I noticed something that caught my eyes. I was looking at a memento (i.e., a historical version of a web page) of Pratik Sinha's Twitter timeline from the Internet Archive. Pratik is the co-founder of Alt News (an Indian fact checking website) and the person who edited an internal document of the IT Cell of BJP (the current ruling party of India), which was then copy-pasted and tweeted by some prominent handles of the party. Tweets on his timeline are generally in English, but the archived page's template language was not English (although, I did not request the page in any specific language). However, this was not surprising to me as we have already investigated the reason behind this template language behavior last year and found that HTTP cookies were causing it. After spending a minute or so on the page, a small notice appeared in the main content area, right above the list of tweets, suggesting that there were 20 more tweets, but the message was in Urdu language, a Right-to-Left (RTL) language, very different from the language used in the page's navigation bar. Urdu, being my first language, immediately alerted me that there is something not quite right. Upon further investigation, I found that the page was composed of three different languages, Portuguese, English, and Urdu as highlighted in Figure 1 (here I am not talking about the language of tweets themselves).

What Can Deface a Composite Memento?


This defaced composite memento is a serious archival replay problem as it is showing a web page that perhaps never existed. While the individual representations all separately existed on the live web, they were never combined in the page as it is replayed by the web archive. In the Web Science and Digital Libraries Research Group, we uncovered a couple of causes in the past that can yield defaced composite mementos. One of them is live-leakage (also known as Zombies) for which Andy Jackson proposed we should use Content-Security-PolicyAda Lerner et al. took a security-centric approach that was deployed by the Internet Archive's Wayback Machine, and we proposed Reconstructive as a potential solution using Service Worker. The other known cause is temporal violations, on which Scott Ainsworth is working as his PhD research. However, this mixed-language Twitter timeline issue cannot be explained by zombies nor temporal violations.

Anatomy of a Twitter Timeline


To uncover the cause, I further investigated the anatomy of a Twitter timeline page and various network requests it makes when accessed live or from a web archive as illustrated in Figure 2. Currently, when a Twitter timeline is loaded anonymously (without logging in), the page is returned with a block of brief description of the user, a navigation bar (containing summary of numbers of tweets and followers etc.), a sidebar block to encourage visitors to create a new account, and an initial set of tweets. The page also contains empty placeholders of some sidebar blocks such as related users to follow, globally trending topics, and recent media posted on that timeline. Apart from loading page requisites, the page also makes some follow up XHR requests to populate these blocks. When the page is active (i.e., the browser tab is focused) it polls for new tweets after every 30 seconds and global trends after every 5 minutes. Successful responses to these asynchronous XHR requests contain data in JSON format, but instead of providing a language-independent bare bone structured data to rendering templates on the client-side, they contain some server-side rendered encoded markup. Which is then decoded on the client-side and directly injected in corresponding empty placeholders (or replaced with any existing content), then the block is set to visible. This server-side partial markup rendering needs to know the language of the parent page in order to utilize phrases translated in the corresponding language to yield a consistent page.

Figure 2: An active Twitter timeline page asynchronously populates related users and recent media blocks then polls for new tweets every 30 seconds and global trends every 5 minutes.

How Does Twitter's Language Internationalization Work?


From our past investigation we know that Twitter handles languages in two primary ways, a query parameter and a cookie header. In order to fetch a page in a specific language (from their 47 currently supported languages) one can either add a "?lang=<language-code>" query parameter in the URI (e.g.,
https://twitter.com/ibnesayeed?lang=ur
for Urdu) or send a Cookie header containing the "lang=<language-code>" name/value pair. A URI query parameter takes precedence in this case and also sets the "lang" Cookie accordingly (overwriting any existing value) for all the subsequent requests until overwritten again explicitly. This works well on the live site, but has some unfortunate consequences when a memento of a Twitter timeline is replayed from a web archive, causing this hodgepodge illustrated in Figure 1 (area highlighted by dotted border indicates the template served in the initial HTML response while areas surrounded with solid border were lazily loaded). This mixed-language rendering does not happen when a memento of a timeline is loaded with an explicit language query parameter in the URI as illustrated in Figures 3, 4, and 5 (the "lang" query parameter is highlighted in the archival banner and also the lazily loaded blocks from each language that corresponds to the blocks in Figure 1). In this case, all the subsequent XHR URIs also contain the explicit "lang" query parameter.

Figure 3: A memento of a Twitter timeline explicitly in Portuguese.

Figure 4: A memento of a Twitter timeline explicitly in English.

Figure 5: A memento of a Twitter timeline explicitly in Urdu. The direction of the page is Right-to-Left (RTL), as a result, sidebar blocks are moved to the left hand side.

To understand the issue, consider the following sequence of events during the crawling of a Twitter timeline page. Suppose, we begin a fresh crawling session and start with fetching the https://twitter.com/ibnesayeed page without any specific language code supplied. Depending on the geo-location of the crawler or any other factors Twitter might return the page in a specific language, for instance, in English. The crawler extracts links of all the page requisites and hyperlinks to add them into the frontier queue. The crawler may also attempt to extract URIs of potential XHR or other JS initiated requests, which might add URIs like:
https://twitter.com/i/trends?k=&pc=true&profileUserId=28631536&show_context=true&src=module
and
https://twitter.com/i/related_users/28631536
(and various other lazily loaded resources) in the frontier queue. The HTML page also contains 47 language-specific alternate links (and one x-default hreflang) in its markup (with "?lang=<language-code>" style parameters). These alternate links will also be added in the frontier queue of the crawler in some order. When these language-specific links are fetched by the crawler, the lang Cookie will be set, overwriting any prior value. Now, suppose the https://twitter.com/ibnesayeed?lang=ur was fetched before the "/i/trends" data, it would set the language for any subsequent requests to be served in Urdu. When the data for global trends block is fetched, Twitter's server will returned a server-side rendered markup in Urdu, which will be injected in the page that was initially served in English. This will cause the header of the block saying "دنیا بھر کے میں رجحانات" instead of "Worldwide trends". Here, I would take a long pause of silence to express my condolence on the brutal murder of a language with more than 100 million speakers worldwide by a platform as big as Twitter. The Urdu translation of this phrase appearing on such a prominent place on the page is a nonsense and grammatically wrong. Twitter, if you are listening, please change it to something like "عالمی رجحانات" and get an audit of other translated phrases. Now, back to the original problem, following is a walk-through of the scenario described above.

$ curl --silent "https://twitter.com/ibnesayeed" | grep "<html"
<html lang="en" data-scribe-reduced-action-queue="true">
$ curl --silent -c /tmp/twitter.cookie "https://twitter.com/ibnesayeed?lang=ur" | grep "<html"
<html lang="ur" data-scribe-reduced-action-queue="true">
$ grep lang /tmp/twitter.cookie
twitter.com FALSE / FALSE 0 lang ur
$ curl --silent -b /tmp/twitter.cookie "https://twitter.com/ibnesayeed" | grep "<html"
<html lang="ur" data-scribe-reduced-action-queue="true">
$ curl --silent -b /tmp/twitter.cookie "https://twitter.com/i/trends?k=&pc=true&profileUserId=28631536&show_context=true&src=module" | jq
{
  "module_html": "<div class=\"flex-module trends-container context-trends-container\">\n  <div class=\"flex-module-header\">\n    \n    <h3><span class=\"trend-location js-trend-location\">دنیا بھر کے میں رجحانات</span></h3>\n  </div>\n  <div class=\"flex-module-inner\">\n    <ul class=\"trend-items js-trends\">\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#PiDay\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/PiDay?src=tren&amp;data_id=tweet%3A1106214111183020034\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#PiDay</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Google employee sets new record for calculating π to 31.4 trillion digits\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#SaveODAAT\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/SaveODAAT?src=tren&amp;data_id=tweet%3A1106252880921747457\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#SaveODAAT</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Netflix cancels One Day at a Time after three seasons\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"Beto\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=Beto&amp;src=tren&amp;data_id=tweet%3A1106142158023786496\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Beto</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Beto O’Rourke announces 2020 presidential bid\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#AvengersEndgame\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/AvengersEndgame?src=tren&amp;data_id=tweet%3A1106169765830295552\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#AvengersEndgame</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Marvel dropped a new Avengers: Endgame trailer\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"12 Republicans\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:tweet_count_1000_10000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=%2212%20Republicans%22&amp;src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">12 Republicans</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          6,157 ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#NationalAgDay\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:tweet_count_1000_10000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/NationalAgDay?src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#NationalAgDay</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          6,651 ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"Kyle Guy\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:tweet_count_1000_10000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=%22Kyle%20Guy%22&amp;src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Kyle Guy</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          1,926 ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#314Day\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:tweet_count_10000_100000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/314Day?src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#314Day</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          12 ہزار ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"Tillis\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=Tillis&amp;src=tren&amp;data_id=tweet%3A1106266707230777344\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Tillis</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Senate votes to block Trump&#39;s border emergency declaration\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"Bikers for Trump\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:tweet_count_10000_100000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=%22Bikers%20for%20Trump%22&amp;src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Bikers for Trump</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          16.8 ہزار ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n    </ul>\n  </div>\n</div>\n",
  "personalized": false,
  "woeid": 1
}

Here, I started by fetching my Twitter time without specifying any language in the URI or via cookies. The response was returned in English. I then fetched the same page with explicit "?lang=ur" query parameter and saved any returned cookies in the "/tmp/twitter.cookie" file. We illustrated that the response was indeed returned in Urdu. We then checked the saved cookie file to see if it contains a "lang" cookie, which it does and has a value of "ur". We then utilized the saved cookie file to fetch the main timeline page again, but without an explicit "?lang=ur" query parameter to illustrate that Twitter's server respects it and returns the response in Urdu. Finally, we fetched global trends data while utilizing saved cookies and illustrated that the response contains a JSON-serialized HTML markup with Urdu header text in it as
"<h3><span class=\"trend-location js-trend-location\">دنیا بھر کے میں رجحانات</span></h3>"
under the "module_html" JSON key. The original response is encoded using Unicode escapes, but we used jq utility here to pretty-print JSON and decode escaped markup for easier illustration.

Understanding Cookie Violations


When fetching a single page (and all its page requisites) at a time, this problem, let's name it a cookie violation, might not happen as often. However, when crawling is done on a large scale, preventing such unfortunate misalignment of frontier queue completely is almost impossible, especially, since the "lang" cookie is set for the root path of the domain and affects every resource from the domain.

The root cause here can more broadly be described as a lossy state information being utilized when replaying a stateful resource representation from archives that originally performed content negotiation based on cookies or other headers. Most of the popular archival replay systems (e.g., OpenWayback, PyWB, and even our own InterPlanetary Wayback) do not perform any content negotiation when serving a memento other than the Accept-Datetime header (which is not part of the original crawl-time interaction, but a means to add the time dimension to the web). Traditional archival crawlers (such as Heritrix) mostly interacted with web servers by using only URIs without any custom request headers that might affect the returned response. This means, generally a canonicalized URI along with the datetime of the capture was sufficient to identify a memento. However, cookies are an exception to this assumption as they are needed for some sites to behave properly, hence cookie management support was added to these crawlers long time ago. Cookies can be used for tracking, client-side configurations, key/value store, and authentication/authorization session management, but in some cases they can also be used for content negotiation (as is the case with Twitter). When cookies are used for content negotiation, the server should adevrtise it in the "Vary" header, but Twitter does not. Accommodating cookies at capture/crawl time, but not utilizing them at replay time has this consequence of cookie violations, resulting in defaced composite mementos. Similarly, in aggregated personal web arching, which is the PhD research topic of Mat Kelly, not utilizing session cookies (or other forms of authorization headers) at replay time can result in a serious security vulnerability of private content leakage. In modern headless browser-based crawlers there might even be some custom headers that a site utilizes in XHR (or fetch API) for content negotiation, which should be considered when indexing the content for replay (or filtering at replay time from a subset). Ideally, a web archive should behave like an HTTP proxy/cache when it comes to content negotiation, but it may not always be feasible.

What Should We Do About It?


So, should we include cookies in the replay index and only return a memento if the cookies in the request headers match? Well, that will be a disaster as it will cause an enormous amount of false-negatives (i.e., mementos that are present in an archive and should be returned, but won't). Perhaps we can canonicalize cookies and only index ones that are authentication/authorization session-related or used for content negotiation. However, identifying such cookies will be difficult and will require some heuristic analysis or machine learning, because, these are opaque strings and their names are decided by the server application (rather than using any standardized names).

Even if we can somehow sort this issue out, there are even bigger problems in making it to work. For example, how to get the client send suitable cookies in the first place? How will the web archive know when to send a "Set-Cookie" header? Should the client follow the exact path of interactions with pages as the crawler did when a set of pages were captured in order to set appropriate cookies

Let's ignore session cookies for now and only focus on the content negotiation related cookies. Also, let's relax the cookie matching condition further by only filtering mementos if a cookies header is present in a request, otherwise ignore cookies from the index. This means, the replay system can send a Set-Cookie header if the memento in question was originally observed with a Set-Cookie header and expect to see it in the subsequent requests. Sounds easy? Welcome to the cookie collision hell. Cookies from various domains will be required to be rewritten to set the domain name of the web archive that is serving the memento. As a result, same cookie names from various domains served over time from the same archive will step over each other (it's worth mentioning that often a single web page has page requisites from many different domains). Even the web archive can have some of its own cookies independent of the memento being served.

We can attempt to solve this collision issue by rewriting the path of cookies and prefixing it with the original domain name to limit the scope (e.g., change
"Set-Cookie: lang=ur; Domain: twitter.com; Path=/"
to
"Set-Cookie: lang=ur; Domain: web.archive.org; Path=/twitter.com/"
). This is not going to work because the client will not send this cookie unless the requested URI-M path has a prefix of "/twitter.com/", but the root path of Twitter is usually rewritten as something like "/web/20190214075028/https://twitter.com/" instead. If the same rewriting rule is used in cookie path then the unique 14-digit datetime path segment will block it from being sent with subsequent requests that have a different datetime (which is almost always the case after an initial redirect). Unfortunately, cookie path does not support wildcard paths like "/web/*/https://twitter.com/".

Another possibility could be prefixing the name of the cookie with the original domain [and path] (with some custom encoding and unique-enough delimiters) then setting path to the root of the replay (e.g., change the above example to
"Set-Cookie: twitter__com___lang=ur; Domain: web.archive.org; Path=/web/"
), which, the replay server understands how to decode and apply properly. I am not aware of any other attributes of cookies that can be exploited to annotate with additional information. The downside of this approach is that if the client is relying on these cookies for certain functionalities then the changed name will affect them.

Additionally, an archival replay system should also rewrite cookie expiration time to a short-lived future value (irrespective of the original value, which could be a value in the past or a very distant value in the future) otherwise the growing pile of cookies from many different pages will increase the request size significantly over time. Moreover, incorporating cookies in replay systems will have some consequences in cross-archive aggregated memento reconstruction.

In our previous post about another cookie related issue, we proposed that explicitly expiring cookies (and garbage collecting cookies older than a few seconds) may reduce the impact. We also proposed that distributing crawl jobs of the URIs from the same domain in smaller sandboxed instances could minimize the impact. I think these two approaches can be helpful in mitigating this mixed-language issue as well. However, it is worth noting that these are crawl-time solutions, which will not solve the replay issues of existing mementos.

Dissecting the Composite Memento


Now, back to the memento of Pratik's timeline from the Internet Archive. The page is archived primarily in Portuguese. When it is loaded in a web browser that can execute JavaScript, the page makes subsequent asynchronous requests to populate various blocks as it does on the live site. Recent media block is not archived, so it does not show up. Related users block is populated in Portuguese (because this block is generally populated immediately after the main page is loaded and does not get a chance to be updated later, hence, unlikely to load a version in a different language). The closest successful memento of the global trends data is loaded from
https://web.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&pc=true&profileUserId=7431372&show_context=true&src=module
(which is in English). As the page starts to poll for new tweets for the account, it first finds the closest memento at
https://web.archive.org/web/20190227220450/https://twitter.com/i/profiles/show/free_thinker/timeline/tweets?composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=30000&latent_count=0&min_position=1095942934640377856
URI-M in Urdu. This adds a notification bar above the main content area that suggests there are 20 new tweets available (clicking on this bar will insert those twenty tweets in the timeline as the necessary markup is already returned in the response, waiting for a user action). I found the behavior of the page to be inconsistent due to intermittent issues, but reloading the page a few times and waiting for a while helps. In the subsequent polling attempts the latent_count parameter changes from "0" to "20" (this suggests how many new tweets are loaded and ready to be inserted) and the min_position parameter changes from "1095942934640377856" to "1100819673937960960" (these are IDs of the most recent tweets loaded so far). Every other parameter generally remains the same in the successive XHR calls after every 30 seconds. If one waits for long enough on this page (while the tab is still active), occasionally another successful response arrives that updates the new tweets notification from 20 to 42 (but in a different language from Urdu). To see if there are any other clues that can explain why the banner was inserted in Urdu, I investigated the HTTP response as shown below (the payload is decoded, pretty-printed, and truncated for ease of inspection):

$ curl --silent -i "https://web.archive.org/web/20190227220450/https://twitter.com/i/profiles/show/free_thinker/timeline/tweets?composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=30000&latent_count=0&min_position=1095942934640377856"
HTTP/2 200 
server: nginx/1.15.8
date: Fri, 15 Mar 2019 04:25:14 GMT
content-type: text/javascript; charset=utf-8
x-archive-orig-status: 200 OK
x-archive-orig-x-response-time: 36
x-archive-orig-content-length: 995
x-archive-orig-strict-transport-security: max-age=631138519
x-archive-orig-x-twitter-response-tags: BouncerCompliant
x-archive-orig-x-transaction: 00becd1200f8d18b
x-archive-orig-x-content-type-options: nosniff
content-encoding: gzip
x-archive-orig-set-cookie: fm=0; Max-Age=0; Expires=Mon, 11 Feb 2019 13:21:45 GMT; Path=/; Domain=.twitter.com; Secure; HTTPOnly, _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCIRWidxoAToMY3NyZl9p%250AZCIlYzlmNGViODk4ZDI0YmI0NzcyMTMyMzA3M2M5ZTRjZDI6B2lkIiU2ODFi%250AZjgzYjMzYjEyYzk1NGNlMDlmYzRkNDIzZTY3Mg%253D%253D--22900f43bec575790847d2e75f88b12296c330bc; Path=/; Domain=.twitter.com; Secure; HTTPOnly
x-archive-orig-expires: Tue, 31 Mar 1981 05:00:00 GMT
x-archive-orig-server: tsa_a
x-archive-orig-last-modified: Mon, 11 Feb 2019 13:21:45 GMT
x-archive-orig-x-xss-protection: 1; mode=block; report=https://twitter.com/i/xss_report
x-archive-orig-x-connection-hash: bca4678d59abc86b8401176fd37858de
x-archive-orig-pragma: no-cache
x-archive-orig-cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
x-archive-orig-date: Mon, 11 Feb 2019 13:21:45 GMT
x-archive-orig-x-frame-options: 
cache-control: max-age=1800
x-archive-guessed-content-type: application/json
x-archive-guessed-encoding: utf-8
memento-datetime: Mon, 11 Feb 2019 13:21:45 GMT
link: <https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="original", <https://web.archive.org/web/timemap/link/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="timegate", <https://web.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="first memento"; datetime="Mon, 11 Feb 2019 13:21:45 GMT", <https://web.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="memento"; datetime="Mon, 11 Feb 2019 13:21:45 GMT", <https://web.archive.org/web/20190217171144/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="next memento"; datetime="Sun, 17 Feb 2019 17:11:44 GMT", <https://web.archive.org/web/20190217171144/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="last memento"; datetime="Sun, 17 Feb 2019 17:11:44 GMT"
content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org
x-archive-src: liveweb-20190211133005/liveweb-20190211132143-wwwb-spn01.us.archive.org.warc.gz
x-app-server: wwwb-app23
x-ts: ----
x-location: All
x-cache-key: httpsweb.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&pc=true&profileUserId=7431372&show_context=true&src=moduleUS
x-page-cache: MISS

{
  "max_position": "1100819673937960960",
  "has_more_items": true,
  "items_html": "\n      <li class=\"js-stream-item stream-item stream-item\n\" data-item-id=\"1100648521127129088\"\nid=\"stream-item-tweet-1100648521127129088\"\ndata-item-type=\"tweet\"\n data-suggestion-json=\"{&quot;suggestion_details&quot;:{},&quot;tweet_ids&quot;:&quot;1100648521127129088&quot;,&quot;scribe_component&quot;:&quot;tweet&quot;}\"> ... [REDACTED] ... </li>",
  "new_latent_count": 20,
  "new_tweets_bar_html": "  <button class=\"new-tweets-bar js-new-tweets-bar\" data-item-count=\"20\" style=\"width:100%\">\n        دیکھیں 20 نئی ٹویٹس\n\n  </button>\n",
  "new_tweets_bar_alternate_html": []
}

While many web archives are good at exposing original response headers via X-Archive-Orig-* headers in mementos, I don't know any web archive (yet) that exposes corresponding original request headers as well (I propose using something like X-Archive-Request-Orig-* headers). By looking the the above response we can understand the structure of how new tweets' notification works on a Twitter timeline, but it does not answer why the response was in Urdu (as highlighted in the value of the "new_tweets_bar_html" JSON key). Based on my assessment and experiment above, I think that the corresponding request should have a header like "Cookie: lang=ur; Domain: twitter.com; Path=/", which can be verified if the corresponding WARC file was available.

Cookie Violations Experiment on the Live Site


Finally, I attempted to recreate this language hodgepodge on the live site on my own Twitter timeline. I followed the the steps below and ended up with a page shown in Figure 6 (which contains phrases from English, Arabic, Hindi, Spanish, Chinese, and Urdu, but could have all 47 supported languages).

  1. Open your Twitter timeline in English by explicitly supplying "?lang=en" query parameter in a browser tab (it can be an incognito window) without logging in, let's call it Tab A
  2. Open another tab in the same window and load your timeline without any "lang" query parameter (it should show your timeline in English), let's call it Tab B
  3. Switch to Tab A and change the value of the "lang" parameter to one of the 47 supported language codes and load the page to update the "lang" cookie (which will be reflected in all the tabs of the same window)
  4. From a different browser (that does not share cookies with the above tabs) or device login to your Twitter account (if not logged in already) and retweet something
  5. Switch to Tab B and wait for a notification to appear suggesting one new tweet in the language selected in the Tab A (it may take a little over 30 seconds)
  6. If you want to add more languages then click on the notification bar (which will insert the new tweet in the current language) and repeat from step 3 otherwise continue
  7. To see the global trends block of Tab B in a different language perform step 3 with the desired language code, switch back to Tab B, and wait until it changes (it may take a little over 5 minutes)

Figure 6: Mixed language illustration on Twitter's live website. It contains phrases from English, Arabic, Hindi, Spanish, Chinese, and Urdu, but could have all 47 supported languages.


Conclusions


With the above experiment on the live site I am confident about my assessment that a cookie violation could be one reason why a composite memento would be defaced. How common this issue is in Twitter's mementos and on other sites is still an open question. While I do not know a silver-bullet solution to this issue yet, I think it can potentially be mitigated to some extent for the future mementos by explicitly reducing the cookie expiration duration in crawlers or distributing the crawling task for the URLs of the same domain in many small sandboxed instances. Investigating options about filtering responses by matching cookies needs a more rigorous research.

--
Sawood Alam

Tuesday, March 5, 2019

2019-03-05: 365 dots in 2018 - top news stories of 2018

Fig. 1: News stories for 365 days in 2018. Each dot represents the average degree of the Giant Connected Component (GCC) with the largest average degree across all the 144 story graphs for a given day. The x-axis represents time, the y-axis represent the average degree of the selected GCC. Click to expand image.

There was no shortage of big news headlines in 2018. Amidst this abundance, a natural question is what were the top news stories of 2018? There are multiple lists from different news organizations that present candidate top stories in 2018 such as CNN's most popular stories and videos of 2018, and The year in review: Top news stories of 2018 month by month from CBS. Even though such lists from respectable news organizations pass the "seems right test," they mostly present the top news stories without presenting an explanation for their process. In other words, they often do not state why some story made the list and why another did not make the list. We consider such information very helpful for two reasons. First, an explanation or presentation of the criteria of why some story made the list opens the criteria to critique and helps alleviate concerns about bias. Second, the criteria is inherently valuable because it could be reused and reapplied on a different collection. For example, one could apply the process to find out the top news stories in a different country.

Fortunately, StoryGraph is well suited to answer our main question, "what were the top news stories of 2018?"

A brief introduction of StoryGraph
We plan to publish a blogpost introducing and explaining StoryGraph in the near future. However, here is a quick explanation. StoryGraph is a service that periodically (10-minute intervals) generates a news similarity graph. In this graph, the nodes represent news stories and an edge between a pair of nodes represents a high degree of similarity between the nodes (news stories). For example, this story, In eulogy, Obama says McCain called on nation to be ‘bigger’ than politics ‘born in fear’ is highly similar (similarity: 0.46) to this story, John McCain honored at National Cathedral memorial service, therefore an edge exists between both stories in their parent story graph.
Small stories vs big stories: how StoryGraph quantifies the magnitude of news stories
On a slow news day, news organization report on multiple different news stories. This results in a low degree of similarity between pairs of news stories (e.g., Fig. 2) and results in smaller connected components.
In contrast, shortly after a major news event, news organizations publish multiple highly similar news stories. This results in a high degree of similarity between pairs of news stories. This often leads to a Giant Connected Component (GCC) in the news story graph  (e.g., Fig. 3).
Fig. 2: A small news story (compared to Fig. 3) exhibits lower (compared to 17.03 in Fig. 3) pairwise node similarity and a lower Giant Connected Component average degree (4) compared to a big news story.
In short, the larger the average degree of a Giant Connected Component of a story graph, the bigger the news event, and vice versa.
StoryGraph generates 144 graphs (1 per 10 minutes) for a single day, this means there are 144 possible candidate (duplicate stories included) news graphs to pick from while selecting the top news story for a given day. The following steps were applied in order to select the top news story for a single day. First, each story graph was awarded a score, the score was derived from the average degree of the giant connected component in the graph. Second, from the set of 144 story graphs, the graph with the highest score was selected to represent the top news story for the day. This graph represents the graph with the giant connected component with the highest average degree. Steps 1 and 2 were applied across 365 days to generate the scores plotted in Fig. 1. Fig. 1 captures the top news stories for each of the 365 days in 2018.
The top news stories of 2018
From Fig. 1, and the table below, it is clear that the Kavanaugh hearings was the biggest news story of 2018 with GCC avg. degree of 25.85. In fact, this story was the top news story for about 25 days (red dots in Fig. 1). Also, the top story had three sibling story graphs with GCC avg. degrees (18.96 - 21.94) higher than the second top story.

Rank Date (MM-DD) News Story (Selected) Title GCC Avg. Deg
1 09-27 Kavanaugh accuser gives vivid details of alleged assault - CNN Video 25.85
2 02-02 Disputed GOP-Nunes memo released - CNNPolitics 18.81
3 06-12 Kim Jong Un, Trump participate in signing ceremony, make history: Live updates - ABC News 18.15
4 10-24 Clinton and Obama bombs: Secret Service intercepts suspicious packages - The Washington Post 17.03
5 03-17 Trump celebrates McCabe firing on Twitter - CNN Video 16.32
6 06-14 DOJ IG report 'reaffirmed' Trump's 'suspicions' of bias of some in FBI: White House - ABC News 15.63
7 08-29 A Black Progressive and a Trump Acolyte Win Florida Governor Primaries - The New York Times 15.37
8 04-14 Trump orders strike on Syria in response to chemical attack - ABC News 15.21
9 02-25 Trump calls Schiff a 'bad guy,' Democratic memo 'a nothing' - CNNPolitics 15.13
10 11-07 John Brennan Warns After Sessions’ Firing: ‘Constitutional Crisis Very Soon’ - Breitbart 14.88

The next major news story of 2018 was the news surrounding the release of Nunes' Memo (GCC avg. degree: 18.81). Similar to the Kavanaugh hearings, the story about the release of the controversial memo was the top news story for seven days. In contrast, the story about Schiff's rebuttal memo did not receive as much attention with rank 9 and GCC avg. degree of 15.13. In third place was the Trump-Kim summit with GCC avg. degree of 18.15. Unlike the top two stories, this story, although initially big, did not linger beyond two days. This is an example of a big news story that lacked staying power.

Multiple news stories in our list were included in the list of top stories from other news organizations such as CNNCBS, NBCNews, and BusinessInsider. For example, the Kavanaugh hearings (No. 1), the Trump-Kim summit (No. 3), the Pipe bomber (No. 4), and the Midterm elections (No. 7), were included in multiple top news lists. However, to our surprise the MSD shooting news story (GCC avg. degree of 7.74) was not in our list of top 10 new stories, even though it appeared in multiple top news lists from multiple news organizations. Also, the Nunes memo story (No. 2) was our second top story, but it was absent from the list of top news stories of the four major news organizations we considered.

President Trump was a dominant figure in the 2018 news discourse. As shown in Fig. 1, out of the 365 days, "Trump" was included in the title representing the story graphs 197 (~54%) times.

-- Alexander Nwala (@acnwala)

Thursday, February 14, 2019

2019-02-14: CISE papers need a shake -- spend more time on the data section


A Crucial Step for Averting AI Disasters

I know this is a large topic and I may not have enough evidence to convince everyone, but based on my reviewing experiences on journal articles and conference proceedings, I strongly feel that computer and information science and engineering (CISE) papers need to put more text on describing and analyzing the data. 

This argument partially comes from my background in astronomy and astrophysics. Astronomers and astrophysicists usually spend a huge chunk of text in their papers talking about data they adopt, including but not limited to where the data are collected, why they do not use another dataset, how the raw data are pre-processed, and carefully justify why they rule out outliers. They also do analysis on the data and report statistical properties, trend, or bias to ensure that they are using legitimate points in their plots.

In contrast, for many papers I read and reviewed, even in top conferences, CISE people do not often do such work. They usually assume the datasets were used before so they could use it. Many emphasize the size of the data, but few look into the structure, completeness, taxonomy, noise, and potential outliers in the data. The consequence is that they spend a lot of space on algorithms and report results better than baselines, but it not a guarantee of anything. Good CISE papers usually discuss the bias and potential risks caused by the data, but good papers are rare, even in top conferences.

Algorithm is one of the pillars of CISE, but this does not mean it is everything. It only provides the framework, like the photo frame. Data is like the photo. Without the right photo, the picture (frame+photo) will not look pleasing. Even if it looks pleasing for a particular photo, it won't for other photos. Of course, no algorithm will fit all data, but at least the paper should discuss what types of data the algorithm should be applied to.

The good news is that many CISE people have started paying attention to this problem. In the IEEE Big Data Conference,  Blaise Aguera y Arcas, the Google AI director emphasizes that AI algorithms have to be accompanied with the right data to be ethical and useful. Recently, a WSJ article titled "A Crucial Step for Averting AI Disasters" echoed the idea. The article quoted Douglas Merrill's word -- “The answer to almost every question in machine learning is more data,” I would supplement this by adding "right" after "more". If we claim we are doing Data Science, how can we neglect the first part?

Jian Wu 

Friday, February 8, 2019

2019-02-08: Google+ Is Being Shuttered, Have We Preserved Enough of It?


In 2017 I reviewed many storytelling, curation, and social media tools to determine which might be a suitable replacement for Storify, should the need arise. Google+ was one of the tools under consideration. It did not make the list of top three, but it did produce quality surrogates.

On January 31, 2019, Sean Keane of CNET published an article indicating that the consumer version of Google+ will shut down on April 2, 2019. I knew that the social media service was being shut down in August, but I was surprised to see the new date. Google's blog mentions that they changed the deadline on December 10, 2018, for security reasons. David Rosenthal's recent blog post cites Google+ as yet another example of Google moving up a service decommission date.


This blog post is long because I am trying to answer several useful questions for would-be Google+ archivists. Here are the main bullet points:
  • End users can create a Google Takeout archive of their Google+ content. The pages from the archive do not use the familiar Google+ stylesheets. The archive only includes images that you explicitly posted to Google+.
  • Google+ pages load more content when a user scrolls. Webrecorder.io is the only web archiving tool that I know of that can capture this content.
  • Google+ consists of mostly empty, unused profiles. We can detect empty, unused profiles by page size. Profile pages less than 568kB are likely empty.
  • The robots.txt for plus.google.com does not block web archives.
  • Even when only considering estimates of active profiles, I estimate that less than 1% of Google+ is archived in either the Internet Archive or Archive.today.
  • I sampled some Google+ mementos from the Internet Archive and found a mean Memento Damage score of 0.347 on a scale where 0 indicates no damage. Though manual inspection does show missing images, stylesheets appear to be consistently present.
Update on 2019/03/20 at 10:37 GMT: Google has started sending email out to users telling them to get started downloading their Google+ data on March 31, 2019. Even though the service will shut down on April 2, it may take a lot of time for some users to save their data. This indicates that the April 2 shutdown date is likely a hard shutdown meaning that any data extraction in progress during the shutdown may not complete. Plan your use of Google Takeout and other preservation methods accordingly.
Google+ will join the long list of shuttered Web platforms. Verizon will be shuttering some former AOL and Yahoo! services in the next year. Here are some more recent examples:
Sometimes the service is not shuttered, but large swaths of content are removed, such as with Tumblr's recent crackdown on porn blogs, and Flickr's mass deletion of the photos of non-paying users.

The content of these services represents serious value for historians. Thus Geocities, Vine, and Tumblr were the targets of concerted hard-won archiving efforts.

Google launched Google+ in 2011. Writers have been declaring Google+ dead since its launch. Google+ has been unsuccessful for many reasons. Here are some mentioned in the news over the years:
As seen below, Google+ still has active users. I lost interest in 2016, but WS-DL Member Sawood Alam, Dave Matthews Band, and Marvel Entertainment still post content to the social network. Barack Obama did not last as long I did, with his last post in 2015.

I stopped posting to Google+ in 2016.
WS-DL member Sawood Alam is a more active Google+ member, having posted 17 weeks ago.

Dave Matthews Band uses Google+ to advertise concerts. Their last post was 1 week ago.

Marvel Entertainment was still posting to Google+ while I was writing this blog post.

Barack Obama lost interest in Google+. His last post was on March 6, 2015.

Back in July of 2018, I analyzed how much of the U.S. Government's AHRQ websites were archived. Google+ is much bigger than those two sites. Google+ allows users to share content with small groups or the public. In this blog post, I focus primarily on public content and current content.

I will use the following Memento terminology in this blog post:
  • original resource - a live web resource
  • memento - an observation, a capture, of a live web resource
  • URI-R - the URI of an original resource
  • URI-M - the URI of a memento
ArchiveTeam has a wiki page devoted to the shutdown of Google+. They list the archiving status as "Not saved yet." As shown below, I have found less than 1% of Google+ pages in the Internet Archive or Archive.today.

Update on 2019/03/18 at 16:07 GMT: ArchiveTeam's archiving status has been updated to "In progress...". According to this article by the Verge, there is a concerted effort now underway by ArchiveTeam and the Internet Archive to archive parts of Google+. There are limitations to web archiving, as only up to 500 comments can be archived per post. To help in these efforts, please read the rest of this post so that you can ensure that your own Google+ data is preserved.
Update on 2019/03/18 at 16:21 GMT: On Twitter, Sawood Alam has mentioned that this Reddit post has more information on ArchiveTeam's efforts. For live tracking of the process, visit this page.


In the spirit of my 2017 blog post about saving data from Storify, I cover how one can acquire their own Google+ data. My goal is to provide information for archivists trying to capture the accounts under their care. Finally, in the spirit of the AHRQ post, I discuss how I determined much of Google+ is probably archived.

Saving Google+ Data

Google Takeout


There are professional services like PageFreezer that specialize in preserving Google+ content for professionals and companies. Here I focus on how individuals might save their content.

Google Takeout allows users to acquire their data from all of Google's services. 


Google provides Google Takeout as a way to download personal content for any of their services. After logging into Google Takeout, it presents you with a list of services. Click "Select None" and then scroll down until you see the Google+ entries.

Select "Google+ Stream" to get the content of your "main stream" (i.e., your posts). There are additional services from which you can download Google+ data. "Google+ Circles" allows you to download vCard-formatted data for your Google+ contacts. "Google+ Communities" allows you to download the content for your communities.

Once you have selected the desired services, click Next. Then click Create Archive on the following page. You will receive an email with a link allowing you to download your archive.

From the email sent by Google, a link to a page like the one in the screenshot allows one to download their data.

The Takeout archive is a zip file that decompresses into a folder containing an HTML file and a set of folders. These HTML pages include your posts, events, information about posts that you +1'd, comments you made on others' posts, poll votes, and photos.

Note that the actual files of some of these images are not part of this archive. It does include your profile pictures and pictures that you uploaded to posts. Images from any Google+ albums you created are also available. With a few exceptions, references to images from within the HTML files in the archive are all absolute URIs pointing to googleusercontent.com.  They will no longer function if googleusercontent.com is shut down. Anyone trying to use this Google Takeout archive will need to do some additional crawling for the missing image content.
Google Takeout (right) does not save some formatting elements in your Google+ posts (left). The image, in this case, was included in my Google Takeout download because it is one that I posted to the service.

Webrecorder.io


One could use webrecorder.io to preserve their profile pages. Webrecorder saves web content to WARCs for use in many web archive playback tools. I chose Webrecorder because Google+ pages require scrolling to load all content, and scrolling is a feature with which Webrecorder assists.

A screenshot of my public Google+ profile replayed in Webrecorder.io.
One of Webrecorder's strengths is the ability to authenticate to services. We should be able to use this authentication ability to capture private Google+ data.

I tried saving it using my native Firefox, but that did not work well. Unfortunately, as shown below, sometimes Google's cookie infrastructure got in the way of authenticating with Google from within Webrecorder.io.

In Firefox, Google does not allow me to log into my Google+ account via Webrecorder.io.

I recommend changing the internal Webrecorder.io browser to Chrome to preserve your profile page. I tried to patch the recording a few times to capture all of the JavaScript and images. Even in these cases, I was unable to record all private posts. If someone else has better luck with Webrecorder and their private data, please indicate how you got it to work in the comments.

Other Web Archiving Tools

The following methods only work on your public Google+ pages. Google+ supports a robots.txt that does not block web archives.

The robots.txt for plus.google.com as of February 5, 2019, is shown below:



You can manually browse through each of your Google+ pages and save them to multiple web archives using the Mink Chrome Extension. The screenshots below show it in action saving my public Google+ profile.

The Mink Chrome Extension in action, click to enlarge. Click the Mink icon to show the banner (far left), and then click on the "Archive Page To..." button (center left). From there choose the archive to which you wish to save the page (center right), or select "All Three Archives" to save to multiple archives. The far right displays a WebCite memento of my profile saved using this process.
Archive.is and the Internet Archive both support forms where you can insert a URI and have it saved. Using the URIs of your Google+ public profile, collections, and other content, manually submit them to these forms and the content will be saved.

The Internet Archive (left) has a Save Page Now form as part of the Wayback Machine.
archive.today (right) has similar functionality on its front page.
My Google+ profile saved using the Internet Archive's Save Page Now form.
If you have all of your Google+ profile, collection, community, post, photo, and so on URIs in a file and wish to push them to web archives automatically, submit them to the ArchiveNow tool. ArchiveNow can save them to archive.is, archive.st, the Internet Archive, and WebCite by default. It also provides support for Perma.cc if you have a Perma.cc API key.

Current Archiving Status of Google+

How Much of Google+ Should Be Archived?

This section is not about making relevance judgments based on the historical importance of specific Google+ pages. A more serious problem exists. Most Google+ profiles are indeed empty. Google made it quite difficult to enroll in Google's services without signing up for Google+ at the same time. At one time, if one wanted a Google account for Gmail, Android, Google Sheets, Hangouts, or a host of other services, they would inadvertently be signed up for Google+. Acquiring an actual count of active users has been difficult because Google reported engagement numbers for all services as if they were for Google+. President Obama, Tyra Banks, and Steven Speilberg have all hosted Google Hangouts. This participation can be misleading, as Google Hangouts and Photos were features most often used by users, and these users may not have maintained a Google+ profile. Again, there are a lot of empty Google+ profiles.

In 2015, Forbes wrote that less than 1% of users (111 million) are genuinely active, citing a study done by Stone Temple Consulting. In 2018, Statistics Brain estimated 295 million active Google+ accounts.

As archivists trying to increase the quality of our captures, we need to detect the empty Google+ profiles. Crawlers start with seeds from sitemaps. I reviewed the robots.txt for plus.google.com and found four sitemaps, one of which focuses on profiles. The sitemap at http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml consists of 50000 additional sitemaps. Due to the number and size of the files, I did not download them all to get an exact profile count. Each consists of between 67,000 and 68,000 URIs for an estimated total of 3,375,000,000 Google+ profiles.


An example of an "empty" Google+ profile.


How do we detect accounts that were never used, like the one shown above?  The sheer number of URIs makes it challenging to perform an extensive lexical analysis in a short amount of time, so I took a random sample of 601 profile page URIs from the sitemap. I chose the sample size using the Sample Size Calculator provided by Qualtrics and verified it with similar calculators provided by SurveyMonkeyRaosoftAustralian Bureau of Statistics, and Creative Research Systems. These sample sizes represent a confidence level of 95% and a margin of error of ±4%.

Detecting unused and empty profiles is similar to the off-topic problem that I tackled for web archive collections, and it turns out that the size of the page is a good indicator of a profile being unused.  I attempted to download all 601 URIs with wget, but 18 returned a 404 status. A manual review of this sample indicated that profiles of size 568kB or higher contain at least one post. Anyone attempting to detect an empty Google+ profile can issue an HTTP HEAD and record the byte size from the Content-Length header. If the byte size is less than 568kB, then the page likely represents an empty profile and can be ignored.

One could automate this detection using a tool like curl. Below we see a command that extracts the status line, date, and content-length for an "empty" Google+ profile of 567,342 bytes:




The same command for a different profile URI shows a size of 720,352 bytes for a non-empty Google+ profile:



An example of a 703kB Google+ profile with posts from 3 weeks ago.

An example of Google+ loading more posts in response to user scrolling. Note the partial blue circle on the bottom of the page, indicating that more posts will load.

As seen above, Google+ profiles load more posts on scroll. Profiles at 663kB or greater have filled the first "page" of scrolling. Any Google+ profile larger than this has more posts to view. Unfortunately, crawling tools must execute a scroll event on the page to load this additional content. Web archive recording tools that do not automatically scroll the page will not record this content.
This histogram displays counts of the file sizes of the downloaded Google+ profile pages. Most profiles are empty, hence a significant spike for the bin containing 554kB.
From my sample 57/601 (9.32%) had content larger than 568kB. Only 12/601 (2.00%) had content larger than 663kB, potentially indicating active users. By applying this 2% to the total number of profiles, we estimate that 67.5 million profiles are active. Of course, based on the sample size calculator, my estimate may be off by as much as 4%, leaving an upper estimate of 135 million, which is between the 111 million number from the 2015 Stone Temple study and the 295 million number from the 2018 StatisticsBrain web page. The inconsistencies are likely the result of the sitemap not reporting all profiles for the entire history of Google+ as well as differences in the definition of a profile between these studies.

I looked at various news sources that had linked to Google+ profiles. The profile URIs from the sitemaps do not correspond to those often shared and linked by users. For example, my vanity Google+ profile URI is https://plus.google.com/+ShawnMJones, but it is displayed in the sitemap as a numeric profile URI https://plus.google.com/115814011603595795989. Google+ uses the canonical link relation to link these two URIs but reveals this relation in the HTML of these pages. For a tool to discover this relationship, it must dereference each sitemap profile URI, an expensive discovery process at scale.  If Google had placed these relations in the HTTP Link header, then archivists could use an HTTP HEAD to discover the relationship. The additional URI forms make it difficult to use profile URIs from sitemaps alone for analysis.

The content of the pages found at the vanity and numeric profile URIs is slightly different. Their SHA256 hashes do not match. A review in vimdiff indicates that the differences are self-referential identifiers in the HTML (i.e., JavaScript variables containing +ShawnMJones vs. 115814011603595795989), a nonce that is calculated by Google+ and inserted into the content when it serves each page, and some additional JavaScript. Visually they look the same when rendered.

How much of Google+ is archived?


The lack of easy canonicalization of profile URIs makes it challenging to use the URIs found in sitemaps for web archive analysis. I chose instead to evaluate the holdings reported by two existing web archives.

For comparison, I used numbers from the sitemaps downloaded directly from plus.google.com.
I use these totals for comparison in the following sections.
Internet Archive Search Engine Result Pages
I turned to the Internet Archive to understand how many Google+ pages exist in its holdings. I downloaded the data file used in the AJAX call that produces the page shown in the screenshot below.

The Internet Archive reports 83,162 URI-Rs captured for plus.google.com.

The Internet Archive reports 83,162 URI-Rs captured. Shown in the table below, I further analyzed the data file and broke it into profiles, posts, communities, collections, and other by URI.

Category # in Internet Archive % of Total from Sitemap
Collections 1 0.00000572%
Communities 0 0%
Posts 12,946 Not reported in sitemap
Profiles 65,000 0.00193%
Topics 0 0%
Other 5,217 Not reported in sitemap

The archived profile page URIs are both of the vanity and numeric types. Without dereferencing each, it is difficult to determine how much overlap exists. Assuming no overlap, the Internet Archive possesses 65,000 profile pages, which is far less than 1% of 3 billion profiles and 0.0481% of our estimate of 135 million active profiles from the previous section.

I randomly sampled 2,334 URI-Rs from this list, corresponding to a confidence level of 95% and a margin of error of ±2%. I downloaded TimeMaps for these URI-Rs and calculated a mean of 67.24 mementos per original resource.
Archive.today Search Engine Result Pages
As shown in the screenshot below, Archive.today also provides a search interface on its web site.

Archive.today reports 2551 URI-Rs captured for plus.google.com.

Archive.today reports 2,551 URI-Rs, but scraping its search pages returns 3,061 URI-Rs. I analyzed the URI-Rs returned from the scraping script to place them into the categories shown in the table below.

Category # in Archive.today % of Total from Sitemap
Collections 10 0.0000572%
Communities 0 0%
Photos 22 Not reported in sitemap
Posts 1994 Not reported in sitemap
Profiles 989 0.0000293%
Topics 1 0.248%
Other 45 Not reported in sitemap


Archive.today contains 989 profiles, a tiny percent of the 3 billion suggested by the sitemap and the 135 million active profile estimate that we generated from the previous section.

Archive.today is Memento-compliant, so I attempted to download TimeMaps for these URI-Rs. For 354 URI-Rs, I received 404s for their TimeMaps, leaving me with 2707 TimeMaps. Using these TimeMaps, I calculated a mean of 1.44 mementos per original resource.

Are these mementos of good quality?

Archives just containing mementos is not enough. Their quality is relevant as well. Crawling web content often results in missing embedded resources such as stylesheets. Fortunately, Justin Brunelle developed an algorithm for scoring the quality of a memento that takes missing embedded resources into account. Erika Siregar developed the Memento Damage tool based on Justin's algorithm so that we can calculate these scores. I used the Memento Damage to score the quality of some mementos from the Internet Archive.

The histogram of memento damage scores from our random sample shows that most have a damage score of 0.
Memento damage takes a long time to calculate, so I needed to keep the sample size small. I randomly sampled 383 URI-Rs from the list acquired from the Internet Archive and downloaded their TimeMaps. I acquired a list of 383 URI-Ms by randomly sampling 1 URI-M from each of TimeMap. I then fed these URI-Ms into a local instance of the Memento Damage tool. The Memento Damage tool experienced errors for 41 URI-Ms.

This memento has the highest damage score of 0.941 in our sample. The raw size of its base page is 635 kB.


The mean damage score for these mementos is 0.347. A score of 0 indicates no damage. This score may be misleading, however, because more content is loaded via JavaScript when the user scrolls down the page. Most crawling software does not trigger this JavaScript code and hence misses this content.

The screenshot above displays the largest memento in our sample. The base page has a size of 1.3 MB and a damage score of 0.0. It is not a profile page, but a page for a single post with comments.
The screenshot above displays the smallest memento in our sample with a size greater than zero and no errors while computing damage. This single post page redirects to a page not captured by the Internet Archive. The base page has a size of 71kB and a damage score of 0.516.
The screenshot above displays a memento for a profile page of size 568kB, the lower bound of pages with posts from our earlier live sample. It has a memento damage score of 0.

This histogram displays the file sizes in our memento sample. Note how most have a size between 600kB and 700kB. 

As an alternative to memento damage, I also downloaded the raw memento content of the 383 mementos to examine their sizes.  The HTML has a mean size of 466kB and a median of 500kB. In this sample, we have mementos of posts and other types of pages mixed in. Post pages appear to be smaller. The memento of a profile page shown below still contains posts at 532kB. Mementos for profile pages smaller than this had just a user name and no posts. It is possible that the true lower bound in size is around 532kB.

This memento demonstrates a possible new lower bound in profile size at 532kB. The Internet Archive captured it in January of 2019.

Discussion and Conclusions


Google+ is being shut down on April 2, 2019. What direct evidence will future historians have of its existence? We have less than two months to preserve much of Google+. In this post, I detailed how users might preserve their profiles with either Google Takeout, Webrecorder.io, and other web archiving tools.

I mentioned that there are questions about how many active users ever existed on Google+. In Google's attempt to make all of its services "social" it conflated the number of active Google+ users with active users of other Google services. Third-party estimates of active Google+ users over the years have ranged from 111 million to 295 million.  With a sample size of 601 profiles from the profile sitemap at plus.google.com, I estimated that the number might be as high as 135 million.

To archive non-empty Google+ pages, we have to be able to detect pages that are empty. I analyzed a small sample of Google+ profile pages and discovered that pages of size 663kB or larger contain enough posts to fill the first "page" of scrolling. I also discovered that inactive profile pages tend to be less than 568kB. Using the HEAD method of HTTP and the Content-Length header, archivists can use this value to detect unused or poorly contributed to Google+ profiles before downloading their content.

I estimated how much of Google+ exists in public web archives. Scraping URIs from the search result pages of the Internet Archive, the most extensive web archive, reveals only 83,162 URI-Rs for Google+. Archive.today only reveals 2,551 URI-Rs. Both have less than 1% of the totals of different Google+ page categories found in the sitemap. The fact that so few are archived may indicate that few archiving crawls found Google+ profiles because few web pages linked to them.

I sampled some mementos from the Internet Archive and found a mean damage score of 0.347 on a scale where 0 indicates no damage. Though manual inspection does show missing images, stylesheets appear to be consistently present.

Because Google+ uses page scrolling to allow users to load more content, this means that many mementos will likely be of poor quality if recorded outside of tools like Webrecorder.io. With the sheer number of pages to preserve, we may have to choose quantity over quality.

If a sizable sample of those profiles is considered to be valuable to historians, then web archives have much catching up to do.

A concerted effort will be necessary to acquire a significant number of profile pages by April 2, 2019. My recommendations are for users to archive their public profile URIs with ArchiveNow, Mink, or the save page forms at the Internet Archive or Archive.today. Archivists looking to archive Google+ more generally should download the topics sitemap and at least capture the 404 (four hundred four, not 404 status) topics pages using these same tools. Enterprising archivists can search news sources, like this Huffington Post article and this Forbes article, that feature popular and famous Google+ users. Sadly, because of the lack of links, much of the data from these articles is not in a machine-readable form.  A Google+ archivist would need to search Google+ for these profile page URIs manually. Once that is done, the archivist can then save these URIs using the tools mentioned above.

Due to its lower usage compared to other social networks and its controversial history, some may ask "Is Google+ worth archiving?" Only future historians will know, and by then it will be too late, so we must act now.

-- Shawn M. Jones