Wednesday, April 17, 2019

2019-04-17: Russell Westbrook, Shane Keisel, Fake Twitter Accounts, and Web Archives


On March 11, 2019 in the NBA, the Utah Jazz hosted their Northwest Division rivals, the Oklahoma City Thunder.  During the game, a Utah fan (Shane Keisel) and a Oklahoma City player (Russell Westbrook) engaged in a verbal exchange, with the player stating the fan was directing racist comments to him and the fan admitting to heckling but denying that his comments were racist.  The event was well documented (see, for example, this Bleacher Report article), and the following day the fan received a lifetime ban from all events at the Vivint Smart Home Arena and the player received a $25k fine from the NBA.

Disclaimer: I have no knowledge of what the fan said during the game, nor do I have an opinion regarding the appropriateness of the respective penalties.  My interest is that after the game, the fan gave at least one interview with a TV station reporter in which he exposed his identity.  That set off a rapidly evolving series of events with both real and fake Twitter accounts, which we unravel with the aid of multiple web archives.  The initial analysis was performed by Justin Whitlock as a project in my CS 895 "Web Archiving Forensics" class; prior to Justin proposing it as a project topic, my only knowledge of this event was via the Daily Show.


First, let's establish a timeline of events.  The timeline is made a little bit complicated because of although the game was played in the Mountain time zone, most media reports are relative to Eastern time, and the web crawlers report their time in UTC (or GMT).  Furthermore, daylight savings time began on Sunday, March 10, and the game was played on Monday, March 11.  This means there is a four hour differential between UTC and EDT, and a six hour differential between UTC and MDT.  Although most events occur after daylight savings, some events will occur before (where there would be a five hour differential between UTC and EST). 
  • 2019-03-12T01:00:00Z -- the game is scheduled to begin at March 11, 9pm EDT (March 12, 1am UTC).  An NBA game will typically last 2--2.5 hours, and at least one tweet shows Westbrook talking to someone in the bleachers midway through second quarter (there may be other videos in circulation as well).
  • 2019-03-12T03:58:00Z -- based on the empty seats and the timestamp on the tweet (11:58pm EDT), the post-game interview with a KSL reporter embedded above reveals the fan's name and face.  The uncommon surname of "Keisel" combined with a closeup of his face enables people to find quickly find his Twitter account: "@skeisel391". 
  • 2019-03-12T04:57:34Z -- Within an hour of the KSL interview being posted, Keisel's Twitter account is "protected". This means we can see his banner and avatar photos and his account metadata, but not his tweets.
  • 2019-03-12T12:23:42Z -- Less than 9 hours after the KSL interview, his Twitter account is "deleted". No information is available from his account at this time.
  • 2019-03-12T15:29:47Z -- Although his Twitter account is deleted, the first page (i.e., first 20 tweets) is still in Google's cache and someone has pushed Google's cached version of the page into a web archive.  The banner of the web archive (archive.is) obscures the banner inserted by Google's cache, but a search of the source code of http://archive.is/K6gP4 reveals: 
    "It is a snapshot of the page as it appeared on Mar 6, 2019 11:29:08 GMT." 
In other words, an archived version of Google's cached page reveals Keisel's tweets (the most recent 20 tweets anyway) from nearly a week before (i.e., 2019-03-06T11:29:08Z) the game on March 11, 2019.

Although Keisel quickly protected and then ultimately deleted his account, until it was deleted his photos and account metadata were available and allowed a number of fake accounts to proliferate.  The most successful fake is "@skeiseI391", which is also now deleted but stayed online until at least 2019-03-17T04:18:48Z.  "@skeiseI391" replaces the lowercase L ("l") with an uppercase I ("I").  Depending on the font of your browser, the two characters can be all but indistinguishable (here they are side-by-side: lI).  I'm not sure who created this account, but we discovered it in this tweet, where the user provides not only screen shots but also a video of scrolling and clicking through the @skeiseI391 account before it was deleted.






The video has significant engagement: originally posted at 2019-03-12T10:55:00Z, it now has greater than 1k RTs, 3k likes, and 381k views.  There are many other accounts circulating these screen shots: some of which are provably true, some of which are provably false, and some of which cannot be verified using public web archives.  The screen shots have had an impact in the news as well, showing up in among others: The Root, News One, and BET.   BET even quoted a provably fake tweet in the headline of their article:

This article's headline references a fake tweet.
The Internet Archive has mementos (archived web pages) for both the fake @skeiseI391 and the real @skeisel391 accounts, but the Twitter account metadata (e.g., when the account was created, how many followers, how many tweets) for the fake acount are in Chinese and in Kannada for real account.  This is admittedly confusing, but is a result of how the Internet Archive's crawler and Twitter's cookies interact; see our research group's posts from 2018-03 and 2019-03 on these topics for further information.  Fortunately, archive.is does not have the same problems with cookies, so we use their mementos for the following screen shots (two from the real account at archive.is and one from the fake account at archive.is).

real account, 2019-03-06T11:29:08Z (Google cache)
real account, 2019-03-12T04:57:34Z
From the account metadata, we can see this was not an especially active account: established in October 2011, it has 202 total tweets, 832 likes, following 51 accounts, and from March 6 to March 12, it went from 41 to 53 followers.  The geographic location is set to "Utah, USA", and the bio has no linked URL and has three flag emojis.

fake account; note the difference in the account metadata
The fake account has notably different metadata: the bio has only two flag emojis, plus a link to "h.cm", a page for a parked domain that appears to have never had actual content (the Internet Archive has mementos back to 2012). Furthermore, this account is far more active with 7k tweets, 23k likes, 1500 followers and following 1300 accounts, all since being created in August 2018.

Twitter allows users to change their username (or "handle") without losing followers, old tweets, etc.  Since the handle is reflected in the URL and web archives only index by URL, we cannot know what the original handle of the fake @skeiseI391 account, but at some point after the game the owner changed from the original handle to "skeiseI391".  Since the account is no longer live, we cannot use the Twitter API to extract more information about the account (e.g., followers and following, tweets prior to the game), but given the link to a parked/spam web page and the high level of engagement in  a short amount of time, this was likely a burner bot account designed amplify legitimate accounts (cf. "The Follower Factory"), and then was adapted for this purpose.

We can pinpoint when the fake @skeiseI391 account was changed.  By examining the HTML source from the IA mementos of the fake and real accounts, we can determine the URLs of the profile images:

Real: https://pbs.twimg.com/profile_images/872289541541044225/X6vI_-xq_400x400.jpg

Fake: https://pbs.twimg.com/profile_images/1105325330347249665/YHcWGvYD_400x400.jpg

Both images are 404 now, but they are archived at those URLs in the Internet Archive:

Archived real image, uploaded 2017-06-07T03:08:07Z
Archived fake image, uploaded 2019-03-12T04:29:09Z
Also note that the tool used to download the real image and then upload as the fake image maintained the circular profile pic instead of the original square.

For those familiar with curl, I include just a portion of the command line interface that shows the original "Last-Modified" HTTP response header from twitter.com.  It is those dates that record when the image changed at Twitter; these are separate from the dates from when the image was archived at the Internet Archive.  The relevant response headers are shown below:

Real image:
$ curl -I http://web.archive.org/web/20190312045057/https://pbs.twimg.com/profile_images/872289541541044225/X6vI_-xq_400x400.jpg
HTTP/1.1 200 OK
Server: nginx/1.15.8
Date: Wed, 17 Apr 2019 15:12:02 GMT
Content-Type: image/jpeg
...

X-Archive-Orig-last-modified: Wed, 07 Jun 2017 03:08:07 GMT
...

Memento-Datetime: Tue, 12 Mar 2019 04:50:57 GMT
...


Fake image:
$  curl -I http://web.archive.org/web/20190312061306/https://pbs.twimg.com/profile_images/1105325330347249665/YHcWGvYD_400x400.jpg
HTTP/1.1 200 OK
Server: nginx/1.15.8
Date: Wed, 17 Apr 2019 15:13:21 GMT
Content-Type: image/jpeg
...

X-Archive-Orig-last-modified: Tue, 12 Mar 2019 04:29:09 GMT
...

Memento-Datetime: Tue, 12 Mar 2019 06:13:06 GMT
...


The "Memento-Datetime" response header is when the Internet Archived crawled/created the memento (real = 2019-03-12T04:50:57Z; fake = 2019-03-12T06:13:06Z), and the "X-Archive-Orig-last-modified" response header is the Internet Archive echoing the "Last-Modified" response header it received from twitter.com at crawl time.  From this we can establish that the image was uploaded to the fake account at 2019-03-12T04:29:09Z, not quite 30 minutes before we can establish that the real account was set to "protected" (2019-03-12T04:57:34Z). 

We've presented a preponderance of evidence of that the account the account @skeiseI391 is fake and that fake account is responsible for the "come at me _____ boy" tweet referenced in multiple news outlets.  But what about some of the other screen shots referenced in social media and the news?  Are they real?  Are they photoshopped?  Are they from other, yet-to-be-uncovered fake accounts?

First, any tweet that is a reply to another tweet will be difficult to verify with web archives unless we know the direct URL of the original tweet or the reply itself (e.g., twitter.com/[handle]/status/[many numbers]).  Unfortunately, the deep links for individual tweets are rarely crawled and archived for less popular accounts.  While the top level page will be crawled and the most recent 20 tweets included, one has to be logged in to Twitter to see the tweets included in the "Tweets & replies" tab, and public web archives are not logged in when they crawl so those contents are typically not available.  As such, it is hard to establish via web archives if the screen shot of the reply below is real of fake.  The original thread is still on the live web, but of the 45 replies, two of them are marked "This Tweet is unavailable".  One of those could be a reply from the real @skeisel391, but we don't have enough information to definitively rule if that is true.  The particular tweet shown below ("#poorloser") is of issue because even though it was from nearly a year ago, it would contradict the "we were having fun" attitude from the KSL interview.  Other screen shots that appear as replies will be similarly difficult to uncover using web archives.

This could be a real reply, but with web archives it is difficult to establish provenance of reply tweets.
The tweet below is more difficult to establish, since it does not appear to be a reply and the datetime that it was posted (2018-10-06T16:11:00Z) falls with the date range of the memento of the page in the Google cache, which has tweets from 2019-02-27 to 2018-10-06.  The use of "#MAGA" is inline with what we know Keisel has tweeted (at least 7 of the 20 tweets are clearly conservative / right-wing).  At first glance it appears that memento covers tweets all the way back to 2018-10-04, since a retweet with that timestamp appears as the 20th and final tweet on the page, and thus a tweet from 2018-10-06 should appear before the one with a timestamp of 2018-10-04.  But retweeting a page does not reset the timestamp; for example if I tweeted something yesterday and you retweet it today, your retweet will show my timestamp of yesterday.  So although the last timestamp shown on the page is 2018-10-04, the 19th tweet on the page is from Keisel and shows a timestamp of 2018-10-06.  So it's possible that the retweet occurred on 2018-10-06 and the tweet below just missed being included in the 20 most recent tweets (i.e., the 21st most recent tweet).  The screen shot shows a time of "11:11am", and in the HTML source of Google's cached page, for the 19th tweet it has:

title="8:11 AM - 6 Oct 2018"

Which would suggest that the screen shot happened after the 19th tweet, but without time zone information we can't reliably sequence the tweets.  Depending on the GeoIP of Google's crawler, Twitter would set the "8:11 AM" value relative to that timezone.  It's tempting to think it's in California and thus PST, but we can't be certain.  Regardless, there's no way to know the default time zone of the presumed client in the screen shot.

We cannot definitely establish the provenance of this tweet.
Bing's cache also has a copy of Keisel's page, and it covers a period of 2018-09-14 to 2018-03-27.  Unfortunately, that leaves a coverage gap from 2018-10-06 to 2018-09-14, inclusive, and if the "#MAGA" tweet is real it could fall between the coverage provided by Google's cache and Bing's cache.

This leaves three scenarios to account for the above "#MAGA" tweet and why we don't have a cached copy of it:
  1. Keisel deleted this tweet on or before March 6, 2019 in anticipation of the game on March 11, 2019.  While not impossible, it does not seem probable because it would require someone taking a screen shot of the tweet prior to the KSL interview.  Since the real @skeisel391 account was not popular (~200 tweets, < 50 followers), this seems like an unlikely scenario.
  2. Someone photoshopped or otherwise created a fake tweet.  Given the existence of the fake @skeiseI391 account (and other fake accounts), this cannot be ruled out.  If it is a fake, it does not appear to have the same origin as the fake @skeiseI391 account.  
  3. The screen shot is legitimate and we are simply unlucky that the tweet in question fell in the coverage gap between the Google cache and the Bing cache, just missing appearing on the page in Google's cache.
I should note that in the process of extending Justin's analysis we came across this thread from sports journalist @JonMHamm, where he uncovered the fake @ account and also looked at the page in Google's cache, although he was unaware that the earliest date it establishes is 2018-10-06 and not 2018-10-04.  He also vouches for a contact that claims to have seen the "#MAGA" tweet while it was still live, but that's not something I can independently verify.




In summary, of the three primary tweets offered as evidence, we can reach the following conclusions:
  1. "come at me _____ boy" -- this tweet is definitively fake.
  2. "#poorloser" -- this tweet is a reply, and in general reply tweets will not appear in public web archives, so web archives cannot help us evaluate this tweet.
  3. "#MAGA" -- this tweet is either faked, or it falls in the gap between what appears in the Google cache and what appears in the Bing cache; using web archives we cannot definitively determine explanation is more likely.
We welcome any feedback, additional cache sources, deep links to individual tweets, evidence that these tweets were ever embedded in HTML pages, or any additional forensic evidence.  I thank Justin Whitlock for the initial analysis, but I take responsibility for any errors (including the persistent fear of incorrectly computing time zone offsets).

Finally, in the future please don't just take a screen shot, push it to multiple web archives

--Michael




Note: There are other fake Twitter accounts, for example: @skeisell391 (two lowercase L's),  @skeisel_ (trailing underscore), but they are not well-executed and I have omitted them from the discussion above.  

Monday, April 1, 2019

2019-04-01: Creating a data set for 116th Congress Twitter handles

Senators from Alabama in the 115th Congress

Any researcher conducting research on Twitter and the US Congress might think, "how hard could it be in creating a data set of Twitter handles for the members of Congress?". At any given time, we know the number of members in the US Congress and we also know the current members of Congress. At this point, creating a data set of Twitter handles for the members of Congress might seem like an easy task, but it turns out it is a lot more challenging than expected. We present the challenges involved in creating a data set of Twitter handles for the members of 116th US Congress and provide a data set of Twitter handles for 116th US Congress

Brief about the US Congress


The US Congress is a bicameral legislature comprising of the Senate and the House of Representatives. The Congress consists of:

  • 100 senators, two from each of the fifty states.
  • 435 representatives, seats are distributed by population across the fifty states.
  • 6 non-voting members from the District of Columbia and US territories which include American Samoa, Guam, Northern Mariana Islands, Puerto Rico, and US Virgin Islands.
Every US Congress is consecutively numbered and has a term of two years. The current US Congress is the 116th Congress which began on 2019-01-03 and will end on 2021-01-03.       

Previous Work on Congressional Twitter


Since the inception of social media, Congress members have aggressively used it as a medium of communication with the rest of the world. Previous researchers have completed their US Congress Twitter handles data set by both using other lists and manually adding to them. 

Jennifer Golbeck et al. in their papers "Twitter Use by the US Congress" (2010) and "Congressional twitter use revisited on the platform's 10-year anniversary" (2018) used the Tweet Congress to build their data set of Twitter handles for the members of Congress. An important highlight from their 2018 paper is that every member of Congress has a Twitter account. Libby Hemphill in "What's congress doing on twitter?" talks about the manual creation of 380 Twitter handles for US Congress which were used for collecting tweets in the winter of 2012. Theresa Loraine Cardenas in "The Tweet Delete of Congress: Congress and Deleted Posts on Twitter" (2013) used Politwoops to create the list of Twitter handles for members of Congress. Jihui Lee et al. in their paper "Detecting Changes in Congressional Twitter Networks over Time" used the community maintained GitHub repository from @unitedstates to collect Twitter data for 369 representatives of the 435 from the 114th US Congress. Libby Hemphill and Matthew A. Shapiro in their paper "Appealing to the Base or to the MoveableMiddle? Incumbents’ Partisan MessagingBefore the 2016 U.S. Congressional Elections" (2018) also used the community maintained GitHub repository from @unitedstates
Screenshot from Tweet Congress

Twitter Handles of the 116th Congress 


January 3, 2019 marked the beginning of 116th United States Congress with 99 freshman members to the Congress. It has already been two months since the new Congress has been sworn in. Now, let us review Tweet Congress and GitHub repository @unitedstates to check how up-to-date these sources are with the Twitter handles for the current members of Congress. We also review the CSPAN Twitter list for the members of Congress in our analysis.

Tweet Congress 

Tweet Congress is an initiative from the Sunlight Foundation with help from Twitter to create a transparent environment which allows easy conversation between lawmakers and voters in real time. It was launched in 2011. It lists all the members of Congress and their contact information. The service also provides visualizations and analytics for Congressional accounts.     

@unitedstates (GitHub Repository)

It is a community maintained GitHub repository which has list of members of the United States Congress from 1789 to present, congressional committees from 1973 to present, committee memberships for current, and information about all the presidents and vice-presidents of the United States. The data is available in YAML, JSON, and CSV format. 

CSPAN (Twitter List)

CSPAN maintains Twitter lists for the 116th US Representatives and US Senators. The Representatives list has 482 Twitter accounts while the Senators list has 114 Twitter accounts. 

Combining Lists  


We used the Wikipedia page on the 116th Congress as our gold-standard data for the current members of Congress. The data from Wikipedia was collected on 2019-03-01. Correspondingly, the data from CSPAN, @unitedstates (GitHub Repository), and Tweet Congress was also collected on 2019-03-01. We then manually compiled a CSV file with the members of Congress and the presence of their Twitter handles in all the different sources. The reason for manual compilation of the list was largely due to discrepancy in the names of the members of Congress from different sources under consideration.
  • Some of the members of Congress use diacritic characters. For example, Wikipedia and Tweet Congress have the name of a representative from New York as Nydia_Velázquez, while  Twitter and @unitedstates repository has her name as Nydia Velazquez
Screenshot from Wikipedia showing Nydia Velazquez, representative from New York using diacritic characters

Screenshot from Twitter for Rep. Nydia Velazquez from New York without diacritic characters
  • Some of the members of Congress have abbreviated middle names or suffixes in their names. For example, Wikipedia has the name of a representative from Tennessee as Mark E. Green while Tweet Congress has his name as Mark Green.
Screenshot from Wikipedia for Rep. Mark Green from Tennessee with his middle name


Screenshot from Twitter for Rep. Mark Green from Tennessee without his middle name
Screenshot from Tweet Congress for Rep. Mark Green from Tennessee without his middle name
Screenshot from Wikipedia for Rep. Chuck Fleischmann from Tennessee using his nick name
Screenshot from Twitter for Rep. Chuck Fleischmann from Tennessee using his nick name
Screenshot from Tweet Congress for Rep. Chuck Fleischmann from Tennessee using his given name

What did we learn from our analysis?


As of 2019-03-01, the US Congress had 538 members of 541 with three vacant representative positions. The three vacant positions include the third and ninth Congressional Districts of North Carolina and the twelfth Congressional District of Pennsylvania. Of the 538 members of Congress, 537 have Twitter accounts while the non-voting member from Guam, Michael San Nicolas, has no Twitter account.


Name Position Joined Congress CSPAN @unitedstates TweetCongress Remark
Collin Peterson Rep. 1991-01-03 F F F @collinpeterson
Greg Gianforte Rep. 2017-06-21 F F F @GregForMontana
Gregorio Sablan Del. 2019-01-03 F T T
Rick Scott Sen. 2019-01-08 T !T F
Tim Kaine Sen. 2013-01-03 T !T F
James Comer Rep. 2016-11-08 T !T F
Justin Amash Rep. 2011-01-03 T !T F
Lucy Clay Rep. 2001-01-03 T !T F
Bill Cassidy Rep. 2015-01-03 T !T T
Members of the 116th Congress whose Twitter handles are missing from either one or all of the sources. T represents both name and Twitter handle present, !T represents name present but Twitter handle missing, and F represents both the name and Twitter handle missing.
  • CSPAN has Twitter handles for 534 members of Congress out of the 537 members of Congress with two representatives and a non-voting member missing from its list. The absentees from the list are Rep. Collin Peterson (@collinpeterson), Rep. Greg Gianforte (@GregForMontana), and Delegate Gregorio Sablan (@Kilili_Sablan).
  • The GitHub repository, @unitedstates has Twitter handles for 529 members of Congress out of the 537 members of Congress with five representatives and three senators missing from its data set. The absentees from the repository are Rep. Collin Peterson (@collinpeterson), Rep. Greg Gianforte (@GregForMontana), Sen. Rick Scott (@SenRickScott), Sen. Tim Kaine (@timkaine), Rep. James Comer (@KYComer), Rep. Justin Amash (@justinamash), Rep. Lucy Clay (@LucyClayMO1), and Sen. Bill Cassidy (@SenBillCassidy).
  • Tweet Congress has Twitter handles for 530 members of Congress out of the 537 members of Congress with five representatives and two senators missing.  The absentees are Rep. Collin Peterson (@collinpeterson), Rep. Greg Gianforte (@GregForMontana), Sen. Rick Scott (@SenRickScott), Sen. Tim Kaine (@timkaine), Rep. James Comer (@KYComer), Rep. Justin Amash (@justinamash), and Rep. Lucy Clay (@LucyClayMO1).
The combined list of Twitter handles for the members of Congress from all the sources has two representatives missing, namely Collin Peterson who is a representative from Minnesota since 1991-01-03 and Greg Gianforte who is a representative from Montana since 2017-06-21. The combined list from all the sources also has six members of Congress who have different Twitter handles from different sources.


Name Position Joined Congress CSPAN @unitedstates + TweetCongress
Chris Murphy Sen. 2013-01-03 @ChrisMurphyCT @senmurphyoffice
Marco Rubio Sen. 2011-01-03 @marcorubio @SenRubioPress
James Inhofe Sen. 1994-11-16 @JimInhofe @InhofePress
Julia Brownley Rep. 2013-01-03 @RepBrownley @JuliaBrownley26
Seth Moulton Rep. 2015-01-03 @Sethmoulton @teammoulton
Earl Blumenauer Rep. 1996-05-21 @repblumenauer @BlumenauerMedia
Members of the 116th Congress who have different Twitter handles in different sources

Possible reasons for disagreement in creating a Members of Congress Twitter handles data set


Scenarios involved in creating Twitter handles for members of Congress when done over a period of time

One Seat - One Member - One Twitter Handle: When creating our data set of Twitter handles for members of Congress over a period of time, the perfect situation is where we have one seat in the Congress which is held by one member for the entire congress tenure who holds one Twitter account. For example, Amy Klobuchar, senator from Minnesota has only one Twitter account @amyklobuchar.

Google search screenshot for Sen. Amy Klobuchar's Twitter account
Twitter screenshot for Sen. Amy Klobuchar's Twitter account

One Seat - One Member - No Twitter Handle: When creating our data set of Twitter handles for members of Congress over a period of time, we have one seat in Congress which is held by one member for the entire congress tenure and does not have a Twitter account. For example, Michael San Nicolas, delegate from Guam has no Twitter account.

Screenshot from Congressman Michael San Nicolas page showing a Twitter link for HouseDems Twitter account while the rest of the social media icons are linked to his personal accounts

One Seat - One Member - Multiple Twitter Handles: When creating our data set of Twitter handles for members of Congress over a period of time, we have one seat in Congress which is held by one member for the entire congress tenure who has more than one Twitter account. A member of Congress can have multiple Twitter accounts. Based on the purpose of the Twitter accounts they can be classified as Personal, Official, and Campaign accounts.

  • Personal Account: A Twitter account used by the members of Congress to tweet their personal thoughts can be referred to as a personal account. A majority of these accounts might have a creation date prior to when they were elected to the Congress. For example, Marco Rubio, a Senator from Florida created his Twitter account @marcorubio in August, 2008 while he was sworn in to Congress on 2011-01-03.
Screenshot for the Personal Twitter account of Sen. Marco Rubio from Florida. The account was created in August, 2008 while he was elected to Congress on 2011-01-03 
  • Official Account: A Twitter account used by the member of Congress or their staff to tweet out all the official information for general public related to the member of Congress' activity is referred to as an official account. A majority of these accounts creation dates will be close to the date on which the member of Congress got elected. For example, Marco Rubio, a Senator from Florida has a Twitter account @senrubiopress which has a creation date of December, 2010, while he was sworn in to Congress on 2011-01-03. 
Screenshot for the Official Twitter account of Sen. Marco Rubio from Florida. The account was created in December, 2010 while he was elected to Congress on 2011-01-03.
  • Campaign Accounts: A Twitter account used by a member of Congress for campaigning their elections is referred to as a campaign account. For example, Rep. Greg Gianforte from Montana has a Twitter account @gregformontana which contains tweets related to his campaigns for re-election can be referred to as a campaign account.
Twitter Screenshot for the Campaign account of Rep. Greg Gianforte from Montana which contains tweets related to his re-election campaigns.
Twitter Screenshot for the Personal account of Rep. Greg Gianforte from Montana which has personal tweets from him. 

One Seat - Multiple Members - Multiple Twitter Handles: When creating our data set of Twitter handles for members of Congress over a period of time, we can have a seat in Congress which is held by different members during the tenure of Congress at different points in time who have different Twitter accounts. An example from the 115th Congress is the Alabama Senator situation between January 2017 and July 2018. On February 9, 2017, Jeff Sessions resigns as senator and was succeeded by Alabama Governor's appointee Luther Strange. After the special election on January 3, 2018, Luther Strange leaves the office to make way for Doug Jones as the Senator of Alabama. Now,  who do we include as the Senator from Alabama for the 115th Congress? Even though we might decide to include all of them based on the date they join or leave their offices but, when this analysis is done for a year who will provide us all the historical information for the current Congress in session. As of now, all the sources we analyzed try to provide with the most recent information rather than historical information about the current Congress and its members over the entire tenure. 
  
Alabama Senate seat situation between January 2017 and July 2018. It highlights the issue in context of Social Feed Manager's 115th Congress tweet dataset.  
One of the other issues worth mentioning is when members of Congress change their Twitter handle. An example for this scenario is when Rep. Alexandria Ocasio-Cortez from New York tweeted on 2018-12-28 about changing her Twitter handle from @ocasio2018 to @aoc. In the case of popular Twitter accounts for members of Congress, it is easy to discover their change of handles but for a member of Congress who is not popular on Twitter, they might go unnoticed for quite some time.

Screenshot of memento for @Ocasio2018
Screenshot of memento which shows the announcement for change of Twitter handle from @Ocasio2018 to @aoc 
Screenshot of @aoc

Twitter data set for the 116th Congress Handle

  • We have created a data set for the 16th Congress Twitter handles which resolves the issues of CSPAN, Tweet Congress, and @unitedstates (GitHub repository). 
  • We have Twitter handles for all the current 537 members of Congress who are on Twitter, except for one delegate from Guam who does not have a Twitter account. 
  • Unlike other sources, our data set does not  include any member of Congress who are not a part of the 116th Congress.
  • In case of conflicts of Twitter handles for members of Congress from different sources under investigation, we chose accounts which were personally managed by the member of Congress (Personal Twitter Account) over accounts which were managed by their teams or used for campaign purposes (Official or Campaign Accounts). The reason for choosing personal accounts over official or campaign accounts is because some of the members of Congress explicitly mention in Twitter biography of their personal accounts that all the tweets are their own which is not reflected in their official or campaign account's Twitter biography. 
Twitter Screenshot of the Personal account for Rep. Seth Moulton where he states that all the tweets are his own in his Twitter bio.

Name Position WSDL Data set CSPAN @unitedstates + TweetCongress
Chris Murphy Sen. @ChrisMurphyCT @ChrisMurphyCT @senmurphyoffice
Marco Rubio Sen. @marcorubio @marcorubio @SenRubioPress
James Inhofe Sen. @JimInhofe @JimInhofe @InhofePress
Julia Brownley Rep. @RepBrownley @RepBrownley @JuliaBrownley26
Seth Moulton Rep. @Sethmoulton @Sethmoulton @teammoulton
Earl Blumenauer Rep. @repblumenauer @repblumenauer @BlumenauerMedia
Members of the 116th Congress who have different Twitter handles in different sources. The WSDL data set has personal Twitter handles over official Twitter handles

Conclusion


Of all the three sources Tweet Congress, @unitedstates (GitHub Repository) and CSPAN, none of them have a full coverage of all the Twitter handles for the members of the 116th Congress. There is one member of Congress who does not have a Twitter account and additionally there are two members of Congress who do not have their Twitter handles present in any of the sources. There is no source which provides the historical information about the members of Congress over the entire tenure of the Congress, as all the sources focus on the recency rather than holding information about the entire tenure of Congress. It turns out creating a data set of Twitter handles for members of Congress seems an easy task on first glance, but it is a lot more difficult owing to multiple reasons for disagreements when the study is to be done for over a period of time. We share a data set for the 116th Congress Twitter handles by combining all the lists.

https://github.com/oduwsdl/US-Congress

----
Mohammed Nauman Siddique
@m_nsiddique

Monday, March 18, 2019

2019-03-18: Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages

Figure 1: Mixed language blocks on a memento of a Twitter timeline. Highlighted with blue colored box for Portuguese, orange for English, and red for Urdu. Dotted border indicates the template present in the original HTML response while blocks with solid borders indicate lazily loaded content.

Would you be surprised if I were to tell you that Twitter is a multi-lingual website, supporting 47 different international languages? How about if I were to tell you that a usual Twitter timeline page can contain tweets in whatever languages the owner of the handle chooses to tweet, but can also show navigation bar and various sidebar blocks in many different languages simultaneously, now surprised? Well, while it makes no sense, it may actually happen in web archives when a memento of a timeline is accessed as shown in Figure 1. Spoiler alert! Cookies are to be blamed, once again.

Last month, I was investigating a real life version of "Ron Burgundy will read anything on the teleprompter (Anchorman)" and "Chatur's speech (3 Idiots)" moments, when I noticed something that caught my eyes. I was looking at a memento (i.e., a historical version of a web page) of Pratik Sinha's Twitter timeline from the Internet Archive. Pratik is the co-founder of Alt News (an Indian fact checking website) and the person who edited an internal document of the IT Cell of BJP (the current ruling party of India), which was then copy-pasted and tweeted by some prominent handles of the party. Tweets on his timeline are generally in English, but the archived page's template language was not English (although, I did not request the page in any specific language). However, this was not surprising to me as we have already investigated the reason behind this template language behavior last year and found that HTTP cookies were causing it. After spending a minute or so on the page, a small notice appeared in the main content area, right above the list of tweets, suggesting that there were 20 more tweets, but the message was in Urdu language, a Right-to-Left (RTL) language, very different from the language used in the page's navigation bar. Urdu, being my first language, immediately alerted me that there is something not quite right. Upon further investigation, I found that the page was composed of three different languages, Portuguese, English, and Urdu as highlighted in Figure 1 (here I am not talking about the language of tweets themselves).

What Can Deface a Composite Memento?


This defaced composite memento is a serious archival replay problem as it is showing a web page that perhaps never existed. While the individual representations all separately existed on the live web, they were never combined in the page as it is replayed by the web archive. In the Web Science and Digital Libraries Research Group, we uncovered a couple of causes in the past that can yield defaced composite mementos. One of them is live-leakage (also known as Zombies) for which Andy Jackson proposed we should use Content-Security-PolicyAda Lerner et al. took a security-centric approach that was deployed by the Internet Archive's Wayback Machine, and we proposed Reconstructive as a potential solution using Service Worker. The other known cause is temporal violations, on which Scott Ainsworth is working as his PhD research. However, this mixed-language Twitter timeline issue cannot be explained by zombies nor temporal violations.

Anatomy of a Twitter Timeline


To uncover the cause, I further investigated the anatomy of a Twitter timeline page and various network requests it makes when accessed live or from a web archive as illustrated in Figure 2. Currently, when a Twitter timeline is loaded anonymously (without logging in), the page is returned with a block of brief description of the user, a navigation bar (containing summary of numbers of tweets and followers etc.), a sidebar block to encourage visitors to create a new account, and an initial set of tweets. The page also contains empty placeholders of some sidebar blocks such as related users to follow, globally trending topics, and recent media posted on that timeline. Apart from loading page requisites, the page also makes some follow up XHR requests to populate these blocks. When the page is active (i.e., the browser tab is focused) it polls for new tweets after every 30 seconds and global trends after every 5 minutes. Successful responses to these asynchronous XHR requests contain data in JSON format, but instead of providing a language-independent bare bone structured data to rendering templates on the client-side, they contain some server-side rendered encoded markup. Which is then decoded on the client-side and directly injected in corresponding empty placeholders (or replaced with any existing content), then the block is set to visible. This server-side partial markup rendering needs to know the language of the parent page in order to utilize phrases translated in the corresponding language to yield a consistent page.

Figure 2: An active Twitter timeline page asynchronously populates related users and recent media blocks then polls for new tweets every 30 seconds and global trends every 5 minutes.

How Does Twitter's Language Internationalization Work?


From our past investigation we know that Twitter handles languages in two primary ways, a query parameter and a cookie header. In order to fetch a page in a specific language (from their 47 currently supported languages) one can either add a "?lang=<language-code>" query parameter in the URI (e.g.,
https://twitter.com/ibnesayeed?lang=ur
for Urdu) or send a Cookie header containing the "lang=<language-code>" name/value pair. A URI query parameter takes precedence in this case and also sets the "lang" Cookie accordingly (overwriting any existing value) for all the subsequent requests until overwritten again explicitly. This works well on the live site, but has some unfortunate consequences when a memento of a Twitter timeline is replayed from a web archive, causing this hodgepodge illustrated in Figure 1 (area highlighted by dotted border indicates the template served in the initial HTML response while areas surrounded with solid border were lazily loaded). This mixed-language rendering does not happen when a memento of a timeline is loaded with an explicit language query parameter in the URI as illustrated in Figures 3, 4, and 5 (the "lang" query parameter is highlighted in the archival banner and also the lazily loaded blocks from each language that corresponds to the blocks in Figure 1). In this case, all the subsequent XHR URIs also contain the explicit "lang" query parameter.

Figure 3: A memento of a Twitter timeline explicitly in Portuguese.

Figure 4: A memento of a Twitter timeline explicitly in English.

Figure 5: A memento of a Twitter timeline explicitly in Urdu. The direction of the page is Right-to-Left (RTL), as a result, sidebar blocks are moved to the left hand side.

To understand the issue, consider the following sequence of events during the crawling of a Twitter timeline page. Suppose, we begin a fresh crawling session and start with fetching the https://twitter.com/ibnesayeed page without any specific language code supplied. Depending on the geo-location of the crawler or any other factors Twitter might return the page in a specific language, for instance, in English. The crawler extracts links of all the page requisites and hyperlinks to add them into the frontier queue. The crawler may also attempt to extract URIs of potential XHR or other JS initiated requests, which might add URIs like:
https://twitter.com/i/trends?k=&pc=true&profileUserId=28631536&show_context=true&src=module
and
https://twitter.com/i/related_users/28631536
(and various other lazily loaded resources) in the frontier queue. The HTML page also contains 47 language-specific alternate links (and one x-default hreflang) in its markup (with "?lang=<language-code>" style parameters). These alternate links will also be added in the frontier queue of the crawler in some order. When these language-specific links are fetched by the crawler, the lang Cookie will be set, overwriting any prior value. Now, suppose the https://twitter.com/ibnesayeed?lang=ur was fetched before the "/i/trends" data, it would set the language for any subsequent requests to be served in Urdu. When the data for global trends block is fetched, Twitter's server will returned a server-side rendered markup in Urdu, which will be injected in the page that was initially served in English. This will cause the header of the block saying "دنیا بھر کے میں رجحانات" instead of "Worldwide trends". Here, I would take a long pause of silence to express my condolence on the brutal murder of a language with more than 100 million speakers worldwide by a platform as big as Twitter. The Urdu translation of this phrase appearing on such a prominent place on the page is a nonsense and grammatically wrong. Twitter, if you are listening, please change it to something like "عالمی رجحانات" and get an audit of other translated phrases. Now, back to the original problem, following is a walk-through of the scenario described above.

$ curl --silent "https://twitter.com/ibnesayeed" | grep "<html"
<html lang="en" data-scribe-reduced-action-queue="true">
$ curl --silent -c /tmp/twitter.cookie "https://twitter.com/ibnesayeed?lang=ur" | grep "<html"
<html lang="ur" data-scribe-reduced-action-queue="true">
$ grep lang /tmp/twitter.cookie
twitter.com FALSE / FALSE 0 lang ur
$ curl --silent -b /tmp/twitter.cookie "https://twitter.com/ibnesayeed" | grep "<html"
<html lang="ur" data-scribe-reduced-action-queue="true">
$ curl --silent -b /tmp/twitter.cookie "https://twitter.com/i/trends?k=&pc=true&profileUserId=28631536&show_context=true&src=module" | jq
{
  "module_html": "<div class=\"flex-module trends-container context-trends-container\">\n  <div class=\"flex-module-header\">\n    \n    <h3><span class=\"trend-location js-trend-location\">دنیا بھر کے میں رجحانات</span></h3>\n  </div>\n  <div class=\"flex-module-inner\">\n    <ul class=\"trend-items js-trends\">\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#PiDay\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/PiDay?src=tren&amp;data_id=tweet%3A1106214111183020034\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#PiDay</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Google employee sets new record for calculating π to 31.4 trillion digits\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#SaveODAAT\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/SaveODAAT?src=tren&amp;data_id=tweet%3A1106252880921747457\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#SaveODAAT</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Netflix cancels One Day at a Time after three seasons\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"Beto\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=Beto&amp;src=tren&amp;data_id=tweet%3A1106142158023786496\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Beto</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Beto O’Rourke announces 2020 presidential bid\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#AvengersEndgame\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/AvengersEndgame?src=tren&amp;data_id=tweet%3A1106169765830295552\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#AvengersEndgame</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Marvel dropped a new Avengers: Endgame trailer\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"12 Republicans\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:tweet_count_1000_10000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=%2212%20Republicans%22&amp;src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">12 Republicans</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          6,157 ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#NationalAgDay\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:tweet_count_1000_10000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/NationalAgDay?src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#NationalAgDay</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          6,651 ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"Kyle Guy\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:tweet_count_1000_10000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=%22Kyle%20Guy%22&amp;src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Kyle Guy</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          1,926 ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"#314Day\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:hashtag_trend:taxi_country_source:tweet_count_10000_100000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/hashtag/314Day?src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#314Day</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          12 ہزار ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"Tillis\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=Tillis&amp;src=tren&amp;data_id=tweet%3A1106266707230777344\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Tillis</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          Senate votes to block Trump&#39;s border emergency declaration\n        </div>\n    </a>\n\n</li>\n\n        <li class=\"trend-item js-trend-item  context-trend-item\"\n    data-trend-name=\"Bikers for Trump\"\n    data-trends-id=\"1025618545345384837\"\n    data-trend-token=\":location_request:entity_trend:taxi_country_source:tweet_count_10000_100000_metadescription:\"\n    \n  >\n\n    <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n        href=\"/search?q=%22Bikers%20for%20Trump%22&amp;src=tren\"\n        data-query-source=\"trend_click\"\n        \n      >\n      <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Bikers for Trump</span>\n\n      \n      <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n        <div class=\"js-nav trend-item-stats js-ellipsis\">\n          16.8 ہزار ٹویٹس\n        </div>\n    </a>\n\n</li>\n\n    </ul>\n  </div>\n</div>\n",
  "personalized": false,
  "woeid": 1
}

Here, I started by fetching my Twitter time without specifying any language in the URI or via cookies. The response was returned in English. I then fetched the same page with explicit "?lang=ur" query parameter and saved any returned cookies in the "/tmp/twitter.cookie" file. We illustrated that the response was indeed returned in Urdu. We then checked the saved cookie file to see if it contains a "lang" cookie, which it does and has a value of "ur". We then utilized the saved cookie file to fetch the main timeline page again, but without an explicit "?lang=ur" query parameter to illustrate that Twitter's server respects it and returns the response in Urdu. Finally, we fetched global trends data while utilizing saved cookies and illustrated that the response contains a JSON-serialized HTML markup with Urdu header text in it as
"<h3><span class=\"trend-location js-trend-location\">دنیا بھر کے میں رجحانات</span></h3>"
under the "module_html" JSON key. The original response is encoded using Unicode escapes, but we used jq utility here to pretty-print JSON and decode escaped markup for easier illustration.

Understanding Cookie Violations


When fetching a single page (and all its page requisites) at a time, this problem, let's name it a cookie violation, might not happen as often. However, when crawling is done on a large scale, preventing such unfortunate misalignment of frontier queue completely is almost impossible, especially, since the "lang" cookie is set for the root path of the domain and affects every resource from the domain.

The root cause here can more broadly be described as a lossy state information being utilized when replaying a stateful resource representation from archives that originally performed content negotiation based on cookies or other headers. Most of the popular archival replay systems (e.g., OpenWayback, PyWB, and even our own InterPlanetary Wayback) do not perform any content negotiation when serving a memento other than the Accept-Datetime header (which is not part of the original crawl-time interaction, but a means to add the time dimension to the web). Traditional archival crawlers (such as Heritrix) mostly interacted with web servers by using only URIs without any custom request headers that might affect the returned response. This means, generally a canonicalized URI along with the datetime of the capture was sufficient to identify a memento. However, cookies are an exception to this assumption as they are needed for some sites to behave properly, hence cookie management support was added to these crawlers long time ago. Cookies can be used for tracking, client-side configurations, key/value store, and authentication/authorization session management, but in some cases they can also be used for content negotiation (as is the case with Twitter). When cookies are used for content negotiation, the server should adevrtise it in the "Vary" header, but Twitter does not. Accommodating cookies at capture/crawl time, but not utilizing them at replay time has this consequence of cookie violations, resulting in defaced composite mementos. Similarly, in aggregated personal web arching, which is the PhD research topic of Mat Kelly, not utilizing session cookies (or other forms of authorization headers) at replay time can result in a serious security vulnerability of private content leakage. In modern headless browser-based crawlers there might even be some custom headers that a site utilizes in XHR (or fetch API) for content negotiation, which should be considered when indexing the content for replay (or filtering at replay time from a subset). Ideally, a web archive should behave like an HTTP proxy/cache when it comes to content negotiation, but it may not always be feasible.

What Should We Do About It?


So, should we include cookies in the replay index and only return a memento if the cookies in the request headers match? Well, that will be a disaster as it will cause an enormous amount of false-negatives (i.e., mementos that are present in an archive and should be returned, but won't). Perhaps we can canonicalize cookies and only index ones that are authentication/authorization session-related or used for content negotiation. However, identifying such cookies will be difficult and will require some heuristic analysis or machine learning, because, these are opaque strings and their names are decided by the server application (rather than using any standardized names).

Even if we can somehow sort this issue out, there are even bigger problems in making it to work. For example, how to get the client send suitable cookies in the first place? How will the web archive know when to send a "Set-Cookie" header? Should the client follow the exact path of interactions with pages as the crawler did when a set of pages were captured in order to set appropriate cookies

Let's ignore session cookies for now and only focus on the content negotiation related cookies. Also, let's relax the cookie matching condition further by only filtering mementos if a cookies header is present in a request, otherwise ignore cookies from the index. This means, the replay system can send a Set-Cookie header if the memento in question was originally observed with a Set-Cookie header and expect to see it in the subsequent requests. Sounds easy? Welcome to the cookie collision hell. Cookies from various domains will be required to be rewritten to set the domain name of the web archive that is serving the memento. As a result, same cookie names from various domains served over time from the same archive will step over each other (it's worth mentioning that often a single web page has page requisites from many different domains). Even the web archive can have some of its own cookies independent of the memento being served.

We can attempt to solve this collision issue by rewriting the path of cookies and prefixing it with the original domain name to limit the scope (e.g., change
"Set-Cookie: lang=ur; Domain: twitter.com; Path=/"
to
"Set-Cookie: lang=ur; Domain: web.archive.org; Path=/twitter.com/"
). This is not going to work because the client will not send this cookie unless the requested URI-M path has a prefix of "/twitter.com/", but the root path of Twitter is usually rewritten as something like "/web/20190214075028/https://twitter.com/" instead. If the same rewriting rule is used in cookie path then the unique 14-digit datetime path segment will block it from being sent with subsequent requests that have a different datetime (which is almost always the case after an initial redirect). Unfortunately, cookie path does not support wildcard paths like "/web/*/https://twitter.com/".

Another possibility could be prefixing the name of the cookie with the original domain [and path] (with some custom encoding and unique-enough delimiters) then setting path to the root of the replay (e.g., change the above example to
"Set-Cookie: twitter__com___lang=ur; Domain: web.archive.org; Path=/web/"
), which, the replay server understands how to decode and apply properly. I am not aware of any other attributes of cookies that can be exploited to annotate with additional information. The downside of this approach is that if the client is relying on these cookies for certain functionalities then the changed name will affect them.

Additionally, an archival replay system should also rewrite cookie expiration time to a short-lived future value (irrespective of the original value, which could be a value in the past or a very distant value in the future) otherwise the growing pile of cookies from many different pages will increase the request size significantly over time. Moreover, incorporating cookies in replay systems will have some consequences in cross-archive aggregated memento reconstruction.

In our previous post about another cookie related issue, we proposed that explicitly expiring cookies (and garbage collecting cookies older than a few seconds) may reduce the impact. We also proposed that distributing crawl jobs of the URIs from the same domain in smaller sandboxed instances could minimize the impact. I think these two approaches can be helpful in mitigating this mixed-language issue as well. However, it is worth noting that these are crawl-time solutions, which will not solve the replay issues of existing mementos.

Dissecting the Composite Memento


Now, back to the memento of Pratik's timeline from the Internet Archive. The page is archived primarily in Portuguese. When it is loaded in a web browser that can execute JavaScript, the page makes subsequent asynchronous requests to populate various blocks as it does on the live site. Recent media block is not archived, so it does not show up. Related users block is populated in Portuguese (because this block is generally populated immediately after the main page is loaded and does not get a chance to be updated later, hence, unlikely to load a version in a different language). The closest successful memento of the global trends data is loaded from
https://web.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&pc=true&profileUserId=7431372&show_context=true&src=module
(which is in English). As the page starts to poll for new tweets for the account, it first finds the closest memento at
https://web.archive.org/web/20190227220450/https://twitter.com/i/profiles/show/free_thinker/timeline/tweets?composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=30000&latent_count=0&min_position=1095942934640377856
URI-M in Urdu. This adds a notification bar above the main content area that suggests there are 20 new tweets available (clicking on this bar will insert those twenty tweets in the timeline as the necessary markup is already returned in the response, waiting for a user action). I found the behavior of the page to be inconsistent due to intermittent issues, but reloading the page a few times and waiting for a while helps. In the subsequent polling attempts the latent_count parameter changes from "0" to "20" (this suggests how many new tweets are loaded and ready to be inserted) and the min_position parameter changes from "1095942934640377856" to "1100819673937960960" (these are IDs of the most recent tweets loaded so far). Every other parameter generally remains the same in the successive XHR calls after every 30 seconds. If one waits for long enough on this page (while the tab is still active), occasionally another successful response arrives that updates the new tweets notification from 20 to 42 (but in a different language from Urdu). To see if there are any other clues that can explain why the banner was inserted in Urdu, I investigated the HTTP response as shown below (the payload is decoded, pretty-printed, and truncated for ease of inspection):

$ curl --silent -i "https://web.archive.org/web/20190227220450/https://twitter.com/i/profiles/show/free_thinker/timeline/tweets?composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=30000&latent_count=0&min_position=1095942934640377856"
HTTP/2 200 
server: nginx/1.15.8
date: Fri, 15 Mar 2019 04:25:14 GMT
content-type: text/javascript; charset=utf-8
x-archive-orig-status: 200 OK
x-archive-orig-x-response-time: 36
x-archive-orig-content-length: 995
x-archive-orig-strict-transport-security: max-age=631138519
x-archive-orig-x-twitter-response-tags: BouncerCompliant
x-archive-orig-x-transaction: 00becd1200f8d18b
x-archive-orig-x-content-type-options: nosniff
content-encoding: gzip
x-archive-orig-set-cookie: fm=0; Max-Age=0; Expires=Mon, 11 Feb 2019 13:21:45 GMT; Path=/; Domain=.twitter.com; Secure; HTTPOnly, _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCIRWidxoAToMY3NyZl9p%250AZCIlYzlmNGViODk4ZDI0YmI0NzcyMTMyMzA3M2M5ZTRjZDI6B2lkIiU2ODFi%250AZjgzYjMzYjEyYzk1NGNlMDlmYzRkNDIzZTY3Mg%253D%253D--22900f43bec575790847d2e75f88b12296c330bc; Path=/; Domain=.twitter.com; Secure; HTTPOnly
x-archive-orig-expires: Tue, 31 Mar 1981 05:00:00 GMT
x-archive-orig-server: tsa_a
x-archive-orig-last-modified: Mon, 11 Feb 2019 13:21:45 GMT
x-archive-orig-x-xss-protection: 1; mode=block; report=https://twitter.com/i/xss_report
x-archive-orig-x-connection-hash: bca4678d59abc86b8401176fd37858de
x-archive-orig-pragma: no-cache
x-archive-orig-cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
x-archive-orig-date: Mon, 11 Feb 2019 13:21:45 GMT
x-archive-orig-x-frame-options: 
cache-control: max-age=1800
x-archive-guessed-content-type: application/json
x-archive-guessed-encoding: utf-8
memento-datetime: Mon, 11 Feb 2019 13:21:45 GMT
link: <https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="original", <https://web.archive.org/web/timemap/link/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="timegate", <https://web.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="first memento"; datetime="Mon, 11 Feb 2019 13:21:45 GMT", <https://web.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="memento"; datetime="Mon, 11 Feb 2019 13:21:45 GMT", <https://web.archive.org/web/20190217171144/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="next memento"; datetime="Sun, 17 Feb 2019 17:11:44 GMT", <https://web.archive.org/web/20190217171144/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="last memento"; datetime="Sun, 17 Feb 2019 17:11:44 GMT"
content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org
x-archive-src: liveweb-20190211133005/liveweb-20190211132143-wwwb-spn01.us.archive.org.warc.gz
x-app-server: wwwb-app23
x-ts: ----
x-location: All
x-cache-key: httpsweb.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&pc=true&profileUserId=7431372&show_context=true&src=moduleUS
x-page-cache: MISS

{
  "max_position": "1100819673937960960",
  "has_more_items": true,
  "items_html": "\n      <li class=\"js-stream-item stream-item stream-item\n\" data-item-id=\"1100648521127129088\"\nid=\"stream-item-tweet-1100648521127129088\"\ndata-item-type=\"tweet\"\n data-suggestion-json=\"{&quot;suggestion_details&quot;:{},&quot;tweet_ids&quot;:&quot;1100648521127129088&quot;,&quot;scribe_component&quot;:&quot;tweet&quot;}\"> ... [REDACTED] ... </li>",
  "new_latent_count": 20,
  "new_tweets_bar_html": "  <button class=\"new-tweets-bar js-new-tweets-bar\" data-item-count=\"20\" style=\"width:100%\">\n        دیکھیں 20 نئی ٹویٹس\n\n  </button>\n",
  "new_tweets_bar_alternate_html": []
}

While many web archives are good at exposing original response headers via X-Archive-Orig-* headers in mementos, I don't know any web archive (yet) that exposes corresponding original request headers as well (I propose using something like X-Archive-Request-Orig-* headers). By looking the the above response we can understand the structure of how new tweets' notification works on a Twitter timeline, but it does not answer why the response was in Urdu (as highlighted in the value of the "new_tweets_bar_html" JSON key). Based on my assessment and experiment above, I think that the corresponding request should have a header like "Cookie: lang=ur; Domain: twitter.com; Path=/", which can be verified if the corresponding WARC file was available.

Cookie Violations Experiment on the Live Site


Finally, I attempted to recreate this language hodgepodge on the live site on my own Twitter timeline. I followed the the steps below and ended up with a page shown in Figure 6 (which contains phrases from English, Arabic, Hindi, Spanish, Chinese, and Urdu, but could have all 47 supported languages).

  1. Open your Twitter timeline in English by explicitly supplying "?lang=en" query parameter in a browser tab (it can be an incognito window) without logging in, let's call it Tab A
  2. Open another tab in the same window and load your timeline without any "lang" query parameter (it should show your timeline in English), let's call it Tab B
  3. Switch to Tab A and change the value of the "lang" parameter to one of the 47 supported language codes and load the page to update the "lang" cookie (which will be reflected in all the tabs of the same window)
  4. From a different browser (that does not share cookies with the above tabs) or device login to your Twitter account (if not logged in already) and retweet something
  5. Switch to Tab B and wait for a notification to appear suggesting one new tweet in the language selected in the Tab A (it may take a little over 30 seconds)
  6. If you want to add more languages then click on the notification bar (which will insert the new tweet in the current language) and repeat from step 3 otherwise continue
  7. To see the global trends block of Tab B in a different language perform step 3 with the desired language code, switch back to Tab B, and wait until it changes (it may take a little over 5 minutes)

Figure 6: Mixed language illustration on Twitter's live website. It contains phrases from English, Arabic, Hindi, Spanish, Chinese, and Urdu, but could have all 47 supported languages.


Conclusions


With the above experiment on the live site I am confident about my assessment that a cookie violation could be one reason why a composite memento would be defaced. How common this issue is in Twitter's mementos and on other sites is still an open question. While I do not know a silver-bullet solution to this issue yet, I think it can potentially be mitigated to some extent for the future mementos by explicitly reducing the cookie expiration duration in crawlers or distributing the crawling task for the URLs of the same domain in many small sandboxed instances. Investigating options about filtering responses by matching cookies needs a more rigorous research.

--
Sawood Alam