Saturday, August 3, 2019

2019-08-03: Searching Web Archives for Unattributed Deleted Tweets From Politwoops

Tweet URL: https://twitter.com/derekwillis/status/1127234631865118731

On May 11th 2019, Derek Willis, who works at Propublica and also maintains the Politwoops project, tweeted a list of deleted tweet ids found by Politwoops that could not be attributed to any Twitter handle being tracked by Politwoops. This was an opportunity for us to revisit our interest in using web archives to uncover the deleted tweets. Although we were unsuccessful in finding any of the deleted tweet ids in web archives provided by Politwoops, we are documenting our process for coming to this conclusion.

Politwoops  

Politwoops is a web service which tracks deleted tweets of elected public officials and candidates running for office in the USA and 55 other countries. The Politwoops USA is supported by Propublica.

Creating Twitter handles list for the 116th Congress 

In a previous post, we discussed the challenges involved in creating a data set of Twitter handles for the members of Congress and provided a data set of Twitter handles for the 116th Congress. A member of Congress can have multiple Twitter accounts which can be categorized into official, personal, and campaign accounts. We made a decision of creating a data set of official Congressional Twitter accounts over their personal or campaign accounts because we did not want to track the personal tweets from the members of Congress. For this reason, our data set has a one-to-one mapping between a member of Congress and their Twitter handle listing all the current 537 members of Congress with their official Twitter handles. However, Politwoops has a one-to-many mapping between a member of Congress and their Twitter handles because it tracks all the Twitter handles for a member of Congress.  We expanded our data set of Twitter handles for the 116th Congress by using the rest of the Twitter handles that Politwoops tracks in addition to those we have in our data set. For example, our data set of Twitter handles for the 116th Congress has @RepAOC as the Twitter handle for Rep. Alexandria Ocasio-Cortez while Politwoops lists @AOC and @RepAOC as her Twitter handles.  
Figure 1: Screenshot of  Rep. Alexandria Ocasio-Cortez's Politwoops page highlighting the two handles (@AOC, @RepAOC) Politwoops  tracks for her

Creating the President, the Vice-President, and Committee Twitter handles list

Politwoops USA tracks members of Congress, the President, and the Vice-President. Propublica provided the list of Twitter handles being tracked by Politwoops in the data sources list provided at Propublica Congress API. Furthermore, we found a subset of the committee Twitter handles to be present in Politwoops which have not been advertised in their data sources list via the Propublica API. With no complete list of the committee Twitter handles being tracked by Politwoops, we used the CSPAN list of committee handles
Figure 2: CSPAN Twitter Committee List showing the SASC Majority committee's Twitter handle, @SASCMajority
Figure 3: Politwoops returns a 404 for SASC Majority committee's Twitter handle, @SASCMajority
Figure 4: Screenshot of the @HouseVetAffairs committee Twitter handle being tracked by Politwoops

List of Different Approaches Used to Find Deleted Tweets using the Web Archives   

Internet Archive cdx Server API

The Internet Archive cdx Server API can be used to list all the mementos in the index of Internet Archive for a URL or URL prefix. We can broaden our search for a URL with the URL match scope option provided by the cdx Server API. In our study, we have used the URL match scope of "prefix".
The URL http://web.archive.archive.org/cdx/search/cdx?url=https://twitter.com/repaoc&matchType=prefix searches for all the URLs in Internet Archive with the prefix https://twitter.com/repaoc. Using this approach, we received all the different URL variants that exist in Internet Archive index file for @RepAOC.
Excerpt of the response received from the Internet Archive's cdx server API for @RepAOC
com,twitter)/repaoc 20190108184114 https://twitter.com/RepAOC text/html 200 GBB2ADFZOLTFQAPQACVT2XFVBVSEEHT5 42489
com,twitter)/repaoc 20190109161007 https://twitter.com/RepAOC text/html 200 SLZHJQKN25URYRWQUQI7DW5JZD5M5E6F 43004
com,twitter)/repaoc 20190109200548 https://twitter.com/RepAOC text/html 200 DWGHG6CSHBE7OETXJD3TEINEWKV372DJ 45123
com,twitter)/repaoc 20190120082837 https://twitter.com/repaoc text/html 200 JVHASBSCBHPGKCVR7GBVOYRM4H5KQYBP 53697
com,twitter)/repaoc 20190126051939 https://twitter.com/repaoc text/html 200 YRE4RPA46F7PTQNBQUMHKCLWLL2WUXE2 56420
com,twitter)/repaoc 20190202170000 https://twitter.com/RepAOC text/html 200 6VS73H6XD5T2TVRC4UJXNT2D6FCNZWMJ 55388
com,twitter)/repaoc 20190207211032 https://twitter.com/repaoc text/html 200 NQQI4UJ6TUMHS36JATOY35D7P255MEIA 56378
com,twitter)/repaoc 20190221024247 https://twitter.com/RepAOC text/html 200 K6B3P7IRHIXTZSPXRWUPSBCRZ2HCWBZB 56678
com,twitter)/repaoc 20190223102039 https://twitter.com/RepAOC text/html 200 OO2U6EUXYTGGEE2Q3ARQJ4SI4QGF2CLR 58008
com,twitter)/repaoc 20190223180906 https://twitter.com/RepAOC text/html 200 HC6RCIVTTUV6JU35PA2JZ256E7RXY2MN 56799
com,twitter)/repaoc 20190305195452 https://twitter.com/RepAOC text/html 200 XH646QWCIOJ4KB4LCPQ6P6MMYSTDMNAA 58315
com,twitter)/repaoc 20190305195452 https://twitter.com/RepAOC text/html 200 XH646QWCIOJ4KB4LCPQ6P6MMYSTDMNAA 58315
com,twitter)/repaoc 20190306232948 https://twitter.com/RepAOC text/html 200 UL2KWN3374FHMP2JFV4TUWODVLEBKZY6 59586
com,twitter)/repaoc 20190306232948 https://twitter.com/RepAOC text/html 200 UL2KWN3374FHMP2JFV4TUWODVLEBKZY6 59587
com,twitter)/repaoc 20190307011545 https://twitter.com/RepAOC text/html 200 R5PQUDWVYCZGAH3B4LVSBQXFXZ5MVXSY 59388
com,twitter)/repaoc 20190307214043 https://twitter.com/RepAOC text/html 200 GWIJQTMZPFZEJPUT47H2ORDCSF4RP5EX 59430
com,twitter)/repaoc 20190307214043 https://twitter.com/RepAOC text/html 200 GWIJQTMZPFZEJPUT47H2ORDCSF4RP5EX 59431
com,twitter)/repaoc 20190309213407 https://twitter.com/RepAOC text/html 200 WDEQBQN552GO2S6SB4IOKLW7M7WDWPCG 59293
com,twitter)/repaoc 20190309213407 https://twitter.com/RepAOC text/html 200 WDEQBQN552GO2S6SB4IOKLW7M7WDWPCG 59293
com,twitter)/repaoc 20190310215135 https://twitter.com/RepAOC text/html 200 MLSCN7ITZVENNMB6TBLCI6BXCR3PSL4Z 59498
com,twitter)/repaoc 20190310215135 https://twitter.com/RepAOC text/html 200 MLSCN7ITZVENNMB6TBLCI6BXCR3PSL4Z 59499

Example for a status URL
com,twitter)/repaoc/status/1082706172623376384 20190108201259 http://twitter.com/RepAOC/status/1082706172623376384 unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 447
The tweet id from the status URL was compared with the list of deleted tweet ids provided by Politwoops. Using this approach we did not find any matching tweet ids. 

From mementos of the 116th Congress

For this approach, we fetched all the mementos from web archives for the 116th Congress between 2019-01-03 and 2019-05-15 using MemGator, a Memento aggregator service.
For example, we queried multiple web archives for Rep. Karen Bass's Twitter handle, @RepKarenBass, to fetch all the mementos for her Twitter profile page. All the embedded tweets from the memento were parsed and compared with the deleted list of tweet ids from Politwoops. Using this approach we did not find any matching tweet ids. 
Example of a URI-M  in CDXJ format


20190201043735 {"uri": "http://web.archive.org/web/20190201043735/https://twitter.com/RepKarenBass", "rel": "memento", "datetime": "Fri, 01 Feb 2019 04:37:35 GMT"}

Figure 5: Screenshot of the memento for Rep. Karen Bass's Twitter profile page with 20 embedded tweets
Output upon parsing the fetched mementos
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827923375880359937Timestamp: 1486227289|||TweetText: Sometimes the best way to stand up is to sit down. Happy Birthday Rosa Parks. #OurStory #BlackHistoryMonthpic.twitter.com/fjPMeD3RzX
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827593256988860417Timestamp: 1486148583|||TweetText: I urge the Gov't of Cameroon to respect the civil and human rights of all of its citizens. See my full statement: http://bass.house.gov/media-center/press-releases/rep-bass-condemns-intimidation-against-english-speaking-population …
TweetType: RT|||ScreenName: RepKarenBass|||TweetId: 827292997100376064Timestamp: 1486075318|||TweetText: Join me in wishing @HouseGOP happy #GroundhogDay! After spending 7 years looking for a viable #ACA alternative, they still have nothing.pic.twitter.com/miqwtKM06L
TweetType: OTR|||ScreenName: RepBarbaraLee|||TweetId: 827285964674441216Timestamp: 1486075318|||TweetText: Join me in wishing @HouseGOP happy #GroundhogDay! After spending 7 years looking for a viable #ACA alternative, they still have nothing.pic.twitter.com/miqwtKM06L
TweetType: OT|||ScreenName: RepBarbaraLee|||TweetId: 827201943323938816Timestamp: 1486055286|||TweetText: This month is National Children’s Dental Health Month (NCDHM). This year's slogan is "Choose Tap Water for a Sparkling Smile"pic.twitter.com/gk1cj8oTK9
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196902273929217Timestamp: 1486054084|||TweetText: On the growing list of things I shouldn't have to defend my stance on, add #UCBerkeley, 1 of our nation's most prestigious pub. universities
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196521347166209Timestamp: 1486053993|||TweetText: .@realDonaldTrump: #UCBerkeley developed immunotherapy for cancer!
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196386512871425Timestamp: 1486053961|||TweetText: .@realDonaldTrump: Do you like Vitamin K? Discovered/synthesized at #UCBerkeley
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196102554296320Timestamp: 1486053894|||TweetText: .@realDonaldTrump What's your stance on painkillers? Beta-endorphins invented at #UCBerkeleyhttps://twitter.com/realDonaldTrump/status/827112633224544256 …
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826960463590207488Timestamp: 1485997713|||TweetText: Happy to see Judge Birotte of LA continue the fight towards ending Pres. Trump’s exec. order.http://www.latimes.com/local/lanow/la-me-ln-federal-order-travel-ban-20170201-story.html …
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826877783787847681Timestamp: 1485978000|||TweetText: This morning, I was happy to attend @MENTORnational's Capitol Hill Day, where mentors advocate for services for all youth. Thank you!pic.twitter.com/EyVgDSIvuE
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826860675007930368Timestamp: 1485973921|||TweetText: A civil & women's rights activist, Dorothy Height helped black women throughout America succeed. #OurStory #BlackHistoryMonth #NewStamppic.twitter.com/v8wnHFpgMu
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826833993874042880Timestamp: 1485967560|||TweetText: Let's not turn our backs on the latest refugees and potential citizens just because they come from Africa. More: https://bass.house.gov/media-center/press-releases/rep-bass-pens-letter-urging-president-trump-rescind-travel-ban …pic.twitter.com/J9veQNSpJu
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826822912057413633Timestamp: 1485964918|||TweetText: Trump's listening session is w people he knows and should be "listening" to all the time---campaign surrogates, supporters, employees
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826799517295058944Timestamp: 1485959340|||TweetText: 57 years ago, four Black college students sat at a lunch counter and asked for lunch. We will not go back. #OurStory #BlackHistoryMonthpic.twitter.com/ER00yv1q7B
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826606703928078336Timestamp: 1485913370|||TweetText: 7 in 10 Americans do NOT support @POTUS relentless quest to strike down Roe v Wade.  Where does #Gorsuch stand? #SCOTUS
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826579567376637952Timestamp: 1485906900|||TweetText: Proud to stand w/ my Foreign Affairs colleagues and defend dissenting diplomats..http://www.politico.com/story/2017/01/trump-immigration-ban-state-department-dissent-democrats-234433 …
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826547056235933697Timestamp: 1485899149|||TweetText: Treasury nominee #Mnuchin denied that his company engaged in robo-signing, foreclosing on Americans without proper review #RejectMnuchin
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826474625831890945Timestamp: 1485881880|||TweetText: Few cities on this planet have benefited so handsomely from immigration as LA. Read the @TrumanProject letter: http://ow.ly/oOhF308waXJ
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826426234422829056Timestamp: 1485870343|||TweetText: Today is the day! #GetCoveredhttps://twitter.com/JoaquinCastrotx/status/826416237223755777 …
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826275582883270656Timestamp: 1485834425|||TweetText: Pres. Trump has replaced Yates as standing AG for standing up for millions. You can't replace us all.http://www.cbsnews.com/amp/news/trump-fires-acting-attorney-general-sally-yates/?client=safari …

From mementos of the 115th Congress

For this approach, we reused locally stored TimeMaps and mementos from the 115th Congress which we collected between 2017-01-20 and 2019-01-03. The list of Twitter handles for the 115th Congress was obtained from the data set on the 115th Congressional tweet ids released by Social Feed Manager. The request for mementos from the web archives was carried out by expanding the URI-Rs for the Twitter handle with the language and with_replies argument (thanks Sawood Alam for suggesting the language variations).
For example, we queried multiple web archives for Doug Jones's Twitter handle, @dougjones, by expanding with the language and with_replies arguments as shown below:

https://twitter.com/dougjones
Twitter supports 47 different language variations and multiple arguments such as with_replies. Upon searching for the URI-R https://twitter.com/dougjones, the web archives return all the mementos for the exact URI-R without any language variations or arguments.

Excerpt of the TimeMap response received for https://twitter.com/dougjones

20110210134919 {"uri": "http://web.archive.org/web/20110210134919/http://twitter.com:80/dougjones", "rel": "first memento", "datetime": "Thu, 10 Feb 2011 13:49:19 GMT"}
20180205201909 {"uri": "http://web.archive.org/web/20180205201909/https://twitter.com/DougJones", "rel": "memento", "datetime": "Mon, 05 Feb 2018 20:19:09 GMT"}
20180306132212 {"uri": "http://wayback.archive-it.org/all/20180306132212/https://twitter.com/DougJones", "rel": "memento", "datetime": "Tue, 06 Mar 2018 13:22:12 GMT"}
20180912165539 {"uri": "http://wayback.archive-it.org/all/20180912165539/https://twitter.com/DougJones", "rel": "memento", "datetime": "Wed, 12 Sep 2018 16:55:39 GMT"}
Upon searching for the URI-R https://twitter.com/dougjones?lang=en, the web archives return all the mementos for the language variation "en".

TimeMap response received for https://twitter.com/dougjones?lang=en

20190424140424 {"uri": "http://web.archive.org/web/20190424140424/https://twitter.com/dougjones?lang=en", "rel": "first memento", "datetime": "Wed, 24 Apr 2019 14:04:24 GMT"}
20190501165834 {"uri": "http://web.archive.org/web/20190501165834/https://twitter.com/dougjones?lang=en", "rel": "memento", "datetime": "Wed, 01 May 2019 16:58:34 GMT"}
20190509164649 {"uri": "http://web.archive.org/web/20190509164649/https://twitter.com/dougjones?lang=en", "rel": "last memento", "datetime": "Thu, 09 May 2019 16:46:49 GMT"}
A lot of mementos in the web archives contain Twitter handle URLs with the language and with_replies arguments. Therefore, we queried for the Twitter handle URL and the with_replies argument URL with 47 different language variations for each Twitter handle. In total we created 96 URLs for each Twitter handle.
https://twitter.com/dougjones?lang=en (47 URLs for 47 languages)
Total: 96 URLs for each URI-R

Example for different language variation URLs:
...

The parsed embedded tweets from the mementos was compared with the deleted list from Politwoops. Using this approach we did not find any matching tweet ids. 
We also had locally stored mementos for the 115th Congress from 2017-01-01 to 2018-06-30. The data set of Twitter handles for this collection was created by taking a Wikipedia page snapshot of the current members of Congress on July 4, 2018 and using the CSPAN Twitter list on members of Congress and Politwoops to get all the Twitter handles. Upon parsing the embedded tweets from the mementos, we compared the parsed tweets with the deleted list from Politwoops. Using this approach we did not find any matching tweet ids. 

From mementos of the President, the Vice-President and the Committee Twitter handles list

For this analysis we fetched all the mementos for the President, the Vice-President, and committee handles between 2019-01-03 and 2019-06-30. Upon fetching the mementos and parsing embedded tweets, we compared the parsed tweets with the deleted list from Politwoops. Using this approach we did not find any matching tweet ids.  

Upon completion of our analysis, we learned to extract timestamp from any tweet id. Snowflake is a service used to generate unique ids for all the tweet ids and other objects within Twitter like lists, users, collections, etc. Snowflake generates unsigned-64 bit integers which consist of: 
  • timestamp - 41 bits (millisecond precision w/ a custom epoch gives us 69 years)
  • configured machine id - 10 bits - gives us up to 1024 machines
  • sequence number - 12 bits - rolls over every 4096 per machine (with protection to avoid rollover in the same ms)
We have created a web service, TweetedAt to extract timestamp from deleted tweet ids. Using TweetedAt, we found the timestamp of all the deleted tweet ids provided by Derek Willis from Politwoops. Of the 107 Politwoops deleted tweet ids, we found only six of the tweet ids from the 116th Congress time range and nine from the 115th Congress time range
To summarize, we were unable to find any of the deleted tweet ids provided by Derek Willis from Politwoops. We analyzed the given below sources:
  • Mementos for the 116th Congress Twitter handles between 2019-01-03 and 2019-05-15.
  • Twitter handle mementos for the committees, the President and the Vice-President between 2019-01-03 and 2019-06-30. 
  • Mementos for the 115th Congress Twitter handles between 2017-01-03 and 2019-01-03. 
  • The Internet Archive cdx server API responses on the 116th Congress Twitter handles.
There are several possible reasons for being unable to find the deleted tweet ids provided by Derek Willis from Politwoops:
  • 92 out of 107 deleted tweets were outside the date range of our analysis.
  • The mementos in a web archive are indexed by their URI-Rs. When a user changes their Twitter handle, the original resource URI-R for the user's Twitter account also changes. For example, Rep. Nancy Pelosi used the Twitter handle, @nancypelosi, during the 115th Congress but changed it to @speakerpelosi in the 116th Congress. Now querying the web archives for the mementos for Rep. Nancy Pelosi with her Twitter handle, @speakerpelosi, returns the earliest mementos from the 116th Congress. In order to get mementos prior to the 116th Congress, we need to query the web archives with Twitter handle, @nancypelosi. 
  • The data set of Twitter handles for the US Congress used in our analysis has a one-to-one mapping between a seat in the Congress and the member of Congress. If a seat in the US Congress has been held by multiple members over the Congress tenure, the data set includes the current member of Congress over the former members thus losing out on Twitter handles of the former members within the same Congress.
We analyzed the web archives for the 115th and the 116th Congress members, the President, the Vice-President, and committee Twitter handles for finding the deleted tweet ids provided to us by Derek Willis from Politwoops. Despite being unable to find any match for the deleted tweet ids from our analysis, we will continue to investigate as we learn more.  We welcome any information that might aid our analysis.

-----
Mohammed Nauman Siddique
(@m_nsiddique)

No comments:

Post a Comment