Wednesday, August 14, 2019

2019-08-14: Building the Better Crowdsourced Study - Literature on Mechanical Turk

The XKCD comic "Study" parodies
 the challenges of recruiting study participants.

As part of "Social Cards Probably Provide For Better Understanding Of Web Archive Collections" (recently accepted for publication by CIKM2019), I had to learn how to conduct user studies. One of the most challenging problems to solve while conducting user studies is recruiting participants. Amazon's Mechanical Turk (MT) solves this problem by providing a marketplace where participants can earn money by completing studies for researchers. This blog post summarizes the lessons I have learned from other studies that have successfully employed MT. I have found parts of this information scattered throughout different bodies of knowledge, but not gathered in one place; thus, I hope it is a useful starting place for future researchers.

MT is by far the largest source of study participants, with over 100,000 available participants. MT is an automated system that facilitates the interaction of two actors: the requester and the worker. A worker signs up for an Amazon account and must wait a few days to be approved. Once approved, MT provides the worker with a list of assignments to choose from. A Human Interface Task (HIT) is an MT assignment. Workers perform HITs for anywhere from $0.01 up to $5.00 or more. Workers earn as much as $50 per week completing these HITs. Workers are the equivalents of subjects or participants found in research studies.

Workers can browse HITs to complete via Amazon's Mechanical Turk.
Requesters are the creators of HITs. After a worker completes a HIT, the requester decides whether or not to accept the HIT and thus pay the worker. Requesters use the MT interface to specify the amount to be paid for a HIT, how many unique workers per HIT, how much time to allot to workers, and when the HIT will no longer be available for work (expire). Also, requesters can specify that they only want workers with specific qualifications, such as age, gender, employment history, or handedness. The Master Qualification is assigned automatically by the MT system based on the behavior of the workers. Requesters can also specify that they only want workers with precise approval rates.

Requesters can create HITs using the MT interface, which provides a variety of templates.
The HITs themselves are HTML forms entered into the MT system. Requesters have much freedom within the interface to design HITs to meet their needs, even including JavaScript. Once the requester has entered the HTML into the system, they can preview the HIT to ensure that it looks and responds as expected. When the requester is done creating the HIT, they can then save it for use. HITs may contain variables for links to visualizations or other external information. When the requester is ready to publish a HIT for workers to perform, they can submit a CSV file containing the values for these variables. MT will create one HIT per row in the CSV file. Amazon will require that the requester deposit enough money into their account to pay for the number of HITs they have specified. After the requester pays for the HITs, workers can see the HIT and then begin their submissions. The requester then reviews each submission as it comes in and pays workers.

The MT environment is different from that used in traditional user studies. MT participants can use their own devices to complete the study wherever they have a connection to the Internet. Requesters are limited in the amount of data that they can collect on MT participants. For each completed HIT, the MT system supplies the completion time and the responses provided by the MT participant. A requester may also employ JavaScript in the HIT to record additional information.

In contrast, traditional user studies allow a researcher to completely control the environment and record the participant's physical behavior. Because of these differences, some scholars have questioned the effectiveness of MT's participants. To assuage this doubt, Heer et al. reproduced the results of a classic visualization experiment. The original experiment used participants recruited using traditional methods. Heer recruited participants via MT and demonstrated that the results were consistent with the original study. Kosara and Ziemkiewicz reproduced one of their previous visualization studies and discovered that MT results were equally consistent with the earlier study. Bartneck et al. conducted the same experiment with both traditionally recruited participants and MT workers. They also confirmed consistent results between these groups.

MT is not without its criticism. Fort, Adda, and Cohen raise questions on the ethical use of MT, focusing on the potentially low wages offered by requesters. In their overview of MT as a research tool, Mason and Suri further discuss such ethical issues as informed consent, privacy, and compensation. Turkopticon is a system developed by Irani and Silberman that helps workers safely voice grievances about requesters, including issues with payment and overall treatment.

In traditional user studies, the presence of the researcher may engender some social motivation to complete a task accurately. MT participants are motivated to maximize their revenue over time by completing tasks quickly, leading some MT participants to not exercise the same level of care as a traditional participant. Because of the differences in motivation and environments, MT studies require specialized design. Based on the work of multiple academic studies, we have the following advice for requesters developing meaningful tasks with Mechanical Turk:
  • complex concepts, like understanding, can be broken into smaller tasks that collectively provide a proxy for the broader concept (Kittur 2008)
  • successful studies ensure that each task has questions with verifiable answers (Kittur 2008)
  • limiting participants by their acceptance score has been successful for ensuring higher quality responses (Micallef 2012, Borkin 2013)
  • participants can repeat a task – make sure each set of responses corresponds to a unique participant by using tools such as Unique Turker (Paolacci 2010)
  • be fair to participants; because MT is a competitive market for participants, they can refuse to complete a task, and thus a requester's actions lead to a reputation that causes participants to avoid them (Paolacci 2010)
  • better payment may improve results on tasks with factually correct answers (Paolacci 2010, Borkin 2013, PARC 2009) – and can address the ethical issue of proper compensation
  • being up front with participants and explaining why they are completing a task can improve their responses (Paolacci 2010) – this can also help address the issue of informed consent
  • attention questions can be useful for discouraging or weeding out malicious or lazy participants that may skew the results (Borkin 2013, PARC 2009)
  • bonus payments may encourage better behavior from participants (Kosara 2010) – and may also address the ethical issue of proper compensation
MT provides a place to recruit participants, but recruitment is only one part of successfully conducting user experiments. To create successful user experiments, I recommend starting with "Methods for Evaluating Interactive Information Retrieval Systems with Users" by Diane Kelly.

For researchers starting down the road of user studies, I recommend starting first with Kelly's work and then circling back to the other resources noted here when developing their experiment.

-- Shawn M. Jones

Saturday, August 3, 2019

2019-08-03: Searching Web Archives for Unattributed Deleted Tweets From Politwoops

Tweet URL: https://twitter.com/derekwillis/status/1127234631865118731

On May 11th 2019, Derek Willis, who works at Propublica and also maintains the Politwoops project, tweeted a list of deleted tweet ids found by Politwoops that could not be attributed to any Twitter handle being tracked by Politwoops. This was an opportunity for us to revisit our interest in using web archives to uncover the deleted tweets. Although we were unsuccessful in finding any of the deleted tweet ids in web archives provided by Politwoops, we are documenting our process for coming to this conclusion.

Politwoops  

Politwoops is a web service which tracks deleted tweets of elected public officials and candidates running for office in the USA and 55 other countries. The Politwoops USA is supported by Propublica.

Creating Twitter handles list for the 116th Congress 

In a previous post, we discussed the challenges involved in creating a data set of Twitter handles for the members of Congress and provided a data set of Twitter handles for the 116th Congress. A member of Congress can have multiple Twitter accounts which can be categorized into official, personal, and campaign accounts. We made a decision of creating a data set of official Congressional Twitter accounts over their personal or campaign accounts because we did not want to track the personal tweets from the members of Congress. For this reason, our data set has a one-to-one mapping between a member of Congress and their Twitter handle listing all the current 537 members of Congress with their official Twitter handles. However, Politwoops has a one-to-many mapping between a member of Congress and their Twitter handles because it tracks all the Twitter handles for a member of Congress.  We expanded our data set of Twitter handles for the 116th Congress by using the rest of the Twitter handles that Politwoops tracks in addition to those we have in our data set. For example, our data set of Twitter handles for the 116th Congress has @RepAOC as the Twitter handle for Rep. Alexandria Ocasio-Cortez while Politwoops lists @AOC and @RepAOC as her Twitter handles.  
Figure 1: Screenshot of  Rep. Alexandria Ocasio-Cortez's Politwoops page highlighting the two handles (@AOC, @RepAOC) Politwoops  tracks for her

Creating the President, the Vice-President, and Committee Twitter handles list

Politwoops USA tracks members of Congress, the President, and the Vice-President. Propublica provided the list of Twitter handles being tracked by Politwoops in the data sources list provided at Propublica Congress API. Furthermore, we found a subset of the committee Twitter handles to be present in Politwoops which have not been advertised in their data sources list via the Propublica API. With no complete list of the committee Twitter handles being tracked by Politwoops, we used the CSPAN list of committee handles
Figure 2: CSPAN Twitter Committee List showing the SASC Majority committee's Twitter handle, @SASCMajority
Figure 3: Politwoops returns a 404 for SASC Majority committee's Twitter handle, @SASCMajority
Figure 4: Screenshot of the @HouseVetAffairs committee Twitter handle being tracked by Politwoops

List of Different Approaches Used to Find Deleted Tweets using the Web Archives   

Internet Archive cdx Server API

The Internet Archive cdx Server API can be used to list all the mementos in the index of Internet Archive for a URL or URL prefix. We can broaden our search for a URL with the URL match scope option provided by the cdx Server API. In our study, we have used the URL match scope of "prefix".
The URL http://web.archive.archive.org/cdx/search/cdx?url=https://twitter.com/repaoc&matchType=prefix searches for all the URLs in Internet Archive with the prefix https://twitter.com/repaoc. Using this approach, we received all the different URL variants that exist in Internet Archive index file for @RepAOC.
Excerpt of the response received from the Internet Archive's cdx server API for @RepAOC
com,twitter)/repaoc 20190108184114 https://twitter.com/RepAOC text/html 200 GBB2ADFZOLTFQAPQACVT2XFVBVSEEHT5 42489
com,twitter)/repaoc 20190109161007 https://twitter.com/RepAOC text/html 200 SLZHJQKN25URYRWQUQI7DW5JZD5M5E6F 43004
com,twitter)/repaoc 20190109200548 https://twitter.com/RepAOC text/html 200 DWGHG6CSHBE7OETXJD3TEINEWKV372DJ 45123
com,twitter)/repaoc 20190120082837 https://twitter.com/repaoc text/html 200 JVHASBSCBHPGKCVR7GBVOYRM4H5KQYBP 53697
com,twitter)/repaoc 20190126051939 https://twitter.com/repaoc text/html 200 YRE4RPA46F7PTQNBQUMHKCLWLL2WUXE2 56420
com,twitter)/repaoc 20190202170000 https://twitter.com/RepAOC text/html 200 6VS73H6XD5T2TVRC4UJXNT2D6FCNZWMJ 55388
com,twitter)/repaoc 20190207211032 https://twitter.com/repaoc text/html 200 NQQI4UJ6TUMHS36JATOY35D7P255MEIA 56378
com,twitter)/repaoc 20190221024247 https://twitter.com/RepAOC text/html 200 K6B3P7IRHIXTZSPXRWUPSBCRZ2HCWBZB 56678
com,twitter)/repaoc 20190223102039 https://twitter.com/RepAOC text/html 200 OO2U6EUXYTGGEE2Q3ARQJ4SI4QGF2CLR 58008
com,twitter)/repaoc 20190223180906 https://twitter.com/RepAOC text/html 200 HC6RCIVTTUV6JU35PA2JZ256E7RXY2MN 56799
com,twitter)/repaoc 20190305195452 https://twitter.com/RepAOC text/html 200 XH646QWCIOJ4KB4LCPQ6P6MMYSTDMNAA 58315
com,twitter)/repaoc 20190305195452 https://twitter.com/RepAOC text/html 200 XH646QWCIOJ4KB4LCPQ6P6MMYSTDMNAA 58315
com,twitter)/repaoc 20190306232948 https://twitter.com/RepAOC text/html 200 UL2KWN3374FHMP2JFV4TUWODVLEBKZY6 59586
com,twitter)/repaoc 20190306232948 https://twitter.com/RepAOC text/html 200 UL2KWN3374FHMP2JFV4TUWODVLEBKZY6 59587
com,twitter)/repaoc 20190307011545 https://twitter.com/RepAOC text/html 200 R5PQUDWVYCZGAH3B4LVSBQXFXZ5MVXSY 59388
com,twitter)/repaoc 20190307214043 https://twitter.com/RepAOC text/html 200 GWIJQTMZPFZEJPUT47H2ORDCSF4RP5EX 59430
com,twitter)/repaoc 20190307214043 https://twitter.com/RepAOC text/html 200 GWIJQTMZPFZEJPUT47H2ORDCSF4RP5EX 59431
com,twitter)/repaoc 20190309213407 https://twitter.com/RepAOC text/html 200 WDEQBQN552GO2S6SB4IOKLW7M7WDWPCG 59293
com,twitter)/repaoc 20190309213407 https://twitter.com/RepAOC text/html 200 WDEQBQN552GO2S6SB4IOKLW7M7WDWPCG 59293
com,twitter)/repaoc 20190310215135 https://twitter.com/RepAOC text/html 200 MLSCN7ITZVENNMB6TBLCI6BXCR3PSL4Z 59498
com,twitter)/repaoc 20190310215135 https://twitter.com/RepAOC text/html 200 MLSCN7ITZVENNMB6TBLCI6BXCR3PSL4Z 59499

Example for a status URL
com,twitter)/repaoc/status/1082706172623376384 20190108201259 http://twitter.com/RepAOC/status/1082706172623376384 unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 447
The tweet id from the status URL was compared with the list of deleted tweet ids provided by Politwoops. Using this approach we did not find any matching tweet ids. 

From mementos of the 116th Congress

For this approach, we fetched all the mementos from web archives for the 116th Congress between 2019-01-03 and 2019-05-15 using MemGator, a Memento aggregator service.
For example, we queried multiple web archives for Rep. Karen Bass's Twitter handle, @RepKarenBass, to fetch all the mementos for her Twitter profile page. All the embedded tweets from the memento were parsed and compared with the deleted list of tweet ids from Politwoops. Using this approach we did not find any matching tweet ids. 
Example of a URI-M  in CDXJ format


20190201043735 {"uri": "http://web.archive.org/web/20190201043735/https://twitter.com/RepKarenBass", "rel": "memento", "datetime": "Fri, 01 Feb 2019 04:37:35 GMT"}

Figure 5: Screenshot of the memento for Rep. Karen Bass's Twitter profile page with 20 embedded tweets
Output upon parsing the fetched mementos
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827923375880359937Timestamp: 1486227289|||TweetText: Sometimes the best way to stand up is to sit down. Happy Birthday Rosa Parks. #OurStory #BlackHistoryMonthpic.twitter.com/fjPMeD3RzX
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827593256988860417Timestamp: 1486148583|||TweetText: I urge the Gov't of Cameroon to respect the civil and human rights of all of its citizens. See my full statement: http://bass.house.gov/media-center/press-releases/rep-bass-condemns-intimidation-against-english-speaking-population …
TweetType: RT|||ScreenName: RepKarenBass|||TweetId: 827292997100376064Timestamp: 1486075318|||TweetText: Join me in wishing @HouseGOP happy #GroundhogDay! After spending 7 years looking for a viable #ACA alternative, they still have nothing.pic.twitter.com/miqwtKM06L
TweetType: OTR|||ScreenName: RepBarbaraLee|||TweetId: 827285964674441216Timestamp: 1486075318|||TweetText: Join me in wishing @HouseGOP happy #GroundhogDay! After spending 7 years looking for a viable #ACA alternative, they still have nothing.pic.twitter.com/miqwtKM06L
TweetType: OT|||ScreenName: RepBarbaraLee|||TweetId: 827201943323938816Timestamp: 1486055286|||TweetText: This month is National Children’s Dental Health Month (NCDHM). This year's slogan is "Choose Tap Water for a Sparkling Smile"pic.twitter.com/gk1cj8oTK9
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196902273929217Timestamp: 1486054084|||TweetText: On the growing list of things I shouldn't have to defend my stance on, add #UCBerkeley, 1 of our nation's most prestigious pub. universities
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196521347166209Timestamp: 1486053993|||TweetText: .@realDonaldTrump: #UCBerkeley developed immunotherapy for cancer!
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196386512871425Timestamp: 1486053961|||TweetText: .@realDonaldTrump: Do you like Vitamin K? Discovered/synthesized at #UCBerkeley
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196102554296320Timestamp: 1486053894|||TweetText: .@realDonaldTrump What's your stance on painkillers? Beta-endorphins invented at #UCBerkeleyhttps://twitter.com/realDonaldTrump/status/827112633224544256 …
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826960463590207488Timestamp: 1485997713|||TweetText: Happy to see Judge Birotte of LA continue the fight towards ending Pres. Trump’s exec. order.http://www.latimes.com/local/lanow/la-me-ln-federal-order-travel-ban-20170201-story.html …
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826877783787847681Timestamp: 1485978000|||TweetText: This morning, I was happy to attend @MENTORnational's Capitol Hill Day, where mentors advocate for services for all youth. Thank you!pic.twitter.com/EyVgDSIvuE
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826860675007930368Timestamp: 1485973921|||TweetText: A civil & women's rights activist, Dorothy Height helped black women throughout America succeed. #OurStory #BlackHistoryMonth #NewStamppic.twitter.com/v8wnHFpgMu
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826833993874042880Timestamp: 1485967560|||TweetText: Let's not turn our backs on the latest refugees and potential citizens just because they come from Africa. More: https://bass.house.gov/media-center/press-releases/rep-bass-pens-letter-urging-president-trump-rescind-travel-ban …pic.twitter.com/J9veQNSpJu
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826822912057413633Timestamp: 1485964918|||TweetText: Trump's listening session is w people he knows and should be "listening" to all the time---campaign surrogates, supporters, employees
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826799517295058944Timestamp: 1485959340|||TweetText: 57 years ago, four Black college students sat at a lunch counter and asked for lunch. We will not go back. #OurStory #BlackHistoryMonthpic.twitter.com/ER00yv1q7B
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826606703928078336Timestamp: 1485913370|||TweetText: 7 in 10 Americans do NOT support @POTUS relentless quest to strike down Roe v Wade.  Where does #Gorsuch stand? #SCOTUS
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826579567376637952Timestamp: 1485906900|||TweetText: Proud to stand w/ my Foreign Affairs colleagues and defend dissenting diplomats..http://www.politico.com/story/2017/01/trump-immigration-ban-state-department-dissent-democrats-234433 …
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826547056235933697Timestamp: 1485899149|||TweetText: Treasury nominee #Mnuchin denied that his company engaged in robo-signing, foreclosing on Americans without proper review #RejectMnuchin
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826474625831890945Timestamp: 1485881880|||TweetText: Few cities on this planet have benefited so handsomely from immigration as LA. Read the @TrumanProject letter: http://ow.ly/oOhF308waXJ
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826426234422829056Timestamp: 1485870343|||TweetText: Today is the day! #GetCoveredhttps://twitter.com/JoaquinCastrotx/status/826416237223755777 …
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826275582883270656Timestamp: 1485834425|||TweetText: Pres. Trump has replaced Yates as standing AG for standing up for millions. You can't replace us all.http://www.cbsnews.com/amp/news/trump-fires-acting-attorney-general-sally-yates/?client=safari …

From mementos of the 115th Congress

For this approach, we reused locally stored TimeMaps and mementos from the 115th Congress which we collected between 2017-01-20 and 2019-01-03. The list of Twitter handles for the 115th Congress was obtained from the data set on the 115th Congressional tweet ids released by Social Feed Manager. The request for mementos from the web archives was carried out by expanding the URI-Rs for the Twitter handle with the language and with_replies argument (thanks Sawood Alam for suggesting the language variations).
For example, we queried multiple web archives for Doug Jones's Twitter handle, @dougjones, by expanding with the language and with_replies arguments as shown below:

https://twitter.com/dougjones
Twitter supports 47 different language variations and multiple arguments such as with_replies. Upon searching for the URI-R https://twitter.com/dougjones, the web archives return all the mementos for the exact URI-R without any language variations or arguments.

Excerpt of the TimeMap response received for https://twitter.com/dougjones

20110210134919 {"uri": "http://web.archive.org/web/20110210134919/http://twitter.com:80/dougjones", "rel": "first memento", "datetime": "Thu, 10 Feb 2011 13:49:19 GMT"}
20180205201909 {"uri": "http://web.archive.org/web/20180205201909/https://twitter.com/DougJones", "rel": "memento", "datetime": "Mon, 05 Feb 2018 20:19:09 GMT"}
20180306132212 {"uri": "http://wayback.archive-it.org/all/20180306132212/https://twitter.com/DougJones", "rel": "memento", "datetime": "Tue, 06 Mar 2018 13:22:12 GMT"}
20180912165539 {"uri": "http://wayback.archive-it.org/all/20180912165539/https://twitter.com/DougJones", "rel": "memento", "datetime": "Wed, 12 Sep 2018 16:55:39 GMT"}
Upon searching for the URI-R https://twitter.com/dougjones?lang=en, the web archives return all the mementos for the language variation "en".

TimeMap response received for https://twitter.com/dougjones?lang=en

20190424140424 {"uri": "http://web.archive.org/web/20190424140424/https://twitter.com/dougjones?lang=en", "rel": "first memento", "datetime": "Wed, 24 Apr 2019 14:04:24 GMT"}
20190501165834 {"uri": "http://web.archive.org/web/20190501165834/https://twitter.com/dougjones?lang=en", "rel": "memento", "datetime": "Wed, 01 May 2019 16:58:34 GMT"}
20190509164649 {"uri": "http://web.archive.org/web/20190509164649/https://twitter.com/dougjones?lang=en", "rel": "last memento", "datetime": "Thu, 09 May 2019 16:46:49 GMT"}
A lot of mementos in the web archives contain Twitter handle URLs with the language and with_replies arguments. Therefore, we queried for the Twitter handle URL and the with_replies argument URL with 47 different language variations for each Twitter handle. In total we created 96 URLs for each Twitter handle.
https://twitter.com/dougjones?lang=en (47 URLs for 47 languages)
Total: 96 URLs for each URI-R

Example for different language variation URLs:
...

The parsed embedded tweets from the mementos was compared with the deleted list from Politwoops. Using this approach we did not find any matching tweet ids. 
We also had locally stored mementos for the 115th Congress from 2017-01-01 to 2018-06-30. The data set of Twitter handles for this collection was created by taking a Wikipedia page snapshot of the current members of Congress on July 4, 2018 and using the CSPAN Twitter list on members of Congress and Politwoops to get all the Twitter handles. Upon parsing the embedded tweets from the mementos, we compared the parsed tweets with the deleted list from Politwoops. Using this approach we did not find any matching tweet ids. 

From mementos of the President, the Vice-President and the Committee Twitter handles list

For this analysis we fetched all the mementos for the President, the Vice-President, and committee handles between 2019-01-03 and 2019-06-30. Upon fetching the mementos and parsing embedded tweets, we compared the parsed tweets with the deleted list from Politwoops. Using this approach we did not find any matching tweet ids.  

Upon completion of our analysis, we learned to extract timestamp from any tweet id. Snowflake is a service used to generate unique ids for all the tweet ids and other objects within Twitter like lists, users, collections, etc. Snowflake generates unsigned-64 bit integers which consist of: 
  • timestamp - 41 bits (millisecond precision w/ a custom epoch gives us 69 years)
  • configured machine id - 10 bits - gives us up to 1024 machines
  • sequence number - 12 bits - rolls over every 4096 per machine (with protection to avoid rollover in the same ms)
We have created a web service, TweetedAt to extract timestamp from deleted tweet ids. Using TweetedAt, we found the timestamp of all the deleted tweet ids provided by Derek Willis from Politwoops. Of the 107 Politwoops deleted tweet ids, we found only six of the tweet ids from the 116th Congress time range and nine from the 115th Congress time range
To summarize, we were unable to find any of the deleted tweet ids provided by Derek Willis from Politwoops. We analyzed the given below sources:
  • Mementos for the 116th Congress Twitter handles between 2019-01-03 and 2019-05-15.
  • Twitter handle mementos for the committees, the President and the Vice-President between 2019-01-03 and 2019-06-30. 
  • Mementos for the 115th Congress Twitter handles between 2017-01-03 and 2019-01-03. 
  • The Internet Archive cdx server API responses on the 116th Congress Twitter handles.
There are several possible reasons for being unable to find the deleted tweet ids provided by Derek Willis from Politwoops:
  • 92 out of 107 deleted tweets were outside the date range of our analysis.
  • The mementos in a web archive are indexed by their URI-Rs. When a user changes their Twitter handle, the original resource URI-R for the user's Twitter account also changes. For example, Rep. Nancy Pelosi used the Twitter handle, @nancypelosi, during the 115th Congress but changed it to @speakerpelosi in the 116th Congress. Now querying the web archives for the mementos for Rep. Nancy Pelosi with her Twitter handle, @speakerpelosi, returns the earliest mementos from the 116th Congress. In order to get mementos prior to the 116th Congress, we need to query the web archives with Twitter handle, @nancypelosi. 
  • The data set of Twitter handles for the US Congress used in our analysis has a one-to-one mapping between a seat in the Congress and the member of Congress. If a seat in the US Congress has been held by multiple members over the Congress tenure, the data set includes the current member of Congress over the former members thus losing out on Twitter handles of the former members within the same Congress.
We analyzed the web archives for the 115th and the 116th Congress members, the President, the Vice-President, and committee Twitter handles for finding the deleted tweet ids provided to us by Derek Willis from Politwoops. Despite being unable to find any match for the deleted tweet ids from our analysis, we will continue to investigate as we learn more.  We welcome any information that might aid our analysis.

-----
Mohammed Nauman Siddique
(@m_nsiddique)

2019-08-03: TweetedAt: Finding Tweet Timestamps for Pre and Post Snowflake Tweet IDs

Figure 1: Screenshot from TweetedAt service showing timestamp for a deleted tweet from @SpeakerPelosi
On May 11, 2019, Derek Willis from Politwoops shared a list of deleted tweet IDs which could not be attributed to any Twitter handle followed by them. We tried multiple techniques to find the list of deleted tweet IDs in the web archives, but we were unsuccessful in finding any of the tweet IDs in web archives within the time range of our analysis. During our investigation, we learned of Snowflake, a service used to generate unique IDs by Twitter. We used Snowflake to extract the timestamp from the deleted tweet IDs. Of the 107 deleted tweet IDs shared with us only seven of them were in the time range of our initial analysis. In this post, we describe TweetedAt, a web service and library to extract the timestamps for post-Snowflake IDs and estimate timestamps for pre-Snowflake IDs.

Previous implementations of Snowflake in different programming languages such as Python, RubyPHP, Java, etc. have implemented finding the timestamp of a Snowflake tweet ID but none provide for estimating timestamps of pre-Snowflake IDs.

The reasons for implementing TweetedAt are:
  • It is the only web service which allows users to find the timestamp of Snowflake tweet IDs and estimate tweet timestamps for pre-Snowflake Tweet IDs.
  • Twitter developer API has access rate limits. It acts as a bottleneck in finding timestamps for a data set of tweet IDs. This bottleneck is not present in TweetedAt because we do not interact with Twitter's developer API for finding the timestamps. 
  • Deleted, suspended, and private tweets do not have their metadata accessible from Twitter's developer API. TweetedAt is the solution to finding the timestamps for any of these inaccessible tweets. 

Snowflake


In 2010, Twitter migrated its database from MySQL to Cassandra. Unlike MySQL, Cassandra does not support sequential ID generation technique, so Twitter announced Snowflakea service to generate unique IDs for all the tweet IDs and other objects within Twitter like lists, users, collections, etc. Snowflake generates unsigned-64 bit integers which consist of: 
  • timestamp - 41 bits (millisecond precision w/ a custom epoch gives us 69 years)
  • configured machine ID - 10 bits - gives us up to 1024 machines
  • sequence number - 12 bits - rolls over every 4096 per machine (with protection to avoid rollover in the same ms)

According to Twitter's post on Snowflake, the tweet IDs are k-sorted within a second bound but the millisecond bound cannot be guaranteed. We can extract the timestamp for a tweet ID by right shifting the tweet ID by 22 bits and adding the Twitter epoch time of 1288834974657.  
Python code to get UTC timestamp of a tweet ID
def get_tweet_timestamp(tid): offset = 1288834974657 tstamp = (tid >> 22) + offset utcdttime = datetime.utcfromtimestamp(tstamp/1000) print(str(tid) + " : " + str(tstamp) + " => " + str(utcdttime))

Twitter released Snowflake on November 4, 2010 but it has been around since March, 2006. Pre-Snowflake IDs do not have their timestamps encoded in the IDs, but we can uncover the value from the 2362 tweet IDs with known timestamps. 

Estimating tweet timestamps for pre-Snowflake tweet IDs

TweetedAt estimates the timestamps for pre-Snowflake IDs with an approximate error of 1 hour. For our implementation, we collected 2362 tweet IDs and their timestamps at a daily interval between March 2006 and November 2010 to create a ground truth data set. The ground truth data set is used for estimating the timestamp of any tweet ID prior to Snowflake. Using the weekly interval as ground truth data set resulted in an approximate error of 3 hours and 23 minutes.
Batch cURL command to find first tweet  
msiddique@wsdl-3102-03:/$ curl -Is "https://twitter.com/foobarbaz/status/[0-21]"| grep "^location:"
location: https://twitter.com/jack/status/20
location: https://twitter.com/biz/status/21
The ground truth data set ranges from tweet ID 20 to 29700859247. The first non-404 tweet ID found using the cURL batch command is 20. We found a memento which contains pre-Snowflake ID  of 29548970348 from Internet Archive for @nytimes close to Snowflake release date time. We performed all possible digits combinations on the tweet ID, 29548970348, using the cURL batch command to uncover the largest non-404 tweet ID known to us, 29700859247.
Figure 2: Exponential tweet growth rate in pre-Snowflake time range

Figure 3: Semi-log scale of tweet growth in pre-Snowflake time range


Figure 4: Pre-Snowflake time range graph showing two close on the curve (upper bound and lower bound) and a point between upper and lower bound for which timestamp is to be estimated. Each point on the graph is represented by a tuple of Tweet Timestamp (T) and Tweet ID (I). 
As shown in figure 4, assuming the two points to be very close on the graph, the slope between the two points is linear.


We know the tweet ID (I) for a tweet and want to estimate the timestamp (T) for it which can be estimated using the formula:

The pre-Snowflake timestamp estimation formula was tested on 1000 random tweet IDs generated between the minimum and maximum tweet ID range and the test set resulted in approximate average error of 45 minutes. We also created a weekly test data set with 1932 tweet IDs for pre-Snowflake time range and reported a approximate mean error of 59 minutes. Figure 5 shows, after 2006 the half-yearly mean error rate to be within 60 minutes.
Summary of error difference between the estimated timestamp and the true Tweet timestamp (in minutes) generated on 1000 pre-Snowflake random Tweet IDs
We can replace the Tweet Generation estimation formula by using a segmented curve fitting technique on the graph shown in figure 2 and reduce the program size by excluding all the 2362 data points.
Figure 5: Box plot of error range for Pre-Snowflake IDs conducted over a weekly test set.
Summary of error difference between the estimated timestamp and the true Tweet timestamp (in minutes) generated on weekly pre-Snowflake random Tweet IDs

Estimating the timestamp of a deleted pre-Snowflake ID

Figure 7 shows a pre-Snowflake deleted tweet from @barackobama which can be validated by the cURL response of the tweet ID. The timestamp of tweet in the memento is in Pacific Daylight Time (GMT -07). Upon converting the timestamp to GMT, it changes from Sun, 19 October 2008 10:41:45 PDT to Sun, 19 October 2008 17:41:45 GMT. Figure 8 shows TweetedAt returning the estimated timestamp of Sun, 19 October 2008 17:29:27 GMT which is off by approximately 12 minutes. 

Figure 7: Memento from Internet Archive for @barackobama having a pre-Snowflake deleted tweet ID  
cURL response for @barackobama deleted Tweet ID

msiddique@wsdl-3102-03:~/WSDL_Work/Twitter-Diff-Tool$ curl -IL https://twitter.com/barackobama/966426142
HTTP/1.1 301 Moved Permanently
location: /barackobama/lists/966426142
...

HTTP/1.1 404 Not Found
content-length: 6329
last-modified: Wed, 31 Jul 2019 22:00:56 GMT
...

Figure 8: TweetedAt timestamp response for @barackobama's pre-Snowflake deleted tweet ID 966426142 which is off  by 12 minutes

To summarize, we released TweetedAt, a service to find the timestamp of any tweet ID from 2006 through today. We created a ground truth data set of pre-Snowflake IDs collected on daily interval for estimating timestamp of any tweet ID prior to Snowflake (November 4, 2010). We tested our pre-Snowflake tweet estimation formula on 1000 test data points and reported an approximate mean error of 45 minutes. We also tested our pre-Snowflake tweet estimation formula on 1932 test data points collected weekly and reported an approximate mean error of 59 minutes.

Related Links