Tuesday, May 14, 2019

2019-05-14: Back to Pennsylvania - Artificial Intelligence for Data Discovery and Reuse (AIDR 2019)

AIDR 2019
The 2019 Artificial Intelligence for Data Discovery and Reuse conference, supported by the National Science Foundation, was held in Carnegie Mellon University, Pittsburg, PA, between May 13 and May 15, 2019. It is called a conference, but it is more like a workshop. There are only plenary meetings (and a small session of posters) and the presentations are not all about frontiers of research. Many of them are research reviews and the speakers are trying to connect them with "data reuse". The presenters are in various domains, from text mining to computer vision, from medical imaging to self-driving cars, etc. Another difference from regular CS conferences in that the accepted presenter list is made only based on the abstracts they submitted. The full papers are submitted later. 

Because CiteSeerX collects a lot of data from the Web, our group does a lot of work on information extraction, classification, and reuses a lot of data for training AI models, Dr. Lee Giles recommended me to give a presentation. My title is "CiteSeerX: Reuse and Discovery for Scholarly Big Data". In general, the talk was well received. One person asked the question of how we plan to collect annotations from authors and readers by crowdsourcing. My answer was to taking advantage of the CiteSeerX platform, but we need to collect more papers (especially more recent papers) and build better author profiles before sending out the requests. I will compile everything into a 4-page paper. 

In my 1 1/2 days in CMU, I listened to two keynotes. The first was given by Tom Mitchell, one of the pioneers of machine learning and the chair of the machine learning department. His talk was on "Discovery from Brain Image Data". I used to be in a webinar by him on a similar topic. His research was on connecting natural language with brain activities, studying how brains react to stimulations of vocal languages. Here are some takeaways: (1) it takes about 400 ms for the brain to fully take a word such as "coffee"; (2) the reaction happens in different regions in the brain and it is dynamic (changing over time). The data was collected using fMIR for several people and there was quite a bit of work to denoise the fMIR signals to filter out other undergoing activities. 

The second keynote was given by the president of a startup company called medidataGlen de Vries. Glen talked about how medidata improves drug testing confidence by using synthetic data. The presentation was given in a very professional way (like a TED presentation), but Dr. Lee Giles made a comment that he was using a statistical method called "boosting" and Glen agreed. 

Another interesting talk was given by Natasha Noy from Google. Her talk was about the recently launched search engine called "Google dataset search". According to Natasha, this idea was proposed in one of her blog post in 2017. The search engine was online in September 2018. Unfortunately, because it was not well advertised, very few people know it. I personally knew it two weeks ago. The search engine uses the crawled data from Google. The backend uses basic methods to identify public tools annotated with the schema in schema.org, which defines a comprehensive list of fields for metadata of semantic entities. I explored this schema in 2016. The schema can be used for CiteSeerX, replacing Dublin core, but it does not cover semantic typed entities and relationships. So currently, it is good for metadata management. The datasets indexed was also limited to certain domains. Another interesting data search engine was called Auctus, which is a dataset search engine tailored for data argumentation. It searches data using data as input. 

Other interesting talks are:
  • Dr. Cornelia Caragea gave two presentations, one on "keyphrase extraction" - she is an expert in this field, and one on "web archiving" - with her collaboration with Mark Phillips of UNT.  
  • Matias Carrasco Kind, an astronomer, was talking about  Searching for similarities and anomalies in galaxy images
In the conference, I met with Dr. C. Lee Giles, Dr. Cornelia Caragea. All of us were very glad to see each other. We had a very pleasant dinner in a restaurant called "spoon". I had a lunch conversation with Dr. Beth Plale, an NSF program director. She gave me some good suggestions for how to survive as a tenure track professor. I also had brief conversations with Natasha Noy in Google AI and Martin Klein in Los Alamos National Lab. 

Overall, the conference experience was very good and I learned a lot by listening to top speakers from CMU. The registration fee was low and they serve breakfast, lunch, and a banquet (I could not attend). The city of Pittsburg is still cool and windy, but I felt that I am quite used to it because I was living in Pennsylvania for 14 years! The Cathedral of Learning reminds me of good old days when I was visiting my friend Dr. Jintao Liu. He used to be a graduate student of UPitt and now a professor at Tsinghua University. By the way, the supershuttle service was not very good. The front desk canceled my trip from the airport to my hotel because she wasn't able to contact the driver. I had to take a taxi. I used Uber on the way back. It was quick and inexpensive. 

Jian Wu

Monday, May 6, 2019

2019-05-06: Twitter broke my scrapers

Fig. 1: The old tweet DIV showing four (data-tweet-id, data-conversation-id, data-screen-name, and tweet-text) attributes with meaningful names. These attributes are absent in the new tweet DIV (Fig. 2).
On April 23, 2019, my Twitter desktop layout changed. I initially thought a glitch caused me to see  the mobile layout on my desktop instead of the standard desktop layout, but I soon learned this was no accident. I was part of a subset of Twitter users who did not have the option to opt-in to try the new layout. 
New desktop look 
While others might have focused on the cosmetic or functional changes, my immediate concern was to understand the extent of the structural changes to the Twitter DOM. So I immediately opened my Google Chrome Developer Tools to inspect the Twitter DOM, and I was displeased to learn that the changes to the layout seeped beyond the cosmetic new looks of the icons into the DOM. This meant that I would have to rewrite all my research applications built to scrape data from the old Twitter layout.
Old Twitter desktop look
At the moment, I am unsure if it would be possible to extract all the data previously accessible from the old layout. It is important to note that scraping goes against Twitter's Terms of Service's and Twitter offers an API that fulfills some requests invalidating the need for scraping. However, the Twitter API is limited in search, but most importantly, the API does not offer a method for extracting all tweets from a conversation. Extracting tweets from a conversation is a task fundamental to my PhD research, so I scrape Twitter privately for research. In this blogpost, I will use the tweet below to highlight some of the major changes to the Twitter DOM, specifically the tweet DIV by comparing the old and new layouts. 
Fig. 2: In the new tweet DIV, semantic items (e.g, the four semantic items in Fig. 1) are absent or obscured.
Old Tweet DIV vs New Tweet DIV
The most consequential (to me) structural difference between the old and new tweet DIVs is that the old tweet DIV includes many attributes with meaningful names while the new tweet DIV does not. In fact, in the old tweet layout, the fundamental unit, the tweet, was explicitly labeled a "tweet" by a DIV with classname="tweet," unlike the new layout. Let us consider the difference between the old and new tweet DIVs from the perspective of the four important attributes marked in Fig. 1:
  1. data-tweet-id: In the old layout, data-tweet-id (contains the tweet ID - unique string that uniquely identifies a tweet) was explicitly marked. In the new layout, the data-tweet-id attribute is absent.
  2. data-conversation-id: This attribute, absent in the new layout, and present in the old layout is responsible for chaining tweets, and thus required for identifying tweets in a reply or conversation thread. A tweet that is a reply includes the Tweet ID of its parent tweet as a value of the data-conversation-id attribute.
  3. data-screen-name: The data-screen-name attribute labels the Twitter handle of the tweet author. This attribute is marked explicitly in the old tweet DIV, but not in the new tweet DIV.
  4. tweet-text: Within the old tweet DIV, the DIV with class name, "tweet-text," marks the text of the tweet, but in the new tweet DIV, there is no such semantic label for the tweet-text.
The new Twitter layout is still under-development, so it comes as no surprise that I discovered a glitch. I noticed that reloading my timeline caused Twitter to load and subsequently quickly remove sponsored tweets from my timeline. This happens too fast to capture with a screenshot, so I recorded my screen to capture the glitch (Fig. 3).
Fig. 3: New Twitter layout glitch showing the loading and subsequent removal of sponsored tweets
It is not clear if the structural changes to the Twitter DOM is a merely coincidental with the rollout of the new layout or if the removal of semantic attributes is part of an intentional effort to discourage scraping. Whatever the actual reason, the consequence is obvious - scraping Twitter has just gotten harder.

-- Alexander C. Nwala (@acnwala)

Friday, May 3, 2019

2019-05-03: Selected Conferences and Orders in WS, DL, IR, DS, NLP, AI

The time when research works should be done is usually less predictable than homework. You may submit a paper next year, but you cannot submit your homework the next year. Even if there is a target deadline, the results may not be delivered on time. Even if the results are ready, the papers may not be in good shape, especially for papers written by students. Even if papers are submitted, they can be rejected. Therefore, it is usually useful to decide where to submit the work next.

I used to struggle to find the next deadline for my work, so I compiled this timeline, sorted by months. The deadlines are not intended to be accurate because they change every year. They can also be extended. The deadlines may vary depending on the submission type: full paper, short paper, poster, etc.  The focus is on the approximate chronological order in which the deadlines happen. One can always visit the conferences' website for the exact deadline. It is also not intended to be exhaustive as it focuses on popular conferences. I also do not want the list to be too crowded but it can be updated by adding new conferences.

The list below is made for people in the Web Science Digital Libraries Group (WS-DL) at ODU, but it can be generalized to researchers working in Web Science, Digital Libraries, Information Retrieval, Data Science, Natural Language Processing, and Artificial Intelligence to better plan where research works can be disseminated. 


  • JCDL (full/short/poster) (January 25, 2019)
  • SIGIR (full/short) (January 28, 2019)
  • ICDAR (full) (February 15, 2019)
  • IJCAI (full) (February 15, 2019)
  • KDD (full/short) (Feb 3, 2019)
  • ACM Web Science (full/short/poster) (Feb 18, 2019)
  • ACL (full/short) (March 4, 2019)
  • COLING (full) (March 16, 2018)
  • ISWC (full) (April 3, 2019) 
  • ECML-PKDD (full) (April 5, 2019)
  • DocEng (full) (April 9, 2019)
  • TPDL (full/short/poster) (April 15, 2019) 
  • IRI (full) (May 2, 2019)
  • ICTIR (full/short) (May 15, 2019)
  • DocEng (short) (May 21, 2019)
  • EMNLP (full/short) (May 21, 2019)
  • CIKM (full/short) (May 22, 2019) 
  • NIPS (full) (May 23, 2019)
  • ICDM (full) (June 5, 2019)
  • K-CAP (full/short) (June 22, 2019)

  • WSDM (full) (August 15, 2018)
  • IEEE Big Data (full) (August 19, 2019), poster due later
  • AAAI, IAAI (full) (September 5, 2018)
  • iConference (full/short/poster) (September 10, 2018)
  • ECIR (full/short) (October 1, 2019)
  • SDM (full) (October 12, 2018)
  • WWW (full/short) (November 5, 2018)
  • ICWS (full/short) (December 1, 2018)
  • NAACL-HLT (full/short) (December 10, 2018) 

Jian Wu 

Wednesday, April 17, 2019

2019-04-17: Russell Westbrook, Shane Keisel, Fake Twitter Accounts, and Web Archives

On March 11, 2019 in the NBA, the Utah Jazz hosted their Northwest Division rivals, the Oklahoma City Thunder.  During the game, a Utah fan (Shane Keisel) and a Oklahoma City player (Russell Westbrook) engaged in a verbal exchange, with the player stating the fan was directing racist comments to him and the fan admitting to heckling but denying that his comments were racist.  The event was well documented (see, for example, this Bleacher Report article), and the following day the fan received a lifetime ban from all events at the Vivint Smart Home Arena and the player received a $25k fine from the NBA.

Disclaimer: I have no knowledge of what the fan said during the game, nor do I have an opinion regarding the appropriateness of the respective penalties.  My interest is that after the game, the fan gave at least one interview with a TV station reporter in which he exposed his identity.  That set off a rapidly evolving series of events with both real and fake Twitter accounts, which we unravel with the aid of multiple web archives.  The initial analysis was performed by Justin Whitlock as a project in my CS 895 "Web Archiving Forensics" class; prior to Justin proposing it as a project topic, my only knowledge of this event was via the Daily Show.

First, let's establish a timeline of events.  The timeline is made a little bit complicated because of although the game was played in the Mountain time zone, most media reports are relative to Eastern time, and the web crawlers report their time in UTC (or GMT).  Furthermore, daylight savings time began on Sunday, March 10, and the game was played on Monday, March 11.  This means there is a four hour differential between UTC and EDT, and a six hour differential between UTC and MDT.  Although most events occur after daylight savings, some events will occur before (where there would be a five hour differential between UTC and EST). 
  • 2019-03-12T01:00:00Z -- the game is scheduled to begin at March 11, 9pm EDT (March 12, 1am UTC).  An NBA game will typically last 2--2.5 hours, and at least one tweet shows Westbrook talking to someone in the bleachers midway through second quarter (there may be other videos in circulation as well).
  • 2019-03-12T03:58:00Z -- based on the empty seats and the timestamp on the tweet (11:58pm EDT), the post-game interview with a KSL reporter embedded above reveals the fan's name and face.  The uncommon surname of "Keisel" combined with a closeup of his face enables people to find quickly find his Twitter account: "@skeisel391". 
  • 2019-03-12T04:57:34Z -- Within an hour of the KSL interview being posted, Keisel's Twitter account is "protected". This means we can see his banner and avatar photos and his account metadata, but not his tweets.
  • 2019-03-12T12:23:42Z -- Less than 9 hours after the KSL interview, his Twitter account is "deleted". No information is available from his account at this time.
  • 2019-03-12T15:29:47Z -- Although his Twitter account is deleted, the first page (i.e., first 20 tweets) is still in Google's cache and someone has pushed Google's cached version of the page into a web archive.  The banner of the web archive (archive.is) obscures the banner inserted by Google's cache, but a search of the source code of http://archive.is/K6gP4 reveals: 
    "It is a snapshot of the page as it appeared on Mar 6, 2019 11:29:08 GMT." 
In other words, an archived version of Google's cached page reveals Keisel's tweets (the most recent 20 tweets anyway) from nearly a week before (i.e., 2019-03-06T11:29:08Z) the game on March 11, 2019.

Although Keisel quickly protected and then ultimately deleted his account, until it was deleted his photos and account metadata were available and allowed a number of fake accounts to proliferate.  The most successful fake is "@skeiseI391", which is also now deleted but stayed online until at least 2019-03-17T04:18:48Z.  "@skeiseI391" replaces the lowercase L ("l") with an uppercase I ("I").  Depending on the font of your browser, the two characters can be all but indistinguishable (here they are side-by-side: lI).  I'm not sure who created this account, but we discovered it in this tweet, where the user provides not only screen shots but also a video of scrolling and clicking through the @skeiseI391 account before it was deleted.

The video has significant engagement: originally posted at 2019-03-12T10:55:00Z, it now has greater than 1k RTs, 3k likes, and 381k views.  There are many other accounts circulating these screen shots: some of which are provably true, some of which are provably false, and some of which cannot be verified using public web archives.  The screen shots have had an impact in the news as well, showing up in among others: The Root, News One, and BET.   BET even quoted a provably fake tweet in the headline of their article:

This article's headline references a fake tweet.
The Internet Archive has mementos (archived web pages) for both the fake @skeiseI391 and the real @skeisel391 accounts, but the Twitter account metadata (e.g., when the account was created, how many followers, how many tweets) for the fake acount are in Chinese and in Kannada for real account.  This is admittedly confusing, but is a result of how the Internet Archive's crawler and Twitter's cookies interact; see our research group's posts from 2018-03 and 2019-03 on these topics for further information.  Fortunately, archive.is does not have the same problems with cookies, so we use their mementos for the following screen shots (two from the real account at archive.is and one from the fake account at archive.is).

real account, 2019-03-06T11:29:08Z (Google cache)
real account, 2019-03-12T04:57:34Z
From the account metadata, we can see this was not an especially active account: established in October 2011, it has 202 total tweets, 832 likes, following 51 accounts, and from March 6 to March 12, it went from 41 to 53 followers.  The geographic location is set to "Utah, USA", and the bio has no linked URL and has three flag emojis.

fake account; note the difference in the account metadata
The fake account has notably different metadata: the bio has only two flag emojis, plus a link to "h.cm", a page for a parked domain that appears to have never had actual content (the Internet Archive has mementos back to 2012). Furthermore, this account is far more active with 7k tweets, 23k likes, 1500 followers and following 1300 accounts, all since being created in August 2018.

Twitter allows users to change their username (or "handle") without losing followers, old tweets, etc.  Since the handle is reflected in the URL and web archives only index by URL, we cannot know what the original handle of the fake @skeiseI391 account, but at some point after the game the owner changed from the original handle to "skeiseI391".  Since the account is no longer live, we cannot use the Twitter API to extract more information about the account (e.g., followers and following, tweets prior to the game), but given the link to a parked/spam web page and the high level of engagement in  a short amount of time, this was likely a burner bot account designed amplify legitimate accounts (cf. "The Follower Factory"), and then was adapted for this purpose.

We can pinpoint when the fake @skeiseI391 account was changed.  By examining the HTML source from the IA mementos of the fake and real accounts, we can determine the URLs of the profile images:

Real: https://pbs.twimg.com/profile_images/872289541541044225/X6vI_-xq_400x400.jpg

Fake: https://pbs.twimg.com/profile_images/1105325330347249665/YHcWGvYD_400x400.jpg

Both images are 404 now, but they are archived at those URLs in the Internet Archive:

Archived real image, uploaded 2017-06-07T03:08:07Z
Archived fake image, uploaded 2019-03-12T04:29:09Z
Also note that the tool used to download the real image and then upload as the fake image maintained the circular profile pic instead of the original square.

For those familiar with curl, I include just a portion of the command line interface that shows the original "Last-Modified" HTTP response header from twitter.com.  It is those dates that record when the image changed at Twitter; these are separate from the dates from when the image was archived at the Internet Archive.  The relevant response headers are shown below:

Real image:
$ curl -I http://web.archive.org/web/20190312045057/https://pbs.twimg.com/profile_images/872289541541044225/X6vI_-xq_400x400.jpg
HTTP/1.1 200 OK
Server: nginx/1.15.8
Date: Wed, 17 Apr 2019 15:12:02 GMT
Content-Type: image/jpeg

X-Archive-Orig-last-modified: Wed, 07 Jun 2017 03:08:07 GMT

Memento-Datetime: Tue, 12 Mar 2019 04:50:57 GMT

Fake image:
$  curl -I http://web.archive.org/web/20190312061306/https://pbs.twimg.com/profile_images/1105325330347249665/YHcWGvYD_400x400.jpg
HTTP/1.1 200 OK
Server: nginx/1.15.8
Date: Wed, 17 Apr 2019 15:13:21 GMT
Content-Type: image/jpeg

X-Archive-Orig-last-modified: Tue, 12 Mar 2019 04:29:09 GMT

Memento-Datetime: Tue, 12 Mar 2019 06:13:06 GMT

The "Memento-Datetime" response header is when the Internet Archived crawled/created the memento (real = 2019-03-12T04:50:57Z; fake = 2019-03-12T06:13:06Z), and the "X-Archive-Orig-last-modified" response header is the Internet Archive echoing the "Last-Modified" response header it received from twitter.com at crawl time.  From this we can establish that the image was uploaded to the fake account at 2019-03-12T04:29:09Z, not quite 30 minutes before we can establish that the real account was set to "protected" (2019-03-12T04:57:34Z). 

We've presented a preponderance of evidence of that the account the account @skeiseI391 is fake and that fake account is responsible for the "come at me _____ boy" tweet referenced in multiple news outlets.  But what about some of the other screen shots referenced in social media and the news?  Are they real?  Are they photoshopped?  Are they from other, yet-to-be-uncovered fake accounts?

First, any tweet that is a reply to another tweet will be difficult to verify with web archives unless we know the direct URL of the original tweet or the reply itself (e.g., twitter.com/[handle]/status/[many numbers]).  Unfortunately, the deep links for individual tweets are rarely crawled and archived for less popular accounts.  While the top level page will be crawled and the most recent 20 tweets included, one has to be logged in to Twitter to see the tweets included in the "Tweets & replies" tab, and public web archives are not logged in when they crawl so those contents are typically not available.  As such, it is hard to establish via web archives if the screen shot of the reply below is real of fake.  The original thread is still on the live web, but of the 45 replies, two of them are marked "This Tweet is unavailable".  One of those could be a reply from the real @skeisel391, but we don't have enough information to definitively rule if that is true.  The particular tweet shown below ("#poorloser") is of issue because even though it was from nearly a year ago, it would contradict the "we were having fun" attitude from the KSL interview.  Other screen shots that appear as replies will be similarly difficult to uncover using web archives.

This could be a real reply, but with web archives it is difficult to establish provenance of reply tweets.
The tweet below is more difficult to establish, since it does not appear to be a reply and the datetime that it was posted (2018-10-06T16:11:00Z) falls with the date range of the memento of the page in the Google cache, which has tweets from 2019-02-27 to 2018-10-06.  The use of "#MAGA" is inline with what we know Keisel has tweeted (at least 7 of the 20 tweets are clearly conservative / right-wing).  At first glance it appears that memento covers tweets all the way back to 2018-10-04, since a retweet with that timestamp appears as the 20th and final tweet on the page, and thus a tweet from 2018-10-06 should appear before the one with a timestamp of 2018-10-04.  But retweeting a page does not reset the timestamp; for example if I tweeted something yesterday and you retweet it today, your retweet will show my timestamp of yesterday.  So although the last timestamp shown on the page is 2018-10-04, the 19th tweet on the page is from Keisel and shows a timestamp of 2018-10-06.  So it's possible that the retweet occurred on 2018-10-06 and the tweet below just missed being included in the 20 most recent tweets (i.e., the 21st most recent tweet).  The screen shot shows a time of "11:11am", and in the HTML source of Google's cached page, for the 19th tweet it has:

title="8:11 AM - 6 Oct 2018"

Which would suggest that the screen shot happened after the 19th tweet, but without time zone information we can't reliably sequence the tweets.  Depending on the GeoIP of Google's crawler, Twitter would set the "8:11 AM" value relative to that timezone.  It's tempting to think it's in California and thus PST, but we can't be certain.  Regardless, there's no way to know the default time zone of the presumed client in the screen shot.

We cannot definitely establish the provenance of this tweet.
Bing's cache also has a copy of Keisel's page, and it covers a period of 2018-09-14 to 2018-03-27.  Unfortunately, that leaves a coverage gap from 2018-10-06 to 2018-09-14, inclusive, and if the "#MAGA" tweet is real it could fall between the coverage provided by Google's cache and Bing's cache.

This leaves three scenarios to account for the above "#MAGA" tweet and why we don't have a cached copy of it:
  1. Keisel deleted this tweet on or before March 6, 2019 in anticipation of the game on March 11, 2019.  While not impossible, it does not seem probable because it would require someone taking a screen shot of the tweet prior to the KSL interview.  Since the real @skeisel391 account was not popular (~200 tweets, < 50 followers), this seems like an unlikely scenario.
  2. Someone photoshopped or otherwise created a fake tweet.  Given the existence of the fake @skeiseI391 account (and other fake accounts), this cannot be ruled out.  If it is a fake, it does not appear to have the same origin as the fake @skeiseI391 account.  
  3. The screen shot is legitimate and we are simply unlucky that the tweet in question fell in the coverage gap between the Google cache and the Bing cache, just missing appearing on the page in Google's cache.
I should note that in the process of extending Justin's analysis we came across this thread from sports journalist @JonMHamm, where he uncovered the fake @skeiseI391 account and also looked at the page in Google's cache, although he was unaware that the earliest date it establishes is 2018-10-06 and not 2018-10-04.  He also vouches for a contact that claims to have seen the "#MAGA" tweet while it was still live, but that's not something I can independently verify.

In summary, of the three primary tweets offered as evidence, we can reach the following conclusions:
  1. "come at me _____ boy" -- this tweet is definitively fake.
  2. "#poorloser" -- this tweet is a reply, and in general reply tweets will not appear in public web archives, so web archives cannot help us evaluate this tweet.
  3. "#MAGA" -- this tweet is either faked, or it falls in the gap between what appears in the Google cache and what appears in the Bing cache; using web archives we cannot definitively determine explanation is more likely.
We welcome any feedback, additional cache sources, deep links to individual tweets, evidence that these tweets were ever embedded in HTML pages, or any additional forensic evidence.  I thank Justin Whitlock for the initial analysis, but I take responsibility for any errors (including the persistent fear of incorrectly computing time zone offsets).

Finally, in the future please don't just take a screen shot, push it to multiple web archives


Note: There are other fake Twitter accounts, for example: @skeisell391 (two lowercase L's),  @skeisel_ (trailing underscore), but they are not well-executed and I have omitted them from the discussion above.  

Monday, April 1, 2019

2019-04-01: Creating a data set for 116th Congress Twitter handles

Senators from Alabama in the 115th Congress

Any researcher conducting research on Twitter and the US Congress might think, "how hard could it be in creating a data set of Twitter handles for the members of Congress?". At any given time, we know the number of members in the US Congress and we also know the current members of Congress. At this point, creating a data set of Twitter handles for the members of Congress might seem like an easy task, but it turns out it is a lot more challenging than expected. We present the challenges involved in creating a data set of Twitter handles for the members of 116th US Congress and provide a data set of Twitter handles for 116th US Congress

Brief about the US Congress

The US Congress is a bicameral legislature comprising of the Senate and the House of Representatives. The Congress consists of:

  • 100 senators, two from each of the fifty states.
  • 435 representatives, seats are distributed by population across the fifty states.
  • 6 non-voting members from the District of Columbia and US territories which include American Samoa, Guam, Northern Mariana Islands, Puerto Rico, and US Virgin Islands.
Every US Congress is consecutively numbered and has a term of two years. The current US Congress is the 116th Congress which began on 2019-01-03 and will end on 2021-01-03.       

Previous Work on Congressional Twitter

Since the inception of social media, Congress members have aggressively used it as a medium of communication with the rest of the world. Previous researchers have completed their US Congress Twitter handles data set by both using other lists and manually adding to them. 

Jennifer Golbeck et al. in their papers "Twitter Use by the US Congress" (2010) and "Congressional twitter use revisited on the platform's 10-year anniversary" (2018) used the Tweet Congress to build their data set of Twitter handles for the members of Congress. An important highlight from their 2018 paper is that every member of Congress has a Twitter account. Libby Hemphill in "What's congress doing on twitter?" talks about the manual creation of 380 Twitter handles for US Congress which were used for collecting tweets in the winter of 2012. Theresa Loraine Cardenas in "The Tweet Delete of Congress: Congress and Deleted Posts on Twitter" (2013) used Politwoops to create the list of Twitter handles for members of Congress. Jihui Lee et al. in their paper "Detecting Changes in Congressional Twitter Networks over Time" used the community maintained GitHub repository from @unitedstates to collect Twitter data for 369 representatives of the 435 from the 114th US Congress. Libby Hemphill and Matthew A. Shapiro in their paper "Appealing to the Base or to the MoveableMiddle? Incumbents’ Partisan MessagingBefore the 2016 U.S. Congressional Elections" (2018) also used the community maintained GitHub repository from @unitedstates
Screenshot from Tweet Congress

Twitter Handles of the 116th Congress 

January 3, 2019 marked the beginning of 116th United States Congress with 99 freshman members to the Congress. It has already been two months since the new Congress has been sworn in. Now, let us review Tweet Congress and GitHub repository @unitedstates to check how up-to-date these sources are with the Twitter handles for the current members of Congress. We also review the CSPAN Twitter list for the members of Congress in our analysis.

Tweet Congress 

Tweet Congress is an initiative from the Sunlight Foundation with help from Twitter to create a transparent environment which allows easy conversation between lawmakers and voters in real time. It was launched in 2011. It lists all the members of Congress and their contact information. The service also provides visualizations and analytics for Congressional accounts.     

@unitedstates (GitHub Repository)

It is a community maintained GitHub repository which has list of members of the United States Congress from 1789 to present, congressional committees from 1973 to present, committee memberships for current, and information about all the presidents and vice-presidents of the United States. The data is available in YAML, JSON, and CSV format. 

CSPAN (Twitter List)

CSPAN maintains Twitter lists for the 116th US Representatives and US Senators. The Representatives list has 482 Twitter accounts while the Senators list has 114 Twitter accounts. 

Combining Lists  

We used the Wikipedia page on the 116th Congress as our gold-standard data for the current members of Congress. The data from Wikipedia was collected on 2019-03-01. Correspondingly, the data from CSPAN, @unitedstates (GitHub Repository), and Tweet Congress was also collected on 2019-03-01. We then manually compiled a CSV file with the members of Congress and the presence of their Twitter handles in all the different sources. The reason for manual compilation of the list was largely due to discrepancy in the names of the members of Congress from different sources under consideration.
  • Some of the members of Congress use diacritic characters. For example, Wikipedia and Tweet Congress have the name of a representative from New York as Nydia_Velázquez, while  Twitter and @unitedstates repository has her name as Nydia Velazquez
Screenshot from Wikipedia showing Nydia Velazquez, representative from New York using diacritic characters

Screenshot from Twitter for Rep. Nydia Velazquez from New York without diacritic characters
  • Some of the members of Congress have abbreviated middle names or suffixes in their names. For example, Wikipedia has the name of a representative from Tennessee as Mark E. Green while Tweet Congress has his name as Mark Green.
Screenshot from Wikipedia for Rep. Mark Green from Tennessee with his middle name

Screenshot from Twitter for Rep. Mark Green from Tennessee without his middle name
Screenshot from Tweet Congress for Rep. Mark Green from Tennessee without his middle name
Screenshot from Wikipedia for Rep. Chuck Fleischmann from Tennessee using his nick name
Screenshot from Twitter for Rep. Chuck Fleischmann from Tennessee using his nick name
Screenshot from Tweet Congress for Rep. Chuck Fleischmann from Tennessee using his given name

What did we learn from our analysis?

As of 2019-03-01, the US Congress had 538 members of 541 with three vacant representative positions. The three vacant positions include the third and ninth Congressional Districts of North Carolina and the twelfth Congressional District of Pennsylvania. Of the 538 members of Congress, 537 have Twitter accounts while the non-voting member from Guam, Michael San Nicolas, has no Twitter account.

Name Position Joined Congress CSPAN @unitedstates TweetCongress Remark
Collin Peterson Rep. 1991-01-03 F F F @collinpeterson
Greg Gianforte Rep. 2017-06-21 F F F @GregForMontana
Gregorio Sablan Del. 2019-01-03 F T T
Rick Scott Sen. 2019-01-08 T !T F
Tim Kaine Sen. 2013-01-03 T !T F
James Comer Rep. 2016-11-08 T !T F
Justin Amash Rep. 2011-01-03 T !T F
Lucy Clay Rep. 2001-01-03 T !T F
Bill Cassidy Rep. 2015-01-03 T !T T
Members of the 116th Congress whose Twitter handles are missing from either one or all of the sources. T represents both name and Twitter handle present, !T represents name present but Twitter handle missing, and F represents both the name and Twitter handle missing.
  • CSPAN has Twitter handles for 534 members of Congress out of the 537 members of Congress with two representatives and a non-voting member missing from its list. The absentees from the list are Rep. Collin Peterson (@collinpeterson), Rep. Greg Gianforte (@GregForMontana), and Delegate Gregorio Sablan (@Kilili_Sablan).
  • The GitHub repository, @unitedstates has Twitter handles for 529 members of Congress out of the 537 members of Congress with five representatives and three senators missing from its data set. The absentees from the repository are Rep. Collin Peterson (@collinpeterson), Rep. Greg Gianforte (@GregForMontana), Sen. Rick Scott (@SenRickScott), Sen. Tim Kaine (@timkaine), Rep. James Comer (@KYComer), Rep. Justin Amash (@justinamash), Rep. Lucy Clay (@LucyClayMO1), and Sen. Bill Cassidy (@SenBillCassidy).
  • Tweet Congress has Twitter handles for 530 members of Congress out of the 537 members of Congress with five representatives and two senators missing.  The absentees are Rep. Collin Peterson (@collinpeterson), Rep. Greg Gianforte (@GregForMontana), Sen. Rick Scott (@SenRickScott), Sen. Tim Kaine (@timkaine), Rep. James Comer (@KYComer), Rep. Justin Amash (@justinamash), and Rep. Lucy Clay (@LucyClayMO1).
The combined list of Twitter handles for the members of Congress from all the sources has two representatives missing, namely Collin Peterson who is a representative from Minnesota since 1991-01-03 and Greg Gianforte who is a representative from Montana since 2017-06-21. The combined list from all the sources also has six members of Congress who have different Twitter handles from different sources.

Name Position Joined Congress CSPAN @unitedstates + TweetCongress
Chris Murphy Sen. 2013-01-03 @ChrisMurphyCT @senmurphyoffice
Marco Rubio Sen. 2011-01-03 @marcorubio @SenRubioPress
James Inhofe Sen. 1994-11-16 @JimInhofe @InhofePress
Julia Brownley Rep. 2013-01-03 @RepBrownley @JuliaBrownley26
Seth Moulton Rep. 2015-01-03 @Sethmoulton @teammoulton
Earl Blumenauer Rep. 1996-05-21 @repblumenauer @BlumenauerMedia
Members of the 116th Congress who have different Twitter handles in different sources

Possible reasons for disagreement in creating a Members of Congress Twitter handles data set

Scenarios involved in creating Twitter handles for members of Congress when done over a period of time

One Seat - One Member - One Twitter Handle: When creating our data set of Twitter handles for members of Congress over a period of time, the perfect situation is where we have one seat in the Congress which is held by one member for the entire congress tenure who holds one Twitter account. For example, Amy Klobuchar, senator from Minnesota has only one Twitter account @amyklobuchar.

Google search screenshot for Sen. Amy Klobuchar's Twitter account
Twitter screenshot for Sen. Amy Klobuchar's Twitter account

One Seat - One Member - No Twitter Handle: When creating our data set of Twitter handles for members of Congress over a period of time, we have one seat in Congress which is held by one member for the entire congress tenure and does not have a Twitter account. For example, Michael San Nicolas, delegate from Guam has no Twitter account.

Screenshot from Congressman Michael San Nicolas page showing a Twitter link for HouseDems Twitter account while the rest of the social media icons are linked to his personal accounts

One Seat - One Member - Multiple Twitter Handles: When creating our data set of Twitter handles for members of Congress over a period of time, we have one seat in Congress which is held by one member for the entire congress tenure who has more than one Twitter account. A member of Congress can have multiple Twitter accounts. Based on the purpose of the Twitter accounts they can be classified as Personal, Official, and Campaign accounts.

  • Personal Account: A Twitter account used by the members of Congress to tweet their personal thoughts can be referred to as a personal account. A majority of these accounts might have a creation date prior to when they were elected to the Congress. For example, Marco Rubio, a Senator from Florida created his Twitter account @marcorubio in August, 2008 while he was sworn in to Congress on 2011-01-03.
Screenshot for the Personal Twitter account of Sen. Marco Rubio from Florida. The account was created in August, 2008 while he was elected to Congress on 2011-01-03 
  • Official Account: A Twitter account used by the member of Congress or their staff to tweet out all the official information for general public related to the member of Congress' activity is referred to as an official account. A majority of these accounts creation dates will be close to the date on which the member of Congress got elected. For example, Marco Rubio, a Senator from Florida has a Twitter account @senrubiopress which has a creation date of December, 2010, while he was sworn in to Congress on 2011-01-03. 
Screenshot for the Official Twitter account of Sen. Marco Rubio from Florida. The account was created in December, 2010 while he was elected to Congress on 2011-01-03.
  • Campaign Accounts: A Twitter account used by a member of Congress for campaigning their elections is referred to as a campaign account. For example, Rep. Greg Gianforte from Montana has a Twitter account @gregformontana which contains tweets related to his campaigns for re-election can be referred to as a campaign account.
Twitter Screenshot for the Campaign account of Rep. Greg Gianforte from Montana which contains tweets related to his re-election campaigns.
Twitter Screenshot for the Personal account of Rep. Greg Gianforte from Montana which has personal tweets from him. 

One Seat - Multiple Members - Multiple Twitter Handles: When creating our data set of Twitter handles for members of Congress over a period of time, we can have a seat in Congress which is held by different members during the tenure of Congress at different points in time who have different Twitter accounts. An example from the 115th Congress is the Alabama Senator situation between January 2017 and July 2018. On February 9, 2017, Jeff Sessions resigns as senator and was succeeded by Alabama Governor's appointee Luther Strange. After the special election on January 3, 2018, Luther Strange leaves the office to make way for Doug Jones as the Senator of Alabama. Now,  who do we include as the Senator from Alabama for the 115th Congress? Even though we might decide to include all of them based on the date they join or leave their offices but, when this analysis is done for a year who will provide us all the historical information for the current Congress in session. As of now, all the sources we analyzed try to provide with the most recent information rather than historical information about the current Congress and its members over the entire tenure. 
Alabama Senate seat situation between January 2017 and July 2018. It highlights the issue in context of Social Feed Manager's 115th Congress tweet dataset.  
One of the other issues worth mentioning is when members of Congress change their Twitter handle. An example for this scenario is when Rep. Alexandria Ocasio-Cortez from New York tweeted on 2018-12-28 about changing her Twitter handle from @ocasio2018 to @aoc. In the case of popular Twitter accounts for members of Congress, it is easy to discover their change of handles but for a member of Congress who is not popular on Twitter, they might go unnoticed for quite some time.

Screenshot of memento for @Ocasio2018
Screenshot of memento which shows the announcement for change of Twitter handle from @Ocasio2018 to @aoc 
Screenshot of @aoc

Twitter data set for the 116th Congress Handle

  • We have created a data set for the 16th Congress Twitter handles which resolves the issues of CSPAN, Tweet Congress, and @unitedstates (GitHub repository). 
  • We have Twitter handles for all the current 537 members of Congress who are on Twitter, except for one delegate from Guam who does not have a Twitter account. 
  • Unlike other sources, our data set does not  include any member of Congress who are not a part of the 116th Congress.
  • In case of conflicts of Twitter handles for members of Congress from different sources under investigation, we chose accounts which were personally managed by the member of Congress (Personal Twitter Account) over accounts which were managed by their teams or used for campaign purposes (Official or Campaign Accounts). The reason for choosing personal accounts over official or campaign accounts is because some of the members of Congress explicitly mention in Twitter biography of their personal accounts that all the tweets are their own which is not reflected in their official or campaign account's Twitter biography. 
Twitter Screenshot of the Personal account for Rep. Seth Moulton where he states that all the tweets are his own in his Twitter bio.

Name Position WSDL Data set CSPAN @unitedstates + TweetCongress
Chris Murphy Sen. @ChrisMurphyCT @ChrisMurphyCT @senmurphyoffice
Marco Rubio Sen. @marcorubio @marcorubio @SenRubioPress
James Inhofe Sen. @JimInhofe @JimInhofe @InhofePress
Julia Brownley Rep. @RepBrownley @RepBrownley @JuliaBrownley26
Seth Moulton Rep. @Sethmoulton @Sethmoulton @teammoulton
Earl Blumenauer Rep. @repblumenauer @repblumenauer @BlumenauerMedia
Members of the 116th Congress who have different Twitter handles in different sources. The WSDL data set has personal Twitter handles over official Twitter handles


Of all the three sources Tweet Congress, @unitedstates (GitHub Repository) and CSPAN, none of them have a full coverage of all the Twitter handles for the members of the 116th Congress. There is one member of Congress who does not have a Twitter account and additionally there are two members of Congress who do not have their Twitter handles present in any of the sources. There is no source which provides the historical information about the members of Congress over the entire tenure of the Congress, as all the sources focus on the recency rather than holding information about the entire tenure of Congress. It turns out creating a data set of Twitter handles for members of Congress seems an easy task on first glance, but it is a lot more difficult owing to multiple reasons for disagreements when the study is to be done for over a period of time. We share a data set for the 116th Congress Twitter handles by combining all the lists.


Mohammed Nauman Siddique