Monday, July 18, 2016

2016-07-18: Tweet Visibility Dynamics in a Tweet Conversation Graph

We conducted another study in the same spirit as the first, as part of our research (funded by IMLS) to build collections for stories or events. This time we sought to understand how to extract not just a single tweet, but the conversation of which the tweet belongs to. We explored how the visibility of tweets in a conversation graph changes based on the tweet selected.

A need for archiving tweet conversations
Archiving tweets usually involves collecting tweets associated with a given hashtag. Even though this provides a "clean" way of collecting tweets about the event associated with the hashtag, something important is often missed - conversations. Not all tweets about a particular topic will have the given hashtag,  including portions of a threaded conversation, even if the initial tweet contained the hashtag. This is unfortunate because conversations may provide contextual information about tweets.
Consider the following tweet by @THEHermanCain, which contains #TrumpSpeechinFourWords
Using #TrumpSpeechinFourWords@THEHermanCain's tweet is collected. However, tweets which replied his tweet but did not include the hashtag in the reply will be excluded from the tweet collection - conversations will be excluded from the collection:

Tweets in the conversation without #TrumpSpeechinFourWords will be excluded from tweet collection
I consider conversations an important aspect of the collective narrative. Before we can archive conversations, we need to understand their nature and structure.
Fig 1: A Hypothetical Tweet Conversation Graph consisting of 8 tweets. An arrowhead points in the direction of a reply. For example, t8 replied t5.
It all began when we started collecting tweets about the Ebola virus. After collecting the tweets, Dr. Nelson expressed an interest in seeing not just the collected tweets, but the collected tweets in context of the tweet conversations they belong to. For example, if through our tweet collection process, we collected tweet t8 (Fig. 1), we were interested at the very least in discovering t5 (replied by t8), t2 (replied by t5), and t1 (replied by t2). A more ambitious goal was to discover the entire graph which contained t8 (Fig. 1: t1 - t8). In order to achieve this, I began by attempting to understand the nature of the tweet graph from two perspectives - the browser's view of the tweets and the Twitter API's view.

Fig 2: Root, Parent and Child tweets.
  1. Root tweet: is a tweet which is not a reply to another tweet. But may be replied by other tweets. For example, t1 (Fig. 1).
  2. Parent tweet: is a tweet with replies, called children. A parent tweet can also be a child tweet to the tweet it replied. For example, t2 (Fig. 1) is a parent to t4 - t6, but a child to t1.
  3. Child tweet: is a tweet which is a reply to another tweet. The tweet it replied is called its parent. For example, t8 (Fig. 1) is the child of t5.
  4. Ancestor tweets: refers to all parent tweets which precede a parent tweet. For example, the ancestor tweets of t8 are t1, t2 and t5.
  5. Descendant tweets: refers to all child tweets which follow a parent tweet. For example, the descendants of t2 are t4, t5, t6 and t8.
Tweet visibility dynamics in a tweet conversation graph - Twitter API's perspective:
The API provides an entry called in_reply_to_status_id in a tweet json. With this entry, every tweet in the chain of replies before a tweet, can be retrieved. This means this option does not let you get tweets which are replies to a current tweet. For example, if we selected tweet t1 (a root tweet), with the API, since t1 did not reply another tweet (has no parent), we will not be able to retrieve any other tweet, because we can only retrieve tweets in one direction (Fig. 3 left). If we selected a tweet t2, the in_reply_to_status_id of t2 points to t1, so we can retrieve t1 (Fig. 3 right).

Fig 3: Through the API, from t1, no tweets can be retrieved, from t2, we can retrieve its parent reply tweet, t1
t5's in_reply_to_status_id points to t2, so we retrieve t2 and then t1 (Fig. 4 left). From t8, we retrieve t5, which retrieves t2, which retrieves t1 (Fig. 4 right)So with the last tweet in a tweet conversation reply chain, we can get all the tweet parents (and parent's descendants).

Fig 4: Through the API, from t8 we can retrieve t5, and from t5 we can retrieve t2, and from t2 we can retrieve t1

To summarize the API's view of tweets in a conversation, given a selected tweet, we can see the parent tweets (plus parent ancestors - above), but NOT children tweets (plus children descendants - below), and NOT sibling tweets (sideways).
Tweet visibility dynamics in a tweet conversation graph - browser's perspective:
By browsing Twitter, we observed that given a selected tweet in a conversation chain, we can see the tweet it replied (parents and parent's ancestors), as well as the tweet's replies (children and children's descendants). For example, given t8, we will be able to retrieve t5, t2, and t1 just like the API (Fig. 5). 
Fig 5: From t8 we can access t5, t2 and t1

However, unlike the API, if we had t1, we will be able to retrieve t1 - t8, since t1 is the root tweet (Fig. 6).

Fig 6: From t1 we can access t2 - t8
To summarize the Browser's view of tweets in a conversation, given a selected tweet, we can see the parent tweets, (plus parent ancestors - above) and children tweets (plus children descendants - below), but NOT sibling tweets (sideways).
Our findings are summarized in the following slides:

Methods for extracting tweet conversations
1. Scraping: Twitter does not encourage scraping as outlined in its Terms of Service: "...NOTE: crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited...". Therefore, the description provided here for extracting a tweet conversation based on scraping is purely academic. Based on the visibility dynamics of a tweet from the browser's perspective, the best start position for collecting a tweet conversation is the root position. Consequently, find the root, then access the children from the root. However, if you are only interested in the conversation surrounding a single tweet, given the single tweet, from the browser, its parent (plus parent ancestors) and children (plus children descendants) are available for extraction.
2. API Method 1: This method which is based on the API's tweet visibility can only get the parent (plus parent ancestors). Given a tweet, get the tweet's parent (by accessing its in_reply_to_status_id). When you get the parent, get the parent's parent (etc.) through the same method until no you reach the root tweet. 
3. API Method 2: This method was initially described to me by Sawood Alam and later, independently implemented by Ed Summers. It uses the Twitter search API. Here is Ed Summers description:
Twitter's API doesn't allow you to get replies to a particular tweet. Strange but true. But you can use Twitter's Search API to search for tweets that are directed at a particular user, and then search through the results to see if any are replies to a given tweet. You probably are also interested in the replies to any replies as well, so the process is recursive. The big caveat here is that the search API only returns results for the last 7 days. So you'll want to run this sooner rather than later.
Informal time analysis of extracting tweets
We also considered a simple informal analysis (as opposed to asymptotic analysis based on Big-O) to estimate how long (in seconds) it might take to extract tweets by using the Twitter API vs the browser (by responsibly scraping Twitter). This analysis only considers counting the number of request issued in other to access tweets.
Informal time analysis for extracting tweets with the API:
The statuses API access point (used to get tweets by ID) imposes a rate limit of 180 requests per 15 minutes (1 request every 5 seconds). Given a tweet t(i) in a chain of tweets, the amount of time (seconds) to get the previous tweets in the conversation chain is:
5(i-1) seconds.
Informal time analysis for extracting tweets with the browser:
Consider a scraping implementation in which we retrieve tweets as follows:
  1. Load Twitter webpage for a tweet
  2. Sleep randomly based on value of ๐›ฟ in [1, ๐›ฟ], where ๐›ฟ > 1 
  3. Scroll to load new tweet content until we reach maxScrollForSingleRequest, (maxScrollForSingleRequest > 0). Exit when no new content loads.
  4. Repeat 3.
Based on the implementation described above, given a tweet t(i) with a maximum sleep time represented by a random variable ๐›ฟ in [1, ๐›ฟ] seconds, and a constant maximumScrollForSingle, which represents the maximum number of scrolls we make per request, the estimated amount of time to get the conversation is at most:
E[๐›ฟ] + (E[๐›ฟ] × maxScrollForSingleRequest) seconds; where E[๐›ฟ] = (1+๐›ฟ)/2
Since ๐›ฟ ~ U{1, ๐›ฟ}, (๐›ฟ is described by the Uniform distribution (discrete) and E[๐›ฟ] is the expected value).

Our findings are of consequence particularly to tweet Archivists who should understand the visibility dynamics of the tweet conversation graph.

No comments:

Post a Comment