Wednesday, March 14, 2018

2018-03-14: Twitter Follower Count History via the Internet Archive

The USA Gymnastics team shows significant growth during the years the Olympics are held.

Due to Twitter's API, we have limited ability to collect historical data for a user's followers. The information for when one account starts following another is unavailable. Tracking the popularity of an account and how it grows cannot be done without that information. Another pitfall is when an account is deleted, Twitter does not provide data about the account after the deletion date. It is as if the account never existed. However, this information can be gathered from the Internet Archive. If the account is popular enough to be archived, then a follower count for a specific date can be collected. 

The previous method to determine followers over time is to plot the users in the order the API returns them against their join dates. This works on the assumption that the Twitter API returns followers in the order they started following the account being observed. The creation date of the follower is the lower bound for when they could have started following the account under observation. Its correctness is dependent on new accounts immediately following the account under observation to get an accurate lower bound. The order Twitter returns followers is subject to unannounced change, so it can't be depended on to work long term. That will not show when an account starts losing followers, because it only returns users still following the account. This tool will help accurately gather and plot the follower count based on mementos, or archived web pages, collected from the Internet Archive to show growth rates, track deleted accounts, and help pinpoint when an account might have bought bots to increase follower numbers.

I improved on a Python script, created by Orkun Krand, that collects the followers for a specific Twitter username from the mementos found in the Internet Archive. The code can be found on Github. Through the historical pages kept in the Internet Archive, the number of followers can be observed for a specific date of the collected memento. This script collects the follower count by identifying various CSS Selectors associated with the follower count for most of the major layouts Twitter has implemented. If a Twitter page isn't popular enough to warrant being archived, or too new, then no data can be collected on that user.

This code is especially useful for investigating users that have been deleted from Twitter. The Russian troll @Ten_GOP, impersonating the Tennessee GOP was deleted once discovered. However, with the Internet Archive we can still study its growth rate while it was active and being archived. 
In February 2018, there was an outcry as conservatives lost, mostly temporarily, thousands of followers due to Twitter suspending suspected bot accounts. This script enables investigating users who have lost followers, and for how long they lost them. It is important to note that the default flag to collect one memento a month is not expected to have the granularity to view behaviors that typically happen on a small time frame. To correct that, the flag [-e] to collect all mementos for an account should be used. The republican political commentator @mitchellvii lost followers in two recorded incidences. In January 2017 from the 1st to the 4th, @mitchellvii lost 1270 followers. In April 2017 from the 15th to the 17th, @mitchellvii lost 1602 followers. Using only the Twitter API to collect follower growth would not show this phenomenon.



Dependencies:

  • Python 3
  • R* (to create graph)
  • bs4
  • urllib
  • archivenow* (push to archive)
  • datetime* (push to archive)
*optional

How to run the script:

$ git clone https://github.com/oduwsdl/FollowerCountHistory.git
$ cd FollowerCountHistory
$ ./FollowerHist.py [-h] [-g] [-e] [-p | -P] <twitter-username-without-@> 

Output: 

The program will create a folder named <twitter-username-without-@>. This folder will contain two .csv files. One, labeled <twitter-username-without-@>.csv, will contain the dates collected, the number of followers for that date, and the URL for that memento. The other, labeled <twitter-username-without-@>-Error.csv, will contain all the dates of mementos where the follower count was not collected and will list the reason why. All file and folder names are named after the Twitter username provided, after being cleaned to ensure system safety.

If the flag [-g] is used, then the script will create an image <twitter-username-without-@>-line.png of the data plotted on a line chart created by the follower_count_linechart.R script. An example of that graph is shown as the heading image for the user @USAGym, the official USA Olympic gymnastics team. The popularity of the page changes with the cycle of the Summer Olympics, evidenced by most of the follower growth occurring in 2012 and 2016.

Example Output:

./FollowerHist.py -g -p USAGym
USAGym
http://web.archive.org/web/timemap/link/http://twitter.com/USAGym
242 archive points found
20120509183245
24185
20120612190007
...
20171221040304
250242
20180111020613
250741
Not Pushing to Archive. Last Memento Within Current Month.
null device 
          1 

cd usagym/; ls
usagym.csv  usagym-Error.csv  usagym-line.png

How it works:

$ ./FollowerHist.py --help

usage: FollowerHist.py [-h] [-g] [-p | -P] [-e] uname

Follower Count History. Given a Twitter username, collect follower counts from
the Internet Archive.

positional arguments:

uname       Twitter username without @

optional arguments:

-h, --help  show this help message and exit
-g          Generate a graph with data points
-p          Push to Internet Archive
-P          Push to all archives available through ArchiveNow
-e          Collect every memento, not just one per month

First, the timemap, the list of all mementos for that URI, is collected for http://twitter.com/username. Then, the script collects the dates from the timemap for each memento. Finally, it dereferences each memento and extracts the follower count if all the following apply:
    1. A previously created .csv of the name the script would generate does not contain the date.
    2. The memento is not in the same month as a previously collected memento, unless [-e] is used.
    3. The page format can be interpreted to find the follower count.
    4. The follower count number can be converted to an Arabic numeral.
A .csv is created, or appended to, to contain the date, number of followers, and memento URI for each collected data point.
A error .csv is created, or appended, with the date, number of followers, and memento URI for each data point that was not collected. This will contain repeats if run repeatedly because it will not delete the old entries while writing the new errors in.

If the [-g] flag is used, a .png of the line chart will be created "<twitter-username-without-@>-line.png".
If the [-p] flag is used, the URI will be pushed to the Internet Archive to create a new memento if there is no current memento.
If the [-P] flag is used, the URI will be pushed to all archives available through archivenow to create new mementos if there is no current memento in Internet Archive.
If the [-e] flag is used, every memento will be collected instead of collecting just one per month.

As a note for future use, if the Twitter layout undergoes another change, the code will need to be updated to continue successfully collecting data.

Special thanks to Orkun Krand, whose work I am continuing.
--Miranda Smith (@mir_smi)


No comments:

Post a Comment