2020-08-05: Historical Twitter Follower Count Via Web Archives
Figure 1: Line charts of absolute Twitter follower growth for the Twitter accounts, @berniesanders (top left), @joebiden (top right), @marwilliamson (bottom left), and @petebuttigieg (bottom right). |
In our previous posts, Twitter Follower Growth for the 2020 Democratic Candidates and Revisiting Twitter Follower Growth for the 2020 Democratic Candidates, we used Twitter follower growth between 2019-01-01 and 2019-04-18 as a proxy to measure popularity of the Democratic candidates. We found an 80% similarity between the list of top 10 candidates based on absolute increase in the Twitter follower count and the final 10 candidates remaining in the 2020 Democratic Presidential nomination race based on their campaign withdrawal date. In order to collect historical Twitter follower count of the Democratic candidates, we parsed the follower count values from the Twitter mementos of the candidates. Miranda Smith, in her post, Twitter Follower Count History via the Internet Archive, released a follower count parser code which accepts a Twitter handle as input and returns the historical Twitter follower counts by parsing mementos from the Internet Archive. In this post, we have released a new version of the code for parsing out Twitter follower count which has been used in our 2020 Democratic candidates' posts.
Web Archives and Twitter
The Twitter API returns current follower count value for a Twitter handle. In order to perform a longitudinal study of Twitter follower count for any given Twitter handle using the Twitter API, we would have to actively record the follower count over the entire study time. Since the Twitter API doesn't provide historical data, we have to use the web archives containing mementos of the Twitter accounts which need to be scraped for parsing of useful Twitter information, a technique first introduced by Miranda Smith in her post "Twitter Follower Count History via the Internet Archive". Figure 2 shows a memento from the Internet Archive for the Twitter account, @berniesanders, but the language for the template (followers and number of tweets etc.) is in Bulgarian. Sawood Alam has explained the role of cookies in archiving of non-English mementos ("Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages" and "Cookies Are Why Your Archived Twitter Page Is Not in English"). In order to parse follower count from the Bulgarian language memento, we need to use translators.
Beginning on July 15, 2019, Twitter rolled out a new user interface and on July 1, 2020 stopped supporting the prior user interface. Figures 3 and 4 show that the migration of Twitter user interface from legacy theme to the new user interface had implications on the archiving of Twitter mementos (post by Kritika Garg and Himarsha Jayanetti). Alexander Nwala, in his post "Twitter broke my scrapers", highlights the issues involved with web scraping and impact of the new Twitter user interface.
Figure 2: Screenshot of a memento for @berniesanders on June 7, 2020 in Bulgarian with the Twitter legacy theme user interface from the Internet Archive |
Figure 3: Screenshot of a memento for @berniesanders on June 20, 2020 from the Internet Archive with a "browser not supported" message (Figure 3 and Figure 4 both link to the same URL) |
Figure 4: Screenshot of a memento for @berniesanders on June 20, 2020 from the Internet Archive with a "something went wrong" message (Figure 3 and Figure 4 both link to the same URL) |
Twitter Follower Count via the Web Archives
Follower Count History is a Python module that collects Twitter follower count from the web archives using MemGator for a given Twitter handle. The module extracts follower count by identifying various CSS Selectors that match the follower count element on the historical Twitter mementos for almost every major overhaul their page layout has gone through. The program collects all of the memento data points by default.
Changes from Version 1.0 to Version 2.0
- Version 2.0 parses mementos from multiple web archives (Internet Archive, Archive-It, Library of Congress, etc.) using the memento aggregator service, MemGator. Version 1.0 used only Internet Archive to parse mementos.
- Version 2.0 plots six line charts for showing growth in Twitter follower count (absolute and relative) and growth in the daily new followers (absolute and relative). Version 1.0 plotted a single graph for line chart of historical follower count from the python code.
- Version 2.0 supports JSON and CSV formats as output formats. Version 1.0 only supports CSV format.
- Version 2.0 does not use ArchiveNow to push Twitter pages to multiple web archives. This feature existed in Version 1.0.
- Python 3
- bs4
- requests
- warcio
- R* (Optional: To create graphs)
$ git clone https://github.com/oduwsdl/FollowerCountHistory.git $ cd FollowerCountHistory $ pip install -r requirements.txt $ ./fch.py [-h] [--st] [--et] [--freq] [-f] <Twitter handle/URL>
Command to create graphs from follower output CSV files
$ Rscript twitterFollowerCount.R <CSV file path>
We have published a docker image at oduwsdl/fch with the tag "2.0", which can be used to run this tool as following:
$ docker container run --rm -it -v <Output Directory>:/app -u $(id -u):$(id -g) oduwsdl/fch:2.0 [options] <Twitter Handle>
Options
Follower Count History (fch)
positional arguments:
thandle Enter a Twitter handle/ URL
optional arguments:
-h, --help show this help message and exit
--st Memento start datetime (YYYYMMDDHHMMSS)
--et Memento end datetime (YYYYMMDDHHMMSS)
--freq Sampling frequency of mementos(in seconds)
-f Output file path (Supported Extensions: JSON and CSV)
- --st option: It sets the start time for the analysis in the RFC 8601 fourteen digit variation (YYYYMMDDHHMMSS). The default value for --st option is set to the Twitter creation date (20060321120000).
- --et option: It sets the end time for the analysis in the RFC 8601 fourteen digit variation (YYYYMMDDHHMMSS). The default value for --et option is set to the current datetime.
- --freq option: It sets sampling rate of the mementos and accepts value in seconds. The default value for --freq option is set to collect all the mementos.
- -f option: This option returns the output as CSV or JSON file of historical Twitter follower count. The default value for -f option is set to return a CSV of historical Twitter follower count.
How it works
- When the user wants to collect the Twitter follower count of a Twitter handle since the account creation date to the current date. (Output is a CSV.)
$ ./fch.py <Twitter Handle/ URL>
Frequency = 2592000
$ ./fch.py --st=20200101000000 --et=20200331000000 --fre=2592000 <Twitter Handle>
MementoDatetime,URIM,FollowerCount,AbsGrowth,RelGrowth,AbsPerGrowth,RelPerGrowth,AbsFolRate,RelFolRate 20200101001959,https://web.archive.org/web/20200101001959/https://twitter.com/JoeBiden,4048208,0,0,0,0,0,0 20200131120028,https://web.archive.org/web/20200131120028/https://twitter.com/joebiden,4142510,94302,94302,2.33,2.33,0.0358,0.0358 20200301001210,https://web.archive.org/web/20200301001210/https://twitter.com/JoeBiden/,4202148,153940,59638,3.8,1.44,0.0297,0.02339
$ ./fch.py -f=output.json --st=20200101000000 --et=20200331000000 --fre=2592000 <Twitter Handle>
Sample JSON Output
[{
"MementoDatetime": "20200101001959",
"URIM": "https://web.archive.org/web/20200101001959/https://twitter.com/JoeBiden",
"FollowerCount": 4048208
}, {
"MementoDatetime": "20200131120028", "URIM": "https://web.archive.org/web/20200131120028/https://twitter.com/joebiden",
"FollowerCount": 4142510
}, {
"MementoDatetime": "20200301001210", "URIM": "https://web.archive.org/web/20200301001210/https://twitter.com/JoeBiden/",
"FollowerCount": 4202148
}]
We use R to create the plots for the follower count CSV files.
$ Rscript twitterFollowerCount.R <twitter-username-without-@>
Example of @petebuttigieg
We collected the historical Twitter follower count for the Twitter account, @PeteButtigieg, with a monthly sampling rate between the account creation date and the current date (2020-07-25).
Command to fetch the follower count and plot R graph
Command to fetch the follower count and plot R graph
$ ./fch.py -f=pete.csv --freq=2592000 petebuttigieg
$ Rscript twitterFollowerCount.R pete.csv
The R script generates six graphs for each handle which show the absolute and relative growth of Twitter followers in numbers and percentage and the absolute and relative daily new follower growth rate for the Twitter account. Figure 5 shows the absolute growth of Twitter followers in percentage and number for the Twitter account, @petebuttigieg. Figure 6 shows the daily new Twitter follower rate for the Twitter account, @petebuttigieg with respect to the first memento. Figure 7 shows the daily new Twitter follower rate for the Twitter account, @petebuttigieg with respect to the previous memento.
Figure 5: Graph showing the absolute growth (in numbers and percentage) of Twitter followers for the Twitter account, @PeteButtigieg, since the first memento capture from December 2012 to July 2020 with a monthly sampling rate. |
Figure 6: Graph showing the relative daily new Twitter follower rate for the Twitter account, @PeteButtigieg, since the first memento capture from December 2012 to July 2020 with a monthly sampling rate. |
Figure 7: Graph showing the absolute daily new Twitter follower rate for the Twitter account, @PeteButtigieg, since the first memento capture from December 2012 to July 2020 with a monthly sampling rate. |
Archival Soft Error Codes
Figure 8 shows the case of soft error codes. It occurs when the web archives respond with a status code of 200 for a memento and the returned memento comes back with an error message content. We did a brief analysis and discovered the archival soft error code behavior from Archive-it for mementos between 2006 and 2008.
Figure 8: An Archive-It memento from 2007 responding with an error code 500 |
cURL response for the Archive-It memento from figure 8
msiddique@wsdl-3102-03:~$ curl -I https://wayback.archive-it.org/all/20071030204619/https://twitter.com/joebiden HTTP/1.1 200 OK server: Apache-Coyote/1.1 content-security-policy-report-only: default-src 'self' 'unsafe-inline' 'unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report memento-datetime: Tue, 30 Oct 2007 20:46:19 GMT link: <https://twitter.com/joebiden>; rel="original", <https://wayback.archive-it.org/all/timemap/link/https://twitter.com/joebiden>; rel="timemap"; type="application/link-format", <https://wayback.archive-it.org/all/https://twitter.com/joebiden>; rel="timegate", <https://wayback.archive-it.org/all/20071030201109/https://twitter.com/joebiden>; rel="prev first memento"; datetime="Tue, 30 Oct 2007 20:11:09 GMT", <https://wayback.archive-it.org/all/20071030204619/https://twitter.com/joebiden>; rel="memento"; datetime="Tue, 30 Oct 2007 20:46:19 GMT", <https://wayback.archive-it.org/all/20071031060332/https://twitter.com/joebiden>; rel="next memento"; datetime="Wed, 31 Oct 2007 06:03:32 GMT", <https://wayback.archive-it.org/all/20200801090225/https://twitter.com/joebiden>; rel="last memento"; datetime="Sat, 01 Aug 2020 09:02:25 GMT" set-cookie: JSESSIONID=265A771FACE02E814C2FEAE89D9EF493; Path=/; HttpOnly x-archive-orig-vary: Accept-Encoding x-archive-guessed-charset: UTF-8 x-archive-orig-server: hi x-archive-orig-connection: close x-archive-orig-content-type: text/html; charset=UTF-8 x-archive-orig-via: 1.0 twitter.com x-archive-orig-cache-control: max-age=300 x-archive-orig-expires: Tue, 30 Oct 2007 20:51:19 GMT x-archive-orig-content-length: 122 x-archive-orig-date: Tue, 30 Oct 2007 20:46:19 GMT content-type: text/html;charset=utf-8 content-length: 16839 date: Mon, 03 Aug 2020 21:06:48 GMT
Conclusion
We have released version 2.0 of the Follower Count History. It uses MemGator to collect mementos from multiple web archives and returns the historical Twitter follower count in JSON as well as CSV file format. The CSV output can be used to plot line charts which show the absolute and relative follower growth and the absolute and relative daily new follower growth rate for a Twitter handle.
Link for the GitHub Repository: https://github.com/oduwsdl/FollowerCountHistory/
Acknowledgement
I would like to thank Miranda Smith and Orkun Krand for their preliminary work on the Historical Twitter Follower code. I would also like to thank Sawood Alam for providing feedback on the code.
Update
2020-08-11 : Released the code on pypi: https://pypi.org/project/fch/------
Mohammed Nauman Siddique (@m_nsiddique)
Comments
Post a Comment