Tuesday, May 7, 2013

2013-05-07: Who Is Archiving Your Tweets?

Who is archiving your tweets?

You're probably thinking "the Library of Congress".  And you're right, since 2010 they have been (see the announcements from Twitter and LC).  But LC is currently providing access only to researchers, and the scale of the archive makes access challenging (see LC's January 2013 white paper that provides a status update on the project).

To say I think this joint project between LC and Twitter is exciting and important is an understatement; I could go on about the scholarly importance, the cultural and technological record, the phenomena of social media, etc.  So I was surprised (but in retrospect, should not have been) when almost immediately afterwards projects like noloc.org surfaced so you could opt out of the archiving of your public tweets.

However, while you might be able to prevent LC from archiving your tweets, companies like Topsy are archiving them, or at least some of them.  Tospy is one of my new, favorite sites in part because they archive tweets; not necessarily because archiving them is the right thing to do (tm), but presumably because: 1) it allows them to build interesting services on top of the tweets, and 2) deleting them is probably more work than not deleting them*.  Hany SalahEldeen and I began exploring Topsy in the context of his research on temporal intent in social media link sharing.

Although I think they think their primary business model is searching the social web, to me the most interesting services are generating the retweet and link neighborhoods for tweets.  For example:


is Topsy's page about me, and:


provides all the various tweets that linked to: cottagelabs.com/news/meeting-the-oaipmh-use-case-with-resourcesync.  Topsy will promote your status to "influential" or "highly influential", presumably based on a mix of followers and/or retweets (e.g., Clay Shirky is "highly influential" with 274k followers, but Farrah is "influential" with less than 500 followers but presumably many retweets).  Tweets about links can be "interesting", and that appears to be based on a tweet having different text from the HTML title of the target link.

Let's look at some examples of how Topsy is archiving at least some tweets. This September 28, 2012 story on BBC.com cited our TPDL 2012 paper, but also started with a nice quote for motivation and context:
On January 28 2011, three days into the fierce protests that would eventually oust the Egyptian president Hosni Mubarak, a Twitter user called Farrah posted a link to a picture that supposedly showed an armed man as he ran on a "rooftop during clashes between police and protesters in Suez". I say supposedly, because both the tweet and the picture it linked to no longer exist.  Instead they have been replaced with error messages that claim the message – and its contents – "doesn’t exist".
It's true: although the user "Farrah" still exists, she has deleted many of her tweets from during the Egyptian Revolution.  For example, the tweet and the picture linked to in the tweet are 404:

But if we prepend the twitpic URI with "topsy.com/" to get:


we see the original tweet, and a small but not full-size version of the image:

And it is not just that tweet & image, there are many others as well (twitter URI, twitpic URI):


Topsy will also archive the tweets marked for deletion from the LC archive.  For example, tweets like this from the author of noloc.org no longer exist and presumably were deleted before inclusion in the LC archive:


But we can see these tweets continue to exist in Topsy:


Note that the above link is a relative offset, so the actual tweets might scroll off that page.  This reflects a limitation of the service at least with respect to being an actual archive: it offers only a limited window (at least for the free service) of 100 pages of 10 tweets each.  For active accounts this 1000 tweet window will scroll by quickly.  For example, the right-wing politics site twitchy.com was giddy when a White House staffer mistook/misspelled "congenital" as "congenial" in this now deleted June 29, 2012 tweet:


But at the time of this writing, 100 pages back only takes you to January 29, 2013 so we can't see if Topsy has archived this tweet.

In my recent presentation at the 2013 IIPC meeting, I mentioned the zombie movie trope of not using the word "zombie" to describe zombies (i.e., no one in a zombie movie has ever heard of zombies).  I drew the parallel of not "using the a-word" -- perhaps the best, commercially viable archives don't use the word "archive".  I don't believe Topsy markets its services as an "archive", but that is what is providing (modulo the 100 page limitation as well as not supporting archival protocols like Memento).  On the other hand, the word "archive" denotes a certain level of permanency, and who knows if Topsy will survive in the marketplace?  This list from tweetsmarter.com has a number of social media companies, many of which are now defunct.  If Topsy goes under, most likely its extensive archives will disappear as well.  True, most of the material won't be missed, but historically important material, such as Farrah's live tweeting of the Egyptian Revolution will disappear with it.  And since it is not clear how to monetize archives and with actual archives such as WebCite running a donation campaign, we should be reminded that what we perceive as "archives" are really just web sites.  So who will archive the archives?


Edit: See also http://ws-dl.blogspot.com/2013/05/2013-05-21-update-about-archiving-tweets.html

* = I don't have any details about how Topsy is designed, their business relationship with Twitter, or anything of the like.  Nor have I paid for a "pro" account or anything like that.  All observations are from my position of being outside and looking in.


  1. Thanks for this post. Your readers might also be interested in a useful hack by @mhawksey to allow users to archive their own, using Google Drive. There's a post about how I started using it here: http://peterwebster.wordpress.com/2013/04/20/what-use-is-a-personal-tweet-archive/

  2. It appears tospy has changed their API. URLs of the form:


    no longer work, but deleting the scheme (i.e., "http://" from the target URL appears to do the right thing: