Monday, January 23, 2017

2017-01-23: Finding URLs on Twitter - A simple recommendation


A prompt from Twitter indicating no search results
As part of a research experiment, I had the need to find URLs embedded in tweets from Twitter's web search service. Most of the URLs where much older than 7 days, so using the Twitter search API was not an option since the search is performed on a sample of tweets published in the past 7 days, so I used the web search service. 
I began the experiment by pasting URLs from tweets into the search box on twitter.com:
Searching Twitter for a URL by pasting the URL into the search box
I noticed I was able to find some URLs embedded in tweets, but this was not always the case. Based on my observations, finding the URLs was not correlated with the age of the tweet. I discussed this observation with Ed Summers and he recommended adding a "url:" prefix to the URL before searching. For example, if the search URL is: 
      "http://www.cnn.com", 
he recommended searching for
      "url:http://www.cnn.com"
I observed that prepending search URLs with the "url:" prefix improved my search success rate. For example, the search URL: "http://www.motherjones.com/environment/2016/09/dakota-access-pipeline-protest-timeline-sioux-standing-rock-jill-stein" was not found except with the "url:" prefix.
Example of a URL that was not found except with the "url:" parameter
Example of a URL that was not found with the "url:" parameter, but found without
Based on these observations, and considering that there was no apparent protocol switching, or URL canonicalization, I scaled the experiment to gain a better insight about this search behavior. I wanted to know the proportion of URLs that are:
  1. found exclusively with the "url:" prefix
  2. found exclusively without the "url:" prefix
  3. found with and without the "url:" prefix (both 1 and 2).
I issued 3,923 URL queries to Twitter and observed the following proportions:
  1. Count of URLs found exclusively with the "url:" prefix: 1,519
  2. Count of URLs found exclusively without the "url:" prefix: 129
  3. Count of URLs found with and without the "url:" prefix (both 1 and 2): 853
  4. Count of URLs not found: 1,422
My initial non-automated tests gave the false impression that the "url:" prefix was the only consistent method to find all URLs embedded in tweets, but these tests result show that even though the "url:" prefix search method exhibits a higher hit rate, it is not self sufficient.
Consequently, to find a URL "U" via twitter web search, I recommend beginning the search with "url:U". If "U" is not found, search for U, because this promises a higher hit ratio.
--Nwala

No comments:

Post a Comment