2025-08-10: Who Cares About All Those Old goo.gl Links Anyway?


11 of the 26 goo.gl URLs for data sets surveyed in Yin & Berger (2017)


In a previous post, I estimated that when Google turns off its goo.gl URL shortening service, at least 1.3M goo.gl URLs are already saved by the Internet Archive's Wayback Machine.  Thanks to the efforts of Archive Team and others, that number will surely grow in the coming weeks before the shutdown.  And Google has already announced plans to keep the links that have recently been used. But all of this begs the question: "who cares about all those old goo.gl links anyway?"  In this post, I examine a single technical paper from 2017 that has 26 goo.gl URLs, one (1/26) of which is scheduled to be deprecated in two weeks.  Assuming this loss rate (1/26) holds for all the goo.gl URLs indexed in Google Scholar, then at least 4,000 goo.gl URLs from the scholarly record will be lost


In our discussions for the Tech Friend article, Shira Ovide shared with me "When to use what data set for your self-driving car algorithm: An overview of publicly available driving datasets", a survey paper published by Yin & Berger at ITSC 2017 in Japan (preprint at ResearchGate).  I can't personally speak to the quality of the paper or its utility in 2025, but it's published at an IEEE conference and according to Google Scholar it has over 100 citations, so for the sake of argument I'm going to consider this a "good" paper, and that as a survey it is still of interest some 8 years later.  


109 citations for Yin & Berger on 2025-08-09 (live web link). 


The paper surveys 27 data sets that can be used to test and evaluate self-driving cars.  Of those 27 data sets, 26 of them are directly on the web (the paper describing the BAE Systems data set has the charming chestnut "contact the author for a copy of the data").  For the 26 data sets that are on the web, the authors link not to the original link, such as:


http://www.gavrila.net/Datasets/Daimler_Pedestrian_Benchmark_D/daimler_pedestrian_benchmark_d.html


but to the much shorter:


https://goo.gl/l3U2Wc


Presumably, Yin & Berger used the shortened links for ease and uniformity of typesetting.  Especially in the two column IEEE conference template, it is much easier to typeset the 21 character goo.gl URL rather than the 98 character gavrila.net URL. But the convenience of the 77 character reduction comes with the loss of semantics: if the gavrila.net URL rotted (e.g., became 404, the domain was lost), then by visual inspection of the original URL, we know to do a search engine query for "daimler pedestrian benchmark" and if it's still on the live web with a different URL, we have a very good chance of (re)discovering its new location (see Martin Klein's 2014 dissertation for a review of techniques).  But if goo.gl shuts down, and all we're left with in the 2017 conference paper is the string "l3U2Wc", then we don't have the semantic clues we need to find the new location, nor do we have the original URL with which to discover the URL in a web archive, such as the Internet Archive's Wayback Machine. 


Fortunately, http://www.gavrila.net/Datasets/Daimler_Pedestrian_Benchmark_D/daimler_pedestrian_benchmark_d.html is still on the live web. 



Let's consider another example that is not on the live web. The short URL:


https://goo.gl/07Us6n


redirects to:


https://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/ 


Which is currently 404:

https://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/ (via https://goo.gl/07Us6n) is now 404 on the live web. 


From inspection of the 404 URL, we can guess that "Caltech Pedestrians" is a good SE query, and the data appears to be available from multiple locations, including the presumably now canonical URL https://data.caltech.edu/records/f6rph-90m20.  (The webmaster at vision.caltech.edu should use mod_rewrite to redirect to data.caltech.edu, but that's a discussion for another time). 



The Google SERP for "Caltech Pedestrians": it appears the data set is in multiple locations on the live web.  


 

https://data.caltech.edu/records/f6rph-90m20 is presumably now the canonical URL and is still on the live web.


Even if all the caltech.edu URLs disappeared from the live web, fortunately the Wayback Machine has archived the original URL.  The Wayback Machine has archived the new data.caltech.edu URL as well, though it appears to be far less popular (so far, only 8 copies of data.caltech.edu URL vs. 310 copies of the original vision.caltech.edu URL). 



https://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/ is well archived at the Wayback Machine. 


This 2017-03-29 archived version is probably close to the state of the page at the time as it was cited by Yin & Berger in 2017. 


The new data.caltech.edu URL is archived, but less so (so far). 


Resolving the 26 goo.gl URLs, 18 of them successfully terminate in an HTTP 200 OK.  The eight that did not have the following response codes or conditions:



Although marked as "404" above, goo.gl/0R8XX6 resolves to an HTTP 200 OK, but it's that HTTP 200 response is actually to an interstitial page saying that this URL was not accessed in late 2024, and thus will be sunsetted on 2025-08-25.  Appending the argument "?si=1" to bypass the interstitial page results in a redirection to the 3dvis.ri.cmu.edu page, and that URL is 404.  Fortunately, the page is archived at the Wayback Machine.  For those in the community, perhaps there is enough context to rediscover this data set, but the first several hits for the query for "CMU Visual Localization Dataset" does not return anything that is obvious to me as the right answer (perhaps the second hit subsumes the original data set?). 



The reference to http://goo.gl/0R8XX6 in Yin & Berger (2017). 




A Google query for "CMU Visual Localization Dataset" on 2025-08-10; perhaps the data set we seek is included in the second hit? 


https://goo.gl/0R8XX6 did not win the popularity contest in late 2024, and will cease working on 2025-08-25. It appears that dereferencing the URL now (August 2025) will not save it. 



Dereferencing https://goo.gl/0R8XX6?si=1 yields http://3dvis.ri.cmu.edu/data-sets/localization/, which no longer resolves (which is technically not an HTTP event, since there is not a functioning HTTP server to respond). 


https://3dvis.ri.cmu.edu/data-sets/localization/ was frequently archived between 2015 and 2018.



https://3dvis.ri.cmu.edu/data-sets/localization/ as archived on 2015-02-19.


So under the current guidance, one of the 26 goo.gl URLs (https://goo.gl/0R8XX6) in Yin & Berger (2017) will cease working in about two weeks, and it's not immediately obvious that the paper provides enough context to refind the original data set. This is compounded by the fact that the original host, 3dvis.ri.cmu.edu, no longer resolves.  Fortunately, the Wayback Machine appears to have the site archived (I have not dived deeper to verify that all the data has been archived; cf. our Web Science 2025 paper).  


2025-08-03 Google Scholar search for "goo.gl"


Here, we've only examined one paper, so the next natural question would be "how many other papers are impacted?"  A search for "goo.gl" at Google Scholar a week ago estimated 109,000 hits. Surely some of those hits include simple mentions of "goo.gl" as a service and don't necessarily have shortened links.  On the other hand, URLs shorteners are well understood and probably don't merit extended discussion, so I'm willing to believe that nearly all of the 109k hits have at least one shortened URL in them; the few that do not are likely balanced by Yin & Berger (2017), which has 26 shortened URLs.


For simplicity, let's assume there are 109,000 shortened URLs indexed by Google Scholar.  Let's also assume that the sunset average (1/26, or 4%) for the URLs in Yin & Berger (2017) also holds for the collection.  That would yield 109,000 * 0.04 = 4,360 shortened URLs to be sunsetted on 2025-08-25.  Admittedly, these are crude approximations, but saying there are "at least 4,000 shortened URLs that will disappear in about two weeks" passes the "looks right" test, and if forced to guess, I would bet that the actual number is much larger than 4,000.  Are all 4,000 "important"? Are all 4,000 unfindable on the live web? Are all 4,000 archived?  I have no idea, and I suppose time will tell.  As someone who has devoted much of their career to preserving the web, especially the scholarly web, deprecating goo.gl feels like an unforced error in order to save "dozens of dollars".  


–Michael 





A gist with the URLs and HTTP responses is available.


Comments