2025-08-03: The Wayback Machine Has Archived at Least 1.3M goo.gl URLs

The interstitial page for https://goo.gl/12XGLG, telling the user that Google will soon abandon this shortened URL.

Last year, Google announced it intended to deprecate its URL shortener, goo.gl, and just last week they released the final shut down date of August 25. I was quoted in Tech Friend, a Washington Post newsletter by Shira Ovide, joking that the move "would save Google dozens of dollars." Then last Friday, Google announced a slight update, and that links that have had some activity in "late 2024" would continue to redirect.

To be sure, the shut down isn't about saving money, or at least not about the direct cost of maintaining the service. goo.gl stopped accepting new shortening requests in 2019, but continued to redirect existing shortened URLs, and maintaining the server with a static mapping of shortened URLs to their full URLs has a negligible hardware cost. The real reason is likely that nobody within Google wants to be responsible for maintaining the service. Engineers in tech companies get promoted based on their innovation in new and exciting projects, not maintaining infrastructure and sunsetted projects. URL shorteners are largely a product of the bad old days of social media, and the functionality has largely been supplanted by the companies themselves (e.g., Twitter's t.co service, added ca. 2011). URL shorteners still have their place: I still use bitly's custom URL service to create mnemonic links for Google Docs (e.g., https://bit.ly/Nelson-DPC2025 instead of https://docs.google.com/presentation/d/1j6k9H3fA1Q540mKefkyr256StaAD6SoQsJbRuoPo4tI/edit?slide=id.g2bc4c2a891c_0_0#slide=id.g2bc4c2a891c_0_0). URL shorteners proliferated for a while, and most of them have since gone away. The 301works.org project at the Internet Archive has archived a lot, but not all, of the mappings.

When Shira contacted me, one of the things she wanted to know was the scale of the problem. A Hacker News article had various estimates: 60k articles in Google Scholar had the string "goo.gl", and another person claimed that a Google search for "site:goo.gl" returned 9.6M links (but my version of Google no longer shows result set size estimates).

2025-08-03 Google Scholar search for "goo.gl"

2025-08-03 Google search for "goo.gl"

Curious and not satisfied with those estimates, I started poking around to see what the Internet Archive's Wayback Machine has. These numbers were taken on 2025-07-25, and will surely increase soon based on Archive Team's efforts.

First, not everyone knows that you can search URL prefixes in the Wayback Machine with the "*" character. I first did a search for "goo.gl/a*", then "goo.gl/aa*", etc. until I hit something less than the max of 10,000 hits per response.

https://web.archive.org/web/*/goo.gl/a*

https://web.archive.org/web/*/goo.gl/aa*

https://web.archive.org/web/*/goo.gl/aaa*

https://web.archive.org/web/*/goo.gl/aaaa*

We could repeat with "b", "bb", "bbb", "bbbb", etc. but that would take quite a while. Fortunately, we can use the CDX API to get a complete response and then process it locally.

The full command line session is shown below, and then I'll step through it:

% curl "http://web.archive.org/cdx/search/cdx?url=goo.gl/*" > goo.gl

% wc -l goo.gl

3974539 goo.gl

% cat goo.gl | awk '{print $3}' | sed "s/https://" | sed "s/http://" | sed "s/?.*//" | sed "s/:80//" | sed "s/www\.//" | sort | uniq > goo.gl.uniq

% wc -l goo.gl.uniq

1374191 goo.gl.uniq

The curl command accesses the CDX API, searching for all URLs prefixed with "goo.gl/*", and saves the response in a file called "goo.gl".

The first wc command shows that there are 3.9M lines in a single response (i.e., pagination was not used). Although not listed above, we can take a peek at the response with the head command:

% head -10 goo.gl

gl,goo)/ 20091212094934 http://goo.gl:80/ text/html 404 2RG2VCBYD2WNLDQRQ2U5PI3L3RNNVZ6T 298

gl,goo)/ 20091217094012 http://goo.gl:80/? text/html 200 HLSTSF76S2N6NDBQ4ZPPQFECB4TKXVCF 1003

gl,goo)/ 20100103211324 http://goo.gl/ text/html 200 HLSTSF76S2N6NDBQ4ZPPQFECB4TKXVCF 1166

gl,goo)/ 20100203080754 http://goo.gl:80/ text/html 200 HLSTSF76S2N6NDBQ4ZPPQFECB4TKXVCF 1010

gl,goo)/ 20100207025800 http://goo.gl:80/ text/html 200 HLSTSF76S2N6NDBQ4ZPPQFECB4TKXVCF 1006

gl,goo)/ 20100211043957 http://goo.gl:80/ text/html 200 HLSTSF76S2N6NDBQ4ZPPQFECB4TKXVCF 1001

gl,goo)/ 20100217014043 http://goo.gl:80/ text/html 200 HLSTSF76S2N6NDBQ4ZPPQFECB4TKXVCF 999

gl,goo)/ 20100224024726 http://goo.gl:80/ text/html 200 HLSTSF76S2N6NDBQ4ZPPQFECB4TKXVCF 1000

gl,goo)/ 20100228025750 http://goo.gl:80/ text/html 200 HLSTSF76S2N6NDBQ4ZPPQFECB4TKXVCF 1003

gl,goo)/ 20100304130514 http://goo.gl:80/ text/html 200 HLSTSF76S2N6NDBQ4ZPPQFECB4TKXVCF 1008

The file has seven space-separated columns. The first column is the URL in SURT format (a form of normalizing URLs), the second column is the datetime of the visit, and the third column is the actual URL encountered. The above response shows that the top level URL, goo.gl, was archived many times (as you would expect), and the first time was on 2009-12-12, at 09:49:34 UTC.

The third command listed above takes the 3.9M line output file, uses awk to select only the third column (the URL, not the SURT), and the first two sed commands remove the schema (http and https) from the URL, and third sed command removes any URL arguments. The fourth sed command removes any port 80 remnants, and fifth sed removes any unnecessary "www." prefixes. Then the result is sorted (even though the input should already be sorted, we sort it again just to be sure), then the result is run through the uniq command to remove duplicate URLs.

We process the URLs and not the SURT form of the URLs because in short URLs, capitalization in the path matters. For example, "goo.gl/003br" and "goo.gl/003bR" are not the same URL – the "r" vs. "R" matters.

goo.gl/003br --> http://www.likemytweets.com/tweet/217957944678031360#217957944678031360%23like

and

goo.gl/003bR --> http://www.howtogeek.com/68999/how-to-tether-your-iphone-to-your-linux-pc/

We remove the URL arguments because although they are technically different URLs, the "?d=1" (show destination) and "si=1" (remove interstitial page) arguments shown above don't alter the destination URLs.

% grep -i "003br" goo.gl | head -10

gl,goo)/0003br 20250301150956 https://goo.gl/0003bR application/binary 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 239

gl,goo)/0003br 20250301201105 https://goo.gl/0003BR application/binary 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 328

gl,goo)/0003br?d=1 20250301150956 https://goo.gl/0003bR?d=1 text/html 200 YS7M3IHIYA4PGO37JKUZBPMX3WDCK5QW 591

gl,goo)/0003br?d=1 20250301201104 https://goo.gl/0003BR?d=1 text/html 200 GSJJBSKEC2AULCMM3VLZZ4R7L37X65T7 718

gl,goo)/0003br?si=1 20250301150956 https://goo.gl/0003bR?si=1 application/binary 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 237

gl,goo)/0003br?si=1 20250301201105 https://goo.gl/0003BR?si=1 application/binary 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 325

gl,goo)/003br 20250228141837 https://goo.gl/003br application/binary 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 273

gl,goo)/003br 20250228141901 https://goo.gl/003bR application/binary 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 281

gl,goo)/003br2 20250302155101 https://goo.gl/003BR2 text/html 200 JO23EZ66WLVAKLHZQ57RS4WEN3LTFDUH 587

gl,goo)/003br2?d=1 20250302155100 https://goo.gl/003BR2?d=1 text/html 200 IQ6K5GU46N3TY3AIZPOYP4RLWZC4GEIT 623

The last wc command shows that there are 1.3M unique URLs, after the URL scheme and arguments have been stripped.

If you want to keep the arguments to the goo.gl URLs, you can do:

% cat goo.gl | awk '{print $3}' | sed "s/https://" | sed "s/http://" | sed "s/:80//" | sed "s/www\.//" | sort | uniq > goo.gl.args

% wc -l goo.gl.args

3518019 goo.gl.args

And the Wayback Machine has 3.5M unique goo.gl URLs if you include arguments (3.5M is, not unsurprisingly, nearly 3X the original 1.3M URLs without arguments).

Not all of those 1.3M (or 3.5M) URLs are syntactically correct. A sharp eye will catch that in the first screen shot for https://web.archive.org/web/*/goo.gl/a* there is a URL with an emoji:

Which is obviously not syntactically correct and that URL does not actually exist and is thus not archived:

https://web.archive.org/web/20240429092824/http://goo.gl/a%F0%9F%91%88 does not exist.

Still, even with a certain number of incorrect URLs, they are surely a minority, and would not effectively change the cardinality of unique 1.3M (or 3.5M) goo.gl URLs archived at the Wayback Machine.

Shira noted in her article that Common Crawl (CC) told her that they estimated 10M URLs were impacted. I'm not sure how they arrived at that number, especially since the Wayback Machine's number is much lower. Perhaps there are CC crawls that have yet to be indexed, or are excluded from replay by the Wayback Machine, or they were including arguments ("d=1", "si=1"), or something else that I haven't considered. Perhaps my original query to the CDX API contained an error or a paginated response that I did not account for.

In summary, thankfully the Internet Archive is preserving the web, which includes shortened URLs. But also, shame on Google for shutting down a piece of web infrastructure that they created, walking away from at least 1.3M URLs they created, and transferring this function to a third party with far fewer resources. The cost to maintain this service is trivial, even in terms of engineer time. The cost is really just intra-company prestige, which is a terrible reason to deprecate a service. And I suppose shame on us, as a culture and more specifically a community, for not valuing investments in infrastructure and maintenance.

Google's concession of maintaining recently used URLs is not as useful as it may seem at first glance. Yes, surely many of these goo.gl URLs redirect to URLs that are either now dead or are/were of limited importance. But we don't know which ones are still useful, and recent usage (i.e., popularity) does not necessarily imply importance. In my next blog post, I will explore some of the shortened URLs in technical publications, including a 2017 conference survey paper recommended by Shira Ovide that used goo.gl URLs, presumably for space reasons, to link to 27 different datasets.

–Michael

Search This Blog

Web Science and Digital Libraries Research Group

2025-08-03: The Wayback Machine Has Archived at Least 1.3M goo.gl URLs

Comments

Post a Comment