Saturday, December 20, 2014

2014-12-20: Using Search Engine Queries For Reliable Links

Earlier this week Herbert brought to my attention Jon Udell's blog post about combating link rot by crafting search engine queries to "refind" content that periodically changes URIs as the hosting content management system (CMS) changes.

Jon has a series of columns for InfoWorld, and whenever InfoWorld changes their CMS the old links break and Jon has to manually refind all the new links and update his page.  For example, the old URI:

http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html

is currently:

http://www.infoworld.com/article/2660595/application-development/xquery-and-the-power-of-learning-by-example.html

The same content had at least one other URI as well, from at least 2009--2012:

http://www.infoworld.com/d/developer-world/xquery-and-power-learning-example-924

The first reaction is to say InfoWorld should use "Cool URIs", mod_rewrite, or even handles.  In fairness, Inforworld is still redirecting the second URI to the current URI:



And it looks like they kept redirecting the original URI to the current URI until sometime in 2014 and then quit; currently the original URI returns a 404:



Jon's approach is to just give up on tracking different URIs for his 100s of articles and instead use a combination of metadata (title & author) and the "site:" operator submitted to a search engine to locate the current URI (side note: this approach is really similar to OpenURL).  For example, the link for the article above would become:

http://www.bing.com/search?q=site%3Ainfoworld.com+%22jon+udell%22+%22XQuery+and+the+power+of+learning+by+example%22

Herbert had a number of comments, which I'll summarize as:
  • This problem is very much related to Martin's PhD research, in which web archives are used to generate lexical signatures to help refind the new URIs on the live web (see "Moved but not gone: an evaluation of real-time methods for discovering replacement web pages").  
  • Throwing away the original URI is not desirable because that is a useful key for finding the page in web archives.  The above examples used the Internet Archive's Wayback Machine, but Memento TimeGates and TimeMaps could also be used (see Memento 101 for more information).   
  • One solution to linking to a SE for discovery while retaining the original URI is to use the data-* attributes from HTML (see the "Missing Link" document for more information).  
For the latter point, including the original URI (and its publishing date), the SE URI, and the archived URI would result in html that looks like:



I posted a comment saying that a search engine's robots.txt page would prevent archives like the Internet Archive from archiving the SERPs and thus not discover (and archive) the new URIs themselves.  In an email conversation Martin made the point that rewriting the link to search engine is assuming that the search engine URI structure isn't going to change (anyone want to bet how many links to msn.com or live.com queries are still working?).  It is also probably worth pointing out that while metadata like the title is not likely to change for Jon's articles, that's not always true for general web pages, whose titles often change (see "Is This A Good Title?"). 

In summary, Jon's solution of using SERPs as interstitial pages as a way to combat link rot is an interesting solution to a common problem, at least for those who wish to maintain publication (or similar) lists.  While the SE URI is a good tactical solution, disposing of the original URI is a bad strategy for several reasons, including working against web archives instead of with them, and betting on the long-term stability of SEs.  The solution we need is a method to include > 1 URI per HTML link, such as proposed in the "Missing Link" document.

--Michael