Thursday, July 5, 2012

2012-07-05: Web Crawler Animation

In this post I'm revisiting a publication from the pre-blog era that has really cool animations.  Most of my work is at the protocol and architecture level (e.g., PMH, ORE, Memento, ResourceSync) and while I enjoy that, it does leave me with a serious case of visualization envy that was made worse by attending a Tufte lecture ca. 2004.  While we don't have anything close to Minard's "Napoleon's March to Moscow", we do have a couple of things of which I'm especially proud.

One of things I find myself showing to people every month or two are Joan Smith's animations of web crawlers visiting a series of synthetic web sites over the course of a year (February 2007 -- February 2008).  Joan's dissertation was on the topic of web servers assisting the task of digital preservation, both by enumerating the valid URIs at a web site and by providing preservation metadata about the resource representations at the web site.  One of the sub-questions in the URI enumeration section was "will all resources at site be visited by conventional web crawlers?"

Conventional wisdom at the time said that web crawlers did not prefer to go "deep" into a site, instead preferring to a broad skim of the "surface" of a site with only a sampling of pages from a site.  To test this prevailing notion, we built synthetic web sites that were simultaneously 100 pages wide (i.e., "/1", "/2", "/3", ... , "/100") and 100 pages deep (i.e., "/1", "/1/2", "/1/2/3", ... , "/1/2/3/4/.../100") and watched how Google, Yahoo, and MSN (it was pre-Bing) crawled the site (Joan had some clever code for generating synthetic web sites, but I don't recall her releasing it).  We actually had four sites:
  • two depth linking policies:
    • bread crumb: "/1" links to "/1/2" which links to "/1/2/3" etc.
    • buffet: "/1", "1/2", "1/2/3", etc. are all linked from the root page
  • two hosting policies:
    • hosted on .com machines
    • hosted on .edu machines
For each of the four machines, we created graphs corresponding to how each robot (G,Y,M) crawled the site on each day.  There were CDF-life graphs that showed what percentage of the site had been crawled over time, but the real insight comes from watching the daily crawl graphs combined into an animated image.  Here's all you need to know: remember that each site is a 100x100 set of page, white (nothing) means not crawled, grey means crawled, a blue dot means a regular HTTP GET on a page and red dot means a conditional HTTP GET (i.e., the crawler remembers that it has visited that page before).

In the above graph (click on it to see the animations), the Google crawler rips through the .com buffet site, visiting all pages in about 3 weeks. 
In the above graph, the Google crawler marches more deliberately through the .com bread crumb site, but still covers the entire site in about 6 weeks.  In this case the site structure influences how long it takes for Google to crawl the site, but Google crawls everything regardless.
The above graph shows that the Yahoo crawler is far more tentative.  On the .com buffet site, for about the first two weeks the crawler is only revisiting the top level page.  Then it takes about 10 weeks before it will go beyond the first level, and then only gets serious about crawling until about 17 weeks in, but even then only going about 55 levels down. That eventually stretches to about 75 levels down, but even at the end of the year there are significant sections of the site that the crawler has not explored.
In the above graph, the MSN crawler on the .edu buffet site doesn't get serious about crawling the site until about 20 weeks into the test (also, it never does a conditional GET).  What we don't know is if that 20 week period was a part of a crawling strategy (e.g., "this site has to stick around long enough before we'll take it seriously"), or if 20 weeks in reflected a change in crawling strategy or increased crawling capacity on the part of MSN. 

That is a typical problem with this kind of black-box experiment: without setting up many different web sites, it is hard to generalize the behavior that we saw.  And even if it was an accurate description of the crawling strategies of the big three search engines, it was only accurate for the time period for which we took data (February 2007-- February 2008).  All three could have changed their strategies in March 2008 and we would not have known.  But it did answer our question: we could build large web sites (100x100) and every page would get crawled by someone if not everyone, and presumably the crawlers only get more aggressive over time.

It made a small buzz when the article came out but it has only received a single citation at the time of this writing, presumably because it addresses more of an engineering audience.  It was never intended to be a primary result of Joan's research (we choose D-Lib Magazine as a suitable venue for animations), however the simple but effective animation has turned out to be the most fun part and it assuages my visualization envy

Above I've selected four of the twelve animations; I highly recommend spending the time for all the animations in tables 3 and 6.  The full citation for the work is:
Joan A. Smith, Michael L. Nelson, Site Design Impact on Robots: An Examination of Search Engine Crawler Behavior at Deep and Wide Websites, D-Lib Magazine, 14(3/4), 2008. 

No comments:

Post a Comment