Saturday, March 1, 2014

2014-03-01 Domains per page over time

A few days ago, I read an interesting blog post by Peter Bengtsson. Peter is sampling web pages and computing basic statistic on the number of domains (RFC 3986 Host) required to completely render the page.  Not surprisingly, the mean is quite high: 33.  Also not surprisingly, he has found pages that depend on more than 100 different domains.

This started me thinking about how this has changed over time. Over the course of my research I have acquired a corpus of composite mementos (archived web pages and all their embedded images, CSS, etc.) dating from 1996 to 2013.  So, I did a little number crunching. What I suspected and confirmed is that the number of domains has increased over time and that the rate of increase has also increased. This is reflected in the median domains data show in Figure 1.

Note that the median shown (3) is a fraction of Peter's (25). I believe there are two major reasons for this. First, our current process for recomposing composite mementos from web archives does not run JavaScript, thus it only finds static URIs. Second, Peter's sample appears to be heavy on media sites, which tend to aggregate information, social media, and advertising from a multitude of other sites. On the other hand, our sample of 4,000 URIs and 82,425 composite mementos might be larger than Peter's sample and is probably more diverse. This difference is immaterial; direct comparability with Peter's results is not required to examine change in domains over time.

Figure 1 also shows the median resources (more precisely, the median unique URIs), required to recompose composite mementos. Median resources also increased over time. Furthermore, as shown in Figure 2, domains clearly increase as resources increase. This correlation seems to weaken as the number of resources increases. Note, however, that above 250 resources, the data is quite thin.
Another question that comes to mind is "what is the occurrence frequency of composite mementos at each resource level?" Figure 3 show an ECDF for the data. Although it is hard to tell from the figure, 99% (81,387) of our composite mementos use 100 resources or less and 90% use 43 or less. Indeed, only 34 (0.0412%) have more than 300 resources.
Also interesting is the distribution of composite mementos with respect to number of domains, which is shown in Figure 4. Here 97.5% of our composite mementos use at most 10 domains. It only takes 14 domains to cover 99% or our composite mementos.
Clearly, the number of resources and domains per web page has increased over time and the rate of increase has accelerated over time. These results are not directly comparable to Peter Bengtsson's, but I suspect were he to use a 17-year sample the same patterns would emerge. I was half tempted to plug the 4,000 URIs from our sample into Peter's Number of Domains page to see what happens, unfortunately I don't have the time available. Still, the results would be very interesting.

—Scott G. Ainsworth

March 2 update: minor grammatical corrections.

No comments:

Post a Comment