Saturday, March 1, 2014

2014-03-01: Starting my research internship at NUS

Well, I made it! I am finally on the green fine island. After a long trip from Norfolk international airport to Washington DC Dulles then 23 hours in the air except for a fueling pit-stop in Tokyo Narita airport I landed in Changi airport in Singapore.

To give you some context, I was invited to spend a semester at the National University of Singapore and work with Dr. Min Yen Kan in the WING research group. The purpose was to work in a common area of interest that helps me progress in the final leg of my PhD marathon and increase the collaboration between our WS-DL lab and WING yielding a reputable paper (or more?). In short, I am a WING this semester! So buckle up!

Due to jet lag being a miserable companion the first couple of days, I decided not to take the first day off to rest and settle and go directly to the university. Or maybe it was my excitement? I will never confess.

At NUS I did the regular paperwork and met my colleague and fellow research partner for the next couple of months miss Tao Chen. Tao showed me around the university and labs and gave me pointers on what to expect around here.

The next day I met with Dr. Kan and we discussed all the logistics of my arrival and the possible ideas to zero on the angle I am going to focus on. Also we discussed points of collaborations on side projects with Tao and Jun Ping, a fellow researcher at WING.

I will be working in the lab on the 5th floor of AS6 the building next to COM1 in the school of computing. My desk is next to a huge building spanning half of the wall and overlooking an adjacent forest with singing birds! I guess I am a very happy PhD student now!

The journey starts now, let's see what I can do in the next couple of months while working with Asia's finest. Wish me luck!

-- Hany M. SalahEldeen

2014-03-01 Domains per page over time

A few days ago, I read an interesting blog post by Peter Bengtsson. Peter is sampling web pages and computing basic statistic on the number of domains (RFC 3986 Host) required to completely render the page.  Not surprisingly, the mean is quite high: 33.  Also not surprisingly, he has found pages that depend on more than 100 different domains.

This started me thinking about how this has changed over time. Over the course of my research I have acquired a corpus of composite mementos (archived web pages and all their embedded images, CSS, etc.) dating from 1996 to 2013.  So, I did a little number crunching. What I suspected and confirmed is that the number of domains has increased over time and that the rate of increase has also increased. This is reflected in the median domains data show in Figure 1.

Note that the median shown (3) is a fraction of Peter's (25). I believe there are two major reasons for this. First, our current process for recomposing composite mementos from web archives does not run JavaScript, thus it only finds static URIs. Second, Peter's sample appears to be heavy on media sites, which tend to aggregate information, social media, and advertising from a multitude of other sites. On the other hand, our sample of 4,000 URIs and 82,425 composite mementos might be larger than Peter's sample and is probably more diverse. This difference is immaterial; direct comparability with Peter's results is not required to examine change in domains over time.

Figure 1 also shows the median resources (more precisely, the median unique URIs), required to recompose composite mementos. Median resources also increased over time. Furthermore, as shown in Figure 2, domains clearly increase as resources increase. This correlation seems to weaken as the number of resources increases. Note, however, that above 250 resources, the data is quite thin.
Another question that comes to mind is "what is the occurrence frequency of composite mementos at each resource level?" Figure 3 show an ECDF for the data. Although it is hard to tell from the figure, 99% (81,387) of our composite mementos use 100 resources or less and 90% use 43 or less. Indeed, only 34 (0.0412%) have more than 300 resources.
Also interesting is the distribution of composite mementos with respect to number of domains, which is shown in Figure 4. Here 97.5% of our composite mementos use at most 10 domains. It only takes 14 domains to cover 99% or our composite mementos.
Clearly, the number of resources and domains per web page has increased over time and the rate of increase has accelerated over time. These results are not directly comparable to Peter Bengtsson's, but I suspect were he to use a 17-year sample the same patterns would emerge. I was half tempted to plug the 4,000 URIs from our sample into Peter's Number of Domains page to see what happens, unfortunately I don't have the time available. Still, the results would be very interesting.

—Scott G. Ainsworth

March 2 update: minor grammatical corrections.