Posts

Showing posts from March, 2014

2014-03-01: Starting my research internship at NUS

Image
Well, I made it! I am finally on the green fine island. After a long trip from Norfolk international airport to Washington DC Dulles then 23 hours in the air except for a fueling pit-stop in Tokyo Narita airport I landed in Changi airport in Singapore.

To give you some context, I was invited to spend a semester at the National University of Singapore and work with Dr. Min Yen Kan in the WING research group. The purpose was to work in a common area of interest that helps me progress in the final leg of my PhD marathon and increase the collaboration between our WS-DL lab and WING yielding a reputable paper (or more?). In short, I am a WING this semester! So buckle up!

Due to jet lag being a miserable companion the first couple of days, I decided not to take the first day off to rest and settle and go directly to the university. Or maybe it was my excitement? I will never confess.

At NUS I did the regular paperwork and met my colleague and fellow research partner for the next couple of m…

2014-03-01 Domains per page over time

Image
A few days ago, I read an interesting blog post by Peter Bengtsson. Peter is sampling web pages and computing basic statistic on the number of domains (RFC 3986Host) required to completely render the page.  Not surprisingly, the mean is quite high: 33.  Also not surprisingly, he has found pages that depend on more than 100 different domains.

This started me thinking about how this has changed over time. Over the course of my research I have acquired a corpus of composite mementos (archived web pages and all their embedded images, CSS, etc.) dating from 1996 to 2013.  So, I did a little number crunching. What I suspected and confirmed is that the number of domains has increased over time and that the rate of increase has also increased. This is reflected in the median domains data show in Figure 1.

Note that the median shown (3) is a fraction of Peter's (25). I believe there are two major reasons for this. First, our current process for recomposing composite mementos from web archive…