2012-08-31: Benchmarking LANL's SiteStory

On August 17th, 2012, Los Alamos National Laboratory's Herbert Van de Sompel announced the release of the anticipated transactional web archiver called SiteStory.

The ODU WS-DL research group (in conjunction with The MITRE Corporation) performed a series of studies to measure the effect of the SiteStory on web server performance. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources.

A sneak-peek at how SiteStory affects server performance is provided below. Please see the technical report for a full description of these results. But first, let's compare the archival behaviors of transactional and conventional Web archives.

Crawler and user visits generate archived copies of a changing page.

A visual representation of a typical page change and user access scenario is depicted in the above figure. This scenario assumes an arbitrary page that will be called P changes at inconsistent intervals. This timeline shows page P changes at points C1, C2, C3, C4, and C5 at times t2, t6, t8, t10, and t13, respectively. A user makes a request for P at points O1, O2, and O3 at times t3, t5, and t11, respectively. A Web crawler (that captures representations for storage in a Web archive) visits P at points V1 and V2 at times t4 and t9, respectively. Since O1 occurs after change C1, an archived copy of C1 is made by the transactional archive (TA). When O2 is made, P has not changed since O1 and therefore, an archived copy is not made since one already exists. The Web crawler visits V1 captures C1, and makes a copy in the Web archive. In servicing V1, an unoptimized TA will store another copy of C1 at t4 and an optimized TA could detect that no change has occurred and not store another copy of C1.

Change C2 occurs at time t6, and C3 occurs at time t8. There was no access to P between t6 and t8, which means C2 is lost -- an archived copy exists in neither the TA nor the Web crawler's archive. However, the argument can be made that if no entity observed the change, should it be archived? Change C3 occurs and is archived during the crawler's visit V2, and the TA will also archive C3. After C4, a user accessed P at O3 creating an archived copy of C4 in the TA. In the scenario depicted in Figure 1, the TA will have changes C1, C3, C4, and a conventional archive will only have C1, C3. Change C2 was never served to any client (human or crawler) and is thus not archived by either system. Change C5 will be captured by the TA when P is accessed next.

The example in the above figure demonstrates a transactional archive's ability to capture a single version of each user-observed version of a page, but does not capture versions unseen by users.

Los Alamos National Laboratory has developed SiteStory, an open-source transactional Web archive. First, mod_sitestory is installed on the Apache server that contains the content to be archived. When the Apache server builds the response for the requesting client, mod_sitestory sends a copy of the response to the SiteStory Web archive, which is deployed as a separate entity. This Web archive then provides Memento-based access to the content served by the Apache server with mod_sitestory installed, and the SiteStory Web archive is discoverable from the Apache web server using standard Memento conventions.

Sending a copy of the HTTP response to the archive is an additional task for the Apache Web server, and this task must not come at too great a performance penalty to the Web server. The goal of this study is to quantify the additional load mod_sitestory places on the Apache Web server to be archived.

ApacheBench (ab) was used to gather the throughput statistics of a server when SiteStory was actively archiving content and compare those statistics to those of the same server when SiteStory was not running. The below figures from the technical report show that SiteStory does not hinder a server's ability to provide content to users in a timely manner.

Total run time for the ab test with 10,000 connections and 1 concurrency.

Total run time for the ab test with 10,000 connections and 100 concurrency.

Total run time for the ab test with 216,000 connections and 1 concurrency.

Total run time for the ab test with 216,000 connections and 100 concurrency.

To test the effect of sites with large numbers of embedded resources, 100 HTML pages were constructed with Page 0 containing 0 embedded images, Page 1 containing 1 embedded image, .., Page n containing n embedded images. As expected, larger resources take longer to serve to a requesting user. SiteStory is affected more for larger resources, as depicted in the below figures.

As depicted in these figures, SiteStory does not significantly hinder a server, and increases the ability to actively archive content served from a server. More details on these graphs can be found in the technical report, which has been posted to arXiv.org:

Justin F. Brunelle, Michael L. Nelson, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Technical Report 1209.1811v1, 2012.

--Justin F. Brunelle