Saturday, September 29, 2012

2012-09-29: Data Curation, Data Citation, ResourceSync

During September 10-11, 2012 I attended the UNC/NSF Workshop Curating for Quality: Ensuring Data Quality to Enable New Science in Arlington.  The structure of the workshop was to invite about 20 researchers involved with all aspects of data curation and solicit position papers in one of four broad topics:
  1. data quality criteria and contexts
  2. human and institutional factors
  3. tools for effective and painless curation
  4. metrics
Although the majority of the discussion was about science data, my position paper was about the importance of archiving the web.  In short, treating the web as the corpus that should be retained for future research.  The pending workshop report will have a full list of participants and their papers, but in the meantime I've uploaded to arXiv my paper, "A Plan for Curating `Obsolete Data or Resources'", which is a summary version of the slides I presented at the Web Archiving Cooperative meeting this summer. 

To be included in the workshop report are the results of various breakout sessions.  The ones that I was involved with involved questions such as: how contextual information should be archived with the data (cf. "preservation description information" and "knowledge base" from OAIS), how much a university's institutional overhead goes to institutional repositories and archiving capability ("put everything in the cloud" is neither an informed nor acceptable answer), and how to handle versioning and diff/patch in large data sets (tools like Galaxy and Google Refine were mentioned in the larger discussion).

(2012-10-23 edit: the final workshop report is now available.)

A nice complement to the Data Curation workshop was the NISO "Tracking it Back to the Source: Managing and Citing Research Data" workshop in Denver on September 24.  This one day workshop focused on how to cite and link to scientific data sets (which came up several times in the UNC workshop as well).  While I applaud the move to make data sets first-class objects in the scholarly communication infrastructure, I always feel there is an unstoppable momentum to "solve" the problem by simply saying "use DOIs" (e.g., DataCite), while ignoring the hard issues of what exactly does a DOI refer to (see: ORE Primer), versioning what it might point to (see: Memento), as well as the minor quibble that DOIs aren't actually URIs (look it up: "doi" is not in the registry).  In short, DOIs are a good start, but they just push the problem one level down instead of solving it.  Highlights from the workshop included a ResourceSync+Memento presentation from Herbert Van de Sompel and "Data Equivalence", by Mark Parsons of the NSIDC

After the NISO workshop, there was a two day ResourceSync working group meeting (September 25-26) in Denver.  We made a great deal of progress on the specification; the pre-meeting (0.1) version of the specification is no longer valid.  Many issues are still being considered and I won't cover the details here, but main result is that the ResourceSync format will no longer be based on Sitemaps.  We were all disappointed to have to make that break, but Martin Klein did a nice set of experiments (to be released later) that showed despite being superficially suitable for the job, there were just too many areas where its primary focus of advertising URIs to search engines inhibited the more nuanced use of advertising resources that have changed.


Thursday, September 27, 2012

2012-09-27: NFL Referee Kerfuffle

For the first three weeks of the 2012 NFL season, replacement officials have refereed the games due to an ongoing labor dispute between the referees and the NFL. Every fan of a team that has been on the losing side of a call has voiced their opinion on the abilities of the replacement referees. Even Jon Stewart had something to say about the labor dispute.

This past Monday night during the Seahawks - Packers game, a controversial call essentially determined the winner of the game. This call was the powder keg that blew open the dam of angry recriminations and complaints directed at the replacement referees and the NFL. This was somewhat amusing to me as the people complaining seem to forget about all of the mistakes the regular referees appeared to make in all of the previous years. In 2008 one of the best referees in the NFL, Ed Hochuli made a rather horrendous call. I have to give him respect for owning up to it and apologizing. NFL fans have always complained about the officiating, warranted or not.

Seeing as how I have been collecting NFL statistics for a number of years, I decided to see what the data could tell me about the replacement referees performance. First I wanted to see if there was a disparity in the number of penalties called by the referees during the first three weeks of this year compared to the first three weeks of other years.

Year Mean penalties per game
2002 13.2609
2003 15.4783
2004 14.2609
2005 15.5870
2006 12.3261
2007 11.4583
2008 12.3617
2009 12.3333
2010 13.1489
2011 13.0417
2012 13.6250

The average number of penalties appears to be consistent with the previous decade.

One concern that I read about was that the replacement referees were local and would favor the home team. Indeed one referee was removed from his assignment after some of his Facebook posts described him as a Saints fan. Just one more reason to watch what you release on social media.
So, have the home teams done better this year than others?

Year Home Wins
2002 23
2003 25
2004 26
2005 29
2006 21
2007 30
2008 28
2009 25
2010 27
2011 31
2012 31

Looking at the number of home team wins in the first three weeks of each season shows that 31 wins in 2012, while a little higher than average and exactly equal to last year is nowhere even close to a statistical anomaly. This leads me to think about what Vegas thinks about the whole situation. The collective intelligence of the NFL fan population realized by the Vegas Spread has been the focus of much of my research.

ESPN and other sites are reporting that over $100 million dollars was lost as a result of the controversial call on Monday night. How far off from reality has the Vegas line been this year compared to the past 20 years?

This figure shows the average difference and standard deviation between the Vegas betting line and the actual Margin of Victory over the first three weeks of each year. Negative numbers indicate that the either the visitor performed better than expected or the home team was favored more than they should have. 2012 is a little less than -2, so the the argument could be made that the NFL fans favored the home teams a little more than they should have. The key takeaway is that the results are well with normal values and maybe even a little more consistent than any other season in the past two decades.

This little experiment did not really prove anything other than the first three weeks of this season have not been statistically different than any other season in the past decade. Things being what they are, people will still find something to complain about and the search for someone to blame will always be successful.

The news is reporting that starting tonight the regular referees will be back and all will be well with the world. The question I leave you with is after the next controversial call, who will they blame?

-- Greg Szalkowski

Monday, September 3, 2012

2012-08-31: Benchmarking LANL's SiteStory

On August 17th, 2012, Los Alamos National Laboratory's Herbert Van de Sompel announced the release of the anticipated transactional web archiver called SiteStory.

The ODU WS-DL research group (in conjunction with The MITRE Corporation) performed a series of studies to measure the effect of the SiteStory on web server performance. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources.

A sneak-peek at how SiteStory affects server performance is provided below. Please see the technical report for a full description of these results. But first, let's compare the archival behaviors of transactional and conventional Web archives.

Crawler and user visits generate archived copies of a changing page.

A visual representation of a typical page change and user access scenario is depicted in the above figure. This scenario assumes an arbitrary page that will be called P changes at inconsistent intervals. This timeline shows page P changes at points C1, C2, C3, C4, and C5 at times t2, t6, t8, t10, and t13, respectively. A user makes a request for P at points O1, O2, and O3 at times t3, t5, and t11, respectively. A Web crawler (that captures representations for storage in a Web archive) visits P at points V1 and V2 at times t4 and t9, respectively. Since O1 occurs after change C1, an archived copy of C1 is made by the transactional archive (TA). When O2 is made, P has not changed since O1 and therefore, an archived copy is not made since one already exists. The Web crawler visits V1 captures C1, and makes a copy in the Web archive. In servicing V1, an unoptimized TA will store another copy of C1 at t4 and an optimized TA could detect that no change has occurred and not store another copy of C1.

Change C2 occurs at time t6, and C3 occurs at time t8. There was no access to P between t6 and t8, which means C2 is lost -- an archived copy exists in neither the TA nor the Web crawler's archive. However, the argument can be made that if no entity observed the change, should it be archived? Change C3 occurs and is archived during the crawler's visit V2, and the TA will also archive C3. After C4, a user accessed P at O3 creating an archived copy of C4 in the TA. In the scenario depicted in Figure 1, the TA will have changes C1, C3, C4, and a conventional archive will only have C1, C3. Change C2 was never served to any client (human or crawler) and is thus not archived by either system. Change C5 will be captured by the TA when P is accessed next.

The example in the above figure demonstrates a transactional archive's ability to capture a single version of each user-observed version of a page, but does not capture versions unseen by users.

Los Alamos National Laboratory has developed SiteStory, an open-source transactional Web archive. First, mod_sitestory is installed on the Apache server that contains the content to be archived. When the Apache server builds the response for the requesting client, mod_sitestory sends a copy of the response to the SiteStory Web archive, which is deployed as a separate entity. This Web archive then provides Memento-based access to the content served by the Apache server with mod_sitestory installed, and the SiteStory Web archive is discoverable from the Apache web server using standard Memento conventions.

Sending a copy of the HTTP response to the archive is an additional task for the Apache Web server, and this task must not come at too great a performance penalty to the Web server. The goal of this study is to quantify the additional load mod_sitestory places on the Apache Web server to be archived.

ApacheBench (ab) was used to gather the throughput statistics of a server when SiteStory was actively archiving content and compare those statistics to those of the same server when SiteStory was not running. The below figures from the technical report show that SiteStory does not hinder a server's ability to provide content to users in a timely manner.

Total run time for the ab test with 10,000 connections and 1 concurrency.

Total run time for the ab test with 10,000 connections and 100 concurrency.

Total run time for the ab test with 216,000 connections and 1 concurrency.

Total run time for the ab test with 216,000 connections and 100 concurrency.

To test the effect of sites with large numbers of embedded resources, 100 HTML pages were constructed with Page 0 containing 0 embedded images, Page 1 containing 1 embedded image, .., Page n containing n embedded images. As expected, larger resources take longer to serve to a requesting user. SiteStory is affected more for larger resources, as depicted in the below figures.

As depicted in these figures, SiteStory does not significantly hinder a server, and increases the ability to actively archive content served from a server. More details on these graphs can be found in the technical report, which has been posted to

Justin F. Brunelle, Michael L. Nelson, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Technical Report 1209.1811v1, 2012.

--Justin F. Brunelle