Posts

2012-11-06: TPDL 2012 Conference

Image
It all started last April, particularly on the 9th, when I received an email from the Dr. George Buchanan delivering the good news, my paper have been accepted at the annual international conference on Theory and Practice of Digital Libraries TPDL 2012 . Being the Program Chair, Dr. Buchanan sent me the reviews and feedback associated with my paper which was entitled “ Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? ” which paved the way in the following months for the preparation process to present this paper.   Along with submitting the paper, Dr. Nelson gave me the permission to submit my PhD proposal to be considered for the Doctoral Consortium at the conference. Scoring my second goal, Dr. Birger Larsen and Dr. Stefan Gradmann sent me a delightful email announcing the committee's acceptance to my proposal and I was invited a day before the conference to present my work at the consortium. The Hat-trick came a few weeks before

2012-10-24: NFL Power Rankings Week 8

Image
After running the R script for the week 8 rankings, the first thing that struck my mind was the disparity in the size of the nodes between the AFC on the left side of our graph and the NFC on the right side. Two weeks ago we wrote that the NFC West has been dominant so far this year. The NFC West has the best combined record and their aggregate point differential puts others to shame.  However it is not just the West division but the entire NFC conference has dominated and out-performed the AFC conference at every turn. CBS Sports rates the NFC as head and shoulders above the AFC this year. Our ranking system is based on Google's PageRank algorithm. It is explained in some detail in past posts . A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory.  In the Pagerank model

2012-10-11: NFL Power Rankings Week 6

Image
It is now five weeks into the 2012 season and the season is starting to come into focus. The topic of many online discussions is this years performance of the NFC West division compared to last year. The NFC West is one of the best performing divisions so far this year, which is a far cry from last year. They are certainly doing well in our ranking system. Our ranking system is based on Google's PageRank algorithm.It is explained in some detail in past posts . A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory.  In the Pagerank model each link from a webpage i to webpage j causes webpage i to give some of its own Pagerank to webpage j .  This is often characterized as webpage i voting for webpage j . In our system the losing team essentially votes for the winning team with a nu

2012-10-10: Zombies in the Archives

Image
Image provided from  http://www.taxhelpattorney.com/ In our current research, the WS-DL group has observed leakage in archived sites. Leakage occurs when archived resources include current content. I enjoy referring to such occurrences as "zombie" resources (which is appropriate given the upcoming Halloween holiday). That is to say, these resources are expected to be archived ("dead") but still reach into the current Web. In the examples below, this reach into the live Web is caused by URIs contained in JavaScript not being rewritten to be relative to the Web archive; the page in the archive is not pulling from the past archived content but is "reaching out" (zombie-style) from the archive to the live Web.  We provide two examples with humorous juxtaposition of past and present content. Because of  JavaScript, rendering a page from the past will include advertisements from the present Web. 2008 memento of cnn.com f

2012-09-29: Data Curation, Data Citation, ResourceSync

Image
During September 10-11, 2012 I attended the UNC/NSF Workshop Curating for Quality: Ensuring Data Quality to Enable New Science in Arlington.  The structure of the workshop was to invite about 20 researchers involved with all aspects of data curation and solicit position papers in one of four broad topics: data quality criteria and contexts human and institutional factors tools for effective and painless curation metrics Although the majority of the discussion was about science data, my position paper was about the importance of archiving the web.  In short, treating the web as the corpus that should be retained for future research.  The pending workshop report will have a full list of participants and their papers, but in the meantime I've uploaded to arXiv my paper, " A Plan for Curating `Obsolete Data or Resources' ", which is a summary version of the slides I presented at the Web Archiving Cooperative meeting this summer.  To be included in the worksh

2012-09-27: NFL Referee Kerfuffle

Image
For the first three weeks of the 2012 NFL season, replacement officials have refereed the games due to an ongoing labor dispute between the referees and the NFL. Every fan of a team that has been on the losing side of a call has voiced their opinion on the abilities of the replacement referees. Even Jon Stewart had something to say about the labor dispute . This past Monday night during the Seahawks - Packers game, a controversial call essentially determined the winner of the game. This call was the powder keg that blew open the dam of angry recriminations and complaints directed at the replacement referee s and the NFL. This was somewhat amusing to me as the people complaining seem to forget about all of the mistakes the regular referees appeared to make in all of the previous years. In 2008 one of the best referees in the NFL, Ed Hochuli made a rather horrendous call . I have to give him respect for owning up to it and apologizing. NFL fans have always complained about the offic

2012-08-31: Benchmarking LANL's SiteStory

Image
On August 17th, 2012, Los Alamos National Laboratory's Herbert Van de Sompel announced the release of the anticipated transactional web archiver called SiteStory . Very excited to announce the release of our SiteStory transactional archive solution # memento mementoweb.github.com/SiteStory/ — Herbert (@hvdsomp) August 17, 2012 The ODU WS-DL research group (in conjunction with The MITRE Corporation ) performed a series of studies to measure the effect of the SiteStory on web server performance. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources. A sneak-peek at how SiteStory affects server performance is provided below. Please see the technical report for a full descri

2012-08-20: MS Thesis: An Extensible Framework for Creating Personal Archives of Web Resources Requiring Authentication

Image
I am pleased to report on the successful completion of my Master's Degree thesis entitled "An Extensible Framework for Creating Personal Archives of Web Resources Requiring Authentication". The problem that I hoped to resolve with the study was one that plagues software like Archive Facebook , even to this day, in that when the hierarchy a social media website changes, tools created to preserve content on those sites tend to break. By conforming these tools to a specification that is setup to represent the hierarchy of the target social media websites, these tools become adaptive without the need of continuous maintenance on the part of the developer. Also in the study was an exploration and enumeration of various aspects of personal web archiving that prevent the field from taking advantage of the tools, procedures and mediums that are widely used in conventional web archiving. In addition to simply identifying the problem, I also created a Google Chrome extension, W

2012-08-10: MS Thesis - Visualizing Digital Collections at Archive-It

Image
Archive-It is a subscription web archiving service, provided by the Internet Archive , that allows institutions and users to create, maintain, and view digital collections of web resources. The current interface of Archive-It is largely text-based, supporting drill-down navigation using lists of URIs. While this interface provides good searching capabilities, it is not very efficient for browsing. This was our motivation for thinking about new visualizations to make it easy for users to browse Archive-It collections. This work, "Visualizing Digital Collections at Archive-It", was the subject of a recent MS thesis by Kalpesh Padia (who is continuing his Ph.D. studies at NC State University ) and a JCDL 2012 short paper by Kalpesh Padia, Yasmin AlNoamany , and Michele C. Weigle . In order to provide a better visual experience to users of Archive-It collections, we implemented six different visualizations (treemap, time cloud, bubble chart, image plot, timeline, and wo

2012-07-28: Four WS-DL Classes Offered in Fall 2012

Image
The WS-DL group is pleased to offer four classes in the Fall 2012 semester: one undergraduate, one undergraduate/graduate, and two upper level graduate classes.  CS 495: Python and Web Mining .  The instructor for this  class will be Hany SalahEldeen and the course content will be focused on data mining and machine learnin g, similar to the " Collective Intelligence " class offered in Spring 2009 with an additional introduction to programming in Python .   CS 418/518: Web Programming .  The instructor for this class will be Dr. Weigle and the course content will focus on programming in a LAMP environment.  The most recent offering of this course was taught in Fall 2010 .  CS 795/895: Information Visualization .  The instructor for this class will be Dr. Weigle and will be a continuation of the class taught in Fall 2011 (see the semester project gallery for a sampling of topics).  2012-08-27 Edit: this class has been moved to Spring 2012.  CS 895: Web-Based Inform

2012-07-27: Digital Preservation 2012 Trip Report

Image
Digital Preservation 2012 was held July 24-25 at the Sheraton Pentagon City in Arlington, Virginia. Previously the NDSA/NDIIPP ( @ndsa2 / @ndiipp ) Partner Meetup (see our trip report from 2011 ), this year's theme was "access to digital content under stewardship". A wide range of presentations were given on the full range of digital preservation topics. Four representatives from the ODU Web Sciences and Digital Libraries group attended to present their research and recent work in one of various realms within the field. WS-DL's Contributions to Digital Preservation 2012 Mat Kelly ( @machawk1 ) presented a demo of a Google Chrome extension he developed called WARCreate , as a further extension to the initial poster/demo presented at WS-DL's trip to JCDL this past June. WARCreate allows a user to create a Web ARChive (WARC) file from any viewable webpage. Mat's main focus was on preserving content behind aut