Sunday, April 5, 2015

2015-04-05: From Student To Researcher...

In 2010, I decided to again study at the Old Dominion University Computer Science Department for better employment opportunities. After taking some classes, I realized that I did not merely want to take classes and earn a Master's Degree, but also wanted to contribute knowledge, like those who wrote the many research papers I had read during my courses.

My Master's Thesis is titled "Avoiding Spoilers On MediaWiki Fan Sites Using Memento".   I came to the topic via a strange route.

During Dr. Nelson's Introduction to Digital Libraries course, we built a digital library based on a single fictional universe.  I chose the television show Lost, and specifically archived Lostpedia, a site that my wife and I used while watching and discussing the show.  We realized that fans were updating Lostpedia while episodes aired.  This highlighted the idea that wiki revisions created prior to the episode obviously did not contain information about that episode, and emphasized that episodes led to wiki revisions.

A few years later, a discussion at work occurred while watching Game of Thrones.  I realized that some of us had seen the episode of the night before while others had not.  We wanted to use the Game of Thrones Wiki to continue our conversation, but realized that those who had not seen the episode easily encountered spoilers.  By this point, I was quite familiar with Memento, had used Memento for Chrome, and was working on the Memento MediaWiki Extension.  The idea of using Memento to avoid spoilers was born.

The figure above exhibits the Naïve Spoiler Concept.  The concept is that wiki revisions in the past of a given episode should not contain spoilers, because information has not yet been revealed by the episode, hence fans could not write about it.  Inversely, wiki revisions in the future of a given episode will likely contain spoilers, seeing as episodes cause fans to write wiki revisions.

It turned out that there was more to avoiding spoilers in fan wiki sites than merely using Memento and the Naïve Spoiler Concept.  Most TimeGates use a heuristic that is not reliable for avoiding spoilers, so I proposed a new one and demonstrated why the existing heuristic was insufficient by calculating the probability of encountering a spoiler using the current heuristic.  I also used the Memento MediaWiki Extension to demonstrate this new heuristic in action.  In this way I was able to develop a Computer Science Master's Thesis on the topic.

Mindist (minimum distance) is the heuristic used by most TimeGates. This works well for an sparse archive, because often the closest memento to the datetime you have requested is best.  Wikis have access to every revision, allowing us to use a new heuristic minpast (minimum distance in the past, minimum distance without going over the given date).  Using records from fan wikis, I showed that, if one is trying to avoid spoilers, there can be as much as a 66% chance of encountering a spoiler if we use the Wayback Machine or a Memento TimeGate using mindist.  I also analyzed Wayback Machine logs for requests and found that 19% of those requests ended up in the future.  From these studies, it was clear that using minpast directly on wikis was the best way to avoid spoilers.

While I was examining fan wikis for spoilers, I also had the opportunity to compare wiki revisions with mementos recorded by the Internet Archive.  Using this information I was actually able to reveal how the Internet Archive's sparsity is changing over time.  Because wikis keep track of every revision, so we can see missed updates by the Internet Archive.

In the figure above, we see a timeline for each wiki page I conducted in the study.  The X-axis shows time and the Y-axis consists of an identifier for each wiki page.  Darker colors indicate more missed updates by the Internet Archive.  We see that the colors are getting lighter, meaning that the Internet Archive has becoming more aggressive in recording pages.

Below are the slides for the presentation, available on my SlideShare account, followed by the video of my defense posted to YouTube.  The full document of my Master's Thesis is available here.

Thanks to Dr. Irwin Levinstein and Dr. Michele Weigle for serving on my committee.  Their support has been invaluable during this process. Were it not for Dr. Levinstein, I would not have been able to become a graduate student.  Were it not Dr. Weigle's wonderful Networking class, I would not have been able to draw some of the conclusions necessary to complete this thesis.

Much of the thanks goes to my advisor, Dr. Michael L. Nelson, who spent hours discussing these concepts with me, helping correct my assumptions and assessments when I erred, while praising the experience when I came up with something original and new.  His patience and devotion not only to the area of study, but also the art of mentoring, led me down the path of success.

In the process of creating this thesis, I also created a technical report which can be referenced using the BibTeX code below.

So, what is next?  Do I use wikis to study the problem of missed updates in more detail? Do I study the use of the naïve spoiler concept in another setting?  Or do I do something completely different?

I realize that I have merely begun my journey from student to researcher, but know even more now that I will enjoy the path I have chosen.

--Shawn M. Jones, Researcher

No comments:

Post a Comment