2014-07-08: Potential MediaWiki Web Time Travel for Wayback Machine Visitors
Over the past year, I've been working on the Memento MediaWiki Extension. In addition to trying to produce a decent product, we've also been trying to build support for the Memento MediaWiki Extension at WikiConference USA 2014. Recently, we've reached out via Twitter to raise awareness and find additional supporters.
To that end, we attempt to answer two questions:
- The Memento extension provides the ability to access a page revision closest, but not over the datetime specified by the user. As mentioned in an earlier blog post, the Internet Archive only has access to the revisions of articles that existed at the time it crawled, but a wiki can access every revision. How effective is the Wayback Machine at ensuring that visitors gain access to pages close to the datetimes they desire?
- How many visitors of the Wayback Machine could benefit from the use of the Memento MediaWiki Extension?
Answering the second question gives us an idea of the potential user base for the Memento MediaWiki Extension.
Thanks to the work by Yasmin AlNoamany's work in "Who and What Links to the Internet Archive", we have access to 766 GB of (compressed) anonymized Internet Archive logs in a common Apache format. Each log file represents a single day of access to the Wayback Machine. We can use these logs to answer these questions.
Effectiveness of accessing closest desired datetime in the Wayback Machine
How effective is the Wayback Machine at ensuring that visitors gain access to pages close to the datetimes they desire?To answer the first question, I used the following shell command to review the logs.
This command was only used on this single log file to find a potential English Wikipedia page as an example to trace in the logs. It was only used to search for an answer to the first question above.
From that command, I found a Wayback Machine capture of a Wikipedia article about the Gulf War. The logs were anonymized, so of course I couldn't see the actual IP address of the visitor, but I was able to follow the path of referrers back to see what path the user took as they browsed via the Wayback Machine.
We see that the user engages in a Dive pattern, as defined in Yasmin AlNoamany's "Access Patterns for Robots and Humans in Web Archives".
- http://web.archive.org/web/20071218235221/angel.ap.teacup.com/gamenotatsujin/24.html
- http://web.archive.org/web/20080112081044/http://angel.ap.teacup.com/gamenotatsujin/259.html
- http://web.archive.org/web/20071228223131/http://angel.ap.teacup.com/gamenotatsujin/261.html
- http://web.archive.org/web/20071228202222/http://angel.ap.teacup.com/gamenotatsujin/262.html
- http://web.archive.org/web/20080105140810/http://angel.ap.teacup.com/gamenotatsujin/263.html
- http://web.archive.org/web/20071228202227/http://angel.ap.teacup.com/gamenotatsujin/264.html
- http://web.archive.org/web/20071228223136/http://angel.ap.teacup.com/gamenotatsujin/267.html
- http://web.archive.org/web/20071228223141/http://angel.ap.teacup.com/gamenotatsujin/268.html
- http://web.archive.org/web/20080102052100/http://en.wikipedia.org/wiki/Gulf_War
The point of this exercise was not to read this Japanese blog that the user was initially interested in. From this series of referrers, we see that the end user chose the original URI with a datetime of 2007/12/18 23:52:21 (from the 20071218235221 part of the archive.org URI). It is the best we can do to determine which Accept-Datetime they would have chosen if they were using Memento. What they actually got at the end was an article with a Memento-Datetime of 2008/01/02 05:21:00.
The Internet Archive produced a page that maps to revision 181419148 (1 January 2008), rather than revision 178800602 (19 December 2007), which is the closest revision to what the visitor actually desired.
What did the user miss out on by getting the more recent version of the article? The old revision discusses how the Gulf War was the last time the United States used battleships in war, but an editor in between decided to strike this information from the article. The old revision listed different countries in the Gulf War coalition than the new revision.
So, seeing as the Internet Archive's Wayback Machine slides the user from date to date, they end up getting a different revision than they originally desired. This algorithm makes sense in an archival environment like the Wayback Machine, where the mementos are sparse.
The Memento MediaWiki Extension has access to all revisions, meaning that the user can get the revision closest to the date they want.
Potential Memento MediaWiki Extension Users at the Internet Archive
How many visitors of the Wayback Machine could benefit from the use of the Memento MediaWiki Extension?The second question involves discovering how many visitors are using the Wayback Machine for browsing Wikipedia when they could be using the Memento MediaWiki Extension.
We processed these logs in several stages to find the answer, using different scripts and commands than the one used earlier.
First, a simple grep command, depicted below, was run on each logfile. The variable $inputfile was the compressed log file, and the $outputfile was stored in a separate location.
Considering we are looping through 766 GB of data, this took quite some time to complete on our dual-core 2.4 GHz virtual machine with 2 GB of RAM.
As Yasmin AlNoamany showed in "Who and What Links to the Internet Archive", wikipedia.org is the biggest referrer to the Internet Archive, but we wanted direct users. So, we were concerned with any entries that were merely referrers from Wikipedia. Because Wikipedia uses links to the Internet Archive to avoid dead links to Wikipedia article references, there are many referrers in these logs from Wikipedia.
We used the simple Python script below on each of the 288 output files returned from the first pass, stripping out all of referrers containing the string 'wikipedia.org'.
Python was used because it offered better performance than merely using a combination of sed, grep, and awk to achieve the same goal.
Once we had stripped the referrers from the processed log data, then we could find the counts of access to Wikipedia with another script. The script below was run with the argument of wikipedia.org as the search term. Seeing as we had removed referrers, only actual requests for wikipedia.org should remain.
Because each log file represents one day of activity, this script gives us a CSV containing a date and a count of how many wikipedia.org requests occur for that day.
Now that we have a list of counts from each day, it is easy to take the numbers from the count column in this CSV and find the mean. Again, enter Python, because it was simple and easy.
It turns out that the Wayback Machine, on average, receives 131,438 requests for Wikipedia articles each day.
If we perform the same exercise for the other popular wiki sties in the web we get the results shown in the table below.
Wiki Site | Mean number of daily requests to the Wayback Machine |
---|---|
*.wikipedia.org (All Wikipedia sites) | 131,438 |
*.wikimedia.org (Wikimedia Commons) | 26,721 |
*.wikia.com (All Wikia sites) | 9,574 |
So, there are a potential 168,000 Memento requests per day who could benefit if these wikis used the Memento MediaWiki Extension.
On top of it, these logs represent a snapshot in time for the Wayback Machine only. The Internet Archive has other methods of access that were not included in this study, so the number of potential Memento requests per day is actually much higher.
Summary
We have established support for two things:
- the Memento MediaWiki Extension will produce results closer to the date requested than the Wayback Machine
- there are a potential 168,000 Memento requests per day that could benefit from the Memento MediaWiki Extension
--Shawn M. Jones
Comments
Post a Comment