Posts

2018-05-15: Archives Unleashed: Toronto Datathon Trip Report

Image
The Archives Unleashed team (pictured below) hosted a two-day datathon, April 26-27, 2018, at the University of Toronto’s Robarts Library. This time around, Shawn Jones and I were selected to represent the Web Science and Digital Libraries ( WSDL ) research group from Old Dominion University . This event was the first in a series of four planned datathons to give researchers, archivists, computer scientists, and many others the opportunity to get hands-on experience with the Archives Unleashed Toolkit ( AUT ) and provide valuable feedback to the team. The AUT facilitates analysis and processing of web archives at scale and the datathons are designed to help participants find ways to incorporate these tools into their own workflow. Check out the Archives Unleashed team on Twitter and their website to find other ways to get involved and stay up to date with the work they’re doing. Archives Unleashed datathon organizers (left to right): Nich Worby , Ryan Deschamps , Ia

2018-05-04: An exploration of URL diversity measures

Image
Fig. 1:  Animal portraits  by  Morten Koldby Recently, as part of a research effort to describe a collections of URLs, I was faced with the problem of identifying a quantitative measure that indicates how many different kinds of URLs there are in a collection. In other words, what is the level of diversity in a collection of URLs? Ideally a diversity measure should produce a normalized value between 0 and 1 . A 0  value means no diversity, for example, a collection of duplicate URLs (Fig. 2 first row, first column). In contrast, a diversity value of 1 indicates maximum diversity - all different URLs (Fig. 2, first row, last column): 1. http://www.cnn.com/path/to/story?p=v 2. https://www.vox.com/path/to/story 3. https://www.foxnews.com/path/to/story Surprisingly, I did not find a standard  URL diversity   measure in the Web Science community, so I introduced the  WSDL diversity index  (described below). I acknowledge there may be other URL diversity measures in the Web Sc

2018-04-30: A High Fidelity MS Thesis, To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages

Image
It is hard to believe that the time has come for me to write a wrap up blog about the adventure that was my Masters Degree and the thesis that got me to this point. If you follow this blog with any regularity you may remember two posts, written by myself, that were the genesis of my thesis topic: 2017-01-20: CNN.com has been unarchivable since November 1st, 2016 2017-03-09: A State Of Replay or Location, Location, Location Bonus points if you can guess the general topic of the thesis from the titles of those two blog posts. However, it is ok if you can not as I will give an oh so brief TL;DR;. The replay problems with cnn.com were, sadly, your typical here today gone tomorrow replay issues involving this little thing, that I have come to , known as JavaScript. What we also found out, when replaying mementos of cnn.com from the major web archi