Posts

Showing posts from May, 2018

2018-05-15: Archives Unleashed: Toronto Datathon Trip Report

Image
The Archives Unleashed team (pictured below) hosted a two-day datathon, April 26-27, 2018, at the University of Toronto’s Robarts Library. This time around, Shawn Jones and I were selected to represent the Web Science and Digital Libraries ( WSDL ) research group from Old Dominion University . This event was the first in a series of four planned datathons to give researchers, archivists, computer scientists, and many others the opportunity to get hands-on experience with the Archives Unleashed Toolkit ( AUT ) and provide valuable feedback to the team. The AUT facilitates analysis and processing of web archives at scale and the datathons are designed to help participants find ways to incorporate these tools into their own workflow. Check out the Archives Unleashed team on Twitter and their website to find other ways to get involved and stay up to date with the work they’re doing. Archives Unleashed datathon organizers (left to right): Nich Worby , Ryan Deschamps , Ia

2018-05-04: An exploration of URL diversity measures

Image
Fig. 1:  Animal portraits  by  Morten Koldby Recently, as part of a research effort to describe a collections of URLs, I was faced with the problem of identifying a quantitative measure that indicates how many different kinds of URLs there are in a collection. In other words, what is the level of diversity in a collection of URLs? Ideally a diversity measure should produce a normalized value between 0 and 1 . A 0  value means no diversity, for example, a collection of duplicate URLs (Fig. 2 first row, first column). In contrast, a diversity value of 1 indicates maximum diversity - all different URLs (Fig. 2, first row, last column): 1. http://www.cnn.com/path/to/story?p=v 2. https://www.vox.com/path/to/story 3. https://www.foxnews.com/path/to/story Surprisingly, I did not find a standard  URL diversity   measure in the Web Science community, so I introduced the  WSDL diversity index  (described below). I acknowledge there may be other URL diversity measures in the Web Sc