2018-05-04: An exploration of URL diversity measures

Fig. 1: Animal portraits by Morten Koldby Recently, as part of a research effort to describe a collections of URLs, I was faced with the problem of identifying a quantitative measure that indicates how many different kinds of URLs there are in a collection. In other words, what is the level of diversity in a collection of URLs? Ideally a diversity measure should produce a normalized value between 0 and 1 . A 0 value means no diversity, for example, a collection of duplicate URLs (Fig. 2 first row, first column). In contrast, a diversity value of 1 indicates maximum diversity - all different URLs (Fig. 2, first row, last column): 1. http://www.cnn.com/path/to/story?p=v 2. https://www.vox.com/path/to/story 3. https://www.foxnews.com/path/to/story Surprisingly, I did not find a standard URL diversity measure in the Web Science community, so I introduced the WSDL diversity index (described below). I acknowledge there may be oth...