Posts

Showing posts from September, 2017

2017-09-19: Carbon Dating the Web, version 4.0

Image
With this release of Carbon Date there are new features being introduced to track testing and force python standard formatting conventions. This version is dubbed Carbon Date v4.0. We've also decided to switch from MementoProxy and take advantage of the  Memgator Aggregator tool built by Sawood Alam. Of course with new APIs come new bugs that need to be addressed, such as this exception handling issue . Fortunately, the new tools being integrated into the project will allow for our team to catch and address these issues quicker than before as explained below. The previous version of this project, Carbon Date 3.0 , added Pubdate  extraction, Twitter searching, and Bing  search. We found that Bing has changed its API to only allow 30 day trials for its API with 1000 requests per month unless someone wants to pay . We also discovered a few more use cases for the Pubdate extraction by applying Pubdate to the mementos retrieved from Memgator. By default, Memgator provides t

2017-09-13: Pagination Considered Harmful to Archiving

Image
Figure 1 - 2016 U.S. News Global Rankings Main Page as Shown on Oct 30, 2015 Figure 2 - 2016 U.S. News Global Rankings Main Page With Pagination Scheme as Shown on Oct 30, 2015 https://web.archive.org/web/20151030092546/https://www.usnews.com/education/best-global-universities/rankings While gathering data for our work in measuring the correlation of university rankings by reputation and by Twitter followers (McCoy et al., 2017), we discovered that many of the web pages which comprised the complete ranking list for U.S. News in a given year were not available in the Internet Archive . In fact, 21 of 75 pages (or 28%)  had never been archived at all. "... what is part of and what is not part of an Internet resource remains an open question" according to research concerning Web archiving mechanisms conducted by Poursadar and Shipman (2017).  Over 2,000 participants in their study were presented with various types of web content (e.g., multi-page stories, reviews,