2016-10-03: Summary of “Finding Pages on the Unarchived Web"

by: Hugo C. Huurdeman , Anat Ben-David , Jaap Kamps , Thaer Samar , and Arjen P. de Vries Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries 2014 In this paper , the authors detailed their approach to recover the unarchived Web based on links and anchors of crawled pages. The data used was from the Dutch 2012 Web archive at the National Library of the Netherlands (KB) , totaling about 38 million webpages. The collection was selected by the library based on categories related to Dutch history, social and cultural heritage. Each website is categorized using UNESCO code . The authors try to address three research questions: Can we recover a significant fraction of unarchived pages?, How rich are the representations for the unarchived pages?, and Are these representations rich enough to characterize the content? The link extraction used Hadoop MapReduce and Apache Pig to process all archived webpages and used JSoup to extract links from their content. ...