2020-02-14: ACM Computing Surveys publication: Change Detection and Notification of Web Pages: A Survey

I'm very excited to announce our recent publication at the prestigious "ACM Computing Surveys" journal.

Vijini Mallawaarachchi, Lakmal Meegahapola, Roshan Madhushanka, Eranga Heshan, Dulani Meedeniya, and Sampath Jayarathna. "Change Detection and Notification of Web Pages: A Survey." ACM Computing Surveys (CSUR) 53, no. 1 (2020): 1-35

ArXiv copy is available at https://arxiv.org/abs/1901.02660

We present our work on various aspects of change detection and notification systems, and different techniques used for each aspect including current challenges and areas of improvement within the field of research. 

This project was initially a part of the early work at the Texas A&M University Center for the study of Digital Libraries (CSDL) group, on building a topic modeling based change detection classifier for ACM Conference proceedings. These initial results were presented at ACM Hypertext 2016, and IEEE Big Data Special Session on Data Mining 2016. Later the work was expanded to look at server-based scheduler optimization framework, as part of the final year undergraduate research project at the Department of Computer Science and Engineering, University of Moratuwa, Sri Lanka. Two of the undergraduate students authors are currently PhD students at Australian National University (Vijini Mallawaarachchi), and EPFL Switzerland (Lakmal Meegahapola).

Free pdf copies of the following relevant publications are available at https://www.cs.odu.edu/~sampath/

Lakmal Meegahapola, Roshan Alwis, Eranga Nimalarathna, Vijini Mallawaarachchi, Dulani Meedeniya, and Sampath Jayarathna. " Random Forest Classifier based Schedular Optimization for Search Engine Web Crawlers", 7th International Conference on Software and Computer Applications, Kuantan, Malaysia, February 8-10, 2018.

Lakmal Meegahapola, Roshan Alwis, Eranga Nimalarathna, Vijini Mallawaarachchi, Dulani Meedeniya, and Sampath Jayarathna. " Detection of Change Frequency in Web Pages to Optimize Server-based Scheduling", 17th International Conference on Advances in ICT for Engineering Regions, Colombo, Sri Lanka, September 07-08, 2017.

Lakmal Meegahapola, Roshan Alwis, Eranga Nimalarathna, *Vijini Mallawaarachchi, Dulani Meedeniya, and Sampath Jayarathna. " Adaptive Technique for Web Page Change Detection using Multi-threaded Crawlers", IEEE 7th International Conference on Innovative Computing Technology, Luton, UK, August 16-18, 2017.


Lakmal Meegahapola, Roshan Alwis, Eranga Nimalarathna, Vijini Mallawaarachchi, Dulani Meedeniya, and Sampath Jayarathna. "Optimizing Change Detection in Distributed Digital Collections", IEEE/ACIS 18th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, Ishikawa, Japan, June 26-28, 2017.

Lakmal Meegahapola, Roshan Alwis, Eranga Nimalarathna, Vijini Mallawaarachchi, Dulani Meedeniya, and Sampath Jayarathna. " Change Detection Optimization in Frequently Changing Web Pages", 3rd International Multidisciplinary Engineering Research Conference, Moratuwa, Sri Lanka, May 29-31, 2017.

Sampath Jayarathna, and Faryaneh Poursardar. "Change Detection and Classification of Digital Collections", IEEE Big Data Conference Special Session on Intelligent Data Mining, Dec. 5-8, 2016.

Luis Meneses, Sampath Jayarathna, Richard Furuta, and Frank Shipman. "Analyzing the Perceptions of Change in a Distributed Collection of Web Documents", The ACM Hypertext Conference, Halifax, Canada, July 10-13, 2016. pp. 273-278. 

Figure 1: Organization of the survey and the aspects discussed

The organization and the aspects discussed in this survey are summarized in Figure 1. We mainly focus on techniques used in web crawler scheduling, change detection and change frequency, where extensive research has been carried out. We identify four main research directions in our survey. The first research direction focuses on improving the architecture of Change Detection and Notification (CDN) systems, where computing resources, and temporal resources can be utilized efficiently while overcoming the limitations of traditional server-based, and client-based architectures. The second research direction focuses on improving change-detection algorithms to track webpage changes quickly with high accuracy. The third research direction focuses on identifying the change frequency of webpages, and designing optimized crawler schedules so that computing resources can be used efficiently by deploying crawlers when required. The final research direction is improving and developing methods, and algorithms to detect changes in dynamic and JavaScript rendered webpages that can efficiently handle large amounts of data.

In addition, we have compared different features of twelve popular CDN systems that are publicly available at present. According to the comparison results, it is evident that most of the systems support checks at fixed intervals, but not checks at random intervals. These systems can be further improved by introducing intelligent crawling schedules to optimize the crawling process by crawling webpages at their estimated change frequency.

Moving forward, our group at the WS-DL Lab (Alexandar Nwala, Yasith Jayawaradana, Gavindya Jayawardena, Jian Wu, Michael Nelson, Sampath Jayarathna), and collaborators from Penn State (C. Lee Giles) are working towards a history aware crawl scheduler for the academic web such as CiteSeerX.

-- Sampath

Comments