2023-12-29: Paper Summary: “A Systematic Review of Reproducibility Research in Natural Language Processing”

In the quest for scientific progress, the replicability crisis has become an increasingly prevalent concern. To characterize and analyze this issue in the field of natural language processing (NLP), Anya Belz, Shubham Agarwal, Anastasia Shimorina, and Ehud Reiter wrote a research review article, "A Systematic Review of Reproducibility Research in Natural Language Processing" (Belz et al., 2021). This study navigated through the intricacies of NLP research, focusing on the challenges, initiatives, and diverse terminology used in the literature. Additionally, they explored the specificities of reproduction studies under various conditions, and the impact of seemingly minor code differences on the reliability and generalizability of NLP endeavors.

Data and Methodology

The research team employed a structured review with a systematic process, using a search in the ACL Anthology for papers with the terms “reproduc” or “replica”. This initial search yielded 47 papers, from which 12 were excluded after inspection (Figure 01), and an additional 25 papers were found in non ACL NLP/ML sources, along with 7 in other fields, resulting in a total of 60 papers for their survey. Their objective was to categorize the research, identify common terminology, and present a holistic overview of the current state of reproducibility in NLP.

Figure 01: Selected 35 papers from ACL anthology search by year and venue [Source: (Belz et al., 2021) Figure 1]

Findings of the Study

Their findings show the current status of reproducibility in NLP and point the community toward a more standardized and transparent future in natural language processing. The key findings obtained from their study are categorized as follows:

(A) Terminology and Frameworks:

The study discussed the complex terminology inherent in NLP reproducibility research. Notably, terms like "reproducibility" and "replicability" lack a consensus in their definitions as shown in Table 01.

Table 01: Different definitions proposed for ‘reproducibility’, ‘replicability’, and ‘repeatability’

The authors propose adopting precision-oriented definitions from the International Vocabulary of Metrology (VIM) which is also in line with what (ACM, 2020) has defined. This recommendation seeks to establish a common ground for effective communication within the NLP research community (Table 02).

Table 02: Terminology proposed by Belz et al., 2021

(B) Categories of Reproducibility Research:

The authors identified three distinct categories of reproducibility research: reproduction under the same conditions (repeatability), reproduction under varied conditions (reproducibility), and multi-test studies.

(i) Reproduction Under Same Conditions (Repeatability):

The study discussed the struggles that researchers face in achieving system sameness. Notably, worse outcomes were found to be more frequent than better outcomes when comparing original and reproduction score pairs, emphasizing the significance of even minor code differences such as small variations in model parameters and settings, including rare-word thresholds, treatment of ties, or case normalization in influencing performance. This underscores the need for diligent attention to detail in code reproducibility.

(ii) Reproduction Under Varied Conditions (Reproducibility):

The authors explored deliberate variation for similar results in this category, with the authors providing examples of studies conducted under varied conditions. This approach recognized the complexity of real-world applications and the necessity for adaptable models that could perform consistently across diverse scenarios.

(iii) Multi-test and Multi-lab Studies:

The study conversed about multi-test and multi-lab studies, providing insights into the challenges and advantages of involving multiple reproduction studies. The authors draw attention to the REPROLANG (Branco et al., 2020) project as an exemplary multi-lab study, showcasing the collaborative efforts needed to address reproducibility challenges effectively.

Key Takeaways

The study provides a wealth of statistics that discuss the challenges in the field. The revelation that a mere 14.03% of original and reproduction score pairs were identical highlights the challenges of achieving reproducibility in NLP research. The majority of non-same reproduction results yielding worse outcomes underscores the sensitivity of code details where even minor differences can translate into substantial variations in performance, as noted by 3/5 of score pair differences being greater than +/-1%, and about 1/4 greater than +/-5% (Figure 2). In essence, these statistics serve as a distressing reminder that the path to reproducibility is agitated with complexities.

Figure 02: Histogram of percentage differences between original and reproduction scores (bin width = 1; clipped to range -20..20 and excluding counts for 0% score pair change) [Source: (Belz et al., 2021) Figure 2]

The key takeaway from this review extends beyond statistics, encapsulating a recognition of the challenges inherent in NLP research. The authors promoted the need for a pivotal shift in focus towards addressing code differences, with a dual emphasis on ensuring reliability and generalizability. The study's categorization of reproduction studies and its call for addressing more complex questions resonate with the ongoing discussions in the NLP research community.

Conclusion

The research review article (Belz et al., 2021) explores the complexities of reproducibility in NLP with a strikingly low percentage of identical reproductions. The categorization of reproducibility research into three types - repeatability, reproducibility, and multi-test studies - adds clarity to the multifaceted nature of the challenges. The proposal to adopt precision-oriented definitions from the International Vocabulary of Metrology (VIM) as a solution to the lack of consensus in the NLP community regarding the definition and measurement of reproducibility, reflects a forward-looking approach, emphasizing the need for a common language in the NLP research community. By recognizing the need for precision in terminology, addressing code differences, and prioritizing reliability and generalizability, the study contributes to a deeper understanding of the reproducibility in NLP and directs towards more robust and credible NLP research practices. In essence, this review not only exposed the challenges but also set the stage for a more standardized and transparent future in NLP research.

References

ACM, 2020. Artifact Review and Badging - Current. URL https://www.acm.org/publications/policies/artifact-review-and-badging-current (accessed 2023-12-23).

Belz, A., Agarwal, S., Shimorina, A., Reiter, E., 2021. A Systematic Review of Reproducibility Research in Natural Language Processing, in: Merlo, P., Tiedemann, J., Tsarfaty, R. (Eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Presented at the EACL 2021, Association for Computational Linguistics, Online, pp. 381–393. DOI: 10.18653/v1/2021.eacl-main.29

Branco, A., Calzolari, N., Vossen, P., Van Noord, G., van Uytvanck, D., Silva, J., Gomes, L., Moreira, A., Elbers, W., 2020. A Shared Task of a New, Collaborative Type to Foster Reproducibility: A First Exercise in the Area of Language Science and Technology with REPROLANG2020, in: Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference. Presented at the LREC 2020, European Language Resources Association, Marseille, France, pp. 5539–5545.

Rougier, N.P., Hinsen, K., Alexandre, F., Arildsen, T., Barba, L.A., Benureau, F.C.Y., Brown, C.T., Buyl, P. de, Caglayan, O., Davison, A.P., Delsuc, M.-A., Detorakis, G., Diem, A.K., Drix, D., Enel, P., Girard, B., Guest, O., Hall, M.G., Henriques, R.N., Hinaut, X., Jaron, K.S., Khamassi, M., Klein, A., Manninen, T., Marchesi, P., McGlinn, D., Metzner, C., Petchey, O., Plesser, H.E., Poisot, T., Ram, K., Ram, Y., Roesch, E., Rossant, C., Rostami, V., Shifman, A., Stachelek, J., Stimberg, M., Stollmeier, F., Vaggi, F., Viejo, G., Vitay, J., Vostinar, A.E., Yurchak, R., Zito, T., 2017. Sustainable computational science: the ReScience initiative . PeerJ Comput. Sci. 3, e142. DOI: 10.7717/peerj-cs.142

Schloss, P.D., 2018. Identifying and Overcoming Threats to Reproducibility, Replicability, Robustness, and Generalizability in Microbiome Research. mBio 9, DOI: 10.1128/mbio.00525-18

Wieling, M., Rawee, J., van Noord, G., 2018. Reproducibility in Computational Linguistics: Are We Willing to Share? Computational Linguistics 44, 641–649. DOI: 10.1162/coli_a_00330

– Rochana R. Obadage

Search This Blog

Web Science and Digital Libraries Research Group

2023-12-29: Paper Summary: “A Systematic Review of Reproducibility Research in Natural Language Processing”

Comments

Post a Comment