Posts

Showing posts with the label Language Model

2022-09-13: A Hybrid Classifier to Extract URLs linking to Open Access Datasets and Software for Computational Reproducibility Study

Image
It has become common practice to include open access datasets and software (OADS) in computational research publications. Emily Escamilla , discusses the increasing trends of including web and software repository platforms in scholarly publications in her One in Five arXiv Articles Reference GitHub blog. OADS are essential resources for replicating computational experiments and making the work more transparent. OADS are also crucial for building repositories that support computational reproducibility. The process of manually examining a large number of research papers in order to extract URLs linking to OADS is time-consuming and labor-intensive. Thus an automatic approach should be adopted to identify and extract OADS-URLs (URLs linking to OADS) from scientific papers. We proposed a hybrid OADSClasssifier consisting of a heuristic and a supervised learning model to identify OADS-URLs in a research paper automatically. The classifier achieves a best F1 of 0.92. The source code is av...