2026-05-24: Paper Summary: Context-Based URL Classification for Open Access Datasets and Software in Scholarly Documents

Figure 1: EnSU Architecture. This is Figure 2 in our paper.

This blog post summarizes our paper, "Context-Based URL Classification for Open Access Datasets and Software in Scholarly Documents," (preprint) published in the 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL ’25). Many scholarly papers include URLs that point to open-access datasets and software (OADS), but the URL string alone rarely tells us what the link refers to. In this paper, we present EnSU, an ensemble of three complementary models for classifying these URLs using the surrounding citation context. EnSU assigns each URL to one of six categories that jointly reflect both the resource type (dataset vs. software) and resource provider (authors vs. third parties), plus two catch-all categories for projects and general links. On our OADS-1K dataset, EnSU achieves a macro-average F1-score of up to 0.90 on a stratified 80/20 split and a mean macro-average F1-score of 0.89 across five-fold cross-validation. We also report that EnSU outperforms the best single-model classifier by 20%.

Introduction

Computational reproducibility depends on being able to access the same data and software used in a published study [1]. In practice, authors often share or cite such resources through URLs embedded in the paper text. These links can be valuable evidence for tracking and preserving OADS, but they are also challenging to index at scale.

A core obstacle is that URLs are semantically underspecified, such that a repository URL might host code, data, both, or something else, and the intended meaning is often expressed only in the nearby prose. We argue that moving from coarse URL detection to fine-grained classification is important for better metadata and discoverability, including distinguishing whether a resource is contributed by the paper’s authors or reused from elsewhere.

Problem Statement: What "Context-based URL Classification" Means

In this work, context-based URL classification means that given a URL that appears in a scholarly document, we classify it using the citation context around the URL, not the URL string alone.

Expanded Context Representation

We represent a URL’s textual context as a three-sentence window: the sentence immediately before the URL’s sentence, the target sentence that contains the URL, and the sentence immediately after.

If the preceding or trailing sentence is missing (for example, at a paragraph boundary), the “expanded context” reduces to the sentences that exist.

Dataset: OADS-1K

For training and evaluation, we compile OADS-1K, which contains 1,129 manually annotated samples. Each sample includes a URL-containing target sentence together with its expanded context. The annotation process considered six categories, listed below.

Output Labels (Six Categories)

We classify each URL into one of six categories:

Third-Party Dataset: points to a dataset hosted by someone other than the paper’s authors.
Third-Party Software: points to software, tools, or code hosted by someone other than the paper’s authors.
Author Provided Dataset: points to a dataset created and shared by the paper’s authors.
Author Provided Software: points to software, tools, or code created and shared by the paper’s authors.
Project: points to a project website or repository that contains both data and software/tools.
General URL: points to something other than a dataset, software/tool, or project.

A graph with blue bars

AI-generated content may be incorrect.

Figure 2: Distribution of Samples Across Different Subject Categories from CORD-19, ETD, and arXiv in OADS-1K. This is Figure 1 in our paper.

We build OADS-1K from 1,574 scholarly documents published between 2016 and 2022, drawn from three publicly available sources: CORD-19 [3], Electronic Theses and Dissertations (ETDs) [2], and arXiv [4]. The sampling prioritizes documents that contain at least two URLs. We note that the resulting set contains many biomedical and computer science scholarly documents, which is consistent with the underlying corpora and with the prevalence of data and software links in those fields.

Manual Extraction and URL Context Normalization

We extract contexts by visually inspecting each PDF and recording the target sentence and its expanded context. While many URLs appear inline, others show up in footnotes or reference sections. In those cases, we first substitute the citation marker with the full footnote or reference entry (including the URL), and then extract the surrounding sentences.

Annotation Process

Two graduate student annotators label all samples and reach 92% consensus. When they disagree, a third annotator with relevant expertise helps adjudicate.

When the target sentence and expanded context do not provide enough information to determine the category, annotators are instructed to follow the URL and inspect the linked content. We give an example where the context suggests the link is a dataset repository but does not reveal whether it is author-provided or third-party; the annotators then cross-reference paper authors with repository contributors to decide the final label.

Class Distribution and Examples

Table 1: Examples of URLs with target sentences and expanded contexts for each URL category. This is Table 1 in our paper. In this table, we represent the preceding sentence with <preceding>...</preceding>, the sentence containing the URL with <target>...</target>, and the trailing sentence with <trailing>...</trailing>.

Table 1 provides both class proportions and representative examples. The dataset contains all six categories, but the "Project" class is notably smaller than the others.

Method

The central design choice is to ensemble complementary models rather than rely on a single classifier. We motivate this by noting that URL contexts can be subtle and that different modeling choices capture different signals.

We build EnSU, an ensemble of three classifiers:

Supervised Contrastive Learning (SCL) [6] classifier,
SciBERT-based classifier,
BertGCN [5] classifier

We then combine their predictions through majority voting, with a deterministic tie-breaking rule (see Fig. 1). If two models agree on the category, we take that shared label. If all three disagree, we output the BertGCN prediction as it is the strongest individual model among the three.

1) Supervised Contrastive Learning (SCL) Model

The SCL component is motivated by data scarcity. In addition to the standard cross-entropy objective, we use supervised contrastive learning, which encourages representations of same-class examples to be closer in embedding space than representations from different classes.

In practice, we start from a pretrained encoder and optimize a weighted mix of cross-entropy and supervised contrastive objectives. We also discuss the temperature term in the contrastive loss, which influences how sharply the model separates hard negatives.

In the experiments, we compare several pretrained encoders within the SCL framework and select SPECTER because it outperformed other language models, such as BERT, SciBERT, and DistilBERT, in context-based URL classification (Table 2).

2) SciBERT-based Model

The SciBERT model is a conventional transformer classifier: we fine-tune SciBERT and add a linear classification head to predict the six URL categories from the concatenated context input.

3) BertGCN Model

BertGCN augments a BERT-style encoder with a graph convolutional network (GCN) over a corpus-level graph. We build a graph containing document nodes (one per OADS-1K sample), word nodes (vocabulary terms), word-word edges weighted by PPMI (pointwise mutual information), and document-word edges weighted by TF-IDF.

On OADS-1K, we report a graph with 1,129 document nodes, 14,956 word nodes, and 969,423 edges. The adjacency matrix occupies 11.16 MB and is generated in 0.377 seconds on a server with 48 CPUs and 32 GB RAM.

We then fine-tune a BERT encoder and train a two-layer GCN to propagate label information across the graph structure, jointly optimizing the components with cross-entropy.

Experimental Setup

We evaluate on OADS-1K using a stratified 80/20 train-test split, and we also report five-fold cross-validation. The primary metric is macro-average F1, alongside precision and recall.

We compare EnSU against several baselines, including individual ensemble components (SCL, SciBERT, BertGCN), an LLM-based few-shot classifier (using GPT-4 and Claude 3.7 Sonnet in the described setup), OADSClassifier [7], a prior hybrid approach that combines heuristic and learning-based components and is adapted here from binary detection to the six-way classification setting.

For the LLM baseline, we use a few-shot prompt with category definitions and labeled examples, sets temperature to 0, and generates five independent predictions per input. It then takes a majority vote, and if there is no majority, it falls back to the class with the highest averaged logit probabilities. We report 94% consensus across the five runs.

Results

Screenshot 2025-12-08 at 12.38.57 PM.png

Table 2: F1-scores for SCL with different language models using the target sentence and expanded context as input. This is Table 2 in our paper.

We evaluated several pre-trained language models to choose the best encoder for the SCL classifier. As shown in Table 2, SPECTER performs best, achieving the highest macro-average F1 score of 0.85 and outperforming the other models.

Screenshot 2025-12-08 at 12.41.18 PM.png

Table 3: Performance metrics (Precision (P), Recall (R), and F1-score) for different input combinations evaluated with EnSU. Input 1: Target sentence. Input 2: Target sentence with expanded context. This is Table 3 in our paper.

To test whether surrounding sentences matter, we compare a target-only input against the expanded context window (Table 3). With expanded context, EnSU’s macro F1 increases from 0.88 to 0.90. This matches our observation that the cues needed to interpret a URL often sit just outside the sentence that contains it.

Table 4: Macro F1-scores for different URL classifiers evaluated on an 80/20 stratified split of the OADS1K dataset. This is Table 4 in our paper. “Claude” refers to Claude 3.7 Sonnet, and “SCL” stands for Supervised Contrastive Learning.

Table 4 shows that the proposed EnSU classifier performs best overall, with a macro F1 score of 0.90. It consistently leads in key categories such as "Author Provided Software," "Project," and "Author Provided Dataset," significantly outperforming baseline methods, including LLM-based approaches.

We report that EnSU’s improvement over the strongest individual model (BertGCN) is statistically significant under a paired Student’s t-test (t(4) = -4.8107, p = 0.0086).

Data Efficiency

To study data efficiency, we train with 25%, 50%, 75%, and 100% of the available training data and tracks how performance changes.

Figure 3: Test F1-scores of SciBERT, SCL, BertGCN, and EnSU across different training data sizes (25%, 50%, 75%, 100%). This is Figure 4 in our paper.

Figure 3 shows a comparison of F1 scores for SciBERT, SCL, BertGCN, and EnSU on the sample test set as the training set size increases from 25% to 100%.

Runtime

On the 230-sample test set, we report a total runtime of 37.85 seconds for EnSU, which works out to roughly 0.165 seconds per sample. We present this as evidence that the approach is practical for larger-scale processing.

Error Analysis

Figure 4: Confusion matrix showing the performance of EnSU on the OADS-1K dataset. This is Figure 5 in our paper.

Figure 4 represents the confusion matrix for EnSU on OADS-1K, summarizing which classes are most often confused.

EnSU benefits from combining multiple models, which helps it handle difficult cases where individual classifiers fail. For example, when one model mislabels a project URL as author-provided software, others correctly identify it, and the ensemble’s majority vote recovers the correct label. The confusion matrix shows strong overall performance, especially for "Project" and "Author Provided Software," but also reveals recurring challenges. The most common errors arise when author-created datasets are hosted on well-known repositories, or when URLs linking instructional pages mentioning software are mistaken for actual software links. These cases highlight how subtle language cues and overlapping mentions of data and tools can still confuse the model.

Limitations and Future Work

In our paper, we emphasize that OADS-1K is relatively small and that the Project category is underrepresented. In addition, the dataset excludes cases where the target sentence contains multiple URLs.

For future work, we plan to expanding and balancing the dataset, studying URLs that appear with limited surrounding text, and exploring LLM-agent approaches that inspect the linked content to help determine the URL type.

Availability

The dataset and source code are publicly available on GitHub: https://github.com/lamps-lab/EnSU. If you use this repository, please cite our JCDL ’25 paper: https://doi.org/10.1109/JCDL67857.2025.00031

Acknowledgment

This work was supported in part by the Institute of Museum and Library Services under grant LG-256694-OLS-24.

References:

[1] National Academies of Sciences, Engineering, and Medicine, “Reproducibility and Replicability in Science,” 2019.

[2] S. Uddin et al., “Building a large collection of multi-domain electronic theses and dissertations,” in 2021 IEEE International Conference on Big Data (Big Data). IEEE, 2021, pp. 6043–6045.

[3] L. L. Wang et al., “CORD-19: The COVID-19 open research dataset,” in Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, K. Verspoor et al., Eds., Online, Jul. 2020. [Online].

[4] M. Färber, “Analyzing the GitHub repositories of research papers,” in Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, 2020, pp. 491–492.

[5] Y. Lin, Y. Meng, X. Sun, Q. Han, K. Kuang, J. Li, and F. Wu, “BertGCN: Transductive text classification by combining GNN and BERT,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds., Online, Aug. 2021, pp. 1456–1462. [Online].

[6] B. Gunel, J. Du, A. Conneau, and V. Stoyanov, “Supervised contrastive learning for pre-trained language model fine-tuning,” arXiv preprint arXiv:2011.01403, 2020.

[7] L. Salsabil et al., “A study of computational reproducibility using URLs linking to open access datasets and software,” in Companion Proceedings of the Web Conference 2022, New York, NY, USA, 2022, p. 784–788. [Online].

-- Lamia Salsabil (@liya_lamia)

Search This Blog

Web Science and Digital Libraries Research Group