2022-09-29: Theory Entity Extraction for Social and Behavioral Sciences Papers using Distant Supervision

In this blog, I will talk about our recent paper "Theory Entity Extraction for Social and Behavioral Sciences Papers using Distant Supervision", which is published in the conference DocEng. In this paper, we proposed an automated framework based on distant supervision that leverages entity mentions from Wikipedia to build a ground truth corpus consisting of more than 4500 automatically annotated sentences containing theory/model mentions. We compared four deep learning architectures and found the RoBERTa-BiLSTM-CRF is the best one with a precision as high as 89.72%. The code and data are publicly available in GitHub. You can also check the slides.

Introduction

Scientific literature has grown exponentially over the past decades. In order to understand the literature more quickly, people can review abstracts and high-level key phrases. But they don't provide enough details. Theories and models extracted from body text can provide more details. While abstracts and high-level key phrases are frequently explored in literature, there has been limited work on theory and model name extraction. In addition, extraction of theories and models can help to build knowledge graphs, build search features of an academic search engine, and also help to make literature analysis, such as domain development and innovation composition.

Theory entity extraction has not been extensively explored due to a lack of data. There is no existing labeled data for extracting theory entities in social and behavioral science (SBS) domains. Crowdsourcing is not appropriate as it requires expertise in a specific domain. It's hard to find a large number of people with such expertise. Manual annotation is time-consuming.

As a result, we propose to use distant supervision to address the data sparsity problem of theory extraction. In our work, we use an existing database, such as Wikipedia, to collect instances of entity mentions, and then we use these instances to automatically generate our training data.

Ground Truth Data Construction

Figure 1: Pipeline for the construction of the ground truth data.

As shown in the diagram above, we use a pipeline to automatically generate training data. We obtain text from a set of papers. Those papers are in PDF format, and have to be converted into XML format in order to obtain body text. Next, body text will be segmented into sentences. Then we index those sentences into a database by ElasticSearch. Meanwhile, we obtain theory entities from existing sources, such as Wikipedia. The theory entities are used to query the database and find the sentences that has the entities. Then the sentence will be automatically annotated into BIO format.

To be specific, the pipeline is composed of the following five modules:

Web Scraping

Use a web scraper to obtain theory and model entities from Wikipedia webpages.
A heuristic filter is used to keep phrases ending with the following head words such as “theory”, “model”, “concept”...

Obtaining Body Text

Use GROBID to convert PDF documents into XML format and keep the body text

Sentence Segmentation

Use Stanza to segment the body text of papers into sentences
870,000 sentences from the 2400 papers

Elasticsearch

The sentences are indexed by Elasticsearch.

Automatic Annotation

The seed theory mentions obtained in Web Scraping are used to query the Elasticsearch index
Sentences represented in BIO (Begin, Inside, Outside) schema

The parent sample is obtained by the Defense Advanced Research Projects Agency (DARPA) program ‘Systematizing Confidence in Open Research and Evidence’ (SCORE) project, containing approximately 30,000 articles published from 2009-2018 in 62 major SBS journals in psychology, economics, politics, management, education, etc.

We obtain the text for labeling from a random sample of 2400 SBS papers. The ground truth dataset contains 4534 sentences with 550 unique theory mentions automatically annotated by the pipeline. Some examples of the training data are shown below:

Figure 2: Sentences containing highlighted theory names in ground truth data.

Models and Evaluation

After we have constructed the groundtruth data, we proceed to build our model. We compare four deep neural network architectures, including BiLSTM, BiLSTM-CRF, Transformer, and GCN.

BiLSTM: The BiLSTM architecture analyzes the contextual dependency for each token from both forwards and backwards, and then assigns each token a label based on probability scores for each tag.

BiLSTM-CRF: BiLSTM can work together with a Conditional Random Field (CRF) layer, which labels a token based on its own features, features and labels of nearby tokens.

Transformer: A transformer model predicts labels of tokens based on features of neighboring tokens simultaneously using a multi-head attention mechanism.

GCN: GCN (Graph Convolutional Network) is a type of CNN that processes graph-like data structures.

Table 1: A comparison of neural network architectures with various text embeddings. The highest values of each metric are indicated in bold. The results from MDER and Wu et al. were directly quoted from their papers.

Among all the architectures, BiLSTM-CRF obtained the best performance with the pre-trained language model RoBERTa. F1 score is 77.21%. Transformer does not perform as superior as in other tasks. This is due to the limited size of our training data.

In the extraction results of a small dataset, all theory names extracted from these sentences were new. In particular, about 42% contain head words that were not in the heuristic filter. Examples of the extraction results are shown below. The first two examples contain head words in the heuristic filter so they are partially new. The next three examples are completely new. In the last one, advance is a verb, so the extracted entity does not make sense.

Figure 3: Examples of theory names (highlighted) extracted.

Conclusions

We proposed a trainable framework that extracts theory and model mentions from scientific papers using distant supervision.

We have created a new benchmark corpus consisting of 4534 annotated sentences from papers in SBS domains. This dataset can be used for future models on theory extraction.

We compared several NER neural architectures and investigated their dependency on pre-trained language models. The empirical results indicated that the RoBERTa-BiLSTM-CRF architecture achieved the best performance with an F1 score of 77.21% and a precision of 89.72%.

References

Xin Wei, Lamia Salsabil, Jian Wu. Theory Entity Extraction for Social and Behavioral SciencesPapers using Distant Supervision. In: Proceedings of the 22th ACM Symposium on Document Engineering, (DocEng 2020), September 20th, 2022 to September 23rd, 2022. Virtual Event.

-- Xin Wei (@Xin9Xin)

Search This Blog

Web Science and Digital Libraries Research Group