2022-09-29: Theory Entity Extraction for Social and Behavioral Sciences Papers using Distant Supervision
In this blog, I will talk about our recent paper "Theory Entity Extraction for Social and Behavioral Sciences Papers using Distant Supervision", which is published in the conference DocEng. In this paper, we proposed an automated framework based on distant supervision that leverages entity mentions from Wikipedia to build a ground truth corpus consisting of more than 4500 automatically annotated sentences containing theory/model mentions. We compared four deep learning architectures and found the RoBERTa-BiLSTM-CRF is the best one with a precision as high as 89.72%. The code and data are publicly available in GitHub. You can also check the slides.
Introduction
Scientific literature has grown exponentially over the past decades. In order to understand the literature more quickly, people can review abstracts and high-level key phrases. But they don't provide enough details. Theories and models extracted from body text can provide more details. While abstracts and high-level key phrases are frequently explored in literature, there has been limited work on theory and model name extraction. In addition, extraction of theories and models can help to build knowledge graphs, build search features of an academic search engine, and also help to make literature analysis, such as domain development and innovation composition.
Theory entity extraction has not been extensively explored due to a lack of data. There is no existing labeled data for extracting theory entities in social and behavioral science (SBS) domains. Crowdsourcing is not appropriate as it requires expertise in a specific domain. It's hard to find a large number of people with such expertise. Manual annotation is time-consuming.
As a result, we propose to use distant supervision to address the data sparsity problem of theory extraction. In our work, we use an existing database, such as Wikipedia, to collect instances of entity mentions, and then we use these instances to automatically generate our training data.
Ground Truth Data Construction
Figure 1: Pipeline for the construction of the ground truth data. |
As shown in the diagram above, we use a pipeline to automatically generate training data. We obtain text from a set of papers. Those papers are in PDF format, and have to be converted into XML format in order to obtain body text. Next, body text will be segmented into sentences. Then we index those sentences into a database by ElasticSearch. Meanwhile, we obtain theory entities from existing sources, such as Wikipedia. The theory entities are used to query the database and find the sentences that has the entities. Then the sentence will be automatically annotated into BIO format.
To be specific, the pipeline is composed of the following five modules:
- Web Scraping
- Use a web scraper to obtain theory and model entities from Wikipedia webpages.
- A heuristic filter is used to keep phrases ending with the following head words such as “theory”, “model”, “concept”...
- Obtaining Body Text
- Use GROBID to convert PDF documents into XML format and keep the body text
- Sentence Segmentation
- Use Stanza to segment the body text of papers into sentences
- 870,000 sentences from the 2400 papers
- Elasticsearch
- The sentences are indexed by Elasticsearch.
- Automatic Annotation
- The seed theory mentions obtained in Web Scraping are used to query the Elasticsearch index
- Sentences represented in BIO (Begin, Inside, Outside) schema
The parent sample is obtained by the Defense Advanced Research Projects Agency (DARPA) program ‘Systematizing Confidence in Open Research and Evidence’ (SCORE) project, containing approximately 30,000 articles published from 2009-2018 in 62 major SBS journals in psychology, economics, politics, management, education, etc.
We obtain the text for labeling from a random sample of 2400 SBS papers. The ground truth dataset contains 4534 sentences with 550 unique theory mentions automatically annotated by the pipeline. Some examples of the training data are shown below:
Figure 2: Sentences containing highlighted theory names in ground truth data. |
Models and Evaluation
After we have constructed the groundtruth data, we proceed to build our model. We compare four deep neural network architectures, including BiLSTM, BiLSTM-CRF, Transformer, and GCN.- BiLSTM: The BiLSTM architecture analyzes the contextual dependency for each token from both forwards and backwards, and then assigns each token a label based on probability scores for each tag.
- BiLSTM-CRF: BiLSTM can work together with a Conditional Random Field (CRF) layer, which labels a token based on its own features, features and labels of nearby tokens.
- Transformer: A transformer model predicts labels of tokens based on features of neighboring tokens simultaneously using a multi-head attention mechanism.
- GCN: GCN (Graph Convolutional Network) is a type of CNN that processes graph-like data structures.
Among all the architectures, BiLSTM-CRF obtained the best performance with the pre-trained language model RoBERTa. F1 score is 77.21%. Transformer does not perform as superior as in other tasks. This is due to the limited size of our training data.
In the extraction results of a small dataset, all theory names extracted from these sentences were new. In particular, about 42% contain head words that were not in the heuristic filter. Examples of the extraction results are shown below. The first two examples contain head words in the heuristic filter so they are partially new. The next three examples are completely new. In the last one, advance is a verb, so the extracted entity does not make sense.
- We proposed a trainable framework that extracts theory and model mentions from scientific papers using distant supervision.
- We have created a new benchmark corpus consisting of 4534 annotated sentences from papers in SBS domains. This dataset can be used for future models on theory extraction.
- We compared several NER neural architectures and investigated their dependency on pre-trained language models. The empirical results indicated that the RoBERTa-BiLSTM-CRF architecture achieved the best performance with an F1 score of 77.21% and a precision of 89.72%.
References
Xin Wei, Lamia Salsabil, Jian Wu. Theory Entity Extraction for Social and Behavioral SciencesPapers using Distant Supervision. In: Proceedings of the 22th ACM Symposium on Document Engineering, (DocEng 2020), September 20th, 2022 to September 23rd, 2022. Virtual Event.-- Xin Wei (@Xin9Xin)
Comments
Post a Comment