2021-02-09: The 1st AAAI Workshop on Scientific Document Understanding (SDU 2021)

The AAAI-21 Workshop on Scientific Document Understanding (SDU) was co-located with the 35th AAAI Conference on Artificial Intelligence. This is another workshop focusing on scientific documents, following the Workshop on Scientific Document Processing (SDP). The SDP workshop was last held with EMNLP 2020.

The workshop accepted 23 papers, including 14 long papers and 9 short papers, in the following topics:
  • Information extraction: 7
  • Information veracity or significance: 5
  • Biomedical text proceeding: 4
  • Scientific image processing: 3
  • Document classification: 2
  • Summarization: 1
  • Reading comprehension: 1
I co-authored two papers accepted by the AAAI-21 SDU workshop, they are

Retraction Prediction Paper

The last two decades have seen growing concern in the scientific community on the integrity of published works, represented by an increase of retracted papers. The figure below illustrates the uptrend in retractions in previous years.

The retraction prediction paper attempts a binary classification to papers that should be retracted and should not be retracted using classical machine learning methods. The data comes from the Retraction Watch database, including 8087 retractions in social sciences for classification in multiple disciplines such as health sciences (5396 papers), social sciences (2651 papers), humanities (366 papers). More than one subject may be listed for a given paper.

The paper extracted the following features used for classification
  1. Lead author university ranking
  2. Journal impact score SJR
  3. Citation Next: the average number of citations of a published work in the first 3-5 years after it was published.
  4. Citation velocity: the average rate at which the paper is cited in recent years, excluding self-citations.
  5. Citation and reference intents
  6. Whether the paper is open access
  7. Subject area
  8. Country of the primary author's affiliation
  9. The number of references
  10. The number of authors
  11. Sample size derived from p-value extraction
  12. Acknowledgments
  13. Self-citations
  14. Word embedding of the abstract: the authors tried Doc2vec, BioSenVec, SciBERT, and TFIDF (dimension reduced by SVD)

The 10-fold cross-validation achieved an F1=71.37%. The paper performed ablation analysis and found that the following features do not achieve any separability and thus could not be used for retraction prediction: self-citation, university ranking, and open access. The following features seem to have the stronger predictive power: the country of the primary author, SJR, abstract, country, and TFIDF (dimension reduced to 15 by SVD).

Patent Label Extraction Paper

The patent label extraction paper proposed an algorithm to extract labels in patent figures. Patent figures may contain one or multiple subfigures, each of which has a label associated. To separate the text from the figure, the paper proposed an alpha-shape-based method. An alpha-shape is the smallest polygon that encloses all the points (foreground pixels) in an area, similar to a convex-hull, but with an alpha-allowance for non-convexity (Edelsbrunner, Kirkpatrick, and Seidel 1983).

The pipeline to generate the alpha-shapes are
  1. Input image binarized
  2. Closed regions filling
  3. Label candidates
  4. Dashed lines removal
  5. Generated alpha-shapes
The method was evaluated on a small set of ground truth consisting of 100 USPTO DESIGN patents approved in Jan 2020. The ground truth consists of 126 figure labels in 100 figure files. The method was also compared with open source and commercial tools. The comparison result is shown below:

It is seen that the alpha-base method achieved almost state-of-the-art performance, with F1=0.91. The paper also addresses one limitation of this method, which was the lack of adaptability of this method. That is, human interference is needed to adjust hyperparameters in order to achieve the above performance.

After the paper was presented, we tried Amazon Rekognition OCR and it achieved an F1=0.93 on the same dataset. Although it is still below Google. It is much easier to use than Google vision API.

The workshop presented many interesting papers, outlined below.

Towards A Robust Method for Understanding the Replicability of Research (by Ben Gelman [Two-Six])

The paper proposed a framework to predict the applicability of a psychological paper. The pipeline is illustrated below:

The authors used "Automator" to extract text from PDF converted to RTF and then HTML format. In "Semantic Tagging", the authors developed a random forest model that tags each sentence into one of 6 categories: Introduction, Methodology, Results, Discussion, Research Practices, and References. They use the Universal Sentence Encoder to encode each sentence into a dense vector. The ground truth includes 81k labeled sentences extracted from 838 papers. The F1 of the above semantic tags was between 0.56 and 0.84.

The Information Extraction is a bit more complicated. The authors extracted three types of information: language quality of paragraphs, methodological information, and statistical tests, tabulated below:

Language quality of paragraphs




Flesch Readability Ease (Wikipedia)


TextBlob (Tutorial)


AllenNLP (Demo)

Methodological information



Sample count


Sample noun


sample detail

AMT, in china



Exclusion count


Exclusion reason

could not describe a partner transgression

Experiment reference

this study

Experimental condition

control condition

Experimental variable/factor

reaction time, sleep quality


questionnaire, sem

Statistical tests

  • 25 statistical tests and values, e.g., p, R, R2, d, F-tests, T-test, mean, median, standard deviation, confidence levels, odds ratio, non-significance, etc. 
  • Cluster extracted elements by proximity 
The ground truth was compiled from about 150 PDFs from the Journal of Experimental Psychology. The judgment of replicability was based on the percentage of known replications to known failures. For example, if an experiment failed 3 times and replicated 5 times then the experiment was labeled to be replicable. The authors used a Random Forest model with 5000 estimators (this is a bit higher than usual) with a max-depth of 3. The testing corpus includes 11 psychological papers (unclear how they are selected, but the testing set is a bit small). The model correctly predicted 10 out of 11 papers.

There are other interesting papers, briefly summarized below:

Importance Assessment in Scholarly Networks (by Saurav Manchanda [UMN])

  • The paper proposed a new index called Content Informed Index (CII) that quantitatively measure the impact of scholarly papers by weighing the edges of the citation network. The basic idea is that each paper can be represented as a set of historical concepts H, and novelty concepts N. The contribution of a cited paper P_j towards the citing paper P_i is the set of concepts C_ji = C_j \Intersect H_i. The task is to quantify the extend to which C_ji contributes towards H_i. 

A Paragraph-level Multi-task Learning Model for Scientific Fact-Verification (by Xiangci Li [UT-Dallas, UCLA]) 

  • The paper is based on a previously published paper titled "Fact or Fiction: Verifying Scientific Claims" (Wadden et al. 2020). Wadden's paper proposed a dataset called SciFact that contains a set of claims and the evidences annotated on scientific papers. It also proposed a framework called VeriSci that automatically search for evidence rationale sentences given a claim in a scientific paper. Li's paper proposed a multitask learning model to improve Wadden's work. The authors computed a sequence of contextualized sentence embedding from a BERT model and jointly trained the model on rationale selection and stance prediction. 

Graph Knowledge Extraction of Causal, Comparative, Predictive, and Proportional Associations in Scientific Claims with a Transformer-Based Model (by Ian H. Magnusson [Sift])

  • This paper proposed a software framework that performs knowledge graph extraction out of sentence level scientific claims. The graph contains a set of pre-defined entities and relationships. The paper is based on a previous dataset called SciERC (Luan et al. 2018) and the transformer-based model called SpERT (Eberts and Ulges 2020). The paper adds multi-label attributes such as causation, correlation, comparison, increase, decrease, indicates to extend the SpERT model, and also performed span-based classification. The types of entities includes factors, evidence, association, epistemic, magnitudes, and qualifiers. The relations include arg0, arg1, comp_to, modifies, q+, and q-. The authors annotated 299 sentences of scientific claims from social and behavior science (SBS), PubMed, and CORD-19 papers. They reported 10-fold cross validation results for their best performing models. The micro-averages are 80.88% (entities), 83.81% (attributes), and 63.75% (relations). 
Other papers of my interest are
  • On Generating Extended Summaries of Long Documents (by Sajad Sotudeh [Georgetown])
  • Identifying Used Methods and Datasets in Scientific Publications (by Michael Färber [KIT])
  • AT-BERT: Adversarial Training BERT for Acronym Identification Winning Solution for SDU@AAAI-21 (by Danqing Zhu)
  • Probing the SpanBERT Architecture to interpret Scientific Domain Adaptation Challenges for Coreference Resolution (by Hari Timmapathini)
The proceedings will be available from the papers will be published in CEUR-WS.org for online publication. 

-- Jian Wu