2021-02-09: The 1st AAAI Workshop on Scientific Document Understanding (SDU 2021)
The AAAI-21 Workshop on Scientific Document Understanding (SDU) was co-located with the 35th AAAI Conference on Artificial Intelligence. This is another workshop focusing on scientific documents, following the Workshop on Scientific Document Processing (SDP). The SDP workshop was last held with EMNLP 2020.
The workshop accepted 23 papers, including 14 long papers and 9 short papers, in the following topics:
- Information extraction: 7
- Information veracity or significance: 5
- Biomedical text proceeding: 4
- Scientific image processing: 3
- Document classification: 2
- Summarization: 1
- Reading comprehension: 1
- Understanding and Predicting Retractions of Published Work (by Sai Ajay Modukuri [PSU])
- Recognizing Figure Labels in Patents (by Ming Gong [Dayton])
Retraction Prediction Paper
The retraction prediction paper attempts a binary classification to papers that should be retracted and should not be retracted using classical machine learning methods. The data comes from the Retraction Watch database, including 8087 retractions in social sciences for classification in multiple disciplines such as health sciences (5396 papers), social sciences (2651 papers), humanities (366 papers). More than one subject may be listed for a given paper.
- Lead author university ranking
- Journal impact score SJR
- Citation Next: the average number of citations of a published work in the first 3-5 years after it was published.
- Citation velocity: the average rate at which the paper is cited in recent years, excluding self-citations.
- Citation and reference intents
- Whether the paper is open access
- Subject area
- Country of the primary author's affiliation
- The number of references
- The number of authors
- Sample size derived from p-value extraction
- Acknowledgments
- Self-citations
- Word embedding of the abstract: the authors tried Doc2vec, BioSenVec, SciBERT, and TFIDF (dimension reduced by SVD)
The 10-fold cross-validation achieved an F1=71.37%. The paper performed ablation analysis and found that the following features do not achieve any separability and thus could not be used for retraction prediction: self-citation, university ranking, and open access. The following features seem to have the stronger predictive power: the country of the primary author, SJR, abstract, country, and TFIDF (dimension reduced to 15 by SVD).
Patent Label Extraction Paper
The patent label extraction paper proposed an algorithm to extract labels in patent figures. Patent figures may contain one or multiple subfigures, each of which has a label associated. To separate the text from the figure, the paper proposed an alpha-shape-based method. An alpha-shape is the smallest polygon that encloses all the points (foreground pixels) in an area, similar to a convex-hull, but with an alpha-allowance for non-convexity (Edelsbrunner, Kirkpatrick, and Seidel 1983).
- Input image binarized
- Closed regions filling
- Label candidates
- Dashed lines removal
- Generated alpha-shapes
After the paper was presented, we tried Amazon Rekognition OCR and it achieved an F1=0.93 on the same dataset. Although it is still below Google. It is much easier to use than Google vision API.
Towards A Robust Method for Understanding the Replicability of Research (by Ben Gelman [Two-Six])
The authors used "Automator" to extract text from PDF converted to RTF and then HTML format. In "Semantic Tagging", the authors developed a random forest model that tags each sentence into one of 6 categories: Introduction, Methodology, Results, Discussion, Research Practices, and References. They use the Universal Sentence Encoder to encode each sentence into a dense vector. The ground truth includes 81k labeled sentences extracted from 838 papers. The F1 of the above semantic tags was between 0.56 and 0.84.
The Information Extraction is a bit more complicated. The authors extracted three types of information: language quality of paragraphs, methodological information, and statistical tests, tabulated below:
Language quality of paragraphs
Metrics | Models |
Readability | Flesch Readability Ease (Wikipedia) |
Subjectivity | TextBlob (Tutorial) |
Sentiment | AllenNLP (Demo) |
Methodological information
Properties | Example |
Sample count | twenty-four |
Sample noun | students |
sample detail | AMT, in china |
Compensation | $1 |
Exclusion count | SE |
Exclusion reason | could not describe a partner transgression |
Experiment reference | this study |
Experimental condition | control condition |
Experimental variable/factor | reaction time, sleep quality |
Method/material | questionnaire, sem |
Statistical tests
- 25 statistical tests and values, e.g., p, R, R2, d, F-tests, T-test, mean, median, standard deviation, confidence levels, odds ratio, non-significance, etc.
- Cluster extracted elements by proximity
Importance Assessment in Scholarly Networks (by Saurav Manchanda [UMN])
- The paper proposed a new index called Content Informed Index (CII) that quantitatively measure the impact of scholarly papers by weighing the edges of the citation network. The basic idea is that each paper can be represented as a set of historical concepts H, and novelty concepts N. The contribution of a cited paper P_j towards the citing paper P_i is the set of concepts C_ji = C_j \Intersect H_i. The task is to quantify the extend to which C_ji contributes towards H_i.
A Paragraph-level Multi-task Learning Model for Scientific Fact-Verification (by Xiangci Li [UT-Dallas, UCLA])
- The paper is based on a previously published paper titled "Fact or Fiction: Verifying Scientific Claims" (Wadden et al. 2020). Wadden's paper proposed a dataset called SciFact that contains a set of claims and the evidences annotated on scientific papers. It also proposed a framework called VeriSci that automatically search for evidence rationale sentences given a claim in a scientific paper. Li's paper proposed a multitask learning model to improve Wadden's work. The authors computed a sequence of contextualized sentence embedding from a BERT model and jointly trained the model on rationale selection and stance prediction.
Graph Knowledge Extraction of Causal, Comparative, Predictive, and Proportional Associations in Scientific Claims with a Transformer-Based Model (by Ian H. Magnusson [Sift])
- This paper proposed a software framework that performs knowledge graph extraction out of sentence level scientific claims. The graph contains a set of pre-defined entities and relationships. The paper is based on a previous dataset called SciERC (Luan et al. 2018) and the transformer-based model called SpERT (Eberts and Ulges 2020). The paper adds multi-label attributes such as causation, correlation, comparison, increase, decrease, indicates to extend the SpERT model, and also performed span-based classification. The types of entities includes factors, evidence, association, epistemic, magnitudes, and qualifiers. The relations include arg0, arg1, comp_to, modifies, q+, and q-. The authors annotated 299 sentences of scientific claims from social and behavior science (SBS), PubMed, and CORD-19 papers. They reported 10-fold cross validation results for their best performing models. The micro-averages are 80.88% (entities), 83.81% (attributes), and 63.75% (relations).
- On Generating Extended Summaries of Long Documents (by Sajad Sotudeh [Georgetown])
- Identifying Used Methods and Datasets in Scientific Publications (by Michael Färber [KIT])
- AT-BERT: Adversarial Training BERT for Acronym Identification Winning Solution for SDU@AAAI-21 (by Danqing Zhu)
- Probing the SpanBERT Architecture to interpret Scientific Domain Adaptation Challenges for Coreference Resolution (by Hari Timmapathini)
Comments
Post a Comment