2021-01-11: Trip report to SDP (First Workshop on Scholarly Document Processing) 2020@EMNLP

The First Workshop on Scholarly Document Processing (SDP 2020) was collocated with EMNLP 2020 on Nov. 19, 2020. This workshop is organized for research on mining scientific literature. The SDP workshop includes a research track and three shared tasks: 6th CL-SciSumm (this shared task also asked for participation in past years), 1st LongSumm, 1st LaySumm. The research track received 34 papers, among which seven are accepted as full papers and two as short papers. There are also 11 posters and in addition, 18 shared task papers were submitted to the shared task track.


Shared task 1: Longsum shared task. Guy Feigenblat from IBM Research AI talked about this task. Most of the work on summarization mainly generates short summaries. For scientific articles, a longer and detailed summary is more needed by the researchers. This task focuses on generating long summaries for academic articles. It’s already produced as IBM Science Summarizer and they are improving it. The training and testing data can be found in their GitHub repo.

Figure 1: IBM Science Summarizer


IBM Science Summarizer generates long summaries for academic articles. The summary is not a paragraph for the whole paper, but they do this by sections. For each section in this paper such as introduction, dataset or design, they generate a summary independently. You can check this out one by one and get a general review about each section of the paper.


Gharebagh et al, GUIR @ LongSumm 2020 provides another solution to this task as well. They proposed a multi-tasking approach that outperforms BERT-based summarization models and a two-stage model base on BART. 


Shared task 2: CL-SciSumm 2020 (the 6th Computational Linguistics Scientific Document Summarization Shared Task). This task aims to summarize a paper (called Reference Paper or RP) by analyzing papers that cite it (called Citing Paper). The general steps are as follows:

  1. Identify the spans of text (cited text spans) in the RP that most accurately reflect the citing papers.
  2. Identify which facet of the paper it belongs to. 
  3. Generate a summary for the RP from all the cited text spans.

 The training and testing data can be found in their GitHub repo.


Shared task 3: CL-LaySumm 2020 (The 1st Computational Linguistics Lay Summary Challenge Shared Task). This task aims to generate summaries of an academic paper for the general public (LaySumm). This is different from SciSumm because LaySumm should not only be representative of the content, but it has to be readable to a lay audience with little technical background. It does not require much technical details so the summary is 70-100 words in length. A small sample dataset can be found in their GitHub repo.


There's a list of papers trying to solve those shared tasks:



Keynote 1 is given by Kuanshan Wang from Microsoft Research on their findings in “Mitigating scholarly corpus biases with citations: A case study on CORD-19”. This is exactly the dataset our paper is focused on and he is from one of the institutions that produce this dataset. Therefore I am especially interested in this research and I will talk about it in detail here.


CORD-19 (The COVID-19 Open Research Dataset) is a dataset created by Allen Institute for AI along with several other institutes. It includes the papers directly associated with the topic COVID-19. They compare the properties of this dataset with all potentially relevant papers in the scientific society and figured out the biases brought by document collection in terms of distribution over subject fields, journal importance, and author networks.


The potentially relevant papers are found by going through citations and references. There are three types of datasets based on how long they go on both directions (citations and references). As shown in Figure 2, they collect datasets at hop 1, hop 3 and hop 11 on both directions. Since the number of papers after hop 11 almost stays the same, so hop 11 gives maximum set of papers they want. Those datasets are CORD-19E, CORD-19I, CORD-19C, respectively. The three datasets are used as the "potentially relevant papers in the scientific society" to compare with CORD-19 dataset, which is potentially biased.

Figure 2: The datasets are generated at different steps (by Kuanshan Wang)


Figure 3 shows the difference in topic coverage for the four topics. CORD-19, CORD-19E, CORD-19I, CORD-19C are at iteration 0, 1, 3, 11, respectively. For CORD-19, almost 90% of it comes from biology and medicine. But as we include more relevant papers from the whole scientific society, the topic of biology and medicine accounts for less than 30% of papers in the end. The bias is especially significant for CORD-19 compared with CORD-19C dataset. This is one of the obvious biases in CORD-19 dataset from the extended datasets.

Figure 3: Topic coverage of CORD-19 dataset varies from extended datasets (by Kuanshan Wang)


Figure 4 shows the distribution of papers along a span of years in those datasets.  In the aspect of article distribution, we can also see an obvious bias in CORD-19 dataset from extended datasets CORD-19E, CORD-19I and CORD-19C. There is a steep jump in 2003 and 2019 in CORD-19 dataset, and the distribution skews to recent years. But this trend is not seen in the rest 3 extended datasets CORD-19E, CORD-19I and CORD-19C.


Figure 4: Distribution of the articles in 4 datasets along the years (by Kuanshan Wang)


In conclusion, CORD-19 dataset shows a strong tilt toward recent works and specific topics. This is pointed out in this paper and it suggests research based on the CORD-19 dataset should take into account of those biases in order to give more accurate results statistically.


The long and short papers are presented in the conference. Most of them are recorded in advance and presented online.

I will talk about the most impressive ones here. 

Bhambhoria et al.: A Smart System to Generate and Validate Question Answer Pairs for COVID-19 Literature is the first one that is impressive to me. They generate and validate Question Answer Pairs for COVID-19 literature. Developing QA systems for COVID-19 was firstly introduced by the Kaggle competition and not limited to this competition. Different groups provide their solutions such as clustering, BERT, hybrid models and so on. However, lack of annotated data is still a problem for this dataset. This paper tries to solve this through the use of transformer and rule-based models (silver data). In addition, they use engaging subject matter experts (SMEs) for annotating the QA pairs through a web application (gold data). This step can turn silver data into high quality gold data. This is the first work which create QA pairs for quality verification by SMEs.


Another one I am interested in is Satish et al.: The impact of preprint servers in the formation of novel ideas. They try to answer this question through a a Bayesian method:  where does novel ideas in biomedical research appear first? In preprints or traditional journals? They use a Bayesian Approach (BAND) to measure the frequency of new terms and phrases. By pointing out the time a new term is developed, they can see when this new idea appears and do a comparison between journals (primarily represented by PubMed) and preprints (primarily represented by bioRxiv, medRxiv and so on). This approach to measure novelty is new and insightful among all the existing methods to measure novelty. This may be adapted in some future research for related topics.


There's a list of the long and short papers presented in the conference.

Long papers:

Short papers:



- Xin Wei

Comments