2020-08-05: Trip report to WOSP: the 8th Workshop on Mining Scientific Publications


The International Workshop on Mining Scientific Publications (WOSP) started in 2012. The main theme is to use Natural Language Processing (NLP) and text mining tools to aid knowledge creation and improve the process by which the research is done. This year, due to the unprecedented pandemic by COVID-19, the entire workshop is moved online on August 5th, right after JCDL 2020. The workshop this year was organized by CORE, The Open University, UK, in collaboration with Oak Ridge National Laboratory (ORNL), Tennessee, US.


  1. Kuansan Wang (MSR Outreach Academic Services). Kuansan's presentation titled "Mitigating document collection biases with citations: A case study on CORD-19". Kuansan's presentation addressed a very important problem: document collection bias when conducting statistical analysis on distribution over subject fields, journal importance, and author networks. By comparing the properties of corpora selected by their algorithms, they had two interesting findings. First, CORD-19 has a strong tilt in favor of recent articles and uneven coverages in the topical fields and the publication venues. As a result, it may not be suitable for identifying critical knowledge and assess journal importance. Second, it does not appear to exhibit biases in describing research collaborations in terms of team sizes or geolocations. Kuansan's group uses three citation network traversal algorithms to select another 3 corpora of documents. 
  2. Neil Smalheiser (University of Illinois at Chicago, Department of Psychiatry). His presentation titled "LBD: Beyond the ABCs" (see his paper with the same title published in 2012 JASIST). Here, LBD stands for Literature-based discovery, which refers to a particular type of text mining that identifies nontrivial implicit assertations within documents. The biggest challenge of LBD has been framed in terms of finding hypotheses that are novel, nontrivial, and likely to be true. Traditionally, this problem was approached by a so-called ABC model (Swanson 1988, Swanson & Smalheiser 1997). For example, given the assertion "A affects B", appearing in one article, and "B affects C" appearing in a different article, one can derive the implicit assertion "A affects C", representing a potential hypothesis. One can see it is crucial but difficult to establish gold standards for the LBD system. For more details about LBD, people can read a recent survey paper by Thilakaratne et al. (2019)
  3. David Jurgens (Assistant Professor, University of Michigan). His presentation topic was "Citation Classification for Behavioral Analysis of a Scientific Field". David analyzes the purpose of citations and studied the correlation between citation purposes with discourse structures and publication venues. He also demonstrated that the style a paper cites related work is predictive of its citation count. He finally uses changes in citation roles to show that the field of NLP has undergone a systematic change in its citation practices to become a rapid discovery science. A paper with the same title can be found on arXiv (Jurgens 2016 et al.).
  4. Anne Lauscher (University of Mannheim). Her presentation title was "The Special Case of Scientific Argumentation: Analyzing Scitorics." She presented a tool called ArguminSci, which is a tool to analyze argumentation and other rhetorical aspects of scientific writing, which we collectively dub scitorics. The main aspect we focus on is the fine-grained argumentative analysis of scientific text through identification of argument components. The functionality of ArguminSci is accessible via three interfaces: as a command line tool, via a RESTful application programming interface, and as a web application. A paper titled "ArguminSci: A Tool for Analyzing Argumentation and Rhetorical Aspects in Scientific Writing" was published. 
  5. Allan Hanbury (Professor for Data Intelligence and Head of the E-Commerce Research Unit, TU Wien, Austria). His presentation title was "Supporting Systematic Reviews in Medicine." 

The workshop features 3 long papers and 4 short papers. Our paper titled "SmartCiteCon: Implicit Citation Context Extraction from Academic Literature Using Supervised Learning" was selected as one of the long papers. I did not attend all the four sessions due to timing issue. Below, I briefly summarize two presentations I attended.

Long paper

Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with GROBID, GIANT and CORA (by Mark Grennan and Joeran Beel)

In this work, the authors compares the CRF-based citation parsing models implemented by GROBID trained on organic and synthesized citation strings. The latter was drawn from 1 billion synthesized citation strings in the GIANT dataset. The authors made several interesting observations. Specifically, they found that both synthetic and organic reference strings are equally suited for training GROBID with (F1=0.74). They also found that adding more parsing fields improves the performance, even if these fields are not available in the testing dataset. They further suggest testing data for citation parsing should include datasets not drawn from the parent sample of the training data. 

The paper investigates citation parsing from a different perspective, studying the difference between organic and synthesized data. The motivation is legitimate and strong. The paper also addresses questions on the biases in training and testing datasets. 

The idea to train a citation parsing model using synthesized data is not new. Back to the 2018, Dr. Ed Fox discussed this idea with me during JCDL 2018. We then started this project in collaboration with Min-Yen Kan (NUS). We submitted a draft to JCDL 2019, but was rejected due to the fact that the work was half-cooked. The main obstacle is to find a legitimate tokenizer for citation strings and a annotator to automatically assign correct labels to tokens. It seemed the problem was solved by Beel's group, but it wasn't clear why a neural model was not trained on top of this big dataset. 

The paper is published on arXiv. The senior author of this paper Joeran Beel is a well-known researcher for his work on mining scholarly big data. 

Short paper

Representing and Reconstructing PhySH: Which Embedding Competent? (by Xiaoli Chen and Zhixiong Zhang)

The paper compares three types of representations to generate vectors used for deriving the hierarchical relationship between PhySH terms: word embedding, graph embedding, and Poincare embedding. For each type, the authors used different implementations and calculated the mean rank and mean average precision (MAP). They concluded that Poincare embeddings (especially the Pytorch implementation) are the best to reconstruct PhySH from the hypernym-hyponym relations than the other embedding models. However, none of the three types of representations did quality jobs to reconstruct PhySH from free text.

The research topic on learning hierarchical concept representations is interesting. However, since the Poincare embedding was originally designed to represent hierarchical representations, so it is kind of expected that it beats the other two types of embeddings. 

The paper used the abstracts and titles in the APS dataset containing more than half-million articles. However, the paper does not address how to deal with Non-ASCII characters such as math formulae that are very prevalent in physical science papers. 

Xiaoli Chen and Zhixiong Zhang are from the Chinese Academy of Sciences in China.

All papers of WOSP will be published in ACL proceedings.

Beside WOSP, there are other workshops organized, the themes of which are aligned with NLP and text mining on scholarly documents, such as the Workshop for Scientific Document Processing (SDP), colocated with EMNLP 2020. It seems SDP will replace previous workshops including BIRNDL, BIR, and CLBib. 

-- Jian Wu


  1. % curl -I http://web.archive.org/static/js/jwplayer/jwplayer.js

    HTTP/1.1 200 OK
    Server: Tengine/1.4.6
    Date: Mon, 15 Jul 2013 14:01:51 GMT
    Content-Type: text/javascript
    Content-Length: 39272
    Connection: keep-alive
    set-cookie: wayback_server=99; Domain=archive.org; Path=/; Expires=Wed, 14-Aug-13 14:01:51 GMT;
    Link: ; rel="type"
    ETag: W/"39272-1369173952000"
    Last-Modified: Tue, 21 May 2013 22:05:52 GMT
    X-Archive-Wayback-Perf: [, , , , , , , Total: 2, ]
    X-Archive-Playback: 0
    X-Page-Cache: HIT
    Accept-Ranges: bytes


Post a Comment