2021-11-02: Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2021) Workshop Trip Report


                                                             (source: https://2021.jcdl.org/)

Due to the global pandemic the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2021) was organized online by the School of Information Sciences of the University of Illinois at Urbana-Champaign during September 27-30, 2021. Members of the Web Science and Digital Libraries Research Group (WSDL) attended the workshops, tutorials, paper sessions, posters and demonstrations virtually via Zoom. Muntabir Choudhury wrote an excellent trip report covering the JCDL 2021 main conference.

The JCDL workshop, September 30, 2021, on “Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2021)” was organized by Dr. Chengzhi Zhang, Dr. Philipp Mayr, Dr. Wei Lu, and Dr. Yi Zhang with an aim to draw the attention of the Computer and Information Science scholars specialized in relevant fields (Information Extraction, Text Mining, NLP etc.) towards the open problems in the extraction and evaluation of knowledge entities from scientific documents.

The EEKE2021 workshop was assembled into four different sessions where three papers were presented in each of the first three sessions. Session 4 was the longest session with six paper presentations.

Session 1: Entity Extraction and Application

Session 1 started with the presentation of the long paper “ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts ” by Anastasia Zhukova, from University of Wuppertal, Germany. 

This paper proposed an unsupervised Wiktionary-based automated entity annotator that automatically extracts domain entities from German texts. Anastasia Zhukova asserted that the entities are extracted by deriving the most domain representative automatically annotated terms. This work addresses a core problem of Named Entity Recognition (NER) - the cost of expensive and labor intensive human annotations required for the creation of domain-specific NER datasets. ANEA substitutes this time consuming task by automating the categorization and annotation of most representative terms. 

The second paper presented in this session was “Joint Entity and Relation Extraction from Scientific Documents: Role of Linguistic Information and Entity Types” by Debarshi Kumar Sanyal, from Indian Institute of Technology, India. The goal of this paper is to automatically extract entities and their relationships from a scientific abstract using deep neural networks. In their work, they used a pre-trained transformer, BERT, to get the contextual embedding of the tokens. Then they used POS encoder ScispaCy to generate POS tags of the input sentence. A fusion module is used to add the BERT embedding of the token and the POS embedding of its corresponding POS tag. Finally, a shallow entity classifier and a shallow relation classifier are used to identify entities and classify the relationship between every pair of entities respectively.

The last paper presented in this session was “Classification of URLs Citing Research Artifacts in Scholarly Documents based on Distributed Representations” which describes methods for classifying URLs mentioned in scholarly papers into three categories:

  • a tool - a program, software, toolkit etc.

  • dataset - experimental data, observation data

  • Other - not research artifacts

The citation context of URL is a key approach to their work. They have used two different approaches for distributed representation of URLs:

  • Considering each URL as a word

  • Considering each element of a URL as a word

The authors asserted that their approach is not good at discriminating the “data” and they have decided to extend this work to multi-label classification tasks in future. 

Session 2: Keyword Extraction and Application

The theme of session 2 was “Keyword extraction and their application”. Session 2 started with the paper presentation by Liangping Ding, from University of Chinese Academy of Science, China on "Design and Implementation of Key phrase Extraction Engine for Chinese Scientific Literature". The authors developed a Chinese key phrase extraction engine for Chinese scientific literature based on advanced deep learning. The training corpus used in training the Chinese Keyphrase Extraction Engine is small and not diversified and also the models are not openly and widely accessible to the researchers. The authors asserted that their Keyword Extraction Engine is built on large-scale training data from multiple domains. They have conducted their experiments on five models - four of them are based on sequence labeling and the last model is based on span formulation prediction.


They deployed the keyphrase extraction model as a service, and built a keyphrase extraction engine for Chinese scientific literature through API interface calls.

Aofei Chang, from Peking University, China presented their paper “Keyword Extraction and Technology Entity Extraction for Disruptive Technology Policy Texts” in session 2. Aofei Chang started the presentation by discussing the importance of disruptive technologies.

Then he discussed text collection, relevance judgement and the construction of keyword extraction algorithms of disruptive technology. He also showed a comparison of algorithms used (YAKE, TextRank, KeyBERT).  

Last paper in session 2 was “Extracting Domain Entities from Scientific Papers Leveraging Author Keywords” by Jiabin Peng, from Nanjing University of Science and Technology, China. In this paper, the authors proposed a method for extracting domain based entities from scientific papers using author keywords. The core purpose of this paper is to reduce the dependency of current extraction methods on manually annotated corpus and to increase the generalization ability.

From their experimental analysis and result, they have shown that SVM performs best with a f1-score of 0.753 among the five models.


Session 2 ended with Keynote 2 "Entity Summarization: Where We Are and What Lies Ahead?" by Gong Cheng. It was a presentation largely from his survey paper "Entity Summarization: State of the Art and Future Challenges". He started like this- In the wiki there are summaries, for example for Illinois. For which I do not need to read all the things. But how can we summarise all the entities? In this survey, he addressed present state of R&Ds which tries to solve the summarization problem. In this research, the authors identified more then 30 technical features which are being used in this field. They synthesized them into two types of features: Generic and specific. Gong explained a few generic features including score a property, statistical informativeness, Ontological informativeness and Diversity. Specific features are used when it requires domain knowledge, context awareness, and personalization. Few include Query relevance and Entity interdependence. 

Gong identified current methods that are mainly unsupervised and combine multiple technical features using various frameworks. Authors also found some deep learning methods, but they are far from perfect because of the lacking for training data.

Gong pointed out several future directions: the use of semantics, human factors, machine and deep learning, non-extractive methods, and interactive methods. 

Session 3: Knowledge Graph and Application

Session 3 kicked in with the presentation "Detecting Cross-Language Plagiarism using Open Knowledge Graphs" by Johannes Stegmüller, University of Wuppertal, Germany. The authors have proposed the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for identifying cross-language plagiarism. They have shown that CL-OSA outperforms the trending methods for retrieving candidate documents from large, topically diverse test corpora which includes distant language pairs. Johannes described how the detection of plagiarism works in a multilingual platform with a brief and informative overview of CL-OSA.  

The second paper presented in session 3 was on “A PICO-based Knowledge Graph for Representing Clinical Evidence” by Yongmei Bai, Peking University, Beijing, China. In this work, the authors have introduced the method of generating PICO-based knowledge graphs from clinical trials. They have used clinical trials about COVID-19 in ClinicalTrials.gov as the raw data and constructed two knowledge graph - CTKG & CTRKG. These knowledge graphs allow for queries, batch exports, and provide data for clinical evidence based on PICO.

Session 3 ended with a short paper presentation by Chuanming Yu on “A knowledge graph completion model integrating entity description and network structure”. In this work, a knowledge graph completion model has been proposed which not only includes the entity relationship representation but also includes the entity description and network structure. The  authors have conducted experiments on various datasets -- FB15K, WN18, FB15K237, and WN18RR -- to prove their claim about the proposed model.  

Session 4: Poster/ Greeting Notes of EEKE2021

Session 4 was all about poster paper presentations. It happened parallelly in six different breakout rooms. We joined one presentation by Tohida Rehman on “Automatic Generation of Research Highlights from Scientific”. The authors of this paper proposed deep-neural network based models to automatically generate research highlights from scientific abstracts. 

 They have basically used three deep learning based models to generate the highlights:

  • Sequence-to-sequence (seq2seq) model 

  • Pointer-generator network and

  • Pointer-generator network with coverage mechanism


The EEKE 2021 workshop ended with the keynote presentation of co-chairs of EEKE2021 (Chengzhi Zhang, Philipp Mayr, Wei Lu, Yi Zhang) about the future of this workshop. We found this workshop very informative and learnt about some fresh ideas. Hopefully, this EEKE workshop will be arranged in the future to facilitate further research and welcome fresh and great research ideas.

-- Lamia Salsabil (@liya_lamia), Sami Uddin (@usersami7)