2023-08-11: Joint Workshop of the 4th Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2023) Trip Report

 



The Joint Workshop of the 4th Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2023) and the 3rd AI + Informetrics (AII2023) was held as part of the ACM/IEEE Joint Conference on Digital Libraries 2023, Santa Fe, New Mexico, USA, June 26 - 30, 2023.

This workshop focuses on the extraction and evaluation of knowledge entities from scientific documents as well as other AI and Informetrics problems. It focuses on issues in knowledge entity application such as the construction of a knowledge entity graph and roadmap, modeling functions of knowledge entity citations, etc. This workshop also focuses on AI + Informetrics and a wide range of its practical scenarios including: cohering AI and informetrics to fulfill cross-disciplinary gaps from either theoretical or practical perspectives, AI-empowered informetric models, and models in information management for empirical needs in real-world cases.

This workshop is hybrid with both onsite and online presentations. There are one poster session and three paper sessions this year, with 3 posters and 9 full papers presented in the workshop. Two world-renowned researchers @C. Lee Giles and @Scott Cunningham also gave insightful keynotes.

Session 1 is a poster session with three 10-minute poster presentations. The first one presented by Jinzhu ZhangAn Approach for Identifying Complementary Patents Based on Deep Learning, proposes to analyze the relationship between patents from the perspective of complementarity. Dayu Yan did the second presentation on their paper, Functional Structure Recognition of Scientific Documents in Information Science, proposing to use prompts to recognize the structure and function of texts in academic papers. They compared four different methods (prompt, SciBERT, LSTM and textCNN) and found prompt is most suitable for structure function  recognition. The third paper presented by Zhen LiuLinkages among Science, Technology, and Industry, proposes to construct a heterogeneous network based on citations among articles, patents, and drugs to describe the flow of knowledge among them. They use ‘Key-route main path’ method to find the linkages and Search Path Link Count (SPLC) to measure the importance of each edge. Figure 1 gives a rough description on the links of the network they finally obtained. The results include various kinds of developmental paths, which can be summarized into three main development modes: (a) pushed simultaneously by science and technology; (b) pushed by science; (c) pushed by technology.


Figure 1: Link types in Linkages among Science, Technology, and Industry (slide 8)



Session 2 has three 20-minute presentations of papers on Entity Extraction and Applications. Chunli Liu presented the first paper in this section. Their paper, The Relationship of Interdisciplinarity, Entity Features and Clinical Translation Potential of COVID-19 Papers, uses the binomial regression to measure the clinical translation intensity of COVID-19 articles published in 2021 and measure the impact of interdisciplinary level and the characteristics of biological entity on clinical translation intensity. To be specific, clinical translation intensity is measured by whether the paper is cited in clinical trials or clinical guidelines. Results show that interdisciplinary research requires more time and perseverance to overcome challenges. Different biological entities have different impact on clinical translation intensity.

The second paper in this section is LLM-based Entity Extraction Is Not for Cybersecurity, presented by Maxime Würsch from CYD (Cyber-Defence Campus). They found the lack of LLM (large language models) in bibliometrics search. The usual practice for bibliometrics search is to firstly extract entities and then compare through embedding space. The performance of LLM-based entity extraction is not evaluated yet. Their experiments on scientific papers from arXiv in Computer Science show that cosine similarity of embedding does not perform well to cluster themes. Among all the methods, only roberta-large-conll03 (NER) gives a reasonably good clustering (Figure 2). LLM-based entity extraction seems not suited for concept-oriented bibliometrics in scientific articles.


Figure 2: Clustering results with RoBERTa large conll03 (NER) (slide 10 in LLM-based Entity Extraction Is Not for Cybersecurity



The next paper presented by Shaojian LiCharacterizing Emerging Technologies of Global Digital Humanities Using Scientific Method Entities proposes to use a ChatGPT-based semi-automatic pipeline (Figure 3) to extract scientific method entities (SMEs) from papers in the field of digital humanity. SMEs such as pattern recognition, data representation, data modeling, are used as proxies for emerging technology. The extraction results can be used to show the technology evolution trend over the years (Figure 4, Figure 5).


Figure 3: ChatGPT-based semi-automatic pipeline (slide 3 in Characterizing Emerging Technologies of Global Digital Humanities Using Scientific Method Entities)


Figure 4: Top 10 SMEs in years 2013-2022 (slide 9 in Characterizing Emerging Technologies of Global Digital Humanities Using Scientific Method Entities)



Figure 5: Count of top 10 SMEs from 2011 to 2022 (slide 10 in Characterizing Emerging Technologies of Global Digital Humanities Using Scientific Method Entities)


After session 2, Scott W. Cunningham gave a speech on Scientometrics in the Era of Large Language Models. He talked about the development of Large Language Models and the use of them in scientific knowledge. He also talked about his expectations about future opportunities in this field.

Next in session 3, there are four 20-minute presentations of papers on AI + Informetrics. 

The first paper is Identifying Potential Sleeping Beauties Based on Dynamic Time Warping Algorithm and Citation Curve Benchmarking presented by Yu Chen. It proposes to use a dynamic time warping (DTW) method to more efficiently identify sleeping beauties based on the benchmark sleeping beauty citation curve. The concept of sleeping beauty (SB) refers to a publication that goes unnoticed for a long time and at a point of time suddenly attracts a lot of attention. DTW can find the smallest alignment matching path to minimize the distance between two sequences. DTW algorithm can accurately identify potential SBs by calculating the closest DTW distance to the benchmarking citation curve. 

The second paper is Scientific knowledge combination in networks: new perspectives on analyzing knowledge absorption and integration presented by Hongshu Chen. They proposed a framework based on KeyBERT, TFIDF, and SciBERT methods to construct knowledge networks as proximity for knowledge absorption and integration (Figure 6). They used 124 Nobel prize papers in physics as the dataset for empirical study. The framework of the method is shown in Figure 7. The results show that the average knowledge absorption efficiency is 0.14, and the average knowledge integration efficiency is 0.09.


Figure 6: Illustration of knowledge absorption and integration (slide 2 in Scientific knowledge combination in networks: new perspectives on analyzing knowledge absorption and integration)


Figure 7: Framework for analyzing knowledge network as well as knowledge absorption and integration within it (slide 4 in Scientific knowledge combination in networks: new perspectives on analyzing knowledge absorption and integration)


The third paper in this section is UnScientify: Detecting Scientific Uncertainty in Scholarly Full Text presented by Panggih Kusuma Ningrum. This paper proposes a system to identify scientific uncertainty from scientific text which focuses on the sentence level. As shown in Figure 9, the workflow is based on pattern checking and heuristic methods. They also build this system into a demo and people can access it freely. 


Figure 8: Examples of scientific uncertainty (slide 8 in UnScientify: Detecting Scientific Uncertainty in Scholarly Full Text)


Figure 9: Scientific uncertainty identification workflow (slide 12 in UnScientify: Detecting Scientific Uncertainty in Scholarly Full Text)


After session 4, another keynote speaker C. Lee Giles gave a speech on Large Language Models and Information Extraction. In this speech, he went through the development of machine learning and current progress in Large Language Models from BERT to GPT-4. He also talked about the most up-to-date pipelines based on different LLM models for information extraction.

Next in session 4,  there are three 20-minute presentations of papers on topics related to Entity Extraction. 

The first paper in this section is ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning presented by @Xin Wei. This paper proposes to use a framework ClaimDistiller based on contrastive learning to identify claims from academic papers. It compared contrastive learning with transfer learning and basic model training and found that contrastive learning has better performance with less training data and time. The proposed framework has the best performance with F1=87.45% on the SciCE dataset.


Figure 10: Illustration of contrastive learning (slide 9 in ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning)



The second paper in this section is Forecasting Future Topic Trends in the Blockchain Domain: Using Graph Convolutional Network presented by Yejin Park. It proposes a new approach that integrates topic modeling and GCNs to predict upcoming topic trends within the blockchain field. The overall workflow is shown in Figure 10. The A3T-GCN model is trained to forecast topic trends. The empirical results show significant month-specific trends (Figure 11): word count in January was remarkably high each year.



Figure 11: The Overall Schematic Research Workflow (slide 5 in Forecasting Future Topic Trends in the Blockchain Domain: Using Graph Convolutional Network)



Figure 12: Seasonality of paper documents (slide 16 in Forecasting Future Topic Trends in the Blockchain Domain: Using Graph Convolutional Network)



The third paper in this section is How does AI assist scientific research domains? Evidence based on 26 millions research articles presented by Jiangen He. This paper focuses on a big picture of how AI is applied in scientific research across domains. They observed 26,408,350 articles ranging from 2000 to 2019 from Web of Science (WoS) database. AI methods are obtained from 779,467 articles in the website "Papers with Code". They analyzed those papers and gave a number of observations on AI methods, including the number of  AI methods in a single paper (Figure 13), the number of publications in different disciplines (Figure 14), and frequencies of AI methods under different categories (Figure 15), and so on.


Figure 13:  Number of  AI methods in a single paper (slide 6 in How does AI assist scientific research domains? Evidence based on 26 millions research articles)


Figure 14: Number of publications in different disciplines (slide 8 in How does AI assist scientific research domains? Evidence based on 26 millions research articles)


Figure 15: Frequencies of AI methods under different categories (slide 9 in How does AI assist scientific research domains? Evidence based on 26 millions research articles)



- Xin


List of other trip reports for JCDL2023:



ACM / IEEE Joint Conference on Digital Libraries (JCDL' 23) Doctoral Consortium Trip Report

Comments