2021-04-13: Trip report: 1st International Workshop on Scientific Knowledge: Representation, Discovery, and Assessment

The 1st International Workshop on Scientific Knowledge (Sci-K) was co-located with the 2021 Web Conference. This is a one-day workshop on April 13, 2021. Because of Covid-19, the workshop was held online. This is a There are 11 papers presented in this workshop. I had a paper co-authored with Sarah M. Rajitmajer (assistant professor, Pennsylvania State University), Dr. C. Lee Giles (professor, Pennsylvania State University), and Sree Sai Teja Lanka (MS, graduated from Pennsylvania State University in 2021). Below, I briefly summarize the keynote given by Staša Milojević, and our paper titled "Extraction and Evaluation of Statistical Information from Social and Behavioral Science Papers".

Keynote: Capturing tectonic shifts in contemporary science

Dr. Milojević is the associate professor of informatics at the Indiana University Bloomington. Much of what she presented on the workshop was in the broad domain of science of science, elucidating the dynamics of science as a social and an intellectual endeavor. This is an emerging research topic. Other researchers studying this direction include Dr. Dashun Wang (NWU) and Dr. Albert-László Barabási (NEU). They recently published a book titled "Science of Science". 

One of the questions that Dr. Milojevic's research was trying to answer was whether the exponential growth of publications indicates an exponential growth of knowledge. They used a term called "cognitive extent", which is proxied by concepts appearing in scientific paper titles (Milojević 2015 Journal of Informatics). They found 

  1. The cognitive extent in physics and astronomy has been expanding rapidly since the 1960s;
  2. Small research teams cover greater cognitive extent than large teams in fields studied.
I have a background in astronomy and Conclusion (1) is consistent with the situation in astronomy. There were several important discoveries made in the 1960s, including quasars (1963), cosmic background radiation (1965), pulsars (1967), and interstellar molecules (1969), and . See the Timeline of Astronomy in Wikipedia for reference. 

Another interest contribution made was a measure of knowledge interdisciplinarity, the extent to which the field/author/paper draws on knowledge from distinct fields. The bar chart below shows the level of interdisciplinarity of 13 academic fields. The distribution is roughly consistent with my understanding. For example, mathematics and astronomy have relatively low interdisciplinarity while computers, biology, and psychology tend to incorporate more methods and applications in other fields.

The third point made in this keynote was the modern science features the rise of collaboration. The figure below clearly shows the increasing level of knowledge production by large science teams in astronomy (Milojevic 2014 PNAS). Specifically, in the 1960s, 90% of publications were made by individual authors and small teams, but by the 2000s, contributed publications by individual authors were reduced to 10%. Most modern experimental fields have a "Big Science" component. 

Another interesting point made by Dr. Milojević was that there was an inverse correlation between research team size and the cognitive extent of scientific output. Single authors, pairs of authors, and small teams cover the largest intellectual territory, the same size as the entire field. Large teams cover significantly smaller cognitive territory. This is consistent with a recent discovery published in a Nature paper that small teams were known for disruptive work and large teams for developing work. 

Towards the end of the talk, the speaker presented a very interesting phenomenon, in which there are a rising fraction of authors who never become a lead author (see Figure below) (Milojević, Radicci & Wallsh 2018 PNAS). The study was performed for astronomy, ecology, and robotics. I am not sure if this applies to computer science. 

Extraction and Evaluation of Statistical Information from Social and Behavioral Science Papers

The presenter was Dr. Sarah M. Rajitmajer, assistant professor of information sciences and technology at the Pennsylvania State University. The major contribution was a fast and accurate extractor of p-values and associated statistics in social and behavioral science (SBS) papers. This work is part of DARPA's SCORE project, aiming at automatically predicting confidence scores of research claims in SBS papers. As a participant, the Penn State team collaborated with ODU, Texas A&M, and Rutgers developed a synthetic market model that replaces the traditional machine learning classifier/regressor with a market simulator. The detail of the market model is out of the scope of this paper, but the input of the model is a set of paper-level and claim-level features extracted from scientific papers. The p-value is an important feature. 

This work uses a list of complicated regular expressions to match p-values in plain text converted from PDF documents (see Table 3 of Lanka's paper). These regular expressions can extract 10 typical forms of p-values with test statistics, namely 
  1. T-Test
  2. F-Test
  3. Correlation
  4. Chi-Square
  5. Z-Test
  6. Q-Test
  7. Logistic regression
  8. b-Test
  9. d-Test
  10. Hazard ratio
From the p-value expression, we can extract the following features
  1. p-values, e.g., 0.001
  2. signs of the p-value, e.g., >=<
  3. sample size, applicable to selected statistics
  4. number of hypothesis tested: assuming to be the number of p-value expressions. 
The paper evaluated the rule-based model on a set of 300 papers from 10 disciplines. The first evaluation counts the number of p-values extracted against (1) p-values appearing in the PDF, and (2) p-values appearing in the text converted from the PDF. The latter excluded the errors introduced by the text converter thus should be higher. StatCheck was used as a baseline. The regular expression extractor achieved an accuracy of 83.3-91.5% when evaluated against p-values appearing in the text (Case (2) above), and 79.0-90.2% when evaluated against p-values appearing in the PDF (Case (1) above). Both beats StatCheck, which achieved an accuracy of 69.3-85.9%.

The proposed method is unsupervised, so it does not need any training data. It takes advantage of the relatively consistent pattern of p-values appearing in SBS papers. However, regular expressions are awkward and rigid and thus do not generalize well for new patterns. Its complicated formats also make it hard for future efforts to make improvements. The extraction and parsing rely on the text converter, which may incorrectly convert non-ASCII characters into text. Future efforts should leverage computer vision techniques to directly capture p-values from PDF images.

-- Jian Wu