2024-08-31: Improving LLM Discernibility in Scientific News – A Thesis Journey

In My Process: From the Field to PhD Candidacy

Born and raised in Budapest, Hungary, this journey has been unbelievable. To be honest, I mainly came to Old Dominion University to play football in 2020, but then I took Dr. Jian Wu's CS 450 Database Concepts course which changed the entire direction of my journey. Before the course started he advertised one of his data science 
undergraduate research assistant positions at LAMP-SYS (Lab for Applied Machine Learning and Natural Language Processing Systems), which I then applied for at the beginning of 2022. I have been working with him since, resulting in the publication of two papers to date. On July 18th, 2024 I defended my Master's Thesis titled: "Who Wrote the Scientific News? Improving the Discernibility of LLMs to Human-Written Scientific News". I am enrolled in the ODU Computer Science PhD Program to continue my research in Natural Language Processing under the supervision of Dr. Wu, Dr. Jiang, and Dr. Ashok. I am also glad that my thesis defense was accepted for the PhD Candidacy Exam.

Background

This journey began in December of 2022 when I developed a custom web crawler to download every scientific news article's HTML from ScienceAlert. We extracted valuable information from these HTML such as the title, author, date, category, scientific news text, and more. I managed to collect metadata from 23,346 scientific articles. We discovered that most of ScienceAlert's articles contain links to domains with published research papers in their URLs. Figure 1. illustrates the process flow of the main methodology developed through the study. In the Figure, News Article A represents the ScienceAlert news article, written by a human journalist, while News Article B is the LLM-generated news article. 

Figure 1. Process Flow of Evaluating between Human-Written news and LLM-generated ones

The Rise of GPT

Traditionally, token-based evaluation metrics are used in NLP to assess the quality of machine-generated text by comparing it with a reference text. These standard automatic evaluation metrics include ROUGE-1, ROUGE-2, ROUGE-L, BLEU, and METEOR. These methods rely on lexical similarity, where the focus is on individual tokens or words rather than large units like sentences or paragraphs. As a result, they fail to capture meaning, especially in complex texts like scientific news. Semantic-based methods like BERTScore and evaluations using LLMs provide more understanding by embedding texts into high-dimensional spaces where semantic similarity is captured. 

Since GPT-3.5 was released on November 30th, 2022, many editorial offices have started using it to write news articles. Our work is the first step in discerning the origin of the article. The result implies that we were able to tune large language models (LLMs) to successfully differentiate between human-written news text and AI-generated news text. When we first started evaluating the LLM-generated articles written by GPT, we were instructed to act as a journalist, which we named, Journalist GPT. The articles received much higher scores than the ground truth, the human-written news article from each linked to a research paper abstract. Although scoring can be useful for quick evaluations, we showed in this work that it is not always reliable for comprehensive assessments. Pairwise comparison, by considering the experimental performance of different settings, offers a more accurate and reliable method to evaluate news articles. 

Experiments and Results

Throughout the study, we conducted experiments to test three hypotheses:

  1. Hypothesis 1: The scores generated by GPT are highly correlated with standard evaluation metrics.
  2. Hypothesis 2: Direct scoring is a reliable method to evaluate and discern human-written and LLM-generated news articles.
  3. Hypothesis 3: Pairwise comparison provides a more consistent and reliable way of evaluating the quality of text

Figure 2. Kendall correlation coefficients testing Hypothesis 1
To test Hypothesis 1, GPT-4 was prompted to generate the overall scores for a given GPT-3.5-generated news article. We then calculated the traditional evaluation metrics and the Kendall correlation coefficients between traditional scores and GPT-4 generated scores. The heat maps in Figure 2 do not exhibit strong correlations between GPT4-generated scores and traditional metrics, which ruled out the null hypothesis and further justified using GPT-4 scores independent of traditional text summarization metrics.

Figure 3. Correct Term Frequency vs LLM-generated Term Frequency
One of the main reasons we refuted Hypothesis 2 was due GPT's inconsistency in evaluating news text. This inconsistency can be shown in Figure 3, which compares the correct term frequency distributions for a given text with the LLM-generated term frequency distribution. From the Figure, we can observe that GPT-4 does not appear to have the capability to correctly count the n-gram frequency and use it as a condition to rate scores. Using scores is rigid and can be meaningless, but pairwise comparison methods do not look at exact scores. LLMs’ preference for LLM-generated content can be encountered using a new prompting style to correctly discern which one is human-written or LLM-generated. Pairwise comparison is also a qualitative comparison, therefore it does not rely on rubrics, only guides to the model. Table 1 highlights the importance of both instruction tuning and providing examples to the LLM when using pairwise comparison evaluation methods. By showing the performance of open-weight models, we demonstrate that instruction tuning can be utilized by anyone, thereby enhancing its effectiveness and accessibility for other researchers.

Discussion

Table 1. Guided Few-shot pairwise comparison Performance

In the GPT-3 paper, OpenAI showed that the performance of LLMs can be improved significantly by providing examples of the model. Our study supports this claim, as shown in the results presented in Table 1. Specifically, the 70 billion parameter LLaMA 3 model beats GPT-4 across all settings. It also shows that open-weight Mistral 7B won direct pairwise comparison and beat GPT-3.5 20B across all settings. 

This study revealed a performance comparison between the use of both shots and guides produces better results than either one alone. The findings also emphasize the utility of pairwise comparison methods over direct scoring, when the goal is to distinguish between LLM-generated to human-written news. In the future, we would like to conduct more extensive human studies.

Reflecting on this journey, I realize how transformative it has been. From arriving as a student-athlete to finding my passion in research. This path has not only shaped my academic career but also deepened my relationship with ODU and its people. I'm excited to continue this journey, pushing the boundaries of what's possible in NLP.

The slides can be viewed here:

~Dominik Soós

Comments