Introduction

We analyzed the sentiment of 100 tweets using three sentiment analysis tools (TextBlob, VADER, and a RoBERTa-base model) and six human raters. To measure agreement, we calculated Cohen’s Kappa for each pair of raters (including both humans and tools) and Fleiss’ Kappa for multiple raters. The results? Let’s just say consensus was hard to find. Even the human raters struggled to agree, so we took a majority vote among them and compared it with the tools. Notably, the RoBERTa-base model showed the best alignment with human rating.

Our dataset consists of 100 tweets collected using the keyword “Site C, Khayelitsha” to study residents' perceptions of safety and security in Khayelitsha Township, South Africa, as part of the Minerva Research Initiative Grant awarded by the U.S. Department of Defense in 2022. We then collected the sentiment labels by running the sentiment analysis tools listed below and also by gathering data from six human raters. This data is available in a GitHub repository. Unfortunately, the project that made this work possible has been cancelled effective at the end of February, 2025.

The Sentiment Analysis Tools

We explored three sentiment analysis tools:

TextBlob: TextBlob is a Python library for processing textual data. The sentiment analysis feature in TextBlob returns a namedtuple, Sentiment(polarity, subjectivity), where polarity ranges from [-1.0, 1.0] and subjectivity ranges from [0.0, 1.0]. You can explore the tweet values in this notebook.

VADER: A rule-based model specifically developed for sentiment analysis of social media text. This is published in ICWSM-14 (2014) under the title “VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text”.

RoBERTa-base model: This is a model trained on ~58M tweets and fine tuned for sentiment analysis with the TweetEval benchmark in 2020. Recent updates to this model and more information can be found at TweetNLP.

Measuring Inter-Rater Agreement: Cohen’s Kappa

Cohen's kappa coefficient is a statistic that is used to measure inter-rater agreement between two raters for categorical items. The coefficient ranges from -1 to 1, with 1 indicating perfect agreement and 0 and less indicating no agreement. Values can be interpreted as: <0 (poor agreement), 0.01-0.20 (slight), 0.21-0.40 (fair), 0.41-0.60 (moderate), 0.61-0.80 (substantial), and 0.81-1.00 (almost perfect).

We first calculated the Cohen’s Kappa for each pair of raters (Table 1). The kappa values between human raters and tools vary, with some raters showing moderate agreement (e.g., Rater1 and Rater6 with a kappa of 0.5245) and others showing fair agreement (e.g., Rater1 and Rater2 with a kappa of 0.2802). The tools (TextBlob, VADER, and RoBERTa) generally have slight or fair agreement with the human raters, indicating that the tools are less consistent with human raters.

Table 1: Cohen’s Kappa - Positive, Negative, Neutral

The highest agreement between any two raters (human or tool) is highlighted in red.

P/N/Neu	Rater1	Rater6	Rater2	Rater3	Rater4	Rater5	TextBlob	VADER	RoBERTa
Rater1	x	x	x	x	x	x	x	x	x
Rater6	0.5245	x	x	x	x	x	x	x	x
Rater2	0.2802	0.2746	x	x	x	x	x	x	x
Rater3	0.3438	0.3290	0.1974	x	x	x	x	x	x
Rater4	0.4267	0.4474	0.2490	0.3831	x	x	x	x	x
Rater5	0.4374	0.5762	0.2742	0.4930	0.5269	x	x	x	x
TextBlob	0.2761	0.3187	0.25*	0.2824	0.2311	0.3175	x	x	x
VADER	0.3244	0.4860	0.1008	0.3274	0.2364	0.3186	0.4543	x	x
RoBERTa	0.7230	0.5909	0.3743	0.4608	0.4870	0.5698	0.4564	0.4008	x

Exploring Binary Sentiment Classification for Higher Agreement

We hypothesized that classifying sentiment as "negative vs. non-negative" (or even "positive vs. non-positive") might yield higher agreement than the traditional "Positive, Negative, Neutral" categories. This is because humans are generally more adept at identifying distinct positive or negative sentiments, while neutral statements tend to be more ambiguous and prone to subjective interpretation. To test our theory, we calculated the Cohen’s Kappa for Negative vs. Non-Negative (Table 2) and Positive vs. Non-Positive (Table 3).

The kappa values between raters for Negative vs. Non Negative show significantly higher agreement compared to the three-class sentiment classification in Table 1. For example, in Table 2, Rater1 and Rater6 have a kappa value of 0.8010, indicating almost perfect agreement. RoBERTa exhibits the highest agreement with the raters (for example, RoBERTa with Rater1 having a almost perfect kappa value of 0.8562), which is similar to what we saw in the three-class classification, but with noticeably stronger agreement in the binary negative vs. non-negative classification. VADER and TextBlob still show slight and fair agreement, especially VADER (for example, VADER with Rater3 having a kappa value of -0.0128 indicating poor agreement), which suggests it struggles with more nuanced sentiment identification.

Table 2: Cohen’s Kappa - Negative vs. Non Negative

The highest agreement between any two raters (human or tool) is highlighted in red.

N/nonN	Rater1	Rater6	Rater2	Rater3	Rater4	Rater5	TextBlob	VADER	RoBERTa
Rater1	x	x	x	x	x	x	x	x	x
Rater6	0.8010	x	x	x	x	x	x	x	x
Rater2	0.2710	0.2958	x	x	x	x	x	x	x
Rater3	0.3467	0.2601	0.0310	x	x	x	x	x	x
Rater4	0.6058	0.5455	0.3510	0.3794	x	x	x	x	x
Rater5	0.6077	0.5843	0.2743	0.4881	0.5661	x	x	x	x
TextBlob	0.3892	0.3116	0.0847	0.4120	0.2677	0.4163	x	x	x
VADER	0.6038	0.4672	-0.0128	0.4536	0.2428	0.3857	0.4400	x	x
RoBERTa	0.8562	0.7423	0.3385	0.5007	0.6574	0.6606	0.4711	0.5676	x

As shown in Table 3, the kappa values for the Positive vs. Non-Positive classification also show a moderate increase in agreement compared to the three-class classification in Table 1, but the values are generally lower than the Negative vs. Non-Negative classification in Table 2. For example, Rater 1 and Rater 6 have slight agreement with a kappa value of 0.3429. However, as shown in Table 2, they have almost perfect agreement (kappa value of 0.8010) for the Negative vs. Non-Negative classification. RoBERTa continues to show better performance in both binary classifications (with kappa values like 0.5118 with Rater1), but there is still some variation in how it compares with other raters. VADER and TextBlob show weaker agreement in the positive vs. non-positive classification compared to negative vs. non-negative.

Table 3: Cohen’s Kappa - Positive vs. Non Positive

The highest agreement between any two raters (human or tool) is highlighted in red.

P/nonP	Rater1	Rater6	Rater2	Rater3	Rater4	Rater5	TextBlob	VADER	RoBERTa
Rater1	x	x	x	x	x	x	x	x	x
Rater6	0.3429	x	x	x	x	x	x	x	x
Rater2	0.5027	0.4292	x	x	x	x	x	x	x
Rater3	0.4304	0.5325	0.4813	x	x	x	x	x	x
Rater4	0.2821	0.5015	0.3362	0.4588	x	x	x	x	x
Rater5	0.4393	0.7465	0.4656	0.6298	0.5950	x	x	x	x
TextBlob	0.1627	0.2982	0.3056	0.2466	0.1256	0.2478	x	x	x
VADER	0.1553	0.5439	0.2024	0.2796	0.2911	0.3651	0.5228	x	x
RoBERTa	0.5118	0.5325	0.5505	0.5452	0.3987	0.6298	0.4749	0.3246	x

In summary, the binary classification approach shows improved agreement compared to the three-class system, the Positive vs. Non-Positive classification does not result in as strong an agreement as the Negative vs. Non-Negative case. This suggests that negative sentiment is more distinct and easier to identify than positive sentiment in our dataset.

Multi-Rater Agreement: Fleiss’ Kappa

Fleiss's kappa is a generalization of Cohen's kappa for more than two raters. In Table 4, we calculated Fleiss' Kappa to assess inter-rater agreement across multiple raters and tools for various sentiment classification tasks: Positive/Negative/Neutral, Negative/Non-Negative, and Positive/Non-Positive. Fleiss' Kappa, unlike Cohen's Kappa, is specifically designed for multiple raters and gives a measure of the consistency of the classifications.

Table 4: Fleiss' Kappa - Inter-rater agreement multiple raters/tools

The highest kappa value for each of the three classifications is highlighted in red.

	Positive/Negative/ Neutral	Negative/Not Negative	Positive/Not Positive
All Humans	0.3854	0.4541	0.4845
All Tools	0.4300	0.4925	0.4300
All Humans with VADER Tool	0.3542	0.4234	0.4190
All Humans with TextBlob Tool	0.3514	0.4148	0.3923
All Humans with RoBERTa Tool	0.4279	0.5060	0.4970
All Humans & Tools	0.3764	0.4473	0.3987

Majority Voting: Aligning Human Raters for Better Consistency

As shown in Table 4, the individual human raters had varying interpretations of sentiment, leading to moderate inter-rater agreement at best. To address this inconsistency and improve reliability, we applied a majority voting approach, aggregating the most common sentiment label among human raters. This allowed us to establish a more stable consensus, which we then used to compare against sentiment analysis tools.

We noticed that applying majority voting among human raters significantly improved consistency, making their collective judgment more aligned with automated tools.

Table 5: Cohen’s kappa - Majority of Human Votes vs. each Tool

The highest kappa value for each of the three classifications is highlighted in bold.

	Positive/Negative/Neutral	Negative/Not Negative	Positive/Not Positive
Majority vs. VADER	0.3963	0.4960	0.3697
Majority vs. TextBlob	0.3684	0.4050	0.2922
Majority vs. RoBERTa	0.7340	0.8761	0.6101

RoBERTa shows the almost perfect agreement with the majority vote of human raters, particularly in Negative/Non-Negative (0.8761). It also shows substantial agreement in Positive/Negative/Neutral (0.7340) and Positive/Not Positive (0.6101). This indicates that RoBERTa is highly consistent with the human majority, especially when distinguishing between negative and non-negative sentiments, and also performs substantially in classifying positive/negative/neutral and positive/non-positive sentiment. VADER and TextBlob show fair/moderate agreement, with VADER performing moderately in Negative/Not Negative (0.4960) but still demonstrating lower kappa values compared to RoBERTa. TextBlob performs the worst, especially in Positive/Not Positive (0.2922), showing fair agreement with the majority vote of human raters.

Key Takeaways

Binary sentiment classification improves agreement

Categorizing sentiment as Negative vs. Non-Negative or Positive vs. Non-Positive resulted in higher agreement among human raters and between humans and sentiment analysis tools.
Negative vs. Non-Negative classification showed the strongest agreement, suggesting that negative sentiment is more distinct and easier to identify than positive sentiment.

RoBERTa emerges as the most reliable tool in matching human sentiment classifications, especially after applying majority voting, as it consistently shows substantial/almost perfect agreement with the majority vote across all sentiment categories. Also, RoBERTa’s consistency was particularly strong in Negative vs. Non-Negative classification (Kappa = 0.8761), indicating its ability to detect negative sentiment reliably.

VADER and TextBlob show weaker performance

VADER and TextBlob exhibit less agreement, particularly in Positive/Not Positive, indicating they might struggle with identifying positive sentiments as clearly as negative or non-negative ones.
TextBlob showed the lowest agreement across all categories, indicating that its sentiment classification approach may be less aligned with human perception.

We believe that RoBERTa is more effective than VADER and TextBlob for sentiment analysis because it applies transformer-based deep learning to understand the context of words. As opposed to VADER and TextBlob that rely on fixed rules and lexicons, RoBERTa is trained on a large dataset and fine-tuned through TweetEval, which makes it more effective at capturing richer sentiments, sarcasm, and slang. RoBERTa’s performance reminds us that the future of sentiment analysis lies in models that truly understand the complexity of human emotions in text.

Acknowledgments

The advice given by Dr. Michael Nelson and Dr. Michele Weigle has been a great help in compiling this blog post. I am grateful for their guidance and support, but the responsibility for any errors remains my own.

- - Himarsha R. Jayanetti (@HimarshaJ)

Search This Blog

Web Science and Digital Libraries Research Group

2025-03-26: A Battle of Opinions: Tools vs. Humans (and Humans vs. Humans) in Sentiment Analysis

Measuring Inter-Rater Agreement: Cohen’s Kappa

Exploring Binary Sentiment Classification for Higher Agreement

Multi-Rater Agreement: Fleiss’ Kappa

Majority Voting: Aligning Human Raters for Better Consistency

Key Takeaways

Comments

Post a Comment