As Natural Language Processing (NLP) evolves, Large Language Models (LLMs) are capable of interpreting text and extracting meaningful information. However, during my recent experimentation with Named Entity Recognition (NER) tasks, I found that different LLMs often produce varying outputs, much like humans offering diverse answers based on their unique perspectives. This leads to an interesting challenge: how can we unify the inconsistent entity list generated by multiple LLMs into a consistent and reliable output? If you want to replicate these experiments, you can access the GitHub repository.

Figure 1. Differing Entity Extraction Results from Two LLMs for the Same Text Input.

The Challenge

LLMs Capture Entities Differently

For this experiment, I used Ollama, an open-source project that serves as a powerful and user-friendly platform for running LLMs on a local machine. Specifically, I employed Llama 3.1 8B (Meta), Phi 3.5 8B (Microsoft), and Gemma 2 9B (Google). During this exploration, I noticed that these LLMs refer to entities in slightly different ways, even when presented with the same text, prompt, and instructions.

For example, consider the sentence, “Alice met Bob in New York City.” Llama 3.1 might recognize the entities “Alice” and “New York,” while Phi 3.5 identifies “Bob” and “New York City.” Meanwhile, Gemma 2 might identify “Alice” and “NYC” (instead of New York City). This demonstrates how “New York,” “New York City,” and “NYC” refer to the same entity but are represented differently, creating a disjointed output that makes consistency in the NER task challenging.

Theoretically, this problem can be generalized: LLM 1 might recognize entities A and B, while LLM 2 identifies entities B’ and C. Here, B and B’ represent the same entity, leading to disjointed outputs. This observation led to the research question: How can we create a unified list of entities that captures “Alice,” “Bob,” and “New York City/NYC” across models? Or, put differently, how can we create a unified list of entities that captures A, B/B’, and C across models?

Using Cosine Similarity for Entity Unification

A short-term solution I explored involves text similarity techniques, specifically the Sentence-BERT (SBERT) model. SBERT is a pre-trained model designed for sentence embeddings, which allows the capture of the semantic similarity between different entities. For example, consider the entities “New York,” “New York City,” “NYC,” and “House.” SBERT generates embeddings (numerical representations) for each entity based on their meaning. I will use the model all-MiniLM-L6-v2 in this experiment to generate contextual embeddings.

Using cosine similarity, I compare embeddings from the different entities to measure their similarity on a scale from -1 (entirely dissimilar) to 1 (identical). For instance, using the all-MiniLM-L6-v2 model, the cosine similarity between “New York City” and “New York” is 0.948, and “New York City” and “NYC” is 0.918, indicating a high degree of similarity, while the similarity between “New York City” and “House” is 0.369, showing that they are distinct. By setting a threshold (e.g., 0.7), I can merge entities that exceed this threshold—such as “New York City,” “New York,” and “NYC”—while keeping distinct entities separate like “House.” This approach allows us to build a unified list that integrates all perspectives while ensuring consistency.

Figure 2. Example of all-Mini-LM-L6-v2 by comparing “New York City” with “New York,” “NYC,” and “House”

Experiment Setup

For the experiment, I used a text that hypothetically illustrates a real-world scenario involving political figures and judicial decisions: “President Joe Biden criticized the Supreme Court’s decision on the redistricting of the South Carolina district by issuing a public statement highlighting concerns about racial discrimination.”

In reality, the focus of the news was on the Supreme Court's ruling, which allowed South Carolina’s racially discriminatory congressional map to stand. Civil rights organizations, such as the ACLU, responded to the ruling, stating that the map unfairly targeted Black voters and diluted their electoral influence. The ACLU emphasized that this decision undermines principles of equality and justice protected under the Equal Protection Clause.

It is important to clarify that President Biden did not make such a statement. The hypothetical text was crafted to represent a scenario, while the actual reactions centered on the judicial ruling and the concerns raised by civil rights advocates.

I employed a system prompt to guide the LLMs in extracting entities: “You are a helpful information extraction system. Consider the following criteria: 1. Do not use acronyms to refer to entities. 2. Only output entities that are in the passage (i.e., keep the exact text of the entity). 3. If you detect two entities together, consider them as one entity (e.g., ‘Colombia, Barranquilla’ should be considered a single entity of type Location).” This prompt aimed to ensure consistency and precision in entity identification.

The task prompt given to the models was: “Given a passage, your task is to extract all entities and identify their entity types. The output should be in a list of tuples of the following format: [("entity 1", "type of entity 1"),...].” This prompt, taken from UniversalNER, was used across different LLMs, including Llama 3.1, Phi 3.5, and Gemma 2, to compare their entities’ output.

Results

Entities Extraction using LLMs

Once the instruction and prompt were set up, I prompted the LLMs and processed each response to create a list of entities—such as persons, organizations, and locations, among others—identified by each model. As shown in Figure 3, the outputs from the models differed slightly, but the pattern specified in the prompt ensured consistency in the general structure: [("entity 1", "type of entity 1"),...].” To extract these entities from the LLM responses, I used regular expressions (regex) to address slight variations in the output format.

Figure 3. Llama 3.1, Phi 3.5, and Gemma 2 answer for NER tasks.

Regex was necessary because each model’s output format varied slightly, requiring a consistent method to accurately extract the entity tuples containing the entity name and type. As shown in Figure 4, I wrote a script to ensure I could standardize the extracted information despite formatting differences.

Figure 4. Script of regex to extract tuples of entities.

Figure 5 illustrates the processed output, showing how different models recognized and categorized entities. The results were further analyzed to highlight entity recognition and classification variations across the models.

Figure 5. The output of LLMs entities after applying a regex script to extract tuples of entities.

Towards a Unified Entity List

The process of creating a unified list of entities involves combining outputs generated by multiple models, such as Gemma 2, Llama 3.1, and Phi 3.5. First, I generated SBERT embeddings for each entity to enable a consistent representation of textual data in vector form. Using these embeddings, I applied cosine similarity to measure how closely related the entities were. For entities with a similarity score above a specific threshold (e.g., 0.7), I merged them to form a unified list.

The cosine similarity threshold was crucial in determining whether two entities should be merged. As shown in Figure 6, the cosine similarity scores vary, and a threshold of 0.7 achieves a balance between over-merging (incorrectly combining distinct entities) and under-merging (failing to recognize similar entities as the same). During the merging process for “Person” entities, I ensured that the complete version of the name was retained, addressing issues with partial names or different formats of the same entity. After merging the entities, I reviewed the unified list to address any remaining redundancies, ensuring that each entity was unique and accurately represented, especially for names that could be ambiguous.

Figure 6. Cosine similarity heatmap comparing entity extraction responses from three language models (Llama 3.1, Phi 3.5, and Gemma 2). The heatmap illustrates similarities between entities such as “Joe Biden,” “Supreme Court,” and “South Carolina.” The variation in similarity scores reflects differences in how models classify entities and resolve distinctions like “district,” “redistricting,” and “Supreme Court decision.”

The experiment revealed how each LLM provided different interpretations of the same text. For example, in the extracted entities, Llama 3.1 identified “Joe Biden” as a Person, “Supreme Court” as an Organization, and “South Carolina” along with “district” as Locations. Phi 3.5, on the other hand, identified “President Joe Biden” as a Person—at least in this example, it is relevant to the role—, “South Carolina district” as a Location & Political Subdivision, and added more descriptive entities such as “Supreme Court district decision” categorized as a Legal Entity related to Government Judiciary Body, and ‘redistricting’ as a Political Process. Meanwhile, Gemma 2 recognized “Joe Biden” as a Person, “Supreme Court” as an Organization, and “South Carolina” with some variations in capitalization or format. Among the models, Phi 3.5 notably captured a richer and more comprehensive set of entities, going beyond basic entities to include more descriptive and nuanced categories. This model’s ability to identify entities like “Supreme Court district decision” as a Legal Entity related to a Government Judiciary Body and “redistricting” as a Political Process demonstrates its advanced understanding of context and semantics.

This variation in entity recognition highlights the inherent challenge of different LLMs perceiving and categorizing the same information differently. By applying a similarity-based approach, I merged these outputs into a unified list, capturing all perspectives while minimizing redundancy. For example, merging ‘Joe Biden’ and ‘President Joe Biden’ under a single entity ensured consistency while acknowledging the temporal context associated with such titles. These designations often reflect specific roles tied to different points in time—before becoming president, Joe Biden held positions such as vice president, senator, and candidate. Recognizing and preserving this temporal nuance is critical for accurately representing the evolution of an entity’s identity.

However, the results still have room for improvement. In some cases, entities that are similar but not identical are merged incorrectly, which compromises the overall accuracy of the list. This highlights a critical limitation of the current approach—while it is effective at standardizing outputs, it occasionally overlooks these nuanced distinctions. Refining the similarity threshold and incorporating additional context features, such as temporal or role-based attributes, could further enhance the precision and reliability of the results, paving the way for more robust and contextually aware entity extraction systems.

Figure 7. Unified entity list after applying cosine similarity. The output shows the final entity classification, where similar terms have been consolidated and categorized.

If you want to replicate this experimentation, make sure to run first “llm_ner_extraction.py”—Figure 8— and then “llm_ner_agreement.py”—Figure 9.

Figure 8. Script “llm_ner_extraction.py”: Works to extract entities from different LLMs.

Figure 9. Script “llm_ner_agreement.py: Works to reconcile all entities by using cosine similarity.”

Conclusions

The potential of LLMs in entity recognition is vast, but like humans, they come with varying perspectives that can complicate the extraction process. This suggests that certain LLMs may offer deeper insights, which can be particularly valuable for complex information extraction tasks that require more than just surface-level entity recognition. By using techniques like SBERT and cosine similarity, we can unify these diverse outputs into a more coherent and reliable set of data, leveraging the strengths of models like Phi 3.5 while harmonizing them with other perspectives.

This is just the beginning of what’s possible when combining multiple LLM outputs. I look forward to refining this process further and exploring NER tasks in domain-specific applications, such as Modeling & Simulation. If you’re interested in learning more about this, check out my blog post, “2025-01-22: From Narrative to Conceptualization: The Role of Large Language Models in Modeling & Simulation.” Fine-tuning similarity thresholds and incorporating additional contextual features could enhance the accuracy and utility of unified entity lists, paving the way for more sophisticated NLP applications.

- Brian Llinás (bllin001)

Search This Blog

Web Science and Digital Libraries Research Group

2025-01-16: Do Large Language Models Agree on Entity Extraction?