If you've worked with graphs or business data, you're likely familiar with entity resolution and named entity recognition (NER).
At its core, entity resolution resolves two descriptions of the same entity into a single structure representing that entity. In graphs, this often takes the form of merging multiple nodes that describe the same entity and into a single node.
Entity resolution becomes a crucial step in graph processing when precision is important, and let's be honest, working with graphs is a little bit harder than working with vectors alone, so we can assume you care about accuracy if you are working with graphs.
Say you have nodes with the names "IBM" and "International Business Machines". We agree that these two nodes refer to the same entity. If the user mentioned "IBM" in their query, they might only get the "IBM" node and it's associated relationships, missing any information under "International Business Machines". Having only been able to retrieve half of the context, the query result would be less precise.
However, if these nodes had been merged, no matter how many alternative spellings of the entity exist in the graph you would always get the full context when querying. That's why entity resolution matters.
Entity resolution is big business, and there are entire companies who solve this problem as their sole purpose. Possible the most well-known company focusing on this product is Senzing, a portfolio company of Mithril Capital, one of Peter Thiel's investment arms. Palantir also offer an entity resolution service as part of the Gotham platform, as do AWS through their Data Matching Service and Azure through their Language Service.
There are many ways to do entity resolution yourself in ways that are suitable for graphs without having to shell out tens of thousands of dollars.
The most common of these methods uses Levenshtein Distance, but this simple string-matching approach comes with some drawbacks.
Firstly, matching node names with Levenshtein Distance creates quadratically increasing complexity as the number of nodes in your graph increases. If you have 500,000 nodes in your graph, the total complexity is not just O(n²), it's O(n²m²) where n is the number of nodes and m is the string length. This would result in 125 billion operations and approximately 12.5 trillon comparisons given a string length of 10 characters. While Levenshtein Distance might be ok for 1,000 nodes, it is not suitable for large graphs. Even with blocking and parallelism, the time and compute requirements are prohibitive.
Secondly, like in the example given above, the names might not always be close spellings of each other. The threshold you'd need to apply to match IBM and International Business Machines is so low that your graph would be almost unusable after applying a Levenshtein Distance-based approach to entity resolution.
The first step you should take for entity resolution is coreference resolution. This is a pre-processing step that unfurls acronyms to their full form and changes spellings of the same entity to use a uniform name.
If in your text you have both "IBM" and "International Business Machines", coreference resolution changes the text to the single entity "IBM Corporation" and uses this in all other instances where the entity is mentioned.
You can use tools like spaCy neuralcoref for this, but it is unmaintained and you will be wading deep into dependency hell. The most recent Python version compatible with this library is 3.6 and at the time of this writing, the most recent commit to the repository was 4 years ago to fix a typo. Use at your own risk.
A much better approach is to fine tune a small language model like meta-llama/Llama-3.1-8B-Instruct
to perform coref for you, then either run it on local GPU, AWS Batch, SageMaker endpoints, LoRAX, or in another way where you can optimise for batching and throughput. This simple precursory step will improve everything your other entity resolution steps do.
A semantically relevant option for entity resolution involves using FAISS and sentence transformers to understand the actual meaning within nodes. This approach converts node properties into dense vector embeddings that capture semantic relationships, meaning it can understand that "IBM Corporation" and "International Business Machines" refer to the same entity even when they appear in different contexts or have different connections. FAISS then enables extremely fast similarity search across these embeddings using approximate nearest neighbour techniques, making it practical to find semantically similar nodes even in graphs with millions of entities.
When selecting a small language model, the key is finding the right balance between computational efficiency and task performance. Models like all-MiniLM-L6-v2
have emerged as particularly effective choices, offering strong semantic understanding while requiring minimal computational resources compared to larger models like BERT or RoBERTa. The "L6" in MiniLM-L6 indicates it has only 6 layers (compared to BERT's 12 or 24), yet through careful knowledge distillation and architecture optimization, it maintains impressive performance on text similarity tasks while being significantly faster and lighter. This makes it especially suitable for production environments where resource constraints are important but semantic accuracy can't be compromised. Its 384-dimensional embeddings provide sufficient expressiveness for most semantic tasks while keeping memory usage manageable, and its ability to run efficiently on CPUs makes it accessible for organizations without extensive GPU infrastructure.
Once similar entity groups have been identified, whether through semantic similarity or structural patterns, they need to be merged into consolidated nodes that preserve the richness of the original data. This merging process involves carefully combining the properties of related nodes, handling any conflicts in their data, and ensuring all relationships are properly redirected to the newly merged entities. Typically, the most complete node (the one with the most properties) serves as a base, with additional properties and alternative values being incorporated from other nodes in the group. The real challenge lies in maintaining the integrity of the graph during this consolidation - relationships need to be updated to point to the new merged nodes, duplicate relationships need to be removed, and any dangling references must be cleaned up. This transforms what might have started as a messy, redundant graph into a clean, consolidated network where each real-world entity is represented exactly once.
Depending on the entity resolution task at hand and the shape of your data, you might want to use an algorithm that takes into account the structure of the graph to assist you in finding similar entities that should be merged.
Node2Vec doesn't look at the meaning of the information within the node as much as the structure of the nodes and edges to help you identify similar patterns. For example, if you are analysing a social network knowledge graph and want to know which users have similar behavioural patterns, Node2Vec is your best friend. It works by performing random walks through the graph to create "sentences" of nodes, which are then fed into a Skip-gram model to create vector embeddings where structurally similar nodes end up close together. This approach means Node2Vec can identify users who play similar roles in the network - like finding bridge users who connect different communities or central figures in social groups - even if these users have completely different characteristics or interests. If your graph similarity is dependent on the structure of the graph, you should familiarise yourself with Node2Vec.
GraphSAGE (SAmple and aggreGatE) offers an important advancement over Node2Vec as it's an inductive learning approach, meaning it can handle dynamic graphs and new nodes. While Node2Vec creates embeddings by walking through the graph and learning embeddings for specific nodes it saw during training, GraphSAGE learns how to generate embeddings by sampling and aggregating features from neighbors. This means GraphSAGE can create embeddings using a learned aggregation function, handle new nodes that appear after training, and importantly, uses node features along with structure – making it more versatile for real-world applications where graphs are constantly evolving.
We won't worry about these two structural approaches as they are generally more suitable for specialised graphs, or high-level analysis of large graphs as opposed to the granular resolution of smaller graphs that we are interested in. But you should know they exist and are available tools in your toolbox.
To demonstrate these ideas, let's create a simple module where we can pass in graphs and have clusters of similar nodes returned. We will then merge the nodes in these clusters into a single node.
First let's set up our EntityResolution class, including the sentence transformer model MiniLM, which we've chosen for its good balance of speed and accuracy.
class EntityResolutionManager:
def __init__(self,
model_name: str = 'sentence-transformers/all-MiniLM-L6-v2',
device: Optional[str] = None):
# Initialize sentence transformer model
self.model = SentenceTransformer(model_name, device=self.device)
# Setup FAISS for GPU if available
if self.device == 'cuda':
self.res = faiss.StandardGpuResources()
Next we'll set up definitions to extract the node texts and encode them. Since graph structure is not important to us in a text-to-graph sequence, we won't worry about using GraphSAGE or Node2Vec, though this set up can be easily augmented with these steps if it is important to your application.
def get_node_text(self, node: Node) -> str:
"""Convert a node's properties to a text representation."""
return f"name: {node.name} | type: {node.type}"
def encode_nodes(self, nodes: List[Node], batch_size: int = 1024) -> np.ndarray:
"""Convert nodes into vector embeddings."""
texts = [self.get_node_text(node) for node in nodes]
return self.model.encode(texts, batch_size=batch_size, normalize_embeddings=True)
Next, let's find similar entities in our graph using FAISS GPU. The threshold parameter (default 0.85) determines how similar nodes need to be to be considered a match.
def find_similar_entities(self, embeddings: np.ndarray, threshold: float = 0.85) -> List[List[int]]:
"""Find groups of similar entities using FAISS."""
# Setup FAISS index
index = faiss.GpuIndexFlatIP(self.res, embeddings.shape[1]) if self.device == 'cuda' else faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)
# Search for similar nodes
D, I = index.search(embeddings, k=min(k, num_nodes))
# Group similar entities
groups = []
processed = set()
for i in range(len(embeddings)):
if i in processed:
continue
similar = [j for j, sim in zip(I[i], D[i])
if sim > threshold and j != i and j not in processed]
if similar:
groups.append([i] + similar)
processed.update(similar)
Finally, we cluster the similar nodes into groups along with their similarity scores.
def resolve_entities(self, graphs: List[Graph], threshold: float = 0.85) -> List[EntityGroup]:
"""Main entity resolution process."""
# Combine all graphs
combined_graph = self.combine_graphs(graphs)
# Convert nodes to embeddings
embeddings = self.encode_nodes(combined_graph.nodes)
# Find similar groups
similar_groups = self.find_similar_entities(embeddings, threshold)
# Create entity groups with similarity scores
return [EntityGroup(
nodes=[combined_graph.nodes[idx] for idx in group],
similarity=calculate_group_similarity(embeddings[group])
) for group in similar_groups]
This will return entity groups comprised of nodes and their similarity scores.
Now let's set up our merging class.
class EntityMerger:
def merge_entity_groups(self, entity_groups: List[EntityGroup]) -> Graph:
# Process groups in parallel
with ThreadPoolExecutor() as executor:
executor.map(self._merge_group, valid_groups)
We'll pass groups of nodes to our class, which will merge the groups.
def _merge_group(self, group: EntityGroup):
# Pick the most detailed node as our base
base_node = max(group.nodes, key=lambda n: len(n.properties))
merged_id = str(uuid.uuid4())
# Create new merged node
merged_node = Node(
id=merged_id,
name=base_node.name,
type=base_node.type,
properties=self._merge_properties(group.nodes)
)
We'll use a simplistic way of merging the properties into lists for demonstration purposes, though in a production use-case, a small language model is very useful in rewriting the properties, particularly when you have many nodes with different descriptions or text-based properties that are important to your retrieval use case.
def _merge_properties(self, nodes: List[Node]) -> Dict:
merged_props = {}
for node in nodes:
for key, value in node.properties.items():
if key not in merged_props:
merged_props[key] = value
elif merged_props[key] != value:
# Handle conflicts by creating lists of values
if isinstance(merged_props[key], list):
merged_props[key].append(value)
else:
merged_props[key] = [merged_props[key], value]
return merged_props
We also need to reassign our relationships so that all of the connections to and from the nodes we are merging are connected to the single merged node.
def _update_relations(self):
updated_relations = []
seen_relations = set() # Track unique relations
for relation in self.graph.relations:
start_id = self.node_id_mapping.get(relation.start_node_id, relation.start_node_id)
end_id = self.node_id_mapping.get(relation.end_node_id, relation.end_node_id)
# Avoid duplicate relations
relation_key = f"{start_id}:{relation.type}:{end_id}"
if relation_key not in seen_relations:
updated_relations.append(new_relation)
seen_relations.add(relation_key)
Finally, it's a good idea to clean up the graph so there are no orphaned nodes and no dangling relationships that don't touch a node on either end.
def _cleanup_graph(self):
valid_node_ids = {node.id for node in self.graph.nodes}
valid_relations = [
relation for relation in self.graph.relations
if relation.start_node_id in valid_node_ids
and relation.end_node_id in valid_node_ids
]
And that's it!
In this example, we've created a simple module to demonstrate how you might do entity resolution using FAISS and sentence transformers. In a production use case, you will want to augment each step so that it is more suitable to your dataset and use case, but overall, this approach to entity resolution produces a significantly better graph and does it much faster than if you were using simple string matching.