26 Oct 2024

Benchmarking duohub's graph RAG API retriever performance

Oseh Mathias

If you've spent any amount of time working with RAG (retrieval augmented generation), you'll know that you will quite quickly encounter a performance ceiling when working with vector-only RAG.

When using a vector-only RAG system in production, you will have to resolve these problems:

1.
You will have to find a chunk size for your data, or add a pre-query step to generate single-sentence facts which you can embed.
2.
In retrieval, you will be limited by the number of N results. So if you have a topic at hand, and there is a single fact in a chunk that is relevant but falls outside of N results, you will miss this context.
3.
In retrieval, you will often encounter a lot of noise in the data, e.g. results that are not relevant to the query. When giving this to an LLM, it can actually cause hallucinations that were previously not a problem because the LLM's attention was directed to irrelevant context.

Using a graph RAG system can resolve many of these problems.

How does a graph RAG system improve performance?

When you use a graph, your text is parsed against an ontology into triples. A triple contains a subject, predicate and object. The knowledge is encoded into the relationships, thus lending a level of compression to the data. The triples are then embedded and stored in a vector database, while the graph is stored in a graph database.

Upon retrieval, your query is embedded and run against the vector database to return a list of relevant triples. The graph database is then queried to expand on these triples, providing additional context and relationships that are relevant to the query. Finally, the graph results are processed into a subgraph that represents context from your graph database that is relevant to your search.

This results in a more accurate and contextually rich response from the LLM.

3 Retrieval Modes

The duohub API supports 3 retrieval modes:

1.
Subgraph: Fastest, just returns the subgraph results.
2.
Assisted: Best for voice AI. Uses the subgraph results to answer the query in one sentence.
3.
Facts: Most accurate, best for text chat interfaces. Uses the subgraph to generate 3 facts that can be used to inform the answer.

Each of these retrieval modes has its own use case, and the performance of each mode is different.

You can use them accordingly:

from duohub import Duohub

client = Duohub()

# Subgraph
client.query(query="What did you do in Paris last week?")

# Assisted
client.query(query="What did you do in Paris last week?", assisted=True)

# Facts & Assisted
client.query(query="What did you do in Paris last week?", facts=True, assisted=True)

Subgraph Mode

Subgraph mode only returns the subgraph results. This is the fastest mode. However, the context can be quite large in size, which is not always ideal for a conversational system because large contexts can absorb attention from the LLM, especially if some information is irrelevant.

Assisted Mode

Assisted mode is best for voice AI systems. It returns a single sentence that is relevant to the query. If you are using a framework like Pipecat, where user's inputs are transcribed in real tim and parsed into phrases in the pipe, this is the best mode to use.

For example, if the user is asking the question, "What are the implications for Ukraine if Donald Trump wins the election?", Pipecat might split this up into the following phrases:

"What are the implications for Ukraine"
"if Donald Trump wins the election?"

The assisted mode will return context for each phrase, for example:

"Ukraine is a country at war with Russia, being supported by the US and NATO as well as the United Kingdom."
"Donald Trump is a former president of the United States who is running for office in 2024. He is a military aid sceptic and has criticised NATO in the past, so it is logical to conclude that if he wins the election, the US and NATO may not provide as much support to Ukraine. Additionally, Donald Trump has a relationship with Vladimir Putin, so it is possible that his presidency could lead to a quicker resolution of the war."

The user will not notice any latency as most assisted mode queries are answered in under 300 ms, and because Pipecat splits up the query into phrases, most of the context will already have been provided by the time the user has finished speaking.

We currently use the Mixtral 8x7b model for assisted mode as we have found it to be the best balance between speed and quality.

Facts Mode

Facts mode is ideal for chat interfaces where the user might be passing in a complete question or statement, as opposed to voice AI systems where the user's input is transcribed in real time and chunked into phrases.

We use an OpenAI GPT4o tool call to extract 3 facts from the subgraph results.

Benchmarking Retrieval Performance

The following benchmarks were conducted in London, UK, using a 5G network. This is relevant because our data centres are distributed all over the world, including but not limited to London, New York, California, Frankfurt, Tokyo, and Sydney.

Your query will always be routed to the nearest available data centre. Since this benchmark was conducted in London, the results should be representative of a query routed to the London data centre. If, for some reason, our London data centre was down, this query would have been routed to the next nearest available data centre, in this case Frankfurt.

Subgraph Mode Latency

For this benchmark, we ingested 3 articles from The Guardian into a graph.

The articles are as follows:

After ingestion, we queried the graph using each of the 3 retrieval modes from a local Python IDE.

pip install duohub

The following code runs each of the modes sequentially and prints the latency for each mode.

import time
from duohub import Duohub

# Initialize the Duohub client
client = Duohub()

# Define the query
query = "What are the implications for Ukraine if Donald Trump wins the election?"

# Function to measure latency
def measure_latency(mode=None):
    start_time = time.time()
    
    if mode == 'assisted':
        client.query(query=query, assisted=True)
    elif mode == 'facts':
        client.query(query=query, facts=True, assisted=True)
    else:  # Default to subgraph
        client.query(query=query)
        
    end_time = time.time()
    latency = (end_time - start_time) * 1000  # Convert to milliseconds
    return latency

# Run queries and measure latency
latency_subgraph = measure_latency()
latency_assisted = measure_latency(mode='assisted')
latency_facts = measure_latency(mode='facts')

# Print results
print(f"Subgraph: {latency_subgraph:.2f} ms")
print(f"Assisted: {latency_assisted:.2f} ms")
print(f"Facts: {latency_facts:.2f} ms")

Latency results for this query are as follows:

Subgraph: 42.66 ms
Assisted: 194.12 ms
Facts: 1089.38 ms

As you can see, the subgraph mode is the fastest, followed by the assisted mode, and then the facts mode.

Subgraph Mode Latency

duohub graph RAG vs the competition

Personally, I dislike the word "competition" because it implies a zero-sum game. I will, however, compare duohub's graph RAG to the other RAG systems out there which do more or less the same thing.

The two main players in the graph RAG space are Zep and Graphlit.

duohub vs Zep

Zep is a popular memory layer for AI. While duohub allows you to ingest arbitrary data into the graph, Zep is primarily used to ingest conversation history into a graph.

We have found them to be an excellent product, which we use ourselves for conversation facts.

Zep benchmarking

Zep returns an answer in 247 ms on average. This is partially because Zep generates facts in the background, providing you with a payload of existing facts when you request it. However, if you wish to ingest your own data into Zep, you will quickly encounter limitations.

duohub vs Graphlit

Graphlit is a relatively new player in the graph RAG space. It is a managed service that allows you to ingest your own data into a graph, and then query it using a LLM.

However, most queries take more than 10 seconds to return a response. Graphlit is not suitable for real-time applications.

Conclusion

If you need a low-latency, high-accuracy graph RAG API, duohub is the best option. When working with subgraph queries, you can expect results to return in under 50ms, and when working with answer queries, you can expect results in under 300ms, though more realistically around 180ms.

If you are working with voice AI systems, you should use the assisted mode, which is the fastest.

If you are working with chat interfaces, you should use the facts mode, which is the most accurate.