If you've spent any amount of time working with RAG (retrieval augmented generation), you'll know that you will quite quickly encounter a performance ceiling when working with vector-only RAG.
When using a vector-only RAG system in production, you will have to resolve these problems:
1.
2.
3.
Using a graph RAG system can resolve many of these problems.
When you use a graph, your text is parsed against an ontology into triples. A triple contains a subject, predicate and object. The knowledge is encoded into the relationships, thus lending a level of compression to the data. The triples are then embedded and stored in a vector database, while the graph is stored in a graph database.
Upon retrieval, your query is embedded and run against the vector database to return a list of relevant triples. The graph database is then queried to expand on these triples, providing additional context and relationships that are relevant to the query. Finally, the graph results are processed into a subgraph that represents context from your graph database that is relevant to your search.
This results in a more accurate and contextually rich response from the LLM.
The duohub API supports 3 retrieval modes:
1.
2.
3.
Each of these retrieval modes has its own use case, and the performance of each mode is different.
You can use them accordingly:
from duohub import Duohub client = Duohub() # Subgraph client.query(query="What did you do in Paris last week?") # Assisted client.query(query="What did you do in Paris last week?", assisted=True) # Facts & Assisted client.query(query="What did you do in Paris last week?", facts=True, assisted=True)
Subgraph mode only returns the subgraph results. This is the fastest mode. However, the context can be quite large in size, which is not always ideal for a conversational system because large contexts can absorb attention from the LLM, especially if some information is irrelevant.
Assisted mode is best for voice AI systems. It returns a single sentence that is relevant to the query. If you are using a framework like Pipecat, where user's inputs are transcribed in real tim and parsed into phrases in the pipe, this is the best mode to use.
For example, if the user is asking the question, "What are the implications for Ukraine if Donald Trump wins the election?", Pipecat might split this up into the following phrases:
The assisted mode will return context for each phrase, for example:
The user will not notice any latency as most assisted mode queries are answered in under 300 ms, and because Pipecat splits up the query into phrases, most of the context will already have been provided by the time the user has finished speaking.
We currently use the Mixtral 8x7b model for assisted mode as we have found it to be the best balance between speed and quality.
Facts mode is ideal for chat interfaces where the user might be passing in a complete question or statement, as opposed to voice AI systems where the user's input is transcribed in real time and chunked into phrases.
We use an OpenAI GPT4o tool call to extract 3 facts from the subgraph results.
The following benchmarks were conducted in London, UK, using a 5G network. This is relevant because our data centres are distributed all over the world, including but not limited to London, New York, California, Frankfurt, Tokyo, and Sydney.
Your query will always be routed to the nearest available data centre. Since this benchmark was conducted in London, the results should be representative of a query routed to the London data centre. If, for some reason, our London data centre was down, this query would have been routed to the next nearest available data centre, in this case Frankfurt.
For this benchmark, we ingested 3 articles from The Guardian into a graph.
The articles are as follows:
After ingestion, we queried the graph using each of the 3 retrieval modes from a local Python IDE.
pip install duohub
The following code runs each of the modes sequentially and prints the latency for each mode.
import time
from duohub import Duohub
# Initialize the Duohub client
client = Duohub()
# Define the query
query = "What are the implications for Ukraine if Donald Trump wins the election?"
# Function to measure latency
def measure_latency(mode=None):
start_time = time.time()
if mode == 'assisted':
client.query(query=query, assisted=True)
elif mode == 'facts':
client.query(query=query, facts=True, assisted=True)
else: # Default to subgraph
client.query(query=query)
end_time = time.time()
latency = (end_time - start_time) * 1000 # Convert to milliseconds
return latency
# Run queries and measure latency
latency_subgraph = measure_latency()
latency_assisted = measure_latency(mode='assisted')
latency_facts = measure_latency(mode='facts')
# Print results
print(f"Subgraph: {latency_subgraph:.2f} ms")
print(f"Assisted: {latency_assisted:.2f} ms")
print(f"Facts: {latency_facts:.2f} ms")
Latency results for this query are as follows:
Subgraph: 42.66 ms
Assisted: 194.12 ms
Facts: 1089.38 ms
As you can see, the subgraph mode is the fastest, followed by the assisted mode, and then the facts mode.
Personally, I dislike the word "competition" because it implies a zero-sum game. I will, however, compare duohub's graph RAG to the other RAG systems out there which do more or less the same thing.
The two main players in the graph RAG space are Zep and Graphlit.
Zep is a popular memory layer for AI. While duohub allows you to ingest arbitrary data into the graph, Zep is primarily used to ingest conversation history into a graph.
We have found them to be an excellent product, which we use ourselves for conversation facts.
Zep returns an answer in 247 ms on average. This is partially because Zep generates facts in the background, providing you with a payload of existing facts when you request it. However, if you wish to ingest your own data into Zep, you will quickly encounter limitations.
Graphlit is a relatively new player in the graph RAG space. It is a managed service that allows you to ingest your own data into a graph, and then query it using a LLM.
However, most queries take more than 10 seconds to return a response. Graphlit is not suitable for real-time applications.
If you need a low-latency, high-accuracy graph RAG API, duohub is the best option. When working with subgraph queries, you can expect results to return in under 50ms, and when working with answer queries, you can expect results in under 300ms, though more realistically around 180ms.
If you are working with voice AI systems, you should use the assisted mode, which is the fastest.
If you are working with chat interfaces, you should use the facts mode, which is the most accurate.