How to improve query latency with Qdrant cross-region distribution

In a previous post, we looked at how to set up multi-region distributed Qdrant cluster on AWS. In this post, we will look at how to improve query latency by using listener nodes.

How do Qdrant distributed node queries work?

When you query a Qdrant cluster, the query is first sent to the replica node that you are querying. To ensure read consistency, Qdrant will poll the other replicas via the leader node in the cluster to ensure that you have data consistency across the cluster.

This polling happens over TCP/UDP on port 6335 which adds some network latency if your primary node is in a different region to your replicas.

Imagine you are querying a replica in eu-west-2 (London) and your primary node is in us-east-1 (N. Virginia). The query will be sent to the replica in eu-west-2 and then to the leader node in us-east-1 which will poll the other replicas to ensure consistency. Network latency alone can add 600ms to your query time.

This is obviously not ideal if you want the low-latency queries that you see in Qdrant's benchmark results.

Qdrant latency benchmark results

How to use listener nodes to improve query latency

Listener nodes are a type of node in Qdrant that do not participate in read/query operations. They are used to offload the polling from the primary nodes in your cluster. The listener node still accepts write operations, but ignores read operations.

In the diagram below, we have set up a primary node in an unnamed region, with a listener and a replica in each of the other regions. In this example, we are running the listener and the replica node in the same EC2 instance within a Docker network.

Qdrant listener configuration

When you query the replica node in eu-west-2, it will only poll the listener node in the same EC2 instance in eu-west-2 instead of the primary node in us-east-1.

This ensures that you get a consistent read from the replica node in eu-west-2 without the network latency of polling the primary node in us-east-1.

While I have started both the listener and replica nodes on the same EC2 instance, you could also set this up as parallel ECS tasks.

Learn more about Qdrant listener nodes

To learn more about Qdrant listener nodes, you can refer to the official documentation.

Improve query latency with quantisation and RAM-only mode

In addition to using listener nodes, you can also improve query latency by using quantisation and RAM-only modes. Quantisation is a technique that reduces the precision of the vectors in your collection, which in turn reduces the amount of data that needs to be stored on disk or in memory. RAM-only mode ensures that the vectors are stored in RAM, which further reduces the query latency.

Using quantisation

Qdrant's documentation on quantisation is excellent, and I would recommend referring to it for all of the configuration options available.

Like with RAM-only mode, quantisation options are set when you are creating the collection.

I have personally found that the following configuration is a good starting point:

from qdrant_client import QdrantClient, models
from qdrant_client.models import VectorParams, Distance, HnswConfigDiff, OptimizersConfigDiff, ScalarQuantization, ScalarQuantizationConfig, ScalarType

client = QdrantClient(host=self.host, port=self.port)
client.create_collection(
    collection_name=qdrant_collection,
    vectors_config=VectorParams(
        size=vector_size, 
        distance=Distance.COSINE,
    ),
    hnsw_config=HnswConfigDiff(
        m=64,
        ef_construct=100,
    ),
    optimizers_config=OptimizersConfigDiff(
        default_segment_number=4
    ),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(
            type=ScalarType.INT8,
            quantile=0.99,
            always_ram=True,
        ),
    ),
    replication_factor=3,
    write_consistency_factor=2
)

Using RAM-only mode

There are three main options for optimising storage for latency. These are covered in detail in Qdrant's documentation.

The options are:

  • High speed search with low memory usage
  • High precision search with high memory usage
  • High precision search with high memory usage

I prefer the latter for low-latency queries.

As per Qdrant's example in the documentation, the high precision search with high memory usage keeps the vectors in RAM which, when combined with quantisation, can give you the fastest queries available.

from qdrant_client import QdrantClient, models

client = QdrantClient(url="http://localhost:6333")

client.create_collection(
    collection_name="{collection_name}",
    vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE),
    quantization_config=models.ScalarQuantization(
        scalar=models.ScalarQuantizationConfig(
            type=models.ScalarType.INT8,
            always_ram=True,
        ),
    ),
)

Conclusion

Putting it all together, using listener nodes, quantisation, and RAM-only mode should give you the best query latency for your Qdrant cluster. In most cases, I have found that using these optimisations I can get a query returned in under 4ms in a distributed, cross-region Qdrant cluster. Conversely, using just primaries and replicas, don't be surprised if you see query times of around 650ms.

Happy querying!

Copyright © 2024 homogeneous ai pty ltd. All rights reserved.