Vector Quantization: Scale Search & Generative AI Applications

Mai Nguyen and Henry Weller

#genAI#Vector Search

This post is also available in: Deutsch, Français, Español, Português, Italiano, 한국어, 简体中文.

Update 12/12/2024: The upcoming vector quantization capabilities mentioned at the end of this blog post are now available in public preview:

Support for ingestion and indexing of binary (int1) quantized vectors: gives developers the flexibility to choose and ingest the type of quantized vectors that best fits their requirements.

Automatic quantization and rescoring: provides a native mechanism for scalar quantization and binary quantization with rescoring, making it easier for developers to implement vector quantization entirely within Atlas Vector Search.

View the documentation to get started.

We are excited to announce a robust set of vector quantization capabilities in MongoDB Atlas Vector Search. These capabilities will reduce vector sizes while preserving performance, enabling developers to build powerful semantic search and generative AI applications with more scale—and at a lower cost. In addition, unlike relational or niche vector databases, MongoDB’s flexible document model—coupled with quantized vectors—allows for greater agility in testing and deploying different embedding models quickly and easily.

Support for scalar quantized vector ingestion is now generally available, and will be followed by several new releases in the coming weeks. Read on to learn how vector quantization works and visit our documentation to get started!

Brand graphic representing Atlas Vector Search

The challenges of large-scale vector applications

While the use of vectors has opened up a range of new possibilities, such as content summarization and sentiment analysis, natural language chatbots, and image generation, unlocking insights within unstructured data can require storing and searching through billions of vectors—which can quickly become infeasible.

Vectors are effectively arrays of floating-point numbers representing unstructured information in a way that computers can understand (ranging from a few hundred to billions of arrays), and as the number of vectors increases, so does the index size required to search over them. As a result, large-scale vector-based applications using full-fidelity vectors often have high processing costs and slow query times, hindering their scalability and performance.

Vector quantization for cost-effectiveness, scalability, and performance

Vector quantization, a technique that compresses vectors while preserving their semantic similarity, offers a solution to this challenge. Imagine converting a full-color image into grayscale to reduce storage space on a computer. This involves simplifying each pixel's color information by grouping similar colors into primary color channels or "quantization bins," and then representing each pixel with a single value from its bin. The binned values are then used to create a new grayscale image with smaller size but retaining most original details, as shown in Figure 1.

Figure 1: Illustration of quantizing an RGB image into grayscale
This image is an illustration of quantizing an RGB image into grayscale. On the left side is a photo of a puppy in normal color. In the middle is that same photo in RGB examples. And then on the right is a grayscale version of the photo.

Vector quantization works similarly, by shrinking full-fidelity vectors into fewer bits to significantly reduce memory and storage costs without compromising the important details. Maintaining this balance is critical, as search and AI applications need to deliver relevant insights to be useful.

Two effective quantization methods are scalar (converting a float point into an integer) and binary (converting a float point into a single bit of 0 or 1). Current and upcoming quantization capabilities will empower developers to maximize the potential of Atlas Vector Search.

The most impactful benefit of vector quantization is increased scalability and cost savings through reduced computing resources and efficient processing of vectors. And when combined with Search Nodes—MongoDB’s dedicated infrastructure for independent scalability through workload isolation and memory-optimized infrastructure for semantic search and generative AI workloads— vector quantization can further reduce costs and improve performance, even at the highest volume and scale to unlock more use cases.

"Cohere is excited to be one of the first partners to support quantized vector ingestion in MongoDB Atlas,” said Nils Reimers, VP of AI Search at Cohere. “Embedding models, such as Cohere Embed v3, help enterprises see more accurate search results based on their own data sources. We’re looking forward to providing our joint customers with accurate, cost-effective applications for their needs.”

In our tests, compared to full-fidelity vectors, BSON-type vectors—MongoDB’s JSON-like binary serialization format for efficient document storage—reduced storage size by 66% (from 41 GB to 14 GB). And as shown in Figures 2 and 3, the tests illustrate significant memory reduction (73% to 96% less) and latency improvements using quantized vectors, where scalar quantization preserves recall performance and binary quantization’s recall performance is maintained with rescoring–a process of evaluating a small subset of the quantized outputs against full-fidelity vectors to improve the accuracy of the search results.

Figure 2: Significant storage reduction + good recall and latency performance with quantization on different embedding models
This image is a table displaying storage size and latency times for different amounts of documents and test groups. The test is divided into three groups, which are Full-Fidelity Vectors, Scalar Quantization, and Binary Quantization. Then, there are two different groups for the number of total documents, one being 200k docs on OpenAI embedding models, and the other being 3 million docs on Cohere embedding model. For the data, the full-fidelity vectors test on 200k docs had a vector index size of 1.2 GB and a latency of 13ms, and a 12GB vector index size and 26ms latency on the 3 million docs test. The Scalar Quantization test had a vector index size of .32 GB and 11ms latency on the 200k docs test, and a 3.2 GB vector index size and 19ms latency on the 3 million docs test. Finally, the binary quantization had a .05 GB vector index size on the 200k docs test (a 96% reduction from other tests) along with a 12ms latency, and then a .5 GB vector index size on 3 million docs test, representing a 96% reduction from the Full-Fidelity Vectors test.

Figure 3: Remarkable improvement in recall performance for binary quantization when combining with rescoring
This image is a graph of improvement in recall performance for binary quantization when combining with rescoring. The Y axis of the graph represents average recall over 50 queries, while the X axis represents num candidates. There are 4 lines on the graph, each representing a different type of queries. The line representing binary, in red, starts near 0,0 and stays below 0.6 on the graph across all num candidates, putting it as the lowest line on the graph. The float ANN line, in blue, starts near the top of the Y axis at 0 num candidates and moves in a level line across the graph, same goes for the scalar line, in orange, which comes in just below the float ANN. The binary + rescoring line starts towards the bottom of the Y axis at 0 num candidates, but gradually increases the more the graph moves right.

In addition, thanks to the reduced cost advantage, vector quantization facilitates more advanced, multiple vector use cases that would have been too computationally-taxing or cost-prohibitive to implement. For example, vector quantization can help users:

  • Easily A/B test different embedding models using multiple vectors produced from the same source field during prototyping. MongoDB’s document model—coupled with quantized vectors—allows for greater agility at lower costs. The flexible document schema lets developers quickly deploy and compare embedding models’ results without the need to rebuild the index or provision an entirely new data model or set of infrastructure.

  • Further improve the relevance of search results or context for large language models (LLMs) by incorporating vectors from multiple sources of relevance, such as different source fields (product descriptions, product images, etc.) embedded within the same or different models.

How to get started, and what’s next

Now, with support for the ingestion of scalar quantized vectors, developers can import and work with quantized vectors from their embedding model providers of choice (such as Cohere, Nomic, Jina, Mixedbread, and others)—directly in Atlas Vector Search. Read the documentation and tutorial to get started.

And in the coming weeks, additional vector quantization features will equip developers with a comprehensive toolset for building and optimizing applications with quantized vectors:

Support for ingestion of binary quantized vectors will enable further reduction of storage space, allowing for greater cost savings and giving developers the flexibility to choose the type of quantized vectors that best fits their requirements.

Automatic quantization and rescoring will provide native capabilities for scalar quantization as well as binary quantization with rescoring in Atlas Vector Search, making it easier for developers to take full advantage of vector quantization within the platform.

With support for quantized vectors in MongoDB Atlas Vector Search, you can build scalable and high-performing semantic search and generative AI applications with flexibility and cost-effectiveness. Check out these resources to get started documentation and tutorial.

Head over to our quick-start guide to get started with Atlas Vector Search today.