Vector Quantization: Scale Search & Generative AI Applications

Mai Nguyen and Henry Weller
October 7, 2024 | Updated: March 5, 2025
#genAI #Vector Search

This post is also available in: Deutsch, Français, Español, Português, Italiano, 한국어, 简体中文.

Update 12/12/2024: The upcoming vector quantization capabilities mentioned at the end of this blog post are now available in public preview:

Support for ingestion and indexing of binary (int1) quantized vectors: gives developers the flexibility to choose and ingest the type of quantized vectors that best fits their requirements.

Automatic quantization and rescoring: provides a native mechanism for scalar quantization and binary quantization with rescoring, making it easier for developers to implement vector quantization entirely within Atlas Vector Search.

View the documentation to get started.

We are excited to announce a robust set of vector quantization capabilities in MongoDB Atlas Vector Search. These capabilities will reduce vector sizes while preserving performance, enabling developers to build powerful semantic search and generative AI applications with more scale—and at a lower cost. In addition, unlike relational or niche vector databases, MongoDB’s flexible document model—coupled with quantized vectors—allows for greater agility in testing and deploying different embedding models quickly and easily.

Support for scalar quantized vector ingestion is now generally available, and will be followed by several new releases in the coming weeks. Read on to learn how vector quantization works and visit our documentation to get started!

The challenges of large-scale vector applications

While the use of vectors has opened up a range of new possibilities, such as content summarization and sentiment analysis, natural language chatbots, and image generation, unlocking insights within unstructured data can require storing and searching through billions of vectors—which can quickly become infeasible.

Vectors are effectively arrays of floating-point numbers representing unstructured information in a way that computers can understand (ranging from a few hundred to billions of arrays), and as the number of vectors increases, so does the index size required to search over them. As a result, large-scale vector-based applications using full-fidelity vectors often have high processing costs and slow query times, hindering their scalability and performance.

Vector quantization for cost-effectiveness, scalability, and performance

Vector quantization, a technique that compresses vectors while preserving their semantic similarity, offers a solution to this challenge. Imagine converting a full-color image into grayscale to reduce storage space on a computer. This involves simplifying each pixel's color information by grouping similar colors into primary color channels or "quantization bins," and then representing each pixel with a single value from its bin. The binned values are then used to create a new grayscale image with smaller size but retaining most original details, as shown in Figure 1.

Figure 1: Illustration of quantizing an RGB image into grayscale

This image is an illustration of quantizing an RGB image into grayscale. On the left side is a photo of a puppy in normal color. In the middle is that same photo in RGB examples. And then on the right is a grayscale version of the photo.

Vector quantization works similarly, by shrinking full-fidelity vectors into fewer bits to significantly reduce memory and storage costs without compromising the important details. Maintaining this balance is critical, as search and AI applications need to deliver relevant insights to be useful.

Two effective quantization methods are scalar (converting a float point into an integer) and binary (converting a float point into a single bit of 0 or 1). Current and upcoming quantization capabilities will empower developers to maximize the potential of Atlas Vector Search.

The most impactful benefit of vector quantization is increased scalability and cost savings through reduced computing resources and efficient processing of vectors. And when combined with Search Nodes—MongoDB’s dedicated infrastructure for independent scalability through workload isolation and memory-optimized infrastructure for semantic search and generative AI workloads— vector quantization can further reduce costs and improve performance, even at the highest volume and scale to unlock more use cases.

"Cohere is excited to be one of the first partners to support quantized vector ingestion in MongoDB Atlas,” said Nils Reimers, VP of AI Search at Cohere. “Embedding models, such as Cohere Embed v3, help enterprises see more accurate search results based on their own data sources. We’re looking forward to providing our joint customers with accurate, cost-effective applications for their needs.”

In our tests, compared to full-fidelity vectors, BSON-type vectors—MongoDB’s JSON-like binary serialization format for efficient document storage—reduced storage size by 66% (from 41 GB to 14 GB). And as shown in Figures 2 and 3, the tests illustrate significant memory reduction (73% to 96% less) and latency improvements using quantized vectors, where scalar quantization preserves recall performance and binary quantization’s recall performance is maintained with rescoring–a process of evaluating a small subset of the quantized outputs against full-fidelity vectors to improve the accuracy of the search results.

Figure 2: Significant storage reduction + good recall and latency performance with quantization on different embedding models

This image is a table displaying storage size and latency times for different amounts of documents and test groups. The test is divided into three groups, which are Full-Fidelity Vectors, Scalar Quantization, and Binary Quantization. Then, there are two different groups for the number of total documents, one being 200k docs on OpenAI embedding models, and the other being 3 million docs on Cohere embedding model. For the data, the full-fidelity vectors test on 200k docs had a vector index size of 1.2 GB and a latency of 13ms, and a 12GB vector index size and 26ms latency on the 3 million docs test. The Scalar Quantization test had a vector index size of .32 GB and 11ms latency on the 200k docs test, and a 3.2 GB vector index size and 19ms latency on the 3 million docs test. Finally, the binary quantization had a .05 GB vector index size on the 200k docs test (a 96% reduction from other tests) along with a 12ms latency, and then a .5 GB vector index size on 3 million docs test, representing a 96% reduction from the Full-Fidelity Vectors test.

Figure 3: Remarkable improvement in recall performance for binary quantization when combining with rescoring

This image is a graph of improvement in recall performance for binary quantization when combining with rescoring. The Y axis of the graph represents average recall over 50 queries, while the X axis represents num candidates. There are 4 lines on the graph, each representing a different type of queries. The line representing binary, in red, starts near 0,0 and stays below 0.6 on the graph across all num candidates, putting it as the lowest line on the graph. The float ANN line, in blue, starts near the top of the Y axis at 0 num candidates and moves in a level line across the graph, same goes for the scalar line, in orange, which comes in just below the float ANN. The binary + rescoring line starts towards the bottom of the Y axis at 0 num candidates, but gradually increases the more the graph moves right.

In addition, thanks to the reduced cost advantage, vector quantization facilitates more advanced, multiple vector use cases that would have been too computationally-taxing or cost-prohibitive to implement. For example, vector quantization can help users:

Easily A/B test different embedding models using multiple vectors produced from the same source field during prototyping. MongoDB’s document model—coupled with quantized vectors—allows for greater agility at lower costs. The flexible document schema lets developers quickly deploy and compare embedding models’ results without the need to rebuild the index or provision an entirely new data model or set of infrastructure.
Further improve the relevance of search results or context for large language models (LLMs) by incorporating vectors from multiple sources of relevance, such as different source fields (product descriptions, product images, etc.) embedded within the same or different models.

How to get started, and what’s next

Now, with support for the ingestion of scalar quantized vectors, developers can import and work with quantized vectors from their embedding model providers of choice (such as Cohere, Nomic, Jina, Mixedbread, and others)—directly in Atlas Vector Search. Read the documentation and tutorial to get started.

And in the coming weeks, additional vector quantization features will equip developers with a comprehensive toolset for building and optimizing applications with quantized vectors:

Support for ingestion of binary quantized vectors will enable further reduction of storage space, allowing for greater cost savings and giving developers the flexibility to choose the type of quantized vectors that best fits their requirements.

Automatic quantization and rescoring will provide native capabilities for scalar quantization as well as binary quantization with rescoring in Atlas Vector Search, making it easier for developers to take full advantage of vector quantization within the platform.

With support for quantized vectors in MongoDB Atlas Vector Search, you can build scalable and high-performing semantic search and generative AI applications with flexibility and cost-effectiveness. Check out these resources to get started documentation and tutorial.

Head over to our quick-start guide to get started with Atlas Vector Search today.

← Previous

MongoDB.local London 2024: Better Applications, Faster

This post is also available in: Deutsch , Français , Español , Português , Italiano , 한국어 , 简体中文 . Since we kicked off MongoDB’s series of 2024 events in April, we’ve connected with thousands of customers, partners, and community members in cities around the world—from Mexico City to Mumbai. Yesterday marked the nineteenth stop of the 2024 MongoDB.local tour, and we had a blast welcoming folks across industries to MongoDB.local London, where we discussed the latest technology trends, celebrated customer innovations, and unveiled product updates that make it easier than ever for developers to build next-gen applications. Over the past year, MongoDB’s more than 50,000 customers have been telling us that their needs are changing. They’re increasingly focused on three areas: Helping developers build faster and more efficiently Empowering teams to create AI-powered applications Moving from legacy systems to modern platforms Across these areas, there’s a common need for a solid foundation: each requires a resilient, scalable, secure, and highly performant database. The updates we shared at MongoDB.local London reflect these priorities. MongoDB is committed to ensuring that our products are built to exceed our customers’ most stringent requirements, and that they provide the strongest possible foundation for building a wide range of applications, now and in the future. Indeed, during yesterday’s event, Sahir Azam, MongoDB’s Chief Product Officer, discussed the foundational role data plays in his keynote address. He also shared the latest advancement from our partner ecosystem, an AI solution powered by MongoDB, Amazon Web Services, and Anthropic that makes it easier for customers to deploy gen AI customer care applications. MongoDB 8.0: The best version of MongoDB ever The biggest news at .local London was the general availability of MongoDB 8.0 , which provides significant performance improvements, reduced scaling costs, and adds additional scalability, resilience, and data security capabilities to the world’s most popular document database. Architectural optimizations in MongoDB 8.0 have significantly reduced memory usage and query times, and MongoDB 8.0 has more efficient batch processing capabilities than previous versions. Specifically, MongoDB 8.0 features 36% better read throughput, 56% faster bulk writes, and 20% faster concurrent writes during data replication. In addition, MongoDB 8.0 can handle higher volumes of time series data and can perform complex aggregations more than 200% faster—with lower resource usage and costs. Last (but hardly least!), Queryable Encryption now supports range queries, ensuring data security while enabling powerful analytics. For more on MongoDB.local London’s product announcements—which are designed to accelerate application development, simplify AI innovation, and speed developer upskilling—please read on! Accelerating application development Improved scaling and elasticity on MongoDB Atlas capabilities New enhancements to MongoDB Atlas’s control plane allow customers to scale clusters faster, respond to resource demands in real-time, and optimize performance—all while reducing operational costs. First, our new granular resource provisioning and scaling features—including independent shard scaling and extended storage and IOPS on Azure—allow customers to optimize resources precisely where needed. Second, Atlas customers will experience faster cluster scaling with up to 50% quicker scaling times by scaling clusters in parallel by node type. Finally, MongoDB Atlas users will enjoy more responsive auto-scaling, with a 5X improvement in responsiveness thanks to enhancements in our scaling algorithms and infrastructure. These enhancements are being rolled out to all Atlas customers, who should start seeing benefits immediately. IntelliJ plugin for MongoDB Announced in private preview, the MongoDB for IntelliJ Plugin is designed to functionally enhance the way developers work with MongoDB in IntelliJ IDEA, one of the most popular IDEs among Java developers. The plugin allows enterprise Java developers to write and test Java queries faster, receive proactive performance insights, and reduce runtime errors right in their IDE. By enhancing the database-to-IDE integration, JetBrains and MongoDB have partnered to deliver a seamless experience for their shared user-base and unlock their potential to build modern applications faster. Sign up for the private preview here . MongoDB Copilot Participant for VS Code (Public Preview) Now in public preview, the new MongoDB Participant for GitHub Copilot integrates domain-specific AI capabilities directly with a chat-like experience in the MongoDB Extension for VS Code . The participant is deeply integrated with the MongoDB extension, allowing for the generation of accurate MongoDB queries (and exporting them to application code), describing collection schemas, and answering questions with up-to-date access to MongoDB documentation without requiring the developer to leave their coding environment. These capabilities significantly reduce the need for context switching between domains, enabling developers to stay in their flow and focus on building innovative applications. Multicluster support for the MongoDB Enterprise Kubernetes Operator Ensure high availability, resilience, and scale for MongoDB deployments running in Kubernetes through added support for deploying MongoDB and Ops Manager across multiple Kubernetes clusters. Users now have the ability to deploy ReplicaSets, Sharded Clusters (in public preview), and Ops Manager across local or geographically distributed Kubernetes clusters for greater deployment resilience, flexibility, and disaster recovery. This approach enables multi-site availability, resilience, and scalability within Kubernetes, capabilities that were previously only available outside of Kubernetes for MongoDB. To learn more, check out the documentation . MongoDB Atlas Search and Vector Search are now generally available via the Atlas CLI and Docker The local development experience for MongoDB Atlas is now generally available. Use the MongoDB Atlas CLI and Docker to build with MongoDB Atlas in your preferred local environment, and easily access features like Atlas Search and Atlas Vector Search throughout the entire software development lifecycle. The Atlas CLI provides a unified and familiar terminal-based interface that allows you to deploy and build with MongoDB Atlas in your preferred development environment, locally or in the cloud. If you build with Docker, you can also now use Docker and Docker Compose to easily integrate Atlas in your local and continuous integration environments with the Atlas CLI . Avoid repetitive work by automating the lifecycle of your development and testing environments and focus on building application features with full-text search, AI and semantic search, and more. Simplifying AI innovation Reduce costs and increase scale in Atlas Vector Search We announced vector quantization capabilities in Atlas Vector Search . By reducing memory (by up to 96%) and making vectors faster to retrieve, vector quantization allows customers to build a wide range of AI and search applications at higher scale and lower cost. Generally available now, support for scalar quantized vector ingestion lets customers seamlessly import and work with quantized vectors from their embedding model providers of choice—directly in Atlas Vector Search. Coming soon, additional vector quantization features, including automatic quantization, will equip customers with a comprehensive toolset for building and optimizing large-scale AI and search applications in Atlas Vector Search. Additional integrations with popular AI frameworks Ship your next AI-powered project faster with MongoDB, no matter your framework or LLM of choice. AI technologies are advancing rapidly, making it important to build and scale performant applications quickly, and to use your preferred stack as your requirements and available technologies evolve. MongoDB’s enhanced suite of integrations with LangChain, LlamaIndex, Microsoft Semantic Kernel, AutoGen, Haystack, Spring AI, the ChatGPT Retrieval Plugin, and more make it easier than ever to build the next generation of applications on MongoDB . Advancing developer upskilling New MongoDB Learning Badges Faster to achieve and more targeted than a certification, MongoDB's free Learning Badges show your commitment to continuous learning and to proving your knowledge about a specific topic. Follow the learning path, gain new skills, and get a digital badge to show off on LinkedIn. Check out the two new gen AI learning badges! Building gen AI Apps : Learn to create innovative gen AI apps with Atlas Vector Search, including retrieval-augmented generation (RAG) apps. Deploying and Evaluating gen AI Apps : Take your apps from creation to full deployment, focusing on optimizing performance and evaluating results. Learn more To learn more about MongoDB’s recent product announcements and updates, check out our What’s New product announcements page and all of our blog posts about product updates . Happy building!

October 3, 2024

Next →

Next-Generation Mobility Solutions with Agentic AI and MongoDB Atlas

Driven by advancements in vehicle connectivity, autonomous systems, and electrification, the automotive and mobility industry is currently undergoing a significant transformation. Vehicles today are sophisticated machines, computers on wheels, that generate massive amounts of data, driving demand for connected and electric vehicles. Automotive players are embracing artificial intelligence (AI), battery electrical vehicles (BEVs), and software-defined vehicles (SDVs) to maintain their competitive advantage. However, managing fleets of connected vehicles can be a challenge. As cars get more sophisticated and are increasingly integrated with internal and external systems, the volume of data they produce and receive greatly increases. This data needs to be stored, transferred, and consumed by various downstream applications to unlock new business opportunities. This will only grow: the global fleet management market is projected to reach $65.7 billion by 2030, growing at a rate of almost 10.8% annually. A 2024 study conducted by Webfleet showed that 32% of fleet managers believe AI and machine learning will significantly impact fleet operations in the coming years; optimizing route planning and improving driver safety are the two most commonly cited use cases. As fleet management software providers continue to invest in AI, the integration of agentic AI can significantly help with things like route optimization and driver safety enhancement. For example, AI agents can process real-time traffic updates and weather conditions to dynamically adjust routes, ensuring timely deliveries while advising drivers on their car condition. This proactive approach contrasts with traditional reactive methods, improving vehicle utilization and reducing operational and maintenance costs. But what are agents? In short, they are operational applications that attempt to achieve goals by observing the world and acting upon it using the data and tools the application has at its disposal. The term "agentic" denotes having agency, as AI agents can proactively take steps to achieve objectives without constant human oversight. For example, rather than just reporting an anomaly based on telemetry data analysis, an agent for a connected fleet could autonomously cross-check that anomaly against known issues, decide whether it's critical or not, and schedule a maintenance appointment all on its own. Why MongoDB for agentic AI Agentic AI applications are dynamic by nature as they require the ability to create a chain of thought, use external tools, and maintain context across their entire workflow. These applications generate and consume diverse data types, including structured and unstructured data. MongoDB’s flexible document model is uniquely suited to handle both structured and unstructured data as vectors. It allows all of an agent’s context, chain-of-thought, tools metadata, and short-term and long-term memory to be stored in a single database. This means that developers can spend more time on innovation and rapidly iterate on agent designs without being constrained by rigid schemas of a legacy relational database. Figure 1. Major components of an AI agent. Figure 1 shows the major components of an AI agent. The agent will first receive a task from a human or via an automated trigger, and will then use a large language model (LLM) to generate a chain of thought or follow a predetermined workflow. The agent will use various tools and models during its run and store/retrieve data from a memory provider like MongoDB Atlas . Tools: The agent utilizes tools to interact with the environment. This can contain API methods, database queries, vector search, RAG application, anything to support the model Models: can be a large language model (LLM), vision language model (VLM), or a simple supervised machine learning model. Models can be general purpose or specialized, and agents may use more than one. Data: An agent requires different types of data to function. MongoDB’s document model allows you to easily model all of this data in one single database. An agentic AI spans a wide range of functional tools and context. The underlying data structures evolve throughout the agentic workflow and as an agent uses different tools to complete a task. It also builds up memory over time. Let us list down the typical data types you will find in an agentic AI application. Data types: Agent profile: This contains the identity of the agent. It includes instructions, goals and constraints. Short-term memory: This holds temporary, contextual information—recent data inputs or ongoing interactions—that the agent uses in real-time. For example, short-term memory could store sensor data from the last few hours of vehicle activity. In certain agentic AI frameworks like Langgraph, short term memory is implemented through a checkpointer. The checkpointer stores intermediate states of the agent’s actions and/or reasoning. This memory allows the agent to seamlessly pause and resume operations. Long-term memory: This is where the agent stores accumulated knowledge over time. This may include patterns, trends, logs and historical recommendations and decisions. By storing each of these data types into rich, nested documents in MongoDB, AI developers can create a single-view representation of an agent’s state and behavior. This enables fast retrieval and simplifies development. In addition to the document model advantage, building agentic AI solutions for mobility requires a robust data infrastructure. MongoDB Atlas offers several key advantages that make it an ideal foundation for these AI-driven architectures. These include: Scalability and flexibility: Connected Car platforms like fleet management systems need to handle extreme data volumes and variety. MongoDB Atlas is proven to scale horizontally across cloud clusters, letting you ingest millions of telemetry events per minute and store terabytes of telemetry data with ease. For example, the German company ZF uses MongoDB to process 90,000 vehicle messages per minute (over 50 GB of data per day) from hundreds of thousands of connected cars. The flexibility of the document model accelerates development and ensures your data model stays aligned with the real-world entities it represents. Built-in vector search: AI agents require a robust set of tools to work with. One of the most widely used tools is vector search, which allows agents to perform semantic searches on unstructured data like driver logs, error codes descriptions, and repair manuals. MongoDB Atlas Vector Search allows you to store and index high-dimensional vectors alongside your documents and to perform semantic search over unstructured data. In practice, this means your AI embeddings live right next to the relevant vehicle telemetry and operational data in the database, simplifying architectures for use cases like the connected car incident advisor, in which a new issue can be matched against past issues before passing contextual information to the LLM. For more, check out this example of how an automotive OEM leverages vector search for audio based diagnostics with MongoDB Atlas Vector Search. Time series collections and real-time data processing: MongoDB Atlas is designed for real-time applications. It provides time series collections for connected car telemetry data storage, change streams, and triggers that can react to new data instantly. This is crucial for agentic AI feedback loops, where ongoing data ingestion and learning are happening continuously. Best-in-class embedding models with Voyage AI: In early 2025, MongoDB acquired Voyage AI , a leader in embedding and reranking models. Voyage AI embedding models are currently being integrated into MongoDB Atlas, which means developers will no longer need to manage external embedding APIs, standalone vector stores, or complex search pipelines. AI retrieval will be built into the database itself, making semantic search, vector retrieval, and ranking as seamless as traditional queries. This will reduce the time required for developing agentic AI applications. Agentic AI in action: Connected fleet incident advisor Figure 2 shows a list of use cases in the Mobility sector, sorted by various capabilities that an agent might demonstrate. AI agents excel at managing multi-step tasks via context management across tasks, they automate repetitive tasks better than Robotic process automation (RPA), and they demonstrate human-like reasoning by revisiting and revising past decisions. These capabilities enable a wide range of applications both during the manufacturing of a vehicle and while it's on the road, connected and sending telemetry. We will review a use case in detail below, and will see how it can be implemented using MongoDB Atlas, LangGraph, Open AI, and Voyage AI. Figure 2. Major use cases of agentic AI in the mobility and manufacturing sectors. First, the AI agent connects to traditional fleet management software and supports the fleet manager in diagnosing and advising the drivers. This is an example of a multi-step diagnostic workflow that gets triggered when a driver submits a complaint about the vehicle's performance (for example, increased fuel consumption). Figure 3 shows the sequence diagram of the agent. Upon receiving the driver complaint, it creates a chain of thought that follows a multi-step diagnostic workflow where the system ingests vehicle data such as engine codes and sensor readings, generates embeddings using the Voyage AI voyage-3-large embedding model, and performs a vector search using MongoDB Atlas to find similar past incidents. Once relevant cases are identified, those–along with selected telemetry data–are passed to OpenAI gpt-4o LLM to generate a final recommendation for the driver (for example, to pull off immediately or to keep driving and schedule regular maintenance). All data, including telemetry, past issues, session logs, agent profiles, and recommendations are stored in MongoDB Atlas, ensuring traceability and the ability to refine diagnostics over time. Additionally, MongoDB Atlas is used as a checkpointer by LangGraph, which defines the agent's workflow. Figure 3. Sequence diagram for a connected fleet advisor agentic workflow. Figure 4 shows the agent in action, from receiving an issue to generating a recommendation. So by leveraging MongoDB’s flexible data model and powerful Vector Search capabilities, we can agentic AI can transform fleet management through predictive maintenance and proactive decision-making. Figure 4. The connected fleet advisor AI agent in action. To set up the use case shown in this article, please visit our GitHub repository . And to learn more about MongoDB’s role in the automotive industry, please visit our manufacturing and automotive webpage . Want to learn more about why MongoDB is the best choice for supporting modern AI applications? Check out our on-demand webinar, “ Comparing PostgreSQL vs. MongoDB: Which is Better for AI Workloads? ” presented by MongoDB Field CTO, Rick Houlihan.

April 4, 2025