頁面標題：Atlas Vector Search 上市：借助語意搜尋和 AI 針對任何類型的資料構建智慧型應用程式

Benjamin Flast
June 22, 2023 | Updated: July 31, 2024

我們很高興地宣布 Atlas Vector Search 現已正式上市。Vector Search 現在支持生產工作負載，開發者能夠構建由語意搜尋和生成式人工智慧提供支持的智慧型應用程式，同時透過 Search Nodes （搜尋節點）來最佳化資源消耗並提高效能。

請閱讀下面的部落格，了解完整的公告以及該功能的一系列優勢。

這一刻終於到來。人工智慧已經觸手可及。原本僅限於企業內部資料科學和機器學習團隊中的工具，現在已隨時為世界各地的構建者敞開大門。但是，要充分運用這些新工具的強大功能，需要一個可靠、可靈活組合且簡練的資料平台來進行構建。同時，正如我們所知，這些新功能的好壞取決於它們能夠存取的資料或「基準真相」的品質。因此，我們很高興能在 MongoDB Atlas 開發者資料平台添加另一項新功能，釋放出資料的全部潛能並支持 AI 應用程式的發展。今天，MongoDB 隆重宣布全新的 Vector Search 功能，它旨在滿足各種形式的資料需求，使我們的合作夥伴得以享受這些令人驚豔的新功能所帶來的優勢。

歡迎各位查看我們的 AI 資源頁面，以瞭解更多使用 MongoDB 來建構 AI 驅動應用程式的相關資訊。

這項新功能是什麼？

Vector Search 是一種根據語意或資料含義，而非根據資料本身來查詢資料的功能。這種功能的實現原理在於，能將任何形式的資料轉為數位向量，並可透過複雜的演算法進行比較。第一步是獲取原始資料，例如文本、音訊、圖片或影片，並使用「編碼模型」將它們轉換為「向量」或「嵌入」。受益於人工智慧的最新進展，這些向量現在能透過將低構面資料投影到包含更多資料上下文的高構面空間中，從而更準確地理解資料的含義。一旦將資料轉換為數值表示，就可以使用近似最鄰近（Approximate Nearest Neighbor）演算法尋找相似值，該演算法能使查詢非常快速地找到具有相似向量的資料。這能讓使用者在用自然語言查詢如「請推薦一些悲傷的電影」或「請找一些類似……的圖片」等問題時更加滿意。這項功能解鎖了全新類型的可能性。

它與我們的平台有何關係？

MongoDB Atlas 已原生內建此功能，開發者無需複製和轉換資料、學習一些新的堆棧和語法，也無須管理一整套新設立的基礎架構。借助 MongoDB 的 Atlas Vector Search，這一切都不再是必需，開發者可以在一個經過實戰考驗的世界級平台中利用這些強大的新功能，以前所未有的速度構建應用程式。有效使用 AI 和 Vector Search 所面臨的許多挑戰，都源自於安全地公開應用程式資料所涉及的複雜性。這些任務會給開發者的體驗帶來多層的摩擦，並使應用程式更難構建、除錯和維護。MongoDB 消除了所有這些挑戰，同時將 Vector Search 的強大功能帶入一個可靈活垂直和水平擴展的平台，以支持任何的工作負載。最後，如果沒有對安全性和可用性的保證，這一切將毫無意義。MongoDB 致力於提供安全資料管理的解決方案，透過冗餘和自動故障轉移實現的高可用性，確保應用程式始終維持穩定運行。

MongoDB.local 倫敦新功能發表會

在 .Local 的倫敦發表會上，我們很高興地宣布推出專用的 Vector Search 聚合階段，該階段可以透過 $vectorSearch 調用。新的聚合階段引入了一些增加了新能力的新概念，並使 Vector Search 比以往都更容易使用。借助 $vectorSearch，開發者還可以使用 MQL 語法的預先篩選器（例如 $gte, $eq 等) ，以在遍歷索引時過濾文件，從而獲得一致的結果和高效能。任何了解 MongoDB 的開發者都能夠輕鬆利用這項過濾功能！最後，我們還介紹了兩種在聚合階段內部調整結果的方法，也就是「numCandidates」和「limit」參數。透過這些參數，開發者可以調整應為近似最鄰近搜索候選的文件數量，接著使用「limit」來限制想要的結果數量。

它如何與生態系統互動？

人工智慧的發展及大量創新十分驚人，而開源社群的突飛猛進也令人嘆為觀止。開源語言模型以及將它們整合到應用程式中的各種方法取得了巨大的進步。隨著人工智慧展現出強大力量，建立能讓開發者對功能靈活發揮的穩固抽象也因此變得前所未見的重要。考量到這一點，我們非常興奮地向大家分享 LangChain 和 LlaMAindex，它們支持了多種功能，包括 Vector Search、聊天記錄和文件索引等。我們正在快速推進，並將繼續為主要提供商發布新功能。

總結

一切才剛開始，MongoDB 致力於提供市場上最好的開發者資料平台，協助開發者打造下一代結合 AI 的應用程式。我們還會持續研究更多可以支持的框架和外掛程式架構。然而始終不變的是，最重要的核心仍是開發者。我們將與社群交流，找到並提供最佳服務的方式，確保每一步都能滿足開發者的需求。繼續向前構建吧！

內容區塊：要了解有關 Atlas Vector Search 的更多資訊以及它是否適合您，請查看我們的文件、白皮書和教學，或立即開始使用。

前往 MongoDB.local 中心查看我們接下來會在哪裡出現。

← Previous

頁面標題：仰賴 MongoDB Atlas 與大型語言模型，從無到有達成企業就緒目標

透過蘊含生成式 AI 的應用程式，為您的客戶打造引人入勝、實現真正差異化的體驗，這也意味著 AI 是一種以事實為基礎的先進技術。而這項事實正是源自於您的資料，更具體地說──是源自於您的最新營運資料。無論您是透過進階語意搜尋提供超個人化體驗，還是生成使用者提示的相關內容和對話，MongoDB Atlas 都能統一操作、分析和進行向量搜尋資料服務，簡化將大型語言模型和轉換器模型的強大功能嵌入您的應用程式。每天，開發者都致力於打造下一個由充滿開創性和變革性的生成式 AI 驅動的應用程式。商業活動和開源大型語言模型更以驚人的速度持續進步。圍繞它們建構出的框架和工具相當豐富，同時實現了創新民主化。然而，要將這些應用程式從原型轉變為企業就緒目標的階段，正是開發團隊必須跨越的技術鴻溝。首先，這些大型模型可能會提供錯誤或答非所問的答案，原因在於它們存取的資料早已過時。目前有兩種選擇可解決大型模型回應答非所問的答案，其中包括微調大型模型或提升長期記憶。然而，這麼做便會產生第二道障礙──圍繞著知識廣泛的大型語言模型部署應用程式，並採取正確的安全控制措施，進而達到使用者期望的規模和性能。開發者需要的是一個具有資料模型靈活性的資料平台，以適應不斷演變的非結構化和結構化資料，進而在不受僵化模式阻礙的情況下，為大型模型提供資訊。儘管微調模型只是其中一種選擇，但考量到時間成本和運算資源，這也為一種成本高昂的選擇。這意味著開發者需要能將資料視為大型模型的上下文呈現，並作為提示的一部分。對此，他們便需要賦予這些生成式模型長期記憶。以下我們將探討幾個範例，進一步了解如何使用各種大型語言模型和生成式 AI 框架來實現這一點。歡迎各位查看我們的 AI 資源頁面，以瞭解更多使用 MongoDB 來建構 AI 驅動應用程式的相關資訊。快速入門 MongoDB Atlas 和大型語言模式的 5 個資源 MongoDB Atlas 可以無縫集成多項引領業界的生成式 AI 服務和系統，例如超大規模供應商、開源大型語言模型和框架。透過 Atlas Database 和 Atlas Vector Search （預覽版）將文件和向量嵌入資料儲存於同一個地方，如此一來，開發者便能加速建構蘊含基於營運資料真實性的生成式 AI 應用程式。以下是如何使用現正流行的大型語言模型框架和 MongoDB 操作的範例： 1. 使用 Atlas Vector Search（預覽版）和 OpenAI 執行語意搜尋本使用教學將引導您完成使用 MongoDB Atlas 對範例電影資料集執行語意搜尋的步驟。首先，您需要設置一個 Atlas 觸發器，以便在每次將新文件插入叢集時調用 OpenAI API，進而將其轉換為向量嵌入。然後，您將使用 Atlas Vector Search 執行向量搜尋查詢。甚至還有一個利用 HuggingFace 模型的特別獎勵環節。閱讀使用教學文章。 2. 使用 Llamalndex 和 MongoDB 並透過您的專利資料建構蘊含生成式 AI 的聊天應用程式 LlamaIndex 提供了一個簡單靈活的介面，可以將大型語言模型和外部資料互相連接。在 LlamaIndex 和 MongoDB 的共同部落格中，詳細介紹了您為何需要以及您應該如何建構專屬的聊天應用程式。部落格中隨附的筆記本則提供了有關如何使用英語版本的語言查詢功能來查詢任何 PDF 文件的程式碼演練資訊。閱讀部落格文章。 3. 參閱文件來了解如何使用 Atlas Vector Search（預覽版）作為 LangChain 的向量儲存一如合作夥伴公告部落格文章中所述，LangChain 和 MongoDB Atlas 堪稱絕佳拍檔組合，而社群的熱烈討論就是最好的證明，正是這股熱情促成 LangChain for MongoDB 的多次集成。除了目前支援 Atlas Vector Search 作為向量儲存之外，也支援使用 MongoDB 作為儲存聊天日誌歷史記錄的媒介。閱讀文件： python 、 javascript 。 4. 使用 MindsDB AI 集合直接在 MongoDB Atlas 中生成各項預測 MindsDB 是一個開源機器學習平台，它會將自動化機器學習引入資料庫。在本部落格文章中，您將使用 MindsDB AI 集合功能直接在 MongoDB Atlas 中生成各項預測，讓您能將這些預測作為常規資料使用和查詢用途，並透過簡化部署工作流程來加快開發速度。閱讀部落格文章。 5. 使用 Atlas 觸發器將 HuggingFace 轉換器模型集成至 MongoDB Atlas HuggingFace 是一個 AI 社群，可以輕鬆建構、訓練和部署機器學習模型。只要善加利用 Atlas 觸發器和 HuggingFace，便能輕鬆應對營運資料的各項變化，讓您的模型提升長期記憶。歡迎進一步了解如何設定 Atlas 觸發器自動預測 MongoDB 資料庫中新文件的情緒分析，並將其作為附加欄位新增至文件中。參閱 GitHub 儲存庫。影像區塊：圖 1：範例應用程式架構顯示了外部或專利資料如何讓大型語言模型提升長期記憶，以及資料如何經由使用者的輸入轉移至由大型語言模型支援的回應。從原型到生產，透過 MongoDB 建構適用於蘊含生成式 AI 的應用程式 MongoDB 是以 MongoDB Atlas 為基礎建構的開發者資料平台，該平台提供了現代化、最佳化的開發者體驗，並歷經全球數千家企業的實戰測試，確保能夠大規模、安全地執行任何管理操作。無論您是在新創公司或大型企業策劃佈局下一件大事，MongoDB Atlas 都能讓您：加速建構蘊含基於營運資料真實性的生成式 AI 應用程式。利用單一平台簡化您的應用程式架構，該平台允許其將應用程式和向量資料儲存於同一個地方，並使用無伺服器功能應對來源資料的變化，並跨多種資料模式進行搜尋，提升應用程式生成回應的相關性和準確性。透過文件模型的靈活性，輕鬆發展蘊含生成式 AI 的應用程式，同時保有簡單優雅的開發者體驗。無縫集成引領業界的 AI 服務和系統，例如超大規模供應商、開源大型語言模型和框架，以在瞬息萬變的市場競爭中站穩腳步。在具有高性能、高度可擴展性的營運資料庫中建構生成式 AI 的應用程式，該資料庫已在廣泛的 AI 使用案例中歷經 10 年的實戰驗證。儘管這些範例是更具創新性的建構區塊，但 MongoDB 可幫助您從概念到生產再到規模化的全面性部署。立即註冊 MongoDB Atlas ，即可部署免費層級，並與您首選的框架和大型語言模型互相集成。如果您有興趣與我們攜手合作，歡迎查看我們的 MongoDB AI 創新者計劃，這項計劃大力扶持 AI 創新，並展示來自新創公司、客戶與合作夥伴的頂尖技術解決方案。

June 22, 2023

Next →

Why Vector Quantization Matters for AI Workloads

Key takeaways As vector embeddings scale into millions, memory usage and query latency surge, leading to inflated costs and poor user experience. By storing embeddings in reduced-precision formats (int8 or binary), you can dramatically cut memory requirements and speed up retrieval. Voyage AI's quantization-aware embedding models are specifically tuned to handle compressed vectors without significant loss of accuracy. MongoDB Atlas streamlines the workflow by handling the creation, storage, and indexing of compressed vectors, enabling easier scaling and management. MongoDB is built for change, allowing users to effortlessly scale AI workloads as resource demands evolve. Organizations are now scaling AI applications from proofs of concept to production systems serving millions of users. This shift creates scalability, latency, and resource challenges for mission-critical applications leveraging recommendation engines, semantic search, and retrieval-augmented generation (RAG) systems. At scale, minor inefficiencies compound and become major bottlenecks, increasing latency, memory usage, and infrastructure costs. This guide explains how vector quantization enables high-performance, cost-effective AI applications at scale. The challenge: Scaling vector search in production Let’s start by considering a modern voice assistance platform that combines semantic search with natural language understanding. During development, the system only needs to process a few hundred queries per day, converting speech to text and matching the resulting embeddings against a modest database of responses. The initial implementation is straightforward: each query generates a 32-bit floating-point embedding vector that's matched against a database of similar vectors using cosine similarity. This approach works smoothly in the prototype phase—response times are quick, memory usage is manageable, and the development team can focus on improving accuracy and adding features. However, as the platform gains traction and scales to processing thousands of queries per second against millions of document embeddings, the simple approach begins to break down. Each incoming query now requires loading massive amounts of high-precision floating-point vectors into memory, computing similarity scores across an exponentially larger dataset, and maintaining increasingly complex vector indexes for efficient retrieval. Without proper optimization, the system struggles as memory usage balloons, query latency increases, and infrastructure costs spiral upward. What started as a responsive, efficient prototype has become a bottleneck production system that struggles to maintain its performance requirements while serving a growing user base. The key challenges are: Loading high-precision 32-bit floating-point vectors into memory Computing similarity scores across massive embedding collections Maintaining large vector indexes for efficient retrieval Which can lead to critical issues like: High memory usage as vector databases struggle to keep float32 embeddings in RAM Increased latency as systems process large volumes of high-precision data Growing infrastructure costs as organizations scale their vector operations Reduced query throughput due to computational overhead AI workloads with tens or hundreds of millions of high-dimensional vectors (e.g., 80M+ documents at 1536 dimensions) face soaring RAM and CPU requirements. Storing float32 embeddings for these workloads can become prohibitively expensive. Vector quantization: A path to efficient scaling The obvious question is: How can you maintain the accuracy of your recommendations, semantic matches, and search queries, while drastically cutting down on compute and memory usage and reducing retrieval latency? Vector quantization is how. It helps you store embeddings more compactly, reduce retrieval times, and keep costs under control. Vector quantization offers a powerful solution to scalability, latency, and resource utilization challenges by compressing high-dimensional embeddings into compact representations while preserving their essential characteristics. This technique can dramatically reduce memory requirements and accelerate similarity computations without compromising retrieval accuracy. What is vector quantization? Vector quantization is a compression technique widely applied in digital signal processing and machine learning. Its core idea is to represent numerical data using fewer bits, reducing storage requirements without entirely sacrificing the data’s informative value. In the context of AI workloads, quantization commonly involves converting embeddings—originally stored as 32-bit floating-point values—into formats like 8-bit integers. By doing so, you can substantially decrease memory and storage consumption while maintaining a level of precision suitable for similarity search tasks. An important point to note is that the quantization mechanism is especially suitable for use cases that involve over 1 million vector embeddings, such as RAG applications, semantic search, or recommendation systems that require tight control of operational costs without a compromise on retrieval accuracy. Smaller datasets with fewer than 1 million embeddings might not see significant gains from quantization procedures. For smaller datasets, the overhead of implementing quantization might outweigh its benefits. Understanding vector quantization Vector quantization operates by mapping high-dimensional vectors to a discrete set of prototype vectors or converting them to lower-precision formats. There are three main approaches: Scalar quantization: Converts individual 32-bit floating-point values to 8-bit integers, reducing memory usage of vector values by 75% while maintaining reasonable precision. Product quantization: Compresses entire vectors at once by mapping them to a codebook of representative vectors, offering better compression than scalar quantization at the cost of more complex encoding/decoding. Binary quantization: Transforms vectors into binary (0/1) representations, achieving maximum compression but with more significant information loss. A vector database that applies these compression techniques must effectively manage multiple data structures: Hierarchical navigable small world (HNSW) graph for navigable search Full-fidelity vectors (32-bit float embeddings) Quantized vectors (int8 or binary) When quantization is defined in the vector index, the system builds quantized vectors and constructs the HNSW graph from these compressed vectors. Both structures are placed in memory for efficient search operations, significantly reducing the RAM footprint compared to storing full-fidelity vectors alone. The table below illustrates how different quantization mechanisms impact memory usage and disk consumption. This example focuses on HNSW indexes storing 30 GB of original float32 embeddings alongside a 0.1 GB HNSW graph structure. Our RAM usage estimates include a 10% overhead factor (1.1 multiplier) to account for JVM memory requirements with indexes loaded into page cache, reflecting typical production deployment conditions. Actual overhead may vary based on specific configurations. Here are key attributes to consider based on the table below: Estimated RAM usage: Combines HNSW graph size with either full or quantized vectors, plus a small overhead factor (1.1 for index overhead). Disk usage: Includes storage for full-fidelity vectors, HNSW graph, and quantized vectors when applicable. Notice that while enabling quantization increases total disk usage —because you still store full-fidelity vectors for exact nearest neighbor queries in both cases and rescoring in the case of binary quantization—it dramatically decreases RAM requirements and speeds up initial retrieval . MongoDB Atlas Vector Search offers powerful scaling capabilities through its automatic quantization system . As illustrated in Figure 1 below, MongoDB Atlas supports multiple vector search indexes with varying precision levels: Float32 for maximum accuracy, Scalar Quantized (int8) for balanced performance with 3.75× RAM reduction, and Binary Quantized (1-bit) for maximum speed with 24× RAM reduction. The quantization variety provided by MongoDB Atlas allows users to optimize their vector search workloads based on specific requirements. For collections exceeding 1M vectors, Atlas automatically applies the appropriate quantization mechanism, with binary quantization particularly effective when combined with Float32 rescoring for final refinement. Figure 1: MongoDB Atlas Vector Search Architecture with Automatic Quantization Data flow through embedding generation, storage, and tiered vector indexing with binary rescoring. Binary quantization with rescoring A particularly effective strategy is to combine binary quantization with a rescoring step using full-fidelity vectors. This approach offers the best of both worlds: extremely fast lookups thanks to binary data formats, plus more precise final rankings from higher-fidelity embeddings. Initial retrieval (Binary) Embeddings are stored as binary to minimize memory usage and accelerate the approximate nearest neighbor (ANN) search. Hamming distance (via XOR + population count) is used, which is computationally faster than Euclidean or cosine similarity on floats. Rescoring The top candidate results from the binary pass are re-evaluated using their float or int8 vectors to refine the ranking. This step mitigates the loss of detail in binary vectors, balancing result accuracy with the speed of the initial retrieval. By pairing binary vectors for rapid recall with full-fidelity embeddings for final refinement, you can keep your system highly performant and maintain strong relevance. The need for quantization-aware models Not all embedding models perform equally well under quantization. Models need to be specifically trained with quantization in mind to maintain their effectiveness when compressed. Some models—especially those trained purely for high-precision scenarios—suffer significant accuracy drops when their embeddings are represented with fewer bits. Quantization-aware training (QAT) involves: Simulating quantization effects during the training process Adjusting model weights to minimize information loss Ensuring robust performance across different precision levels This is particularly important for production applications where maintaining high accuracy is crucial. Embedding models like those from Voyage AI— which recently joined MongoDB —are specifically designed with quantization awareness, making them more suitable for scaled deployments. These models preserve more of their essential feature information even under aggressive compression. Voyage AI provides a suite of embedding models specifically designed with QAT in mind, ensuring minimal loss in semantic quality when shifting to 8-bit integer or even binary representations. Figure 2: Embedding model performance comparing retrieval quality (NDCG@10) versus storage costs . Voyage AI models (green) maintain superior retrieval quality even with binary quantization (triangles) and int8 compression (squares), achieving up to 100x storage efficiency compared to standard float embeddings (circles) . The graph above shows several important patterns that demonstrate why quantization-aware training (QAT) is crucial for maintaining performance under aggressive compression. The Voyage AI family of models (shown in green) demonstrates strong performance in retrieval quality even under extreme compression. The voyage-3-large model demonstrates this dramatically—when using int8 precision at 1024 dimensions, it performs nearly identically to its float precision, 2048-dimensional counterpart, showing only a minimal 0.31% quality reduction despite using 8 times less storage. This showcases how models specifically designed with quantization in mind can preserve their semantic understanding even under substantial compression. Even more impressive is how QAT models maintain their edge over larger, uncompressed models. The voyage-3-large model with int8 precision and 1024 dimensions outperforms OpenAI-v3-large (using float precision and 3072 dimensions) by 9.44% while requiring 12 times less storage. This performance gap highlights that raw model size and dimension count aren't the decisive factors —it's the intelligent design for quantization that matters. The cost implications become truly striking when we examine binary quantization. Using voyage-3-large with 512-dimensional binary embeddings, we still achieve better retrieval quality than OpenAI-v3-large with its full 3072-dimensional float embeddings while using 200 times less storage. To put this in practical terms: what would have cost $20,000 in monthly storage can be reduced to just $100 while actually improving performance. In contrast, models not specifically trained for quantization, such as OpenAI's v3-small (shown in gray), show a more dramatic drop in retrieval quality as compression increases. While these models perform well in their full floating-point representation (at 1x storage cost), their effectiveness deteriorates more sharply when quantized, especially with binary quantization. For production applications where both accuracy and efficiency are crucial, choosing a model that has undergone quantization-aware training can make the difference between a system that degrades under compression and one that maintains its effectiveness while dramatically reducing resource requirements. Read more on the Voyage AI blog . Impact: Memory, retrieval latency, and cost Vector quantization addresses the three core challenges of large-scale AI workloads—memory, retrieval latency, and cost—by compressing full-precision embeddings into more compact representations. Below is a breakdown of how quantization drives efficiency in each area. Figure 3: Quantization Performance Metrics: Memory Savings with Minimal Accuracy Trade-offs Comparison of scalar vs. binary quantization showing RAM reduction (75%/96%), query accuracy retention (99%/95%), and performance gains (>100%) for vector search operations Memory and storage optimization Quantization techniques dramatically reduce compute resource requirements while maintaining search accuracy for vector embeddings at scale. Lower RAM footprint Storage in RAM is often the primary bottleneck for vector search systems Embeddings stored as 8-bit integers or binary reduce overall memory usage, allowing significantly more vectors to remain in memory. This compression directly shrinks vector indexes (e.g., HNSW), leading to faster lookups and fewer disk I/O operations. Reduced disk usage in collection with binData binData (binary) formats can cut raw storage needs by up to 66%. Some disk overhead may remain when storing both quantized and original vectors, but the performance benefits justify this tradeoff. Practical gains 3.75× reduction in RAM usage with scalar (int8) quantization Up to 24× reduction with binary quantization, especially when combined with rescoring to preserve accuracy. Significantly more efficient vector indexes, enabling large-scale deployments without prohibitive hardware upgrades. Retrieval latency Quantization methods leverage CPU cache optimizations and efficient distance calculations to accelerate vector search operations beyond what's possible with standard float32 embeddings. Faster similarity computations Smaller data types are more CPU-cache-friendly, which speeds up distance calculations. Binary quantization uses Hamming distance (XOR + popcount), yielding dramatically faster top-k candidate retrieval. Improved throughput With reduced memory overhead, the system can handle more concurrent queries at lower latencies. In internal benchmarks, query performance for large-scale retrievals improved by up to 80% when adopting quantized vectors. Cost efficiency Vector quantization provides substantial infrastructure savings by reducing memory and computation requirements while maintaining retrieval quality through compression and rescoring techniques. Lower infrastructure costs Smaller vectors consume fewer hardware resources, enabling deployments on less expensive instances or tiers. Reduced CPU/GPU time per query allows resource reallocation to other critical parts of the application. Better scalability As data volumes grow, memory and compute requirements don’t escalate as sharply. Quantization-aware training (QAT) models, such as those from Voyage AI, help maintain accuracy while reaping cost savings at scale. By compressing vectors into int8 or binary formats, you tackle memory constraints, accelerate lookups, and curb infrastructure expenses—making vector quantization an indispensable strategy for high-volume AI applications. MongoDB Atlas: Built for Changing Workloads with Automatic Vector Quantization The good news for developers is that MongoDB Atlas supports “automatic scalar” and “automatic binary quantization” in index definitions, reducing the need for external scripts or manual data preprocessing. By quantizing at index build time and query time, organizations can run large-scale vector workloads on smaller, more cost-effective clusters. A common question most developers ask is when to use quantization. Quantization becomes most valuable once you reach substantial data volumes—on the order of a million or more embeddings. At this scale, memory and compute demands can skyrocket, making reduced memory footprints and faster retrieval speeds essential. Examples of cases that call for quantization include: High-volume scenarios: Datasets with millions of vector embeddings where you must tightly control memory and disk usage. Real-time responses: Systems needing low-latency queries under high user concurrency. High query throughput: Environments with numerous concurrent requests demanding both speed and cost-efficiency. For smaller datasets (under 1 million vectors), the added complexity of quantization may not justify the benefits. However, for large-scale deployments, it becomes a critical optimization that can dramatically improve both performance and cost-effectiveness. Now that we have established a strong foundation on the advantages of quantization—specifically the benefits of binary quantization with rescoring— feel free to refer to the MongoDB documentation to learn more about implementing vector quantization. You can also learn more about Voyage AI’s state-of-the-art embedding models on our product page .

February 27, 2025