Henry Weller

11 results

Why Vector Quantization Matters for AI Workloads

Key takeaways As vector embeddings scale into millions, memory usage and query latency surge, leading to inflated costs and poor user experience. By storing embeddings in reduced-precision formats (int8 or binary), you can dramatically cut memory requirements and speed up retrieval. Voyage AI's quantization-aware embedding models are specifically tuned to handle compressed vectors without significant loss of accuracy. MongoDB Atlas streamlines the workflow by handling the creation, storage, and indexing of compressed vectors, enabling easier scaling and management. MongoDB is built for change, allowing users to effortlessly scale AI workloads as resource demands evolve. Organizations are now scaling AI applications from proofs of concept to production systems serving millions of users. This shift creates scalability, latency, and resource challenges for mission-critical applications leveraging recommendation engines, semantic search, and retrieval-augmented generation (RAG) systems. At scale, minor inefficiencies compound and become major bottlenecks, increasing latency, memory usage, and infrastructure costs. This guide explains how vector quantization enables high-performance, cost-effective AI applications at scale. The challenge: Scaling vector search in production Let’s start by considering a modern voice assistance platform that combines semantic search with natural language understanding. During development, the system only needs to process a few hundred queries per day, converting speech to text and matching the resulting embeddings against a modest database of responses. The initial implementation is straightforward: each query generates a 32-bit floating-point embedding vector that's matched against a database of similar vectors using cosine similarity. This approach works smoothly in the prototype phase—response times are quick, memory usage is manageable, and the development team can focus on improving accuracy and adding features. However, as the platform gains traction and scales to processing thousands of queries per second against millions of document embeddings, the simple approach begins to break down. Each incoming query now requires loading massive amounts of high-precision floating-point vectors into memory, computing similarity scores across an exponentially larger dataset, and maintaining increasingly complex vector indexes for efficient retrieval. Without proper optimization, the system struggles as memory usage balloons, query latency increases, and infrastructure costs spiral upward. What started as a responsive, efficient prototype has become a bottleneck production system that struggles to maintain its performance requirements while serving a growing user base. The key challenges are: Loading high-precision 32-bit floating-point vectors into memory Computing similarity scores across massive embedding collections Maintaining large vector indexes for efficient retrieval Which can lead to critical issues like: High memory usage as vector databases struggle to keep float32 embeddings in RAM Increased latency as systems process large volumes of high-precision data Growing infrastructure costs as organizations scale their vector operations Reduced query throughput due to computational overhead AI workloads with tens or hundreds of millions of high-dimensional vectors (e.g., 80M+ documents at 1536 dimensions) face soaring RAM and CPU requirements. Storing float32 embeddings for these workloads can become prohibitively expensive. Vector quantization: A path to efficient scaling The obvious question is: How can you maintain the accuracy of your recommendations, semantic matches, and search queries, while drastically cutting down on compute and memory usage and reducing retrieval latency? Vector quantization is how. It helps you store embeddings more compactly, reduce retrieval times, and keep costs under control. Vector quantization offers a powerful solution to scalability, latency, and resource utilization challenges by compressing high-dimensional embeddings into compact representations while preserving their essential characteristics. This technique can dramatically reduce memory requirements and accelerate similarity computations without compromising retrieval accuracy. What is vector quantization? Vector quantization is a compression technique widely applied in digital signal processing and machine learning. Its core idea is to represent numerical data using fewer bits, reducing storage requirements without entirely sacrificing the data’s informative value. In the context of AI workloads, quantization commonly involves converting embeddings—originally stored as 32-bit floating-point values—into formats like 8-bit integers. By doing so, you can substantially decrease memory and storage consumption while maintaining a level of precision suitable for similarity search tasks. An important point to note is that the quantization mechanism is especially suitable for use cases that involve over 1 million vector embeddings, such as RAG applications, semantic search, or recommendation systems that require tight control of operational costs without a compromise on retrieval accuracy. Smaller datasets with fewer than 1 million embeddings might not see significant gains from quantization procedures. For smaller datasets, the overhead of implementing quantization might outweigh its benefits. Understanding vector quantization Vector quantization operates by mapping high-dimensional vectors to a discrete set of prototype vectors or converting them to lower-precision formats. There are three main approaches: Scalar quantization: Converts individual 32-bit floating-point values to 8-bit integers, reducing memory usage of vector values by 75% while maintaining reasonable precision. Product quantization: Compresses entire vectors at once by mapping them to a codebook of representative vectors, offering better compression than scalar quantization at the cost of more complex encoding/decoding. Binary quantization: Transforms vectors into binary (0/1) representations, achieving maximum compression but with more significant information loss. A vector database that applies these compression techniques must effectively manage multiple data structures: Hierarchical navigable small world (HNSW) graph for navigable search Full-fidelity vectors (32-bit float embeddings) Quantized vectors (int8 or binary) When quantization is defined in the vector index, the system builds quantized vectors and constructs the HNSW graph from these compressed vectors. Both structures are placed in memory for efficient search operations, significantly reducing the RAM footprint compared to storing full-fidelity vectors alone. The table below illustrates how different quantization mechanisms impact memory usage and disk consumption. This example focuses on HNSW indexes storing 30 GB of original float32 embeddings alongside a 0.1 GB HNSW graph structure. Our RAM usage estimates include a 10% overhead factor (1.1 multiplier) to account for JVM memory requirements with indexes loaded into page cache, reflecting typical production deployment conditions. Actual overhead may vary based on specific configurations. Here are key attributes to consider based on the table below: Estimated RAM usage: Combines HNSW graph size with either full or quantized vectors, plus a small overhead factor (1.1 for index overhead). Disk usage: Includes storage for full-fidelity vectors, HNSW graph, and quantized vectors when applicable. Notice that while enabling quantization increases total disk usage —because you still store full-fidelity vectors for exact nearest neighbor queries in both cases and rescoring in the case of binary quantization—it dramatically decreases RAM requirements and speeds up initial retrieval . MongoDB Atlas Vector Search offers powerful scaling capabilities through its automatic quantization system . As illustrated in Figure 1 below, MongoDB Atlas supports multiple vector search indexes with varying precision levels: Float32 for maximum accuracy, Scalar Quantized (int8) for balanced performance with 3.75× RAM reduction, and Binary Quantized (1-bit) for maximum speed with 24× RAM reduction. The quantization variety provided by MongoDB Atlas allows users to optimize their vector search workloads based on specific requirements. For collections exceeding 1M vectors, Atlas automatically applies the appropriate quantization mechanism, with binary quantization particularly effective when combined with Float32 rescoring for final refinement. Figure 1: MongoDB Atlas Vector Search Architecture with Automatic Quantization Data flow through embedding generation, storage, and tiered vector indexing with binary rescoring. Binary quantization with rescoring A particularly effective strategy is to combine binary quantization with a rescoring step using full-fidelity vectors. This approach offers the best of both worlds: extremely fast lookups thanks to binary data formats, plus more precise final rankings from higher-fidelity embeddings. Initial retrieval (Binary) Embeddings are stored as binary to minimize memory usage and accelerate the approximate nearest neighbor (ANN) search. Hamming distance (via XOR + population count) is used, which is computationally faster than Euclidean or cosine similarity on floats. Rescoring The top candidate results from the binary pass are re-evaluated using their float or int8 vectors to refine the ranking. This step mitigates the loss of detail in binary vectors, balancing result accuracy with the speed of the initial retrieval. By pairing binary vectors for rapid recall with full-fidelity embeddings for final refinement, you can keep your system highly performant and maintain strong relevance. The need for quantization-aware models Not all embedding models perform equally well under quantization. Models need to be specifically trained with quantization in mind to maintain their effectiveness when compressed. Some models—especially those trained purely for high-precision scenarios—suffer significant accuracy drops when their embeddings are represented with fewer bits. Quantization-aware training (QAT) involves: Simulating quantization effects during the training process Adjusting model weights to minimize information loss Ensuring robust performance across different precision levels This is particularly important for production applications where maintaining high accuracy is crucial. Embedding models like those from Voyage AI— which recently joined MongoDB —are specifically designed with quantization awareness, making them more suitable for scaled deployments. These models preserve more of their essential feature information even under aggressive compression. Voyage AI provides a suite of embedding models specifically designed with QAT in mind, ensuring minimal loss in semantic quality when shifting to 8-bit integer or even binary representations. Figure 2: Embedding model performance comparing retrieval quality (NDCG@10) versus storage costs . Voyage AI models (green) maintain superior retrieval quality even with binary quantization (triangles) and int8 compression (squares), achieving up to 100x storage efficiency compared to standard float embeddings (circles) . The graph above shows several important patterns that demonstrate why quantization-aware training (QAT) is crucial for maintaining performance under aggressive compression. The Voyage AI family of models (shown in green) demonstrates strong performance in retrieval quality even under extreme compression. The voyage-3-large model demonstrates this dramatically—when using int8 precision at 1024 dimensions, it performs nearly identically to its float precision, 2048-dimensional counterpart, showing only a minimal 0.31% quality reduction despite using 8 times less storage. This showcases how models specifically designed with quantization in mind can preserve their semantic understanding even under substantial compression. Even more impressive is how QAT models maintain their edge over larger, uncompressed models. The voyage-3-large model with int8 precision and 1024 dimensions outperforms OpenAI-v3-large (using float precision and 3072 dimensions) by 9.44% while requiring 12 times less storage. This performance gap highlights that raw model size and dimension count aren't the decisive factors —it's the intelligent design for quantization that matters. The cost implications become truly striking when we examine binary quantization. Using voyage-3-large with 512-dimensional binary embeddings, we still achieve better retrieval quality than OpenAI-v3-large with its full 3072-dimensional float embeddings while using 200 times less storage. To put this in practical terms: what would have cost $20,000 in monthly storage can be reduced to just $100 while actually improving performance. In contrast, models not specifically trained for quantization, such as OpenAI's v3-small (shown in gray), show a more dramatic drop in retrieval quality as compression increases. While these models perform well in their full floating-point representation (at 1x storage cost), their effectiveness deteriorates more sharply when quantized, especially with binary quantization. For production applications where both accuracy and efficiency are crucial, choosing a model that has undergone quantization-aware training can make the difference between a system that degrades under compression and one that maintains its effectiveness while dramatically reducing resource requirements. Read more on the Voyage AI blog . Impact: Memory, retrieval latency, and cost Vector quantization addresses the three core challenges of large-scale AI workloads—memory, retrieval latency, and cost—by compressing full-precision embeddings into more compact representations. Below is a breakdown of how quantization drives efficiency in each area. Figure 3: Quantization Performance Metrics: Memory Savings with Minimal Accuracy Trade-offs Comparison of scalar vs. binary quantization showing RAM reduction (75%/96%), query accuracy retention (99%/95%), and performance gains (>100%) for vector search operations Memory and storage optimization Quantization techniques dramatically reduce compute resource requirements while maintaining search accuracy for vector embeddings at scale. Lower RAM footprint Storage in RAM is often the primary bottleneck for vector search systems Embeddings stored as 8-bit integers or binary reduce overall memory usage, allowing significantly more vectors to remain in memory. This compression directly shrinks vector indexes (e.g., HNSW), leading to faster lookups and fewer disk I/O operations. Reduced disk usage in collection with binData binData (binary) formats can cut raw storage needs by up to 66%. Some disk overhead may remain when storing both quantized and original vectors, but the performance benefits justify this tradeoff. Practical gains 3.75× reduction in RAM usage with scalar (int8) quantization Up to 24× reduction with binary quantization, especially when combined with rescoring to preserve accuracy. Significantly more efficient vector indexes, enabling large-scale deployments without prohibitive hardware upgrades. Retrieval latency Quantization methods leverage CPU cache optimizations and efficient distance calculations to accelerate vector search operations beyond what's possible with standard float32 embeddings. Faster similarity computations Smaller data types are more CPU-cache-friendly, which speeds up distance calculations. Binary quantization uses Hamming distance (XOR + popcount), yielding dramatically faster top-k candidate retrieval. Improved throughput With reduced memory overhead, the system can handle more concurrent queries at lower latencies. In internal benchmarks, query performance for large-scale retrievals improved by up to 80% when adopting quantized vectors. Cost efficiency Vector quantization provides substantial infrastructure savings by reducing memory and computation requirements while maintaining retrieval quality through compression and rescoring techniques. Lower infrastructure costs Smaller vectors consume fewer hardware resources, enabling deployments on less expensive instances or tiers. Reduced CPU/GPU time per query allows resource reallocation to other critical parts of the application. Better scalability As data volumes grow, memory and compute requirements don’t escalate as sharply. Quantization-aware training (QAT) models, such as those from Voyage AI, help maintain accuracy while reaping cost savings at scale. By compressing vectors into int8 or binary formats, you tackle memory constraints, accelerate lookups, and curb infrastructure expenses—making vector quantization an indispensable strategy for high-volume AI applications. MongoDB Atlas: Built for Changing Workloads with Automatic Vector Quantization The good news for developers is that MongoDB Atlas supports “automatic scalar” and “automatic binary quantization” in index definitions, reducing the need for external scripts or manual data preprocessing. By quantizing at index build time and query time, organizations can run large-scale vector workloads on smaller, more cost-effective clusters. A common question most developers ask is when to use quantization. Quantization becomes most valuable once you reach substantial data volumes—on the order of a million or more embeddings. At this scale, memory and compute demands can skyrocket, making reduced memory footprints and faster retrieval speeds essential. Examples of cases that call for quantization include: High-volume scenarios: Datasets with millions of vector embeddings where you must tightly control memory and disk usage. Real-time responses: Systems needing low-latency queries under high user concurrency. High query throughput: Environments with numerous concurrent requests demanding both speed and cost-efficiency. For smaller datasets (under 1 million vectors), the added complexity of quantization may not justify the benefits. However, for large-scale deployments, it becomes a critical optimization that can dramatically improve both performance and cost-effectiveness. Now that we have established a strong foundation on the advantages of quantization—specifically the benefits of binary quantization with rescoring— feel free to refer to the MongoDB documentation to learn more about implementing vector quantization. You can also learn more about Voyage AI’s state-of-the-art embedding models on our product page .

February 27, 2025

Binary Quantization & Rescoring: 96% Less Memory, Faster Search

We are excited to share that several new vector quantization capabilities are now available in public preview in MongoDB Atlas Vector Search : support for binary quantized vector ingestion, automatic scalar quantization, and automatic binary quantization and rescoring. Together with our recently released support for scalar quantized vector ingestion , these capabilities will empower developers to scale semantic search and generative AI applications more cost-effectively. For a primer on vector quantization, check out our previous blog post . Enhanced developer experience with native quantization in Atlas Vector Search Effective quantization methods—specifically scalar and binary quantization—can now be done automatically in Atlas Vector Search. This makes it easier and more cost-effective for developers to use Atlas Vector Search to unlock a wide range of applications, particularly those requiring over a million vectors. With the new “quantization” index definition parameters, developers can choose to use full-fidelity vectors by specifying “none,” or they can quantize vector embeddings by specifying the desired quantization type—”scalar” or “binary” (Figure 1). This native quantization capability supports vector embeddings from any model provider as well as MongoDB’s BinData float32 vector subtype . Figure 1: New index definition parameters for specifying automatic quantization type in Atlas Vector Search Scalar quantization—converting a float point into an integer—is generally used when it's crucial to maintain search accuracy on par with full-precision vectors. Meanwhile, binary quantization—converting a float point into a single bit of 0 or 1—is more suitable for scenarios where storage and memory efficiency are paramount, and a slight reduction in search accuracy is acceptable. If you’re interested in learning more about this process, check out our documentation . Binary quantization with rescoring: Balance cost and accuracy Compared to scalar quantization, binary quantization further reduces memory usage, leading to lower costs and improved scalability—but also a decline in search accuracy. To mitigate this, when “binary” is chosen in the “quantization” index parameter, Atlas Vector Search incorporates an automatic rescoring step, which involves re-ranking a subset of the top binary vector search results using their full-precision counterparts, ensuring that the final search results are highly accurate despite the initial vector compression. Empirical evidence demonstrates that incorporating a rescoring step when working with binary quantized vectors can dramatically enhance search accuracy, as shown in Figure 2 below. Figure 2: Combining binary quantization and rescoring helps retain search accuracy by up to 95% And as Figure 3 shows, in our tests, binary quantization reduced processing memory requirement by 96% while retaining up to 95% search accuracy and improving query performance. Figure 3: Improvements in Atlas Vector Search with the use of vector quantization It’s worth noting that even though the quantized vectors are used for indexing and search, their full-fidelity vectors are still stored on disk to support rescoring. Furthermore, retaining the full-fidelity vectors enables developers to perform exact vector search for experimental, high-precision use cases, such as evaluating the search accuracy of quantized vectors produced by different embedding model providers, as needed. For more on evaluating the accuracy of quantized vectors, please see our documentation . So how can developers make the most of vector quantization? Here are some example use cases that can be made more efficient and scaled effectively with quantized vectors: Massive knowledge bases can be used efficiently and cost-effectively for analysis and insight-oriented use cases, such as content summarization and sentiment analysis. Unstructured data like customer reviews, articles, audio, and videos can be processed and analyzed at a much larger scale, at a lower cost and faster speed. Using quantized vectors can enhance the performance of retrieval-augmented generation (RAG) applications. The efficient processing can support query performance from large knowledge bases, and the cost-effectiveness advantage can enable a more scalable, robust RAG system, which can result in better customer and employee experience. Developers can easily A/B test different embedding models using multiple vectors produced from the same source field during prototyping. MongoDB’s flexible document model lets developers quickly deploy and compare embedding models’ results without the need to rebuild the index or provision an entirely new data model or set of infrastructure. The relevance of search results or context for large language models (LLMs) can be improved by incorporating larger volumes of vectors from multiple sources of relevance, such as different source fields (product descriptions, product images, etc.) embedded within the same or different models. To get started with vector quantization in Atlas Vector Search, see the following developer resources: Documentation: Vector Quantization in Atlas Vector Search Documentation: How to Measure the Accuracy of Your Query Results Tutorial: How to Use Cohere's Quantized Vectors to Build Cost-effective AI Apps With MongoDB

December 12, 2024

Vector Quantization: Scale Search & Generative AI Applications

This post is also available in: Deutsch , Français , Español , Português , Italiano , 한국어 , 简体中文 . Update 12/12/2024: The upcoming vector quantization capabilities mentioned at the end of this blog post are now available in public preview: Support for ingestion and indexing of binary (int1) quantized vectors: gives developers the flexibility to choose and ingest the type of quantized vectors that best fits their requirements. Automatic quantization and rescoring: provides a native mechanism for scalar quantization and binary quantization with rescoring, making it easier for developers to implement vector quantization entirely within Atlas Vector Search. View the documentation to get started. We are excited to announce a robust set of vector quantization capabilities in MongoDB Atlas Vector Search . These capabilities will reduce vector sizes while preserving performance, enabling developers to build powerful semantic search and generative AI applications with more scale—and at a lower cost. In addition, unlike relational or niche vector databases, MongoDB’s flexible document model—coupled with quantized vectors—allows for greater agility in testing and deploying different embedding models quickly and easily. Support for scalar quantized vector ingestion is now generally available, and will be followed by several new releases in the coming weeks. Read on to learn how vector quantization works and visit our documentation to get started! The challenges of large-scale vector applications While the use of vectors has opened up a range of new possibilities , such as content summarization and sentiment analysis, natural language chatbots, and image generation, unlocking insights within unstructured data can require storing and searching through billions of vectors—which can quickly become infeasible. Vectors are effectively arrays of floating-point numbers representing unstructured information in a way that computers can understand (ranging from a few hundred to billions of arrays), and as the number of vectors increases, so does the index size required to search over them. As a result, large-scale vector-based applications using full-fidelity vectors often have high processing costs and slow query times, hindering their scalability and performance. Vector quantization for cost-effectiveness, scalability, and performance Vector quantization, a technique that compresses vectors while preserving their semantic similarity, offers a solution to this challenge. Imagine converting a full-color image into grayscale to reduce storage space on a computer. This involves simplifying each pixel's color information by grouping similar colors into primary color channels or "quantization bins," and then representing each pixel with a single value from its bin. The binned values are then used to create a new grayscale image with smaller size but retaining most original details, as shown in Figure 1. Figure 1: Illustration of quantizing an RGB image into grayscale Vector quantization works similarly, by shrinking full-fidelity vectors into fewer bits to significantly reduce memory and storage costs without compromising the important details. Maintaining this balance is critical, as search and AI applications need to deliver relevant insights to be useful. Two effective quantization methods are scalar (converting a float point into an integer) and binary (converting a float point into a single bit of 0 or 1). Current and upcoming quantization capabilities will empower developers to maximize the potential of Atlas Vector Search. The most impactful benefit of vector quantization is increased scalability and cost savings through reduced computing resources and efficient processing of vectors. And when combined with Search Nodes —MongoDB’s dedicated infrastructure for independent scalability through workload isolation and memory-optimized infrastructure for semantic search and generative AI workloads— vector quantization can further reduce costs and improve performance, even at the highest volume and scale to unlock more use cases. "Cohere is excited to be one of the first partners to support quantized vector ingestion in MongoDB Atlas,” said Nils Reimers, VP of AI Search at Cohere. “Embedding models, such as Cohere Embed v3, help enterprises see more accurate search results based on their own data sources. We’re looking forward to providing our joint customers with accurate, cost-effective applications for their needs.” In our tests, compared to full-fidelity vectors, BSON-type vectors —MongoDB’s JSON-like binary serialization format for efficient document storage—reduced storage size by 66% (from 41 GB to 14 GB). And as shown in Figures 2 and 3, the tests illustrate significant memory reduction (73% to 96% less) and latency improvements using quantized vectors, where scalar quantization preserves recall performance and binary quantization’s recall performance is maintained with rescoring–a process of evaluating a small subset of the quantized outputs against full-fidelity vectors to improve the accuracy of the search results. Figure 2: Significant storage reduction + good recall and latency performance with quantization on different embedding models Figure 3: Remarkable improvement in recall performance for binary quantization when combining with rescoring In addition, thanks to the reduced cost advantage, vector quantization facilitates more advanced, multiple vector use cases that would have been too computationally-taxing or cost-prohibitive to implement. For example, vector quantization can help users: Easily A/B test different embedding models using multiple vectors produced from the same source field during prototyping. MongoDB’s document model —coupled with quantized vectors—allows for greater agility at lower costs. The flexible document schema lets developers quickly deploy and compare embedding models’ results without the need to rebuild the index or provision an entirely new data model or set of infrastructure. Further improve the relevance of search results or context for large language models (LLMs) by incorporating vectors from multiple sources of relevance, such as different source fields (product descriptions, product images, etc.) embedded within the same or different models. How to get started, and what’s next Now, with support for the ingestion of scalar quantized vectors, developers can import and work with quantized vectors from their embedding model providers of choice (such as Cohere, Nomic, Jina, Mixedbread, and others)—directly in Atlas Vector Search. Read the documentation and tutorial to get started. And in the coming weeks, additional vector quantization features will equip developers with a comprehensive toolset for building and optimizing applications with quantized vectors: Support for ingestion of binary quantized vectors will enable further reduction of storage space, allowing for greater cost savings and giving developers the flexibility to choose the type of quantized vectors that best fits their requirements. Automatic quantization and rescoring will provide native capabilities for scalar quantization as well as binary quantization with rescoring in Atlas Vector Search, making it easier for developers to take full advantage of vector quantization within the platform. With support for quantized vectors in MongoDB Atlas Vector Search, you can build scalable and high-performing semantic search and generative AI applications with flexibility and cost-effectiveness. Check out these resources to get started documentation and tutorial . Head over to our quick-start guide to get started with Atlas Vector Search today.

October 7, 2024

向量量化:扩展搜索和生成式人工智能应用程序

Update 12/12/2024: The upcoming vector quantization capabilities mentioned at the end of this blog post are now available in public preview: Support for ingestion and indexing of binary (int1) quantized vectors: gives developers the flexibility to choose and ingest the type of quantized vectors that best fits their requirements. Automatic quantization and rescoring: provides a native mechanism for scalar quantization and binary quantization with rescoring, making it easier for developers to implement vector quantization entirely within Atlas Vector Search. View the documentation to get started. 我们很高兴地宣布 MongoDB Atlas Vector Search 将提供一组强大的向量量化功能。这些功能在保持性能的同时还将减小向量大小,使开发者能够以更大的规模和更低的成本构建强大的语义搜索和生成式人工智能应用程序。此外,与关系型或生态位向量数据库不同,MongoDB 灵活的文档模型与量化向量相结合,可以轻松快捷地测试和部署不同嵌入模型,同时提高灵活性。 对标量量化向量注入的支持现已普遍推出,未来几周还将发布几个新版本。继续阅读以了解向量量化的工作原理, 访问我们的文档即可开始 ! 大规模向量应用程序的挑战 虽然向量的使用开辟了一系列 新的可能性 ,如内容摘要和情感分析、自然语言聊天机器人和图像生成,但要从非结构化数据中获得洞察,可能需要存储和搜索数十亿个向量,这很快就会变得不可行。 向量实际上是浮点数数组,以计算机可以理解的方式表示非结构化信息(从几百到数十亿数组不等),随着向量数量的增加,搜索向量所需的索引大小也随之增加。因此,使用全保真向量的大规模向量应用程序通常具有较高的处理成本,并且查询速度慢,从而影响了其可扩展性和性能。 向量量化可提升成本效益、可扩展性和性能 向量量化是一种可保留语义相似性的向量压缩技术,为这一挑战提供了解决方案。想象一下,将全彩图像转换为灰度图像,就能减少计算机上的存储空间。这需要将相似的颜色归入原色通道或“量化区间”,以简化每个像素的颜色信息,然后用其区间中的单个值来表示每个像素。然后使用已划分区间的值创建新的灰度图像,新图像的尺寸更小,但保留了大部分原始细节,如图 1 所示。 图 1:将 RGB 图像量化为灰度图像的示意图 向量量化的工作原理与此类似,缩小全保真向量的位数可以显著降低内存和存储成本,而不会影响重要细节。保持这种平衡至关重要,因为搜索和 AI 应用程序需要提供相关的洞察才能发挥作用。 有效的量化方法有两种:标量量化(将浮点转换为整数)和二进制量化(将浮点转换为一位 0 或 1)。现有的和即将推出的量化功能将助力开发者充分挖掘 Atlas Vector Search 的潜力。 向量量化最显著的优势是通过减少计算资源和高效处理向量提升了可扩展性并节省了成本。 与搜索节点 (MongoDB 的专用基础架构,可通过工作负载隔离性实现独立可扩展性,针对语义搜索和生成式人工智能工作负载进行了内存优化)相结合时,向量量化可进一步降低成本并提高性能,即使在最大容量和规模下也能解锁更多使用案例。 "Cohere 很高兴成为首批支持 MongoDB Atlas 量化向量注入的合作伙伴之一,”Cohere 人工智能搜索副总裁 Nils Reimers 表示。“像 Cohere Embed v3 这样的嵌入模型可帮助企业根据自己的数据源查看更准确的搜索结果。我们期待为我们的共同客户提供准确、经济实惠的应用程序,以满足他们的需求。” 在我们的测试中,与全保真向量相比, BSON 型向量 (MongoDB 的类 JSON 二进制序列化格式,用于高效文档存储)将存储空间减少了 66%(从 41 GB 减少到 14 GB)。如图 2 和图 3 所示,测试表明,使用量化向量可以显著减少内存(减少 73% 到 96%),延迟也有所改善,其中标量量化保留了召回性能,二进制量化的召回性能通过重新评分来维持(重新评分是根据全保真向量对一小部分量化输出进行评估的过程,可提高搜索结果的准确性)。 图 2:通过不同嵌入模型上的量化,存储空间显著减少,召回和延迟性能良好 图 3:与重新评分相结合时,二进制量化的召回性能显著提高 此外,由于成本方面的优势,向量量化有利于实现更先进的多向量使用案例,这类使用案例由于计算负担太重或成本太高而难以实现。例如,向量量化可以帮助用户: 在原型设计期间,使用从同一源字段生成的多个向量,轻松地对不同嵌入模型进行 A/B 测试。MongoDB 的文档模 型与量化向量相结合,能够以更低的成本实现更高的灵活性。灵活的文档模式支持开发者快速部署和比较嵌入模型的结果,而无需重建索引或预配全新的数据模型或基础架构。 通过合并来自多个相关源的向量,例如嵌入在相同或不同模型中的不同源字段(产品描述、产品图像等),进一步提高大型语言模型 (LLM) 搜索结果或上下文的相关性。 如何开始,以及下一步 现在,凭借对标量量化向量注入的支持,开发者可以直接在 Atlas Vector Search 中导入和使用量化向量,这些量化向量来自他们所选择的嵌入模型提供商(如 Cohere、Nomic、Jina、Mixedbread 等)。阅读 文档 和 教程 即可开始。 未来几周还会推出其他向量量化功能,开发者可借助这套全面的工具集,使用量化向量来构建和优化应用程序: 支持注入二进制量化向量,将进一步减少存储空间,从而节省更多成本,开发者能够灵活选择最符合其要求的量化向量类型。 自动量化和重新评分将为标量量化提供原生功能,以及在 Atlas Vector Search 中通过重新评分进行二进制量化的功能,开发者可以更轻松地充分利用平台中的向量量化功能。 MongoDB Atlas Vector Search 支持量化向量,您可以灵活构建可扩展的高性能语义搜索和生成式人工智能应用程序,并实现成本效益。查看这些资源获取入门 文档 和 教程 。 立即查看我们的 快速入门指南 ,开始使用 Atlas Vector Search。

October 7, 2024

벡터 양자화: 대규모 검색 및 생성형 인공지능 애플리케이션

Update 12/12/2024: The upcoming vector quantization capabilities mentioned at the end of this blog post are now available in public preview: Support for ingestion and indexing of binary (int1) quantized vectors: gives developers the flexibility to choose and ingest the type of quantized vectors that best fits their requirements. Automatic quantization and rescoring: provides a native mechanism for scalar quantization and binary quantization with rescoring, making it easier for developers to implement vector quantization entirely within Atlas Vector Search. View the documentation to get started. MongoDB Atlas Vector Search 에 강력한 벡터 양자화 기능이 추가되었음을 발표하게 되어 기쁩니다. 이러한 기능은 성능을 유지하면서 벡터 크기를 줄여 개발자가 더 큰 규모와 더 낮은 비용으로 강력한 시맨틱 검색 및 생성형 인공지능 애플리케이션을 구축할 수 있도록 지원합니다. 또한 관계형 데이터베이스나 특정 벡터 데이터베이스와 달리 MongoDB의 유연한 문서 모델과 양자화된 벡터를 결합하면 다양한 임베딩 모델을 더욱 빠르고 쉽게 테스트하고 배포할 수 있습니다. 스칼라 양자화 벡터 수집 지원이 정식으로 제공되며, 향후 몇 주 내에 몇 가지 새로운 릴리스가 이어질 예정입니다. 벡터 양자화의 작동 방식을 알아보려면 계속 읽어보세요. 시작하려면 MongoDB 문서를 참조하세요 ! 대규모 벡터 애플리케이션의 과제 벡터를 사용하면 콘텐츠 요약, 감정 분석, 자연어 챗봇, 이미지 생성과 같은 다양한 새로운 가능성 이 열립니다. 하지만 비정형 데이터에서 인사이트를 도출하려면 수십억 개의 벡터를 저장하고 검색해야 하는 경우가 발생합니다. 이는 곧 큰 어려움에 직면할 수 있습니다. 벡터는 컴퓨터가 이해할 수 있는 방식으로 비정형 정보를 나타내는 부동 소수점 숫자 배열(수백 개에서 수십억 개의 배열)이며, 벡터의 수가 증가함에 따라 이들을 검색하는 데 필요한 인덱스 크기도 증가합니다. 대규모 벡터 기반 애플리케이션에서 고정밀 벡터를 사용하면 처리 비용이 높아지고 쿼리 시간이 느려질 수 있습니다. 이는 확장성과 성능 저하로 이어지는 경우가 많습니다. 비용 효율성, 확장성 및 성능 향상을 위한 벡터 양자화 시맨틱 유사성을 유지하면서 벡터를 압축하는 기술인 벡터 양자화는 이러한 문제에 대한 해결책을 제시합니다. 컴퓨터의 저장 공간을 줄이기 위해 풀컬러 이미지를 흑백 이미지로 변환하는 것을 생각해 보세요. 이 과정에는 유사한 색상을 기본 색상 채널 또는 "양자화 구간"으로 그룹화하여 각 픽셀의 색상 정보를 단순화한 다음 각 픽셀을 해당 구간의 단일 값으로 표현하는 작업이 포함됩니다. 그런 다음 구간 값을 사용하여 크기는 더 작지만 원본 세부 정보의 대부분을 유지하는 새로운 흑백 이미지를 만듭니다(그림 1 참조). 그림 1: RGB 이미지를 흑백으로 양자화하는 예시 벡터 양자화도 마찬가지로 고정밀 벡터를 더 적은 비트로 축소하여 중요한 세부 정보를 손상시키지 않고 메모리 및 스토리지 비용을 크게 절감합니다. 검색 및 AI 애플리케이션이 유용하려면 관련 있는 인사이트를 제공해야 하므로 이러한 균형을 유지하는 것이 매우 중요합니다. 두 가지 효과적인 양자화 방법은 스칼라(부동 소수점을 정수로 변환)와 이진(부동 소수점을 0 또는 1의 단일 비트로 변환)입니다. 현재 및 향후 제공될 양자화 기능을 통해 개발자는 Atlas Vector Search의 잠재력을 최대한 활용할 수 있습니다. 벡터 양자화의 가장 큰 이점은 컴퓨팅 리소스 감소 및 효율적인 벡터 처리를 통해 확장성이 향상되고 비용이 절감된다는 것입니다. MongoDB의 검색 노드 는 워크로드 격리 및 메모리 최적화 인프라를 통해 독립적인 확장성을 제공하는 전용 인프라입니다. 시맨틱 검색과 생성형 인공지능 워크로드에 최적화된 검색 노드와 벡터 양자화를 결합하면 최대 볼륨 및 규모에서도 비용을 더욱 절감하고 성능을 향상시켜 더 많은 사용 사례를 창출할 수 있습니다. Cohere의 AI 검색 담당 VP인 Nils Reimers는 "Cohere는 MongoDB Atlas에서 양자화된 벡터 수집을 지원하는 최초의 파트너 중 하나가 되어 기쁩니다."라고 말했습니다. "Cohere Embed v3와 같은 임베딩 모델은 기업이 자체 데이터 소스를 기반으로 더욱 정확한 검색 결과를 얻을 수 있도록 지원합니다. 양사 고객에게 필요에 맞는 정확하고 비용 효율적인 애플리케이션을 제공할 수 있기를 기대합니다." 테스트에서 고정밀 벡터와 비교했을 때 BSON 유형 벡터 (효율적인 문서 저장을 위한 MongoDB의 JSON 유사 이진 직렬화 형식)는 저장 용량을 66%(41GB에서 14GB로) 줄였습니다. 그림 2와 3에서 볼 수 있듯이, 양자화된 벡터를 사용한 테스트 결과 메모리 사용량이 73%~96% 감소하고 지연 시간이 크게 개선되었습니다. 특히, 스칼라 양자화는 재현율 성능을 유지하며, 이진 양자화는 리스코어링을 통해 재현율 성능을 유지합니다. 리스코어링은 검색 결과의 정확도를 높이기 위해 양자화된 출력의 일부를 고정밀 벡터와 비교하여 평가하는 프로세스입니다. 그림 2: 다양한 임베딩 모델에 양자화를 적용하여 스토리지 사용량을 크게 줄이면서도 우수한 재현율과 지연 시간 성능을 유지 그림 3: 리스코어링과 결합 시 이진 양자화의 재현율 성능이 크게 향상됨 또한 비용 절감 효과 덕분에 벡터 양자화는 이전에는 컴퓨팅 리소스가 너무 많이 필요하거나 비용이 많이 들어 구현하기 어려웠던 고급 다중 벡터 사용 사례를 더 쉽게 구현할 수 있도록 지원합니다. 예를 들어, 벡터 양자화는 사용자가 다음을 수행하는 데 도움이 될 수 있습니다. 프로토타이핑 중 동일한 소스 필드에서 생성된 여러 벡터를 사용하여 다양한 임베딩 모델을 쉽게 A/B 테스트할 수 있습니다. MongoDB 의 문서 모델 과 양자화된 벡터를 결합하면 더 낮은 비용으로 민첩성을 향상시킬 수 있습니다. 유연한 문서 스키마를 통해 개발자는 인덱스를 다시 빌드하거나 완전히 새로운 데이터 모델이나 인프라 세트를 프로비저닝하지 않고도 임베딩 모델 결과를 신속하게 배포하고 비교할 수 있습니다. 다양한 관련성 소스(예: 제품 설명, 제품 이미지 등)에서 추출한 벡터를 동일한 또는 다른 모델에 통합하면 대규모 언어 모델(LLM)의 검색 결과 또는 컨텍스트 관련성을 더욱 향상시킬 수 있습니다. 시작 방법 및 향후 계획 이제 스칼라 양자화 벡터 수집이 지원되므로 개발자는 원하는 임베딩 모델 제공업체(Cohere, Nomic, Jina, Mixedbread 등)의 양자화된 벡터를 Atlas Vector Search에서 직접 가져와서 사용할 수 있습니다. 시작하 려면 문서 와 튜토리얼을 참조하세요 . 그리고 향후 몇 주 내에 추가 벡터 양자화 기능이 제공되어 개발자는 양자화된 벡터를 사용하여 애플리케이션을 구축하고 최적화하는 데 필요한 포괄적인 툴 세트를 갖추게 될 것입니다. 이진 양자화 벡터 수집 지원을 통해 저장 공간을 더욱 줄일 수 있으므로 비용을 더 절감하고 개발자는 요구 사항에 가장 적합한 유형의 양자화된 벡터를 유연하게 선택할 수 있습니다. Atlas Vector Search는 자동 양자화 및 리스코어링 기능을 통해 스칼라 양자화와 리스코어링을 사용한 이진 양자화를 기본적으로 지원합니다. 이를 통해 개발자는 플랫폼에서 벡터 양자화를 더욱 쉽게 활용할 수 있습니다. MongoDB Atlas Vector Search는 양자화된 벡터를 지원합니다. 이를 통해 확장성이 뛰어나고 비용 효율적인 고성능 시맨틱 검색 및 생성형 AI 애플리케이션을 유연하게 구축할 수 있습니다. 시작하 려면 문서 및 튜토리얼 리소스를 참조하세요 . 지금 바로 Atlas Vector Search 를 시작하려면 빠른 시작 가이드를 확인하세요.

October 7, 2024

Quantização vetorial: pesquisa de escala e aplicativos de IA generativa

Update 12/12/2024: The upcoming vector quantization capabilities mentioned at the end of this blog post are now available in public preview: Support for ingestion and indexing of binary (int1) quantized vectors: gives developers the flexibility to choose and ingest the type of quantized vectors that best fits their requirements. Automatic quantization and rescoring: provides a native mechanism for scalar quantization and binary quantization with rescoring, making it easier for developers to implement vector quantization entirely within Atlas Vector Search. View the documentation to get started. Estamos muito satisfeitos em anunciar um conjunto robusto de recursos de quantização vetorial no MongoDB Atlas Vector Search . Esses recursos reduzirão o tamanho dos vetores e, ao mesmo tempo, preservarão o desempenho, permitindo que os desenvolvedores criem aplicativos avançados de pesquisa semântica e IA generativa com mais escala - e a um custo menor. Além disso, diferentemente dos bancos de dados vetoriais relacionais ou de nicho, o modelo de documento flexível do MongoDB, associado a vetores quantizados, permite maior agilidade para testar e implementar diferentes modelos de incorporação de forma rápida e fácil. O suporte à ingestão de vetores escalares quantizados já está disponível de forma geral e será seguido por várias novas versões nas próximas semanas. Continue lendo para saber como funciona a quantização de vetores e visite nossa documentação para começar! Os desafios dos aplicativos vetoriais de grande escala Embora o uso de vetores tenha aberto uma série de novas possibilidades , como resumo de conteúdo e análise de sentimentos, chatbots de linguagem natural e geração de imagens, o desbloqueio de insights em dados não estruturados pode exigir o armazenamento e a pesquisa em bilhões de vetores, o que pode se tornar inviável rapidamente. Os vetores são, na verdade, matrizes de números de ponto flutuante que representam informações não estruturadas de uma forma que os computadores possam entender (variando de algumas centenas a bilhões de matrizes) e, à medida que o número de vetores aumenta, também aumenta o tamanho do índice necessário para pesquisá-los. Como resultado, os aplicativos baseados em vetores em grande escala que usam vetores de fidelidade total geralmente têm altos custos de processamento e tempos de consulta lentos, o que prejudica sua escalabilidade e desempenho. Quantização vetorial para redução de custos, escalabilidade e desempenho A quantização de vetores, uma técnica que comprime vetores e, ao mesmo tempo, preserva sua similaridade semântica, oferece uma solução para esse desafio. Considere converter uma imagem totalmente digitalizada em escala de cinza para reduzir o espaço de armazenamento em um computador. Isso envolve a simplificação das informações de cores de cada pixel, agrupando cores semelhantes em canais de cores primárias ou "compartimentos de quantização," e, em seguida, representando cada pixel com um único valor de seu compartimento. Os valores binned são então usados para criar uma nova imagem em escala de cinza com tamanho menor, mas mantendo a maioria dos detalhes originais, conforme mostrado na Figura 1. Imagem 1: Ilustração da quantização de uma imagem GB em escala de cinza A quantização de vetores funciona de forma semelhante, diminuindo os vetores de fidelidade total em menos bits para reduzir significativamente os custos de memória e armazenamento sem comprometer os detalhes importantes. Manter esse equilíbrio é fundamental, pois os aplicativos de pesquisa e AI precisam fornecer insights relevantes para serem úteis. Dois métodos eficazes de quantização são o escalar (conversão de um ponto flutuante em um número inteiro) e o binário (conversão de um ponto flutuante em um único bit de 0 ou 1). Os recursos de quantização atuais e futuros capacitarão os desenvolvedores a maximizar o potencial do Atlas Vector Search. O benefício de maior impacto da quantização vetorial é o aumento da escalabilidade e da redução de custos por meio da redução de recursos de computação e do processamento eficiente de vetores. E quando combinada com o Search Nodes - a infraestrutura dedicada do MongoDB para escalabilidade independente por meio do isolamento da carga de trabalho e da infraestrutura otimizada para memória para pesquisa semântica e cargas de trabalho de IA generativas - a quantização vetorial pode reduzir ainda mais os custos e melhorar o desempenho, mesmo no volume e na escala mais altos, para desbloquear mais casos de uso. "A Cohere está satisfeita por ser um dos primeiros parceiros a apoiar a ingestão de vetores quantizados no MongoDB Atlas", disse Nils Reimer, VP de Search da AI da Cohere. “Modelos de incorporação, como o Cohere Embed v3, ajudam as empresas a ver resultados de pesquisa mais precisos com base em suas próprias fontes de dados. Estamos ansiosos para fornecer a nossos clientes em comum aplicativos precisos e econômicos para suas necessidades.” Em nossos testes, em comparação com os vetores de fidelidade total, os vetores do tipo BSON - o formato de serialização binária semelhante ao JSON do MongoDB para armazenamento eficiente de documentos - reduziram o tamanho do armazenamento em 66% (de 41 GB para 14 GB). E, conforme mostrado nas Figuras 2 e 3, os testes ilustram uma redução significativa de memória (73% a 96% menos) e melhorias de latência usando vetores quantizados, em que a quantização escalar preserva o desempenho de recuperação e o desempenho de recuperação da quantização binária é mantido com a restauração - um processo de avaliação de um pequeno subconjunto das saídas quantizadas em relação a vetores de fidelidade total para melhorar a precisão dos resultados da pesquisa. Figura 2: redução significativa do armazenamento + bom desempenho de recuperação e latência com quantização em diferentes modelos de incorporação Figura 3: Melhoria notável no desempenho de recuperação para quantização binária quando combinada com a reescalonamento Além disso, graças à vantagem de custo reduzido, a quantização vetorial facilita casos de uso de vetores múltiplos mais avançados que teriam sido muito computacionalmente taxativos ou proibitivos em termos de custo para serem implementados. Por exemplo, a quantização vetorial pode ajudar os usuários a: Fazer testes A/B facilmente com diferentes modelos de incorporação usando vários vetores produzidos a partir do mesmo campo de origem durante a criação de protótipos. O modelo de documento do MongoDB, juntamente com vetores quantizados, permite maior agilidade a custos mais baixos. O esquema flexível de documento permite que os desenvolvedores implementem e comparem rapidamente os resultados dos modelos incorporados sem a necessidade de reconstruir o índice ou provisionar um modelo de dados totalmente novo ou um conjunto de infraestruturas. Melhorar ainda mais a relevância dos resultados de pesquisa ou do contexto para modelos de linguagem grandes (LLMs) incorporando vetores de várias fontes de relevância, como diferentes campos de origem (descrições de produtos, imagens de produtos etc.) incorporados no mesmo modelo ou em modelos diferentes. Como começar e o que vem a seguir Agora, com suporte para a ingestão de vetores quantizados escalares, os desenvolvedores podem importar e trabalhar com vetores quantizados de seus fornecedores de modelos de incorporação de escolha (como Cohere, Nomic, Jina, Mixedbread e outros) — diretamente no Atlas Vector Search. Leia a documentação e o tutorial para começar. E, nas próximas semanas, recursos adicionais de quantização de vetores equiparão os desenvolvedores com um conjunto abrangente de ferramentas para criar e otimizar aplicativos com vetores quantizados: O suporte à ingestão de vetores binários quantizados permitirá reduzir ainda mais o espaço de armazenamento, possibilitando maior economia de custos e oferecendo aos desenvolvedores a flexibilidade de escolher o tipo de vetor quantizado que melhor se adapta às suas necessidades. A quantização e a repontuação automáticas fornecerão recursos nativos para quantização escalar, bem como para quantização binária com repontuação no Atlas Vector Search, facilitando para os desenvolvedores aproveitar ao máximo a quantização vetorial dentro da plataforma. Com suporte para vetores quantizados no MongoDB Atlas Vector Search, você pode construir pesquisa semântica escalável e de alto desempenho e aplicativos de IA generativa com flexibilidade e relação custo-eficiência. Confira estes recursos para começar a documentação e o tutorial . Acesse nosso guia de início rápido para começar a usar o Atlas Vector Search hoje mesmo.

October 7, 2024

Vector Quantization: come scalare applicazioni di ricerca e di AI Generativa

Update 12/12/2024: The upcoming vector quantization capabilities mentioned at the end of this blog post are now available in public preview: Support for ingestion and indexing of binary (int1) quantized vectors: gives developers the flexibility to choose and ingest the type of quantized vectors that best fits their requirements. Automatic quantization and rescoring: provides a native mechanism for scalar quantization and binary quantization with rescoring, making it easier for developers to implement vector quantization entirely within Atlas Vector Search. View the documentation to get started. Siamo lieti di annunciare una solida serie di funzionalità di quantizzazione vettoriale in MongoDB Atlas Vector Search . Queste funzionalità ridurranno le dimensioni dei vettori preservando le prestazioni, consentendo agli sviluppatori di creare potenti applicazioni di ricerca semantica e AI generativa con maggiore scalabilità e a un costo inferiore. Inoltre, a differenza dei database vettoriali relazionali o di nicchia, il modello di documento flessibile di MongoDB, abbinato a vettori quantizzati, consente una maggiore agilità nel test e nell'implementazione di diversi modelli di incorporamento in modo rapido e semplice. Il supporto per l'inserimento di vettori quantizzati scalari è ora disponibile a livello generale e sarà seguito da diverse nuove release nelle prossime settimane. Continua a leggere per scoprire come funziona la quantizzazione vettoriale e consulta la nostra documentazione per iniziare! Le sfide delle applicazioni vettoriali su larga scala Sebbene l'uso dei vettori abbia aperto una serie di nuove possibilità , come il riepilogo dei contenuti e l'analisi del sentiment, i chatbot in linguaggio naturale e la generazione di immagini, sbloccare insight all'interno di dati non strutturati può richiedere l'archiviazione e la ricerca tra miliardi di vettori, il che può diventare rapidamente irrealizzabile. I vettori sono effettivamente degli array di numeri in virgola mobile che rappresentano informazioni non strutturate in un modo comprensibile ai computer (da poche centinaia a miliardi di array) e, con l'aumentare del numero di vettori, aumenta anche la dimensione dell'indice necessario per effettuare una ricerca su di essi. Di conseguenza, le applicazioni vettoriali su larga scala che utilizzano vettori a piena fedeltà hanno spesso costi di elaborazione elevati e tempi di interrogazione lenti, che ne ostacolano la scalabilità e le prestazioni. Quantizzazione vettoriale per economicità, scalabilità e prestazioni La quantizzazione vettoriale, una tecnica che comprime i vettori preservandone la somiglianza semantica, offre una soluzione a questa sfida. Immagina di convertire un'immagine a colori in scala di grigi per ridurre lo spazio di archiviazione su un computer. Ciò comporta la semplificazione delle informazioni sul colore di ciascun pixel raggruppando colori simili in canali di colore primari o "intervalli di quantizzazione" e quindi rappresentando ogni pixel con un singolo valore dal suo intervallo. I valori degli intervalli vengono poi utilizzati per creare una nuova immagine in scala di grigi con dimensioni più piccole, ma conservando la maggior parte dei dettagli originali, come mostrato nella Figura 1. Figura 1. Illustrazione della quantizzazione di un'immagine RGB in scala di grigi La quantizzazione vettoriale funziona in modo simile, riducendo i vettori a piena fedeltà in un minor numero di bit per ridurre significativamente i costi di memoria e archiviazione senza compromettere i dettagli importanti. Mantenere questo equilibrio è fondamentale, in quanto le applicazioni di ricerca e AI devono fornire insight pertinenti per essere utili. Due metodi di quantizzazione efficaci sono quello scalare (conversione di un punto float in un numero intero) e quello binario (conversione di un punto float in un singolo bit di 0 o 1). Le funzionalità di quantizzazione attuali e future consentiranno agli sviluppatori di massimizzare il potenziale di Atlas Vector Search. Il vantaggio più importante della quantizzazione vettoriale è l'aumento della scalabilità e il risparmio sui costi, grazie alla riduzione delle risorse di calcolo e all'elaborazione efficiente dei vettori. E quando viene combinata con Search Nodes , l'infrastruttura dedicata di MongoDB per la scalabilità indipendente attraverso l'isolamento del carico di lavoro e l'infrastruttura ottimizzata per la memoria per la ricerca semantica e i carichi di lavoro dell'AI generativa, la quantizzazione vettoriale può ridurre ulteriormente i costi e migliorare le prestazioni, anche al massimo volume e alla massima scalabilità, per sbloccare più casi d'uso. "Cohere è entusiasta di essere uno dei primi partner a supportare l'inserimento di vettori quantizzati in MongoDB Atlas", ha dichiarato Nils Reimers, VP of AI Search di Cohere. "L'incorporamento di modelli, come Cohere Embed v3, aiuta le aziende a visualizzare risultati di ricerca più accurati in base alle proprie fonti di dati. Non vediamo l'ora di fornire ai nostri clienti comuni applicazioni accurate e convenienti per le loro esigenze." Nei nostri test, rispetto ai vettori a piena fedeltà, i vettori di tipo BSON , il formato di serializzazione binaria simile a JSON di MongoDB per un'archiviazione efficiente dei documenti, hanno ridotto le dimensioni di archiviazione del 66% (da 41 GB a 14 GB). E come mostrato nelle Figure 2 e 3, i test illustrano una significativa riduzione della memoria (dal 73% al 96% in meno) e miglioramenti della latency utilizzando vettori quantizzati, dove la quantizzazione scalare preserva le prestazioni di richiamo e le prestazioni di richiamo della quantizzazione binaria vengono mantenute con il rescoring, un processo di valutazione di un piccolo sottoinsieme degli output quantizzati rispetto a vettori a piena fedeltà per migliorare l'accuratezza dei risultati della ricerca. Figura 2: Riduzione significativa dello spazio di archiviazione + buone prestazioni di richiamo e latency con quantizzazione su diversi modelli di incorporamento Figura 3: Notevole miglioramento delle prestazioni di richiamo per la quantizzazione binaria quando combinata con il rescoring Inoltre, grazie al vantaggio del costo ridotto, la quantizzazione vettoriale facilita casi d'uso più avanzati, a vettore multiplo, che sarebbero stati troppo onerosi dal punto di vista computazionale o proibitivi da implementare. Ad esempio, la quantizzazione vettoriale può aiutare gli utenti a: Eseguire facilmente A/B test di diversi modelli di incorporamento utilizzando più vettori prodotti dallo stesso campo sorgente durante la prototipazione. Il modello di documento di MongoDB, abbinato a vettori quantizzati, consente una maggiore agilità a costi inferiori. Lo schema flessibile del documento consente agli sviluppatori di distribuire e confrontare rapidamente i risultati dei modelli di incorporamento senza la necessità di ricostruire l'indice o di effettuare il provisioning di un modello di dati o di un set di infrastrutture completamente nuovo. Migliorare ulteriormente la pertinenza dei risultati di ricerca o del contesto per modelli linguistici di grandi dimensioni (LLM) incorporando vettori da più fonti di pertinenza, come diversi campi sorgente (descrizioni di prodotti, immagini di prodotti, ecc.) incorporati nello stesso modello o in modelli diversi. Come iniziare e cosa succede dopo Ora, con il supporto per l'inserimento di vettori quantizzati scalari, gli sviluppatori possono importare e lavorare con vettori quantizzati dai loro fornitori di modelli di incorporamento preferiti (come Cohere, Nomic, Jina, Mixedbread e altri), direttamente in Atlas Vector Search. Per iniziare, leggi la documentazione e il tutorial . E nelle prossime settimane, ulteriori funzionalità di quantizzazione vettoriale forniranno agli sviluppatori un set di strumenti completo per la creazione e l'ottimizzazione di applicazioni con vettori quantizzati: Il supporto per l'inserimento di vettori quantizzati binari consentirà un'ulteriore riduzione dello spazio di archiviazione, consentendo maggiori risparmi sui costi e offrendo agli sviluppatori la flessibilità di scegliere il tipo di vettori quantizzati più adatto alle loro esigenze. La quantizzazione e il rescoring automatici forniranno funzionalità native per la quantizzazione scalare e la quantizzazione binaria con rescoring in Atlas Vector Search, rendendo più facile per gli sviluppatori sfruttare appieno la quantizzazione vettoriale all'interno della piattaforma. Con il supporto per i vettori quantizzati in MongoDB Atlas Vector Search, puoi creare applicazioni di ricerca semantica e di AI generativa scalabili e ad alte prestazioni con flessibilità ed economicità. Consulta queste risorse per ottenere documentazione e tutorial introduttivi. Consulta la nostra guida rapida per iniziare con Atlas Vector Search oggi stesso.

October 7, 2024

Quantification vectorielle : recherche d’évolutivité et applications d’IA générative

Update 12/12/2024: The upcoming vector quantization capabilities mentioned at the end of this blog post are now available in public preview: Support for ingestion and indexing of binary (int1) quantized vectors: gives developers the flexibility to choose and ingest the type of quantized vectors that best fits their requirements. Automatic quantization and rescoring: provides a native mechanism for scalar quantization and binary quantization with rescoring, making it easier for developers to implement vector quantization entirely within Atlas Vector Search. View the documentation to get started. Nous sommes ravis d’annoncer le lancement d’un grand nombre de fonctionnalités avancées de quantification vectorielle dans MongoDB Atlas Vector Search . Elles réduiront la taille des vecteurs tout en préservant les performances. Les développeurs pourront donc créer de puissantes applications de recherche sémantique et d’IA générative à plus grande échelle et à moindre coût. De plus, contrairement aux bases de données vectorielles relationnelles ou de niche, le document model flexible de MongoDB, associé aux vecteurs quantifiés, permet de réaliser des tests plus agiles et de faciliter le déploiement de différents modèles d’intégration. La prise en charge de l’ingestion de vecteurs quantifiés scalaires est désormais disponible. D’autres nouveautés seront annoncées dans les semaines à venir. Poursuivez votre lecture pour découvrir le fonctionnement de la quantification vectorielle et consultez notre documentation pour commencer ! Les défis des applications vectorielles à grande échelle Bien que l’utilisation de vecteurs ait donné lieu à de nombreuses possibilités , telles que la synthèse de contenu et l’analyse des sentiments, les chatbots en langage naturel et la génération d’images, l’exploitation de données non structurées peut nécessiter le stockage et la recherche dans des milliards de vecteurs, ce qui devient une tâche difficile. Les vecteurs sont en fait des tableaux de nombres à virgule flottante. Ils représentent des informations non structurées compréhensibles par les ordinateurs (de quelques centaines à des milliards de tableaux). Plus leur nombre augmente, plus la taille de l’index nécessaire pour effectuer une recherche sur ces vecteurs s’accroît. Par conséquent, les applications vectorielles à grande échelle qui reposent sur des vecteurs de haute fidélité ont souvent des coûts de traitement élevés et des temps de requête lents, ce qui entrave leur évolutivité et leurs performances. Quantification vectorielle pour maximiser la rentabilité, l’évolutivité et les performances La quantification vectorielle, une technique qui permet de compresser les vecteurs tout en préservant leur similarité sémantique, permet de résoudre cette problématique. Imaginez convertir une image en couleurs en niveaux de gris pour réduire l’espace de stockage sur un ordinateur. Cette opération implique de simplifier les informations sur les couleurs de chaque pixel en regroupant celles similaires dans des canaux de couleurs primaires ou des « bacs de quantification », puis de représenter chaque pixel par une seule valeur de son bac. Les valeurs compartimentées sont ensuite utilisées pour créer une nouvelle image en niveaux de gris de plus petite taille tout en conservant la plupart des détails d’origine (voir figure 1). Figure 1 . illustration de la quantification d’une image RGB en niveaux de gris La quantification vectorielle fonctionne de la même manière. Elle réduit les vecteurs de haute fidélité en un plus petit nombre de bits afin de considérablement diminuer les coûts de mémoire et de stockage tout en conservant les informations essentielles. Maintenir cet équilibre est primordial, car les applications de recherche et d’IA doivent fournir des informations pertinentes pour être utiles. Les deux méthodes les plus efficaces sont la méthode scalaire (conversion d’un point flottant en un nombre entier) et la méthode binaire (conversion d’un point flottant en un seul bit de 0 ou 1). Les fonctionnalités de quantification actuelles et à venir permettront aux développeurs d’exploiter tout le potentiel d’Atlas Vector Search. Principal avantage : une évolutivité accrue et des coûts réduits grâce à la diminution des ressources informatiques et au traitement efficace des vecteurs. Associée à Search Nodes , l’infrastructure dédiée de MongoDB pour une évolutivité indépendante grâce à l’isolation des charges de travail et à l’infrastructure optimisée pour la mémoire pour la recherche sémantique et les charges de travail d’IA générative, la quantification vectorielle peut encore réduire les coûts et améliorer les performances. C’est le cas même lorsque le volume et l’évolutivité sont très élevés. Les développeurs peuvent ainsi accéder à un plus grand nombre de cas d’utilisation. « La société Cohere est ravie d’être l’un des premiers partenaires à soutenir l’ingestion quantifiée de vecteurs dans MongoDB Atlas », a déclaré Nils Reimers, vice-président de la recherche sur l’IA chez Cohere. « Les modèles d’intégration, tels que Cohere Embed v3, aident les entreprises à obtenir des résultats de recherche plus précis en fonction de leurs propres sources de données. Nous avons hâte de fournir à nos clients communs des applications précises et rentables adaptées à leurs besoins. » Lors de nos tests, par rapport aux vecteurs de haute fidélité, les vecteurs de type BSON (le format de sérialisation binaire de type JSON de MongoDB pour un stockage efficace des documents) ont réduit la taille de stockage de 66 % (de 41 Go à 14 Go). Comme le montrent les figures 2 et 3, les tests affichent une réduction significative de la mémoire (de 73 % à 96 %) et des améliorations de la latence en utilisant des vecteurs quantifiés. La quantification scalaire préserve la performance de rappel. Celle de la quantification binaire est maintenue avec le rescoring, un processus d’évaluation d’un petit sous-ensemble de résultats quantifiés par rapport à des vecteurs de haute fidélité afin d’améliorer la précision des résultats de la recherche. Figure 2 . réduction significative du stockage et bonnes performances de rappel et de latence avec la quantification sur différents modèles d’intégration Figure 3 . nette amélioration des performances de rappel pour la quantification binaire lorsqu’elle est associée au rescoring De plus, grâce à son coût réduit, la quantification vectorielle facilite des cas d’utilisation plus avancés et multiples, dont la mise en œuvre aurait été trop fastidieuse ou trop onéreuse. Elle peut notamment aider les utilisateurs à réaliser les actions suivantes : procéder à des tests A/B de différents modèles d’intégration en utilisant plusieurs vecteurs produits à partir du même champ source pendant le prototypage. Le document model MongoDB, associé aux vecteurs quantifiés, permet une plus grande agilité à moindre coût. Grâce au schéma flexible du document, les développeurs peuvent déployer et comparer rapidement les résultats des modèles d’intégration sans avoir à reconstruire l’index ou à fournir un modèle de données ou un ensemble d’infrastructures entièrement nouveaux ; améliorer la pertinence des résultats de recherche ou du contexte pour les grands modèles de langage (LLM) en intégrant des vecteurs provenant de multiples sources pertinentes, telles que différents champs sources (descriptions de produits, images de produits, etc.) intégrés dans le même modèle ou dans des modèles différents. Comment se lancer ? Désormais, grâce à la prise en charge de l’ingestion de vecteurs quantifiés scalaires, les développeurs peuvent importer et travailler avec des vecteurs quantifiés provenant des fournisseurs de modèles d’intégration de leur choix (Cohere, Nomic, Jina, Mixedbread, etc.), directement dans Atlas Vector Search. Lisez la documentation et regardez le tutoriel pour commencer. Dans les semaines à venir, de nouvelles fonctionnalités de quantification vectorielle permettront d’utiliser un ensemble complet d’outils pour créer et optimiser des applications avec des vecteurs quantifiés : la prise en charge de l’ingestion de vecteurs quantifiés binaires permettra de réduire davantage l’espace de stockage, ce qui se traduira par des économies plus importantes et donnera aux développeurs la possibilité de choisir les vecteurs quantifiés les plus adaptés à leurs besoins ; la quantification et la rescoring automatiques fourniront des capacités natives pour la quantification scalaire ainsi que la quantification binaire avec rescoring dans Atlas Vector Search. Les développeurs pourront ainsi tirer pleinement parti de la quantification vectorielle au sein de la plateforme. Avec la prise en charge des vecteurs quantifiés dans MongoDB Atlas Vector Search, vous pouvez créer des applications de recherche sémantique et d’IA générative évolutives, performantes, flexibles et rentables. Consultez ces ressources pour vous lancer . Consultez notre guide de démarrage rapide pour commencer à utiliser Atlas Vector Search dès aujourd’hui.

October 7, 2024

Cuantización vectorial: búsqueda a escala y aplicaciones de IA generativa

Update 12/12/2024: The upcoming vector quantization capabilities mentioned at the end of this blog post are now available in public preview: Support for ingestion and indexing of binary (int1) quantized vectors: gives developers the flexibility to choose and ingest the type of quantized vectors that best fits their requirements. Automatic quantization and rescoring: provides a native mechanism for scalar quantization and binary quantization with rescoring, making it easier for developers to implement vector quantization entirely within Atlas Vector Search. View the documentation to get started. Nos complace anunciar un sólido conjunto de capacidades de cuantificación vectorial en MongoDB Atlas Vector Search . Estas capacidades reducirán el tamaño de los vectores al tiempo que preservan el rendimiento, lo que permitirá a los desarrolladores crear poderosas aplicaciones de búsqueda semántica e IA generativa con más escala y a un costo menor. Además, a diferencia de las bases de datos vectoriales relacionales o de nicho, el modelo de documentos flexible de MongoDB, junto con los vectores cuantificados, permite una mayor agilidad en las pruebas y la implementación de diferentes modelos de incrustación de forma rápida y sencilla. La compatibilidad con la ingesta de vectores cuantizados escalares ya está disponible con carácter general, y será seguida por varias versiones nuevas en las próximas semanas. Siga leyendo para saber cómo funciona la cuantificación vectorial y visite nuestra documentación para comenzar. Los desafíos de las aplicaciones vectoriales a gran escala Si bien el uso de vectores abrió una gama de nuevas posibilidades , como el resumen de contenido y el análisis de sentimientos, los chatbots de lenguaje natural y la generación de imágenes, desbloquear información dentro de datos no estructurados puede requerir almacenar y buscar en miles de millones de vectores, lo que puede volver inviable rápidamente. Los vectores son efectivamente matrices de números de coma flotante que representan información no estructurada de una manera que las computadoras pueden entender (que van desde unos pocos cientos hasta miles de millones de matrices), y a medida que aumenta el número de vectores, también lo hace el tamaño del índice requerido para buscar en ellos. Como resultado, las aplicaciones basadas en vectores a gran escala que utilizan vectores de fidelidad completa a menudo tienen altos costos de procesamiento y tiempos de consulta lentos, lo que dificulta su escalabilidad y rendimiento. Cuantificación de vectores para la rentabilidad, la escalabilidad y el rendimiento La cuantización vectorial, una técnica que comprime vectores conservando su similitud semántica, ofrece una solución a este desafío. Imagine convertir una imagen a todo color en escala de grises para reducir el espacio de almacenamiento en una computadora. Esto implica simplificar la información de color de cada pixel agrupando colores similares en canales de color primarios o "contenedores de cuantificación" y, a continuación, representar cada pixel con un solo valor de su contenedor. Los valores en conjunto se utilizan para crear una nueva imagen en escala de grises con un tamaño más pequeño pero conservando la mayoría de los detalles originales, como se muestra en la Figura 1. Figura 1: Ilustración de la cuantificación de una imagen RGB en escala de grises La cuantización vectorial funciona de manera similar, reduciendo los vectores de alta fidelidad a menos bits para reducir significativamente los costos de memoria y almacenamiento sin comprometer los detalles importantes. Mantener este equilibrio es fundamental, ya que las aplicaciones de búsqueda e inteligencia artificial deben proporcionar información relevante para ser útiles. Dos métodos de cuantificación efectivos son escalar (convertir un punto flotante en un número entero) y binario (convertir un punto flotante en un solo bit de 0 o 1). Las capacidades de cuantificación actuales y futuras capacitarán a los desarrolladores para maximizar el potencial de Atlas Vector Search. El beneficio más impactante de la cuantificación vectorial es el aumento de la escalabilidad y el ahorro de costos a través de la reducción de los recursos informáticos y el procesamiento eficiente de vectores. Y cuando se combina con los nodos de búsqueda , la infraestructura dedicada de MongoDB para la escalabilidad independiente a través del aislamiento de cargas de trabajo y la infraestructura optimizada para memoria para cargas de trabajo de búsqueda semántica e IA generativa, la cuantificación vectorial puede reducir aún más los costos y mejorar el rendimiento, incluso en el volumen y la escala más altos para desbloquear más casos de uso. "Cohere se complace en ser uno de los primeros socios en apoyar la ingestión cuantificada de vectores en MongoDB Atlas”, dijo Nils Reimers, VP de búsqueda de AI en Cohere. "Los modelos de incrustación, como Cohere Embed v3, ayudan a las compañías a ver resultados de búsqueda más precisos basados en sus propias fuentes de datos. Esperamos poder ofrecer a nuestros clientes conjuntos aplicaciones precisas y rentables para sus necesidades”. En nuestras pruebas, en comparación con los vectores de fidelidad completa, los vectores tipo BSON , el formato de serialización binaria tipo JSON de MongoDB para un almacenamiento eficiente de documentos, redujeron el tamaño del almacenamiento en un 66% (de 41 GB a 14 GB). Y como se muestra en las Figuras 2 y 3, las pruebas ilustran una reducción significativa de memoria (73% a 96% menos) y mejoras de latencia utilizando vectores cuantificados, donde la cuantificación escalar preserva el rendimiento de recuperación y el rendimiento de recuperación de la cuantificación binaria se mantiene con la reclasificación, un proceso de evaluación de un pequeño subconjunto de las salidas cuantificadas frente a vectores de fidelidad completa para mejorar la precisión de los resultados de búsqueda. Figura 2: Reducción significativa del almacenamiento + buen rendimiento de recuperación y latencia con cuantificación en diferentes modelos de incrustación Figura 3: Mejora notable en el rendimiento de recuperación para la cuantificación binaria cuando se combina con la repuntuación Además, gracias al beneficio de costo reducido, la cuantificación de vectores facilita casos de uso de vectores múltiples más avanzados que fueron demasiado exigentes desde el punto de vista computacional o prohibitivos para implementar. Por ejemplo, la cuantificación vectorial puede ayudar a los usuarios a: Pruebe fácilmente diferentes modelos de integración A/B empleando múltiples vectores producidos a partir del mismo campo fuente durante la creación de prototipos. El modelo de documentos de MongoDB, junto con vectores cuantificados, permite una mayor agilidad a menores costos. El esquema de documentos flexible permite a los desarrolladores implementar y comparar rápidamente los resultados de los modelos de incrustación sin necesidad de reconstruir el índice o aprovisionar un modelo de datos o un conjunto de infraestructura completamente nuevos. Mejore aún más la relevancia de los resultados de búsqueda o el contexto para los modelos lingüísticos grandes (LLM) mediante la incorporación de vectores de múltiples fuentes de relevancia, como diferentes campos de origen (descripciones de productos, imágenes de productos, etc.) incrustados en el mismo modelo o en modelos diferentes. Cómo empezar y qué sigue Ahora, gracias a la compatibilidad con la ingesta de vectores cuantizados escalares, los desarrolladores pueden importar y trabajar con vectores cuantificados de los proveedores de modelos de incrustación que prefieran (como Cohere, Nomic, Jina, Mixedbread y otros), directamente en Atlas Vector Search. Lea la documentación y el tutorial para comenzar. Y en las próximas semanas, características adicionales de cuantificación vectorial equiparán a los desarrolladores con un completo conjunto de herramientas para crear y optimizar aplicaciones con vectores cuantificados: El soporte para la ingestión de vectores binarios cuantificados permitirá una mayor reducción del espacio de almacenamiento, lo que permitirá un mayor ahorro de costos y brindará a los desarrolladores la flexibilidad de elegir el tipo de vectores cuantificados que mejor se adapte a sus necesidades. La cuantificación y la reclasificación automáticas proporcionarán capacidades nativas para la cuantificación escalar, así como la cuantificación binaria con la reclasificación en Atlas Vector Search, lo que facilita a los desarrolladores aprovechar al máximo la cuantificación vectorial dentro de la plataforma. Con la compatibilidad con vectores cuantificados en MongoDB Atlas Vector Search, puede crear aplicaciones de búsqueda semántica y de IA generativa escalables y de alto rendimiento con flexibilidad y rentabilidad. Consulte estos recursos para comenzar la documentación y el tutorial . Diríjase a nuestra guía de inicio rápido para comenzar con Atlas Vector Search hoy.

October 7, 2024

Vektorquantisierung: Scale-Suche und Generative-KI-Anwendungen

Update 12/12/2024: The upcoming vector quantization capabilities mentioned at the end of this blog post are now available in public preview: Support for ingestion and indexing of binary (int1) quantized vectors: gives developers the flexibility to choose and ingest the type of quantized vectors that best fits their requirements. Automatic quantization and rescoring: provides a native mechanism for scalar quantization and binary quantization with rescoring, making it easier for developers to implement vector quantization entirely within Atlas Vector Search. View the documentation to get started. Wir freuen uns, einen robusten Satz von Vektorquantisierungsfunktionen in MongoDB Atlas Vector Search ankündigen zu können. Diese Funktionen reduzieren die Vektorgrößen bei gleichbleibender Leistung und ermöglichen Entwicklern die Erstellung leistungsstarker Anwendungen für semantische Suche und Generative KI in größerem Maßstab – und zu geringeren Kosten. Darüber hinaus ermöglicht das flexible Dokumentmodell von MongoDB – gekoppelt mit quantisierten Vektoren – im Gegensatz zu relationalen oder Nischen-Vektordatenbanken eine größere Flexibilität beim schnellen und einfachen Testen und Bereitstellen verschiedener Einbettungsmodelle. Die Unterstützung für die Aufnahme skalarer quantisierter Vektoren ist jetzt allgemein verfügbar und wird in den kommenden Wochen durch mehrere neue Versionen ergänzt. Lesen Sie weiter, um zu erfahren, wie die Vektorquantisierung funktioniert, und besuchen Sie unsere Dokumentation , um loszulegen! Die Herausforderungen großer Vektoranwendungen Während die Verwendung von Vektoren eine Reihe neuer Möglichkeiten eröffnet hat, wie Inhaltszusammenfassung und Stimmungsanalyse, Chatbots mit natürlicher Sprache und Bilderzeugung, kann das Gewinnen von Erkenntnissen aus unstrukturierten Daten das Speichern und Durchsuchen von Milliarden von Vektoren erfordern, was schnell unpraktikabel werden kann. Vektoren sind im Grunde Arrays von Gleitkommazahlen, die unstrukturierte Informationen auf eine für Computer verständliche Weise darstellen (die Bandbreite reicht von einigen Hundert bis hin zu Milliarden von Arrays). Und mit der Anzahl der Vektoren steigt auch die Indexgröße, die für die Suche in ihnen erforderlich ist. Infolgedessen haben große vektorbasierte Anwendungen, die Full-Fidelity-Vektoren verwenden, oft hohe Verarbeitungskosten und langsame Abfragezeiten, was ihre Skalierbarkeit und Leistung beeinträchtigt. Vektorquantisierung für Kosteneffizienz, Skalierbarkeit und Leistung Die Vektorquantisierung, eine Technik, die Vektoren komprimiert und dabei ihre semantische Ähnlichkeit beibehält, bietet eine Lösung für diese Herausforderung. Stellen Sie sich vor, ein Vollfarbbild wird in Graustufen umgewandelt, um Speicherplatz auf einem Computer zu sparen. Dabei werden die Farbinformationen jedes Pixels vereinfacht, indem ähnliche Farben in Primärfarbkanäle oder „Quantisierungs-Bins“ gruppiert und dann jeder Pixel mit einem einzelnen Wert aus seinem Bin dargestellt wird. Die in Bins eingeteilten Werte werden dann verwendet, um ein neues Graustufenbild mit kleinerer Größe zu erstellen, bei dem jedoch die meisten ursprünglichen Details erhalten bleiben (siehe Abbildung 1). Abbildung 1. Darstellung der Quantisierung eines RGB-Bildes in Graustufen Die Vektorquantisierung funktioniert ähnlich, indem sie Full-Fidelity-Vektoren auf weniger Bits verkleinert, um die Speicher- und Speicherkosten erheblich zu reduzieren, ohne die wichtigen Details zu beeinträchtigen. Die Aufrechterhaltung dieses Gleichgewichts ist von entscheidender Bedeutung, da Such- und KI-Anwendungen relevante Erkenntnisse liefern müssen, um nützlich zu sein. Zwei effektive Quantisierungsmethoden sind skalar (Umwandeln einer Gleitkommazahl in eine Ganzzahl) und binär (Umwandeln einer Gleitkommazahl in ein einzelnes Bit von 0 oder 1). Aktuelle und zukünftige Quantisierungsfunktionen ermöglichen Entwicklern, das Potenzial von Atlas Vector Search optimal zu nutzen. Die wirkungsvollsten Vorteile der Vektorquantisierung sind die erhöhte Skalierbarkeit und Kosteneinsparungen durch reduzierte Rechenressourcen und eine effiziente Verarbeitung von Vektoren. Und in Kombination mit Search Nodes – der dedizierten Infrastruktur von MongoDB für unabhängige Skalierbarkeit durch Workload-Isolierung und speicheroptimierte Infrastruktur für Semantische-Such- und Generative-KI-Workloads – kann die Vektorquantisierung die Kosten weiter senken und die Leistung verbessern, selbst bei höchstem Volumen und Skalierung, um mehr Anwendungsfälle zu erschließen. „Cohere freut sich, einer der ersten Partner zu sein, der die Aufnahme quantisierter Vektoren in MongoDB Atlas unterstützt“, sagte Nils Reimers, VP of AI Search bei Cohere. „Einbettungsmodelle wie Cohere Embed v3 helfen Unternehmen, genauere Suchergebnisse auf der Grundlage ihrer eigenen Datenquellen zu erhalten. Wir freuen uns darauf, unseren gemeinsamen Kunden präzise und kostengünstige Anwendungen für ihre Anforderungen bereitzustellen.“ In unseren Tests reduzierten BSON-Vektoren – MongoDBs JSON-ähnliches binäres Serialisierungsformat für eine effiziente Dokumentenspeicherung – die Speichergröße im Vergleich zu Vektoren mit voller Genauigkeit um 66 % (von 41 GB auf 14 GB). Wie aus den Abbildungen 2 und 3 hervorgeht, zeigen die Tests eine erhebliche Verringerung des Speicherbedarfs (73 % bis 96 % weniger) und eine Verbesserung der Latenzzeit durch quantisierte Vektoren, wobei die Abrufleistung bei skalarer Quantisierung erhalten bleibt und die Abrufleistung bei binärer Quantisierung durch Neubewertung beibehalten wird – ein Prozess, bei dem eine kleine Teilmenge der quantisierten Ausgaben gegen Vektoren mit voller Genauigkeit bewertet wird, um die Genauigkeit der Suchergebnisse zu verbessern. Abbildung 2: Signifikante Speicherreduzierung und gute Recall- sowie Latenzleistung durch Quantisierung bei verschiedenen Einbettungsmodellen Abbildung 3: Bemerkenswerte Verbesserung der Recall-Leistung bei der binären Quantisierung durch Kombination mit Rescoring Darüber hinaus ermöglicht die Vektorquantisierung dank der geringeren Kosten fortschrittlichere Anwendungsfälle mit mehreren Vektoren, deren Implementierung zu rechenintensiv oder zu kostspielig gewesen wäre. Die Vektorquantisierung kann Benutzern beispielsweise bei Folgendem helfen: Führen Sie beim Prototyping problemlos A/B-Tests verschiedener Einbettungsmodelle durch, indem Sie mehrere Vektoren verwenden, die aus demselben Quellfeld erstellt wurden. Das Dokumentmodell von MongoDB ermöglicht – gekoppelt mit quantisierten Vektoren – mehr Agilität bei geringeren Kosten. Das flexible Dokumentschema ermöglicht Entwicklern eine schnelle Bereitstellung und den Vergleich von Ergebnissen eingebetteter Modelle, ohne den Index neu erstellen oder ein völlig neues Datenmodell bzw. eine neue Infrastruktur bereitstellen zu müssen. Verbessern Sie die Relevanz von Suchergebnissen oder Kontext für Large Language Models (LLMs) weiter, indem Sie Vektoren aus mehreren relevanten Quellen integrieren, z. B. verschiedene Quellfelder (Produktbeschreibungen, Produktbilder usw.), die in dasselbe oder in verschiedene Modelle eingebettet sind. Erste Schritte und weitere Schritte Dank der Unterstützung für die Aufnahme skalarer quantisierter Vektoren können Entwickler jetzt quantisierte Vektoren von den Einbettungsmodellanbietern ihrer Wahl (wie Cohere, Nomic, Jina, Mixedbread und anderen) importieren und damit arbeiten – direkt in Atlas Vector Search. Lesen Sie die Dokumentation und das Tutorial , um loszulegen. Und in den kommenden Wochen werden zusätzliche Vektorquantisierungsfunktionen Entwicklern ein umfassendes Toolset für die Erstellung und Optimierung von Anwendungen mit quantisierten Vektoren an die Hand geben: Durch die Unterstützung der Aufnahme binärer quantisierter Vektoren lässt sich der Speicherplatz weiter reduzieren, was zu größeren Kosteneinsparungen führt und Entwicklern die Flexibilität gibt, den Typ quantisierter Vektoren auszuwählen, der ihren Anforderungen am besten entspricht. Automatische Quantisierung und Neubewertung bieten native Funktionen für skalare Quantisierung sowie binäre Quantisierung mit Neubewertung in Atlas Vector Search, was es Entwicklern erleichtert, die Vektorquantisierung innerhalb der Plattform voll auszunutzen. Mit der Unterstützung für quantisierte Vektoren in MongoDB Atlas Vector Search können Sie skalierbare und leistungsstarke Semantische-Such- und Generative-KI-Anwendungen flexibel und kostengünstig erstellen. Schauen Sie sich diese Ressourcen an, um mit der Dokumentation und dem Tutorial zu beginnen. Schauen Sie sich unsere Kurzanleitung an, um noch heute mit Atlas Vector Search zu beginnen.

October 7, 2024

Exact Nearest Neighbor Vector Search for Precise Retrieval

With its ability to efficiently handle high-dimensional, unstructured data, vector search delivers relevant results even when users don’t know what they’re looking for and uses machine learning models to find similar results across any data type. Rapidly emerging as a key technology for modern applications, vector search empowers developers to build next-generation search and generative AI applications faster and easier. MongoDB Atlas Vector Search goes beyond the approximate nearest neighbor (ANN) methods with the introduction of exact nearest neighbor (ENN) vector search . This innovative capability guarantees retrieval of the absolute closest vectors to your query, eliminating the accuracy limitations inherent in ANN. In sum, ENN vector search can help you unleash a new level of precision for your search and generative AI applications, improving benchmarking and moving to production faster. When exact nearest neighbor (ENN) vector search benefits developers While ANN shines in searching across large datasets, ENN vector search offers advantages in specific scenarios: Small-scale vector data: For datasets under 10,000 vectors, the linear time complexity of ENN vector search makes it a viable option, especially considering the added development complexity of tuning ANN parameters. Recall benchmarking of ANN queries: ANN queries are fast, particularly as the scale of your indexed vectors increases, but it may not be easy to know whether the retrieved documents by vector relevance correspond to the guaranteed closest vectors in your index. Using ENN can help provide that exact result set for comparison with your approximate result set, using jaccard similarity or other rank-aware recall metrics. This will allow you to have much greater confidence that your ANN queries are accurate since you can build quantitative benchmarks as your data evolves. Multi-tenant architectures: Imagine a scenario with millions of vectors categorized by tenants. You might search for the closest vectors within a specific tenant (identified by a tenant ID). In cases where the overall vector collection is large (in the millions) but the number of vectors per tenant is small (a few thousand), ANN's accuracy suffers when applying highly selective filters. ENN vector search thrives in this multi-tenant scenario, delivering precise results even with small result sets. Example use cases The small dataset size allows for exhaustive search within a reasonable timeframe, making exact nearest neighbor approach a viable option for finding the most similar data point, improving accuracy confidence in a number of use cases, such as: Multi-tenant data service: You might be building a business providing an agentic service that understands your customers’ data and takes actions on their behalf. When retrieving relevant proprietary data for that agent, it is critical that the right metadata filter be applied and that ENN be executed to retrieve the right sets of documents only corresponding to the appropriate data tenant IDs. Proof of concept development: For instance, a new recommendation engine might have a limited library compared to established ones. Here, ENN vector search can be used to recommend products to a small set of early adopters. Since the data is limited, an exhaustive search becomes practical, ensuring the user gets the most relevant recommendations from the available options. How ENN vector search works on MongoDB Atlas The ENN vector search feature in Atlas integrates seamlessly with the existing $vectorSearch stage within your Atlas aggregation pipelines. Its key characteristics include: Guaranteed accuracy: Unlike ANN, ENN always returns the closest vectors to your query, adhering to the specified limit. Eventual consistency: Similar to approximate vector search, ENN vector search follows an eventual consistency model. Simplified configuration: Unlike approximate vector search, where tuning numCandidates is crucial, ENN vector search only requires specifying the desired limit of returned vectors. Scalable recall evaluation: Atlas allows querying a large number of indexed vectors, facilitating the calculation of comprehensive recall sets for effective evaluation. Fast query execution: ENN vector search query execution can maintain sub-second latency for unfiltered queries up to 10,000 documents. It can also provide low-latency responses for highly selective filters that restrict a broad set of documents into 10,000 documents or less, ordered by vector relevance. Build more with ENN vector search ENN vector search can be a powerful tool when building a proof of concept for retrieval-augmented generation (RAG), semantic search, or recommendation systems powered by vector search. It simplifies the developer experience by minimizing overhead complexity and latency while giving you the flexibility to implement and benchmark precise retrieval. Explore more use cases and build applications faster, start experimenting with ENN vector search. Head over to our quick-start guide to get started with Atlas Vector Search today.

June 20, 2024