Peek at your MongoDB Clusters like a Pro with Keyhole: Part 2

Ken Chen
January 8, 2020

This is the second stop on Peek at your MongoDB Clusters like a Pro with Keyhole. In Part 1, we covered how to use keyhole to collect cluster information, generate HTML reports, and review whether the provisioned resources are adequate to support the application. In Part 2, we’ll discuss performance evaluations from FTDC metrics and mongo logs. FTDC, short for Full Time Diagnostic data Capture, is MongoDB's internal diagnostic data stored in a proprietary format. MongoDB records diagnostic information every second.

In part 2 of this blog post you will learn how to:

Integrate Keyhole with Grafana
Understand Cluster Performance & Bottlenecks from FTDC data
Identify Slow Operations from Logs

From the information revealed by Keyhole from reading MongoDB logs and FTDC data, you should be able to identify the performance bottlenecks 95% of the time. The complete code of Keyhole is available from this GitHub repository.

Integrate Keyhole with Granfana

MongoDB servers store FTDC data every second. This data includes mongo server status and many hardware resource metrics. To display FTDC metrics visually, Keyhole reads FTDC data and works as a SimpleJson Datasource to support a Grafana UI. Grafana is an open source analytics & monitoring solution for databases.

Install Grafana and SimpleJson Plugin

Follow Grafana’s installation documents to install Grafana on your favorite OS. For macOS users, after installing Grafana, start Grafana using command:

brew services start grafana

Continue configuring Grafana by navigating to http://localhost:3000 from a browser and install the grafana-simple-json-datasource plugin. For example, on macOS you can use the commands below to install the plugin and restart Grafana service:

grafana-cli plugins install grafana-simple-json-datasource
brew services restart grafana

Add Keyhole as Default Datasource

The next step is to add keyhole as the default datasource. Locate the Data Source tab under Configuration, and configure the SimpleJson datasource as below:

Keyhole as a datasource

You can ignore the HTTP Error Bad Gateway message for now. The reason that you see the error message is because Keyhole is not started yet.

Import Keyhole FTDC Analytics Template

Download MongoDB FTDC Analytics template from GitHub. On Dashboard, click the New dashboard icon, then import the downloaded file. Use the exact parameters as shown below. These values have to be exact matches to allow Grafana to locate the correct datasource.

Keyhole Full Time Diagnostic data Capture Template

Click the Import button to complete the configuration. You should be redirected to a MongoDB FTDC Analytics dashboard with a number of blank metrics panels.

Understand Cluster Performance & Bottlenecks from FTDC data

The FTDC data files are kept under a directory called diagnostic.data under your mongo database path. Copy the entire diagnostic.data directory (or simply a few files under it) to the computer where you have Grafana and Keyhole installed. Then, start Keyhole as follows:

keyhole --web --diag ./diagnostic.data/

2019/11/25 14:10:43 reading 1 files with 300 second(s) interval
2019/11/25 14:10:43 metrics.2017-10-12T20-08-53Z-00000 blocks: 164 , time: 245.534381ms Memory Alloc = 58 MiB, TotalAlloc = 101 MiB
2019/11/25 14:10:43 1 files loaded, time spent: 245.718665ms
2019/11/25 14:10:43 Stats from 2017-10-12T20:08:54Z to 2017-10-13T04:29:23Z
2019/11/25 14:10:43 host-0 xxx-00:27018
...
2019/11/25 14:10:43 data points ready for xxx-00:27018 , time spent: 1.244537ms
…
http://localhost:3000/d/simagix-grafana/mongodb-mongo-ftdc?orgId=1&from=1507838934000&to=1507868963000

After Keyhole completes reading all FTDC data files, a URL is provided at the end of the console output. Open the link in a browser to see the previous metrics panels filled with charts, for example:

Keyhole Full Time Diagnostic data Capture Analytics

Although these charts present only a few of the MongoDB FTDC metrics, they are enough for me to diagnose the health of a MongoDB cluster. My evaluation steps are outlined below:

WiredTiger Tickets

First of all, check if the WiredTiger tickets dropped to zero at any point in time. In the WiredTiger storage engine, read and write tickets are used to control concurrency. By default, there are 128 read and 128 write tickets. Below is an example of the cluster out of available read tickets at many points of time:

If the server ran out of read tickets, check MongoDB logs for long-running database operations. (We’ll discuss how to identify slow operations later in this blog.) To resolve the running-out-read-tickets problem, solutions can be as simple as adding indexes to improve query performance and free up read tickets quicker. If all queries are supported by proper indexes, but the problems are due to high transaction rate of read operations, consider adding more shards to the MongoDB cluster.

If the server was out of write tickets, the problem is likely disk performance-related, high disk latency or under-provisioned IOPS. Cross-reference the Disk IOPS and Disk Utilization (%) panels to see if it reached the maximum disk IOPS.

The growing number of queues in the Queues panel is a reflection of the WiredTiger Tickets panel. When WiredTiger is out of tickets, subsequent requests will be queued, for example:

Queues

Disk IOPS

The line charts of Disk IOPS panel fluctuates fiercely and likely you will see seesaw-shaped charts. However, if the IOPS of a disk device is showing as a plateau line, it’s very likely that the device has reached its maximum provisioned IOPS. The solution is to increase disk IOPS. If you still use spindle disks, consider an upgrade to SSD or NVMe for better performance. MongoDB has good results and a good price-performance ratio with SATA SSD.

Disk Input/Output Operations per Second

WiredTiger Tickets and Disk IOPS panels are the first two I’ll review. The other panels can show different bottlenecks and/or potential problems. Different cases have very different chart patterns.

WiredTiger Cache (GB)

The wt_cache_dirty, shown in the chart below, indicates data in the cache that has been modified but not yet flushed to disk. Growing amounts of dirty data usually implies that the rate of writing to the database overwhelms the provisioned disk IOPS.

Metrics

A high number of scan_objects with a low number of scan_keys is typically an indication of a missing index or inefficient indexes used by mongo query engine. A growing number of scan_sort implies that the mongo query engine didn’t use an index key to sort. Instead, it had to load all documents into memory before sorting. In either case, we can identify them from mongo logs - see discussions below in Identifying Slow Operations from Logs. The chart below is an example of a Metrics panel of a cluster that has a high transaction rate of reads.

Connections

I look at the conns_created_per_minute to find if connection pools were always used. Ideally, when connection pools are used, there should be a minimal number of connections created per minute. If you see spikes of conns_created_per_minute like the below chart, ask your developers if all applications use connection pools.

Connections

CPU Usages (%)

If a system is properly provisioned, you shouldn’t see CPU pegged out, i.e. 0% of CPU idle. If you have slow disks with a high transaction rate of writes, you will probably see a growing percentage of cpu_iowait. If your mongo server is a virtual machine hosted on a resource overcommitted host, you could see a high flying line of cpu_system even though the mongo server itself has low activity because other VMs could possibly use the CPU cycles heavily.

CPU Usages

Replication Lags (seconds)

Replication lag represents a delay between an operation on the primary and the application of that operation from the oplog to the secondary. This chart is only available when diagnosing a replica set. This metric shows how far a secondary is behind the primary. A high replication lag can be due to networking issues, slow oplog applying in secondary nodes, and/or insufficient write capacity. Below is an example:

Replication Lags in seconds

Other than application tuning, better hardware (faster network switches and/or disks) is likely required to reduce replication lags. A common solution is to replace existing servers with better hardware or to add more shards. Before a new solution is in place, make sure there is enough oplog window (the interval of time between the oldest and the latest entries in the oplog) to avoid losing data.

Identifying Slow Operations from Logs

Keyhole, with --loginfo flag, reads mongo logs and prints a summary of slow operations grouped by query patterns (filters). The input files can be either plain texts or gzipped files. Being able to read gzipped files comes handy. Below is an example result.

The complete usage of log analytics is as follows:

keyhole --loginfo log_file[.gz] [--collscan] [-v]

With the -v flag, Keyhole also prints the original logs of the top 20 slow operations. In addition, with the --collscan flag, Keyhole only outputs operations missing indexes. A query pattern marked as COLLSCAN means that it didn’t use an index and adding a proper index should resolve this. On the other hand, even if an index was used it might not be efficient enough to support the query pattern. It’s quite possible to find indexes that are optimized for querying and others for sorting. Compound indexes can be structured to optimize for both whenever possible.

Note that there is an output file with a “.enc” extension created upon the completion of log parsing. This is designed for those administrators who are reluctant or prohibited to copy log files out of the servers. Instead, they can process the logs on the server and later view the summary by using the command below:

keyhole --loginfo mongodb.log.enc

Or, you can use Maobi to generate an HTML report to have a more user-friendly view of the results. Below is an example report:

Sample Maobi Report

Recap

Combined with methods described in Part 1, you can use the Keyhole tool to quickly identify performance bottlenecks and tune them accordingly. In summary, Keyhole helps you to:

Verify new installations
Collect MongoDB cluster information including configurations and statistics
Visualize resource usages in a snapshot
Identify performance bottlenecks and slow database operations

As always, I would love to hear from you about the Keyhole tool. Please get in touch to let me know your thoughts.

← Previous

Ultimaker brings new dimensions to 3D printing with MongoDB Atlas and Google Cloud Platform

Ultimaker uses MongoDB Atlas on GCP to provide a complete ecosystem for 3D printing production workflows.

December 27, 2019

Next →

Why Vector Quantization Matters for AI Workloads

Key takeaways As vector embeddings scale into millions, memory usage and query latency surge, leading to inflated costs and poor user experience. By storing embeddings in reduced-precision formats (int8 or binary), you can dramatically cut memory requirements and speed up retrieval. Voyage AI's quantization-aware embedding models are specifically tuned to handle compressed vectors without significant loss of accuracy. MongoDB Atlas streamlines the workflow by handling the creation, storage, and indexing of compressed vectors, enabling easier scaling and management. MongoDB is built for change, allowing users to effortlessly scale AI workloads as resource demands evolve. Organizations are now scaling AI applications from proofs of concept to production systems serving millions of users. This shift creates scalability, latency, and resource challenges for mission-critical applications leveraging recommendation engines, semantic search, and retrieval-augmented generation (RAG) systems. At scale, minor inefficiencies compound and become major bottlenecks, increasing latency, memory usage, and infrastructure costs. This guide explains how vector quantization enables high-performance, cost-effective AI applications at scale. The challenge: Scaling vector search in production Let’s start by considering a modern voice assistance platform that combines semantic search with natural language understanding. During development, the system only needs to process a few hundred queries per day, converting speech to text and matching the resulting embeddings against a modest database of responses. The initial implementation is straightforward: each query generates a 32-bit floating-point embedding vector that's matched against a database of similar vectors using cosine similarity. This approach works smoothly in the prototype phase—response times are quick, memory usage is manageable, and the development team can focus on improving accuracy and adding features. However, as the platform gains traction and scales to processing thousands of queries per second against millions of document embeddings, the simple approach begins to break down. Each incoming query now requires loading massive amounts of high-precision floating-point vectors into memory, computing similarity scores across an exponentially larger dataset, and maintaining increasingly complex vector indexes for efficient retrieval. Without proper optimization, the system struggles as memory usage balloons, query latency increases, and infrastructure costs spiral upward. What started as a responsive, efficient prototype has become a bottleneck production system that struggles to maintain its performance requirements while serving a growing user base. The key challenges are: Loading high-precision 32-bit floating-point vectors into memory Computing similarity scores across massive embedding collections Maintaining large vector indexes for efficient retrieval Which can lead to critical issues like: High memory usage as vector databases struggle to keep float32 embeddings in RAM Increased latency as systems process large volumes of high-precision data Growing infrastructure costs as organizations scale their vector operations Reduced query throughput due to computational overhead AI workloads with tens or hundreds of millions of high-dimensional vectors (e.g., 80M+ documents at 1536 dimensions) face soaring RAM and CPU requirements. Storing float32 embeddings for these workloads can become prohibitively expensive. Vector quantization: A path to efficient scaling The obvious question is: How can you maintain the accuracy of your recommendations, semantic matches, and search queries, while drastically cutting down on compute and memory usage and reducing retrieval latency? Vector quantization is how. It helps you store embeddings more compactly, reduce retrieval times, and keep costs under control. Vector quantization offers a powerful solution to scalability, latency, and resource utilization challenges by compressing high-dimensional embeddings into compact representations while preserving their essential characteristics. This technique can dramatically reduce memory requirements and accelerate similarity computations without compromising retrieval accuracy. What is vector quantization? Vector quantization is a compression technique widely applied in digital signal processing and machine learning. Its core idea is to represent numerical data using fewer bits, reducing storage requirements without entirely sacrificing the data’s informative value. In the context of AI workloads, quantization commonly involves converting embeddings—originally stored as 32-bit floating-point values—into formats like 8-bit integers. By doing so, you can substantially decrease memory and storage consumption while maintaining a level of precision suitable for similarity search tasks. An important point to note is that the quantization mechanism is especially suitable for use cases that involve over 1 million vector embeddings, such as RAG applications, semantic search, or recommendation systems that require tight control of operational costs without a compromise on retrieval accuracy. Smaller datasets with fewer than 1 million embeddings might not see significant gains from quantization procedures. For smaller datasets, the overhead of implementing quantization might outweigh its benefits. Understanding vector quantization Vector quantization operates by mapping high-dimensional vectors to a discrete set of prototype vectors or converting them to lower-precision formats. There are three main approaches: Scalar quantization: Converts individual 32-bit floating-point values to 8-bit integers, reducing memory usage of vector values by 75% while maintaining reasonable precision. Product quantization: Compresses entire vectors at once by mapping them to a codebook of representative vectors, offering better compression than scalar quantization at the cost of more complex encoding/decoding. Binary quantization: Transforms vectors into binary (0/1) representations, achieving maximum compression but with more significant information loss. A vector database that applies these compression techniques must effectively manage multiple data structures: Hierarchical navigable small world (HNSW) graph for navigable search Full-fidelity vectors (32-bit float embeddings) Quantized vectors (int8 or binary) When quantization is defined in the vector index, the system builds quantized vectors and constructs the HNSW graph from these compressed vectors. Both structures are placed in memory for efficient search operations, significantly reducing the RAM footprint compared to storing full-fidelity vectors alone. The table below illustrates how different quantization mechanisms impact memory usage and disk consumption. This example focuses on HNSW indexes storing 30 GB of original float32 embeddings alongside a 0.1 GB HNSW graph structure. Our RAM usage estimates include a 10% overhead factor (1.1 multiplier) to account for JVM memory requirements with indexes loaded into page cache, reflecting typical production deployment conditions. Actual overhead may vary based on specific configurations. Here are key attributes to consider based on the table below: Estimated RAM usage: Combines HNSW graph size with either full or quantized vectors, plus a small overhead factor (1.1 for index overhead). Disk usage: Includes storage for full-fidelity vectors, HNSW graph, and quantized vectors when applicable. Notice that while enabling quantization increases total disk usage —because you still store full-fidelity vectors for exact nearest neighbor queries in both cases and rescoring in the case of binary quantization—it dramatically decreases RAM requirements and speeds up initial retrieval . MongoDB Atlas Vector Search offers powerful scaling capabilities through its automatic quantization system . As illustrated in Figure 1 below, MongoDB Atlas supports multiple vector search indexes with varying precision levels: Float32 for maximum accuracy, Scalar Quantized (int8) for balanced performance with 3.75× RAM reduction, and Binary Quantized (1-bit) for maximum speed with 24× RAM reduction. The quantization variety provided by MongoDB Atlas allows users to optimize their vector search workloads based on specific requirements. For collections exceeding 1M vectors, Atlas automatically applies the appropriate quantization mechanism, with binary quantization particularly effective when combined with Float32 rescoring for final refinement. Figure 1: MongoDB Atlas Vector Search Architecture with Automatic Quantization Data flow through embedding generation, storage, and tiered vector indexing with binary rescoring. Binary quantization with rescoring A particularly effective strategy is to combine binary quantization with a rescoring step using full-fidelity vectors. This approach offers the best of both worlds: extremely fast lookups thanks to binary data formats, plus more precise final rankings from higher-fidelity embeddings. Initial retrieval (Binary) Embeddings are stored as binary to minimize memory usage and accelerate the approximate nearest neighbor (ANN) search. Hamming distance (via XOR + population count) is used, which is computationally faster than Euclidean or cosine similarity on floats. Rescoring The top candidate results from the binary pass are re-evaluated using their float or int8 vectors to refine the ranking. This step mitigates the loss of detail in binary vectors, balancing result accuracy with the speed of the initial retrieval. By pairing binary vectors for rapid recall with full-fidelity embeddings for final refinement, you can keep your system highly performant and maintain strong relevance. The need for quantization-aware models Not all embedding models perform equally well under quantization. Models need to be specifically trained with quantization in mind to maintain their effectiveness when compressed. Some models—especially those trained purely for high-precision scenarios—suffer significant accuracy drops when their embeddings are represented with fewer bits. Quantization-aware training (QAT) involves: Simulating quantization effects during the training process Adjusting model weights to minimize information loss Ensuring robust performance across different precision levels This is particularly important for production applications where maintaining high accuracy is crucial. Embedding models like those from Voyage AI— which recently joined MongoDB —are specifically designed with quantization awareness, making them more suitable for scaled deployments. These models preserve more of their essential feature information even under aggressive compression. Voyage AI provides a suite of embedding models specifically designed with QAT in mind, ensuring minimal loss in semantic quality when shifting to 8-bit integer or even binary representations. Figure 2: Embedding model performance comparing retrieval quality (NDCG@10) versus storage costs . Voyage AI models (green) maintain superior retrieval quality even with binary quantization (triangles) and int8 compression (squares), achieving up to 100x storage efficiency compared to standard float embeddings (circles) . The graph above shows several important patterns that demonstrate why quantization-aware training (QAT) is crucial for maintaining performance under aggressive compression. The Voyage AI family of models (shown in green) demonstrates strong performance in retrieval quality even under extreme compression. The voyage-3-large model demonstrates this dramatically—when using int8 precision at 1024 dimensions, it performs nearly identically to its float precision, 2048-dimensional counterpart, showing only a minimal 0.31% quality reduction despite using 8 times less storage. This showcases how models specifically designed with quantization in mind can preserve their semantic understanding even under substantial compression. Even more impressive is how QAT models maintain their edge over larger, uncompressed models. The voyage-3-large model with int8 precision and 1024 dimensions outperforms OpenAI-v3-large (using float precision and 3072 dimensions) by 9.44% while requiring 12 times less storage. This performance gap highlights that raw model size and dimension count aren't the decisive factors —it's the intelligent design for quantization that matters. The cost implications become truly striking when we examine binary quantization. Using voyage-3-large with 512-dimensional binary embeddings, we still achieve better retrieval quality than OpenAI-v3-large with its full 3072-dimensional float embeddings while using 200 times less storage. To put this in practical terms: what would have cost $20,000 in monthly storage can be reduced to just $100 while actually improving performance. In contrast, models not specifically trained for quantization, such as OpenAI's v3-small (shown in gray), show a more dramatic drop in retrieval quality as compression increases. While these models perform well in their full floating-point representation (at 1x storage cost), their effectiveness deteriorates more sharply when quantized, especially with binary quantization. For production applications where both accuracy and efficiency are crucial, choosing a model that has undergone quantization-aware training can make the difference between a system that degrades under compression and one that maintains its effectiveness while dramatically reducing resource requirements. Read more on the Voyage AI blog . Impact: Memory, retrieval latency, and cost Vector quantization addresses the three core challenges of large-scale AI workloads—memory, retrieval latency, and cost—by compressing full-precision embeddings into more compact representations. Below is a breakdown of how quantization drives efficiency in each area. Figure 3: Quantization Performance Metrics: Memory Savings with Minimal Accuracy Trade-offs Comparison of scalar vs. binary quantization showing RAM reduction (75%/96%), query accuracy retention (99%/95%), and performance gains (>100%) for vector search operations Memory and storage optimization Quantization techniques dramatically reduce compute resource requirements while maintaining search accuracy for vector embeddings at scale. Lower RAM footprint Storage in RAM is often the primary bottleneck for vector search systems Embeddings stored as 8-bit integers or binary reduce overall memory usage, allowing significantly more vectors to remain in memory. This compression directly shrinks vector indexes (e.g., HNSW), leading to faster lookups and fewer disk I/O operations. Reduced disk usage in collection with binData binData (binary) formats can cut raw storage needs by up to 66%. Some disk overhead may remain when storing both quantized and original vectors, but the performance benefits justify this tradeoff. Practical gains 3.75× reduction in RAM usage with scalar (int8) quantization Up to 24× reduction with binary quantization, especially when combined with rescoring to preserve accuracy. Significantly more efficient vector indexes, enabling large-scale deployments without prohibitive hardware upgrades. Retrieval latency Quantization methods leverage CPU cache optimizations and efficient distance calculations to accelerate vector search operations beyond what's possible with standard float32 embeddings. Faster similarity computations Smaller data types are more CPU-cache-friendly, which speeds up distance calculations. Binary quantization uses Hamming distance (XOR + popcount), yielding dramatically faster top-k candidate retrieval. Improved throughput With reduced memory overhead, the system can handle more concurrent queries at lower latencies. In internal benchmarks, query performance for large-scale retrievals improved by up to 80% when adopting quantized vectors. Cost efficiency Vector quantization provides substantial infrastructure savings by reducing memory and computation requirements while maintaining retrieval quality through compression and rescoring techniques. Lower infrastructure costs Smaller vectors consume fewer hardware resources, enabling deployments on less expensive instances or tiers. Reduced CPU/GPU time per query allows resource reallocation to other critical parts of the application. Better scalability As data volumes grow, memory and compute requirements don’t escalate as sharply. Quantization-aware training (QAT) models, such as those from Voyage AI, help maintain accuracy while reaping cost savings at scale. By compressing vectors into int8 or binary formats, you tackle memory constraints, accelerate lookups, and curb infrastructure expenses—making vector quantization an indispensable strategy for high-volume AI applications. MongoDB Atlas: Built for Changing Workloads with Automatic Vector Quantization The good news for developers is that MongoDB Atlas supports “automatic scalar” and “automatic binary quantization” in index definitions, reducing the need for external scripts or manual data preprocessing. By quantizing at index build time and query time, organizations can run large-scale vector workloads on smaller, more cost-effective clusters. A common question most developers ask is when to use quantization. Quantization becomes most valuable once you reach substantial data volumes—on the order of a million or more embeddings. At this scale, memory and compute demands can skyrocket, making reduced memory footprints and faster retrieval speeds essential. Examples of cases that call for quantization include: High-volume scenarios: Datasets with millions of vector embeddings where you must tightly control memory and disk usage. Real-time responses: Systems needing low-latency queries under high user concurrency. High query throughput: Environments with numerous concurrent requests demanding both speed and cost-efficiency. For smaller datasets (under 1 million vectors), the added complexity of quantization may not justify the benefits. However, for large-scale deployments, it becomes a critical optimization that can dramatically improve both performance and cost-effectiveness. Now that we have established a strong foundation on the advantages of quantization—specifically the benefits of binary quantization with rescoring— feel free to refer to the MongoDB documentation to learn more about implementing vector quantization. You can also learn more about Voyage AI’s state-of-the-art embedding models on our product page .

February 27, 2025