Search PDFs at Scale with MongoDB and Nomic

Luca Napoli and Richard Guo (Nomic)
April 30, 2024 | Updated: March 6, 2025
#genAI

Data is only valuable if it’s accessible. For example, storing photos, audio files, or PDFs without the ability to extract information from them is like keeping junk in your basement, thinking you might need it someday. The problem is finding what you need to dig through your junk when the day comes.

Until now, companies have followed a similar approach to unstructured data: store everything in data lakes for future use. But whether it’s junk in a basement or data in a data lake, the result is the same: accessibility is hard or impossible.

However, the latest advancements in AI have disrupted this status quo. AI can effectively and efficiently compare similar objects by generating a vector representation or embedding a data object. This capability has revolutionized industries by enabling faster and more precise search, categorization, and recommendation systems than ever before. Whether it's being used to compare text, documents, images, or complex patterns in data, embeddings allow for nuanced interpretations and connections that were impossible with traditional methods. By taking advantage of AI, users can uncover insights and make unprecedented speed and accuracy decisions.

A particularly interesting use case is PDF search, since every company in the world deals with PDFs in one way or another. While PDFs allow portability across platforms and operating systems, most PDF readers only allow for basic exact-match queries.

Check out our AI resource page to learn more about building AI-powered apps with MongoDB.

PDF search powered by MongoDB and Nomic

Enter MongoDB and Nomic: MongoDB Atlas Vector Search with Nomic Embed equips organizations with a powerful and affordable AI-powered search solution for large PDF collections.

A machine learning company specializing in explainable and accessible AI, Nomic Embed is the company’s flagship text embedding model with out-of-the-box features suitable for scalable PDF search. Its features include:

Long context: Nomic Embed breaks new ground by supporting a long context length of 8192 tokens, exceeding the standard 2048. This extended context makes the model ideal for real-world applications that involve processing large PDFs and documents.
High throughput: While achieving top performance on the MTEB embedding benchmark, Nomic Embed is smaller than similarly performing models. At only 137 million parameters and 548MB, Nomic Embed enables high-throughput embedding generation for data-heavy workflows or streaming applications.
Flexible storage: Nomic Embed provides adjustable embedding size via Matryoshka representation learning. Users can freely choose to store the first 64, 128, 256, or 512 embedding dimensions out of the full 768, depending on their project requirements. Smaller embedding sizes come at a minimal performance loss while providing lower storage costs and faster computing benefits.

To put Nomic Embed’s abilities in context, consider a company that processes a high volume of PDFs—say 100,000 documents per month—with an average length of 20 pages each. To improve database retrieval speed, these documents can be partitioned into smaller chunks, such as 2 pages per chunk (see Figure 1 below). Assuming a full page typically contains around 500 words, each document chunk would consist of approximately 1000 words.

**Figure 1:** PDF chunking, embedding creation with Nomic, and storage into MongoDB

Embedding models process words as numerical tokens where a general rule of thumb is 3/4 word = 1 token. One embedding is more than sufficient to represent a document chunk in this case, as 4/3 * 1000 tokens fit nicely in Nomic Embed’s long context window.

A PDF search application for this company would require 100,000 PDFs x 10 chunks = 1,000,000 embeddings. Benchmarked on Nomic’s AWS Sagemaker real-time inference offering on a single GPU ml.g5.xlarge instance, the total runtime is under 4 hours for a total of $15.60 per month. A similar performing embedding model, such as OpenAI’s text-embedding-3-small, costs $26.66 per month to generate the same number of embeddings.

Once the embeddings are stored in MongoDB Atlas, it’s possible to create an Atlas Vector Search index to unlock their potential. Building a PDF search application at this point becomes straightforward. The query text is vectorized, and the embedding is fed to Atlas Vector Search to retrieve similar vectors. The result is a list of the most semantically similar sections of the PDF relevant to the original text. This is a significant leap forward compared to a simple “ctrl-f” search, as it captures meaning rather than just keyword matches.

This process can be further improved by implementing a retrieval-augmented generation (RAG) pipeline, combining Atlas Vector Search and a large language model (LLMs). As shown in Figure 2, this approach allows users to ask questions in natural language about the content of the PDF. The relevant documents are then fed to the LLM as context, and the AI is able to provide structured answers by leveraging knowledge about the data.

**Figure 2:** Retrieval Augmented Generation flow with Nomic

In a nutshell, Nomic and MongoDB provide the building blocks for advanced RAG applications, equipping developers with a cost-effective and integrated toolset.

Seamless integration, supercharged search: Nomic Embeddings in MongoDB Atlas

MongoDB Atlas seamlessly ingests Nomic embeddings with its flexible document storage format. Depending on the application, embeddings and additional metadata can be neatly stored together or separately in MongoDB collections. MongoDB Atlas and Nomic Embed are both available as AWS Marketplace offerings for same-VPC deployments.

MongoDB Atlas Stream Processing is a perfect fit for Nomic Embed’s high throughput capabilities. Incoming data streams are robustly processed and can be combined with MongoDB Database Triggers to generate embeddings for immediate downstream use. Given Nomic Embed’s lightweight nature and offline capabilities (via private or local deployments from open source), embeddings can be produced and ingested into MongoDB at extremely rapid transfer rates.

MongoDB Atlas Vector Search delivers a fast and accessible method to leverage Nomic embeddings for semantic search. MongoDB Atlas Vector Search lets you combine these fast vector search queries with traditional database queries on various metadata, providing a flexible and powerful analytics tool for data insights, user recommendations, and more.

Industry use cases

PDFs are ubiquitous. In one way or another, every company in the world needs to extract and analyze PDF content to make business decisions or comply with regulations. Let’s have a look at some industry use cases:

Financial services

The financial services industry is constantly bombarded with essential updates, including market data, financial statements, and regulatory changes. Some of this information such as financial statements, annual reports, and regulatory filings, resides in PDF format. Efficient and reliable navigation through these documents is crucial for gaining a competitive edge in investment decision-making. For example, investors scrutinize key financial metrics such as revenue growth, profit margins, and cash flow trends extracted from income statements, balance sheets, and cash flow statements. They use this information to compare them between companies, gauging their strategic direction, risks, and competitive positioning before investing. However, accessing and extracting data from these PDFs can be a time-consuming challenge, hindering agility in the fast-paced financial landscape. Here, semantic search for financial PDFs offers a dramatic improvement in information discovery. By leveraging semantic search technology, which interprets the intent and contextual meaning behind a search query, FSI professionals can significantly enhance their ability to find relevant information. This applies equally to the broader financial industry, including areas like market analysis, performance evaluation, and many more.

Retail

In the retail industry, the challenge of processing hundreds of thousands of invoices from numerous suppliers annually is a common scenario. Most invoices are in PDF format, and the challenge arises from the combination of invoice volume and the variability in layouts and languages from one supplier to another. This makes manual processing impractical and error-prone. The question becomes: how can retailers automate this end-to-end process efficiently and accurately? The answer lies in solutions that utilize advanced technologies like AI and PDF search capabilities. By leveraging these solutions, retailers can automatically scan invoices, extract relevant data, and validate it against purchase orders and received goods. Moreover, these solutions offer the flexibility to adapt to different invoice layouts without the need for templates, ensuring scalability and efficiency gains. With increased automation rates and improved accuracy levels, retailers can shift focus from low-value manual tasks to more strategic initiatives, accelerating their digital transformation journey and unlocking significant cost savings along the way.

Manufacturing & motion

There are vast amounts of unstructured data contained in PDFs across the Manufacturing and Automotive industries, from machine instruction booklets to production or maintenance guidelines, Six Sigma best practices, production results, and team lead annotations. All this valuable data must be shared, read, and stored manually, introducing significant friction when it comes to leveraging its full potential. With MongoDB Atlas Vector Search, manufacturing companies have the opportunity to completely revive this data and make real use of it in their day-to-day operations, all while reducing the time spent managing these manuals and having everything ready to be accessed. It is as simple as vectorizing the documents, uploading them to MongoDB Atlas, and connecting a RAG-enabled application to this data source. With this, operators in a manufacturing plant can describe a problem to a smart interface and ask how to troubleshoot it. The interface will retrieve the specific parts of the manual that show how to address the issue. Moreover, it can also retrieve notes from previous operators, team leaders, or previous troubleshooting efforts, providing a very rich context and accelerating the problem-solving process. PDF RAG-enabled applications in manufacturing open up a wide range of operational improvements that directly benefit the company's bottom line.

PDF search at scale

In today’s data-driven world, extracting insights from unstructured data like PDFs is challenging. Traditional search methods fall short, but advancements in AI like, Nomic Embed, have revolutionized PDF search. By leveraging MongoDB with Nomic Embed, organizations gain a powerful and cost-effective AI-powered solution for large PDF collections. Nomic Embed’s extensive context, high throughput capabilities, and MongoDB’s seamless integration and powerful analytics enable efficient and reliable PDF search applications. This translates to enhanced data accessibility, faster decision-making, and improved operational efficiency.

Don't waste time struggling with traditional PDF search! Apply for an innovation workshop to discuss what’s possible with our industry experts.

If you would like to discover more about MongoDB and GenAI:

← Previous

Building AI with MongoDB: Conversation Intelligence with Observe.AI

What's really happening in your business? The answer to that question lies in the millions of interactions between your customers and your brand. If you could listen in on every one of them, you'd know exactly what was up--and down. You’d also be able to continuously improve customer service by coaching agents when needed. However, the reality is that most companies have visibility in only 2% of their customer interactions. Observe.AI is here to change that. The company is focused on being the fastest way to boost contact center performance with live conversation intelligence. Check out our AI Learning Hub to learn more about building AI-powered apps with MongoDB. Founded in 2017 and headquartered in California, Observe.AI has raised over $200m in funding. Its team of 250+ members serves more than 300 organizations across various industries. Leading companies like Accolade, Pearson, Public Storage, and 2U partner with Observe.AI to accelerate outcomes from the frontline to the rest of the business. The company has pioneered a 40 billion-parameter contact center large language model (LLM) and one of the industry’s most accurate Generative AI engines. Through these innovations, Observe.AI provides analysis and coaching to maximize the performance of its customers’ front-line support and sales teams. We sat down with Jithendra Vepa, Ph.D, Chief Scientist & India General Manager at Observe.AI to learn more about the AI stack powering the industry-first contact center LLM. Can you start by describing the AI/ML techniques, algorithms, or models you are using? “Our products employ a versatile range of AI and ML techniques, covering various domains. Within natural language processing (NLP), we rely on advanced algorithms and models such as transformers, including the likes of transformer-based in-house LLMs, for text classification, intent and entity recognition tasks, summarization, question-answering, and more. We embrace supervised, semi-supervised, and self-supervised learning approaches to enhance our models' accuracy and adaptability." "Additionally, our application extends its reach into speech processing, where we leverage state-of-the-art methods for tasks like automatic speech recognition and sentiment analysis. To ensure our language capabilities remain at the forefront, we integrate the latest Large Language Models (LLMs), ensuring that our application benefits from cutting-edge natural language understanding and generation capabilities. Our models are trained using contact center data to make them domain-specific and more accurate than generic models out there.” Can you share more on how you train and tune your models? “In the realm of model development and training, we leverage prominent frameworks like TensorFlow and PyTorch. These frameworks empower us to craft, fine-tune, and train intricate models, enabling us to continually improve their accuracy and efficiency." "In our natural language processing (NLP) tasks, prompt engineering and meticulous fine-tuning hold pivotal roles. We utilize advanced techniques like transfer learning and gradient-based optimization to craft specialized NLP models tailored to the nuances of our tasks." How do you operationalize and monitor these models? "To streamline our machine learning operations (MLOps) and ensure seamless scalability, we have incorporated essential tools such as Docker and Kubernetes. These facilitate efficient containerization and orchestration, enabling us to deploy, manage, and scale our models with ease, regardless of the complexity of our workloads." "To maintain a vigilant eye on the performance of our models in real-time, we have implemented robust monitoring and logging to continuously collect and analyze data on model performance, enabling us to detect anomalies, address issues promptly, and make data-driven decisions to enhance our application's overall efficiency and reliability.” The role of MongoDB in Observe.AI technology stack The MongoDB modern database gives the company’s developers and data scientists a unified solution to build smarter AI applications. Describing how they use MongoDB, Jithendra says “OBSERVE.AI processes and runs models on millions of support touchpoints daily to generate insights for our customers. Most of this rich, unstructured data is stored in MongoDB. We chose to build on MongoDB because it enables us to quickly innovate, scale to handle large and unpredictable workloads, and meet the security requirements of our largest enterprise customers.” Getting started Thanks so much to Jithendra for sharing details on the technology stack powering Observe.AI’s conversation intelligence and MongoDB’s role. To learn more about how MongoDB can help you build AI-enriched applications, take a look at the MongoDB for Artificial Intelligence page. Here, you will find tutorials, documentation, and whitepapers that will accelerate your journey to intelligent apps.

April 29, 2024

Next →

Next-Generation Mobility Solutions with Agentic AI and MongoDB Atlas

Driven by advancements in vehicle connectivity, autonomous systems, and electrification, the automotive and mobility industry is currently undergoing a significant transformation. Vehicles today are sophisticated machines, computers on wheels, that generate massive amounts of data, driving demand for connected and electric vehicles. Automotive players are embracing artificial intelligence (AI), battery electrical vehicles (BEVs), and software-defined vehicles (SDVs) to maintain their competitive advantage. However, managing fleets of connected vehicles can be a challenge. As cars get more sophisticated and are increasingly integrated with internal and external systems, the volume of data they produce and receive greatly increases. This data needs to be stored, transferred, and consumed by various downstream applications to unlock new business opportunities. This will only grow: the global fleet management market is projected to reach $65.7 billion by 2030, growing at a rate of almost 10.8% annually. A 2024 study conducted by Webfleet showed that 32% of fleet managers believe AI and machine learning will significantly impact fleet operations in the coming years; optimizing route planning and improving driver safety are the two most commonly cited use cases. As fleet management software providers continue to invest in AI, the integration of agentic AI can significantly help with things like route optimization and driver safety enhancement. For example, AI agents can process real-time traffic updates and weather conditions to dynamically adjust routes, ensuring timely deliveries while advising drivers on their car condition. This proactive approach contrasts with traditional reactive methods, improving vehicle utilization and reducing operational and maintenance costs. But what are agents? In short, they are operational applications that attempt to achieve goals by observing the world and acting upon it using the data and tools the application has at its disposal. The term "agentic" denotes having agency, as AI agents can proactively take steps to achieve objectives without constant human oversight. For example, rather than just reporting an anomaly based on telemetry data analysis, an agent for a connected fleet could autonomously cross-check that anomaly against known issues, decide whether it's critical or not, and schedule a maintenance appointment all on its own. Why MongoDB for agentic AI Agentic AI applications are dynamic by nature as they require the ability to create a chain of thought, use external tools, and maintain context across their entire workflow. These applications generate and consume diverse data types, including structured and unstructured data. MongoDB’s flexible document model is uniquely suited to handle both structured and unstructured data as vectors. It allows all of an agent’s context, chain-of-thought, tools metadata, and short-term and long-term memory to be stored in a single database. This means that developers can spend more time on innovation and rapidly iterate on agent designs without being constrained by rigid schemas of a legacy relational database. Figure 1. Major components of an AI agent. Figure 1 shows the major components of an AI agent. The agent will first receive a task from a human or via an automated trigger, and will then use a large language model (LLM) to generate a chain of thought or follow a predetermined workflow. The agent will use various tools and models during its run and store/retrieve data from a memory provider like MongoDB Atlas . Tools: The agent utilizes tools to interact with the environment. This can contain API methods, database queries, vector search, RAG application, anything to support the model Models: can be a large language model (LLM), vision language model (VLM), or a simple supervised machine learning model. Models can be general purpose or specialized, and agents may use more than one. Data: An agent requires different types of data to function. MongoDB’s document model allows you to easily model all of this data in one single database. An agentic AI spans a wide range of functional tools and context. The underlying data structures evolve throughout the agentic workflow and as an agent uses different tools to complete a task. It also builds up memory over time. Let us list down the typical data types you will find in an agentic AI application. Data types: Agent profile: This contains the identity of the agent. It includes instructions, goals and constraints. Short-term memory: This holds temporary, contextual information—recent data inputs or ongoing interactions—that the agent uses in real-time. For example, short-term memory could store sensor data from the last few hours of vehicle activity. In certain agentic AI frameworks like Langgraph, short term memory is implemented through a checkpointer. The checkpointer stores intermediate states of the agent’s actions and/or reasoning. This memory allows the agent to seamlessly pause and resume operations. Long-term memory: This is where the agent stores accumulated knowledge over time. This may include patterns, trends, logs and historical recommendations and decisions. By storing each of these data types into rich, nested documents in MongoDB, AI developers can create a single-view representation of an agent’s state and behavior. This enables fast retrieval and simplifies development. In addition to the document model advantage, building agentic AI solutions for mobility requires a robust data infrastructure. MongoDB Atlas offers several key advantages that make it an ideal foundation for these AI-driven architectures. These include: Scalability and flexibility: Connected Car platforms like fleet management systems need to handle extreme data volumes and variety. MongoDB Atlas is proven to scale horizontally across cloud clusters, letting you ingest millions of telemetry events per minute and store terabytes of telemetry data with ease. For example, the German company ZF uses MongoDB to process 90,000 vehicle messages per minute (over 50 GB of data per day) from hundreds of thousands of connected cars. The flexibility of the document model accelerates development and ensures your data model stays aligned with the real-world entities it represents. Built-in vector search: AI agents require a robust set of tools to work with. One of the most widely used tools is vector search, which allows agents to perform semantic searches on unstructured data like driver logs, error codes descriptions, and repair manuals. MongoDB Atlas Vector Search allows you to store and index high-dimensional vectors alongside your documents and to perform semantic search over unstructured data. In practice, this means your AI embeddings live right next to the relevant vehicle telemetry and operational data in the database, simplifying architectures for use cases like the connected car incident advisor, in which a new issue can be matched against past issues before passing contextual information to the LLM. For more, check out this example of how an automotive OEM leverages vector search for audio based diagnostics with MongoDB Atlas Vector Search. Time series collections and real-time data processing: MongoDB Atlas is designed for real-time applications. It provides time series collections for connected car telemetry data storage, change streams, and triggers that can react to new data instantly. This is crucial for agentic AI feedback loops, where ongoing data ingestion and learning are happening continuously. Best-in-class embedding models with Voyage AI: In early 2025, MongoDB acquired Voyage AI , a leader in embedding and reranking models. Voyage AI embedding models are currently being integrated into MongoDB Atlas, which means developers will no longer need to manage external embedding APIs, standalone vector stores, or complex search pipelines. AI retrieval will be built into the database itself, making semantic search, vector retrieval, and ranking as seamless as traditional queries. This will reduce the time required for developing agentic AI applications. Agentic AI in action: Connected fleet incident advisor Figure 2 shows a list of use cases in the Mobility sector, sorted by various capabilities that an agent might demonstrate. AI agents excel at managing multi-step tasks via context management across tasks, they automate repetitive tasks better than Robotic process automation (RPA), and they demonstrate human-like reasoning by revisiting and revising past decisions. These capabilities enable a wide range of applications both during the manufacturing of a vehicle and while it's on the road, connected and sending telemetry. We will review a use case in detail below, and will see how it can be implemented using MongoDB Atlas, LangGraph, Open AI, and Voyage AI. Figure 2. Major use cases of agentic AI in the mobility and manufacturing sectors. First, the AI agent connects to traditional fleet management software and supports the fleet manager in diagnosing and advising the drivers. This is an example of a multi-step diagnostic workflow that gets triggered when a driver submits a complaint about the vehicle's performance (for example, increased fuel consumption). Figure 3 shows the sequence diagram of the agent. Upon receiving the driver complaint, it creates a chain of thought that follows a multi-step diagnostic workflow where the system ingests vehicle data such as engine codes and sensor readings, generates embeddings using the Voyage AI voyage-3-large embedding model, and performs a vector search using MongoDB Atlas to find similar past incidents. Once relevant cases are identified, those–along with selected telemetry data–are passed to OpenAI gpt-4o LLM to generate a final recommendation for the driver (for example, to pull off immediately or to keep driving and schedule regular maintenance). All data, including telemetry, past issues, session logs, agent profiles, and recommendations are stored in MongoDB Atlas, ensuring traceability and the ability to refine diagnostics over time. Additionally, MongoDB Atlas is used as a checkpointer by LangGraph, which defines the agent's workflow. Figure 3. Sequence diagram for a connected fleet advisor agentic workflow. Figure 4 shows the agent in action, from receiving an issue to generating a recommendation. So by leveraging MongoDB’s flexible data model and powerful Vector Search capabilities, we can agentic AI can transform fleet management through predictive maintenance and proactive decision-making. Figure 4. The connected fleet advisor AI agent in action. To set up the use case shown in this article, please visit our GitHub repository . And to learn more about MongoDB’s role in the automotive industry, please visit our manufacturing and automotive webpage . Want to learn more about why MongoDB is the best choice for supporting modern AI applications? Check out our on-demand webinar, “ Comparing PostgreSQL vs. MongoDB: Which is Better for AI Workloads? ” presented by MongoDB Field CTO, Rick Houlihan.

April 4, 2025