Docs Menu
Docs Home
/

Multimodal Event Explorer with MongoDB and Voyage AI

Analyze non-standard events drawn from autonomous driving sensor output using multimodal AI, hybrid search, and a conversational agent powered by AWS Bedrock and S3, and backed by MongoDB.

Use cases: Artificial Intelligence, Internet of Things

Industries: Manufacturing & Motion

Products: MongoDB Atlas, MongoDB Search, MongoDB Vector Search, MongoDB Voyage AI,

Partners: Amazon Bedrock

Autonomous driving systems generate enormous volumes of sensor data: high-resolution images, LiDAR sweeps, radar frames, and telemetry logs. Within this flood of information, the most valuable data points are often the rarest: unusual or unexpected driving scenarios known as edge or corner cases. These data points include animals on the road, flooded intersections, unusual construction zones, and other situations that autonomous systems rarely encounter during standard validation and road testing.

Finding these rare scenarios manually is slow and costly. Data scientists often spend significant time writing custom filters and scripts to locate specific events. Manual search methods are slow and may resolve only a handful of edge cases per year, while teams need to scale that number to thousands.

The Multimodal Event Explorer demonstrates one way to solve this challenge by combining MongoDB Atlas Search with Voyage AI embeddings and a conversational AI agent. The solution enables teams:

  • To search across driving events using natural language descriptions.

  • To filter by environmental conditions, such as weather, season, and time of day.

  • To interact with a ReAct-based AI agent that reasons over the database in real time.

This solution allows engineers and data scientists to discover rare driving scenarios at fleet scale in seconds rather than weeks, accelerating model training cycles and improving the safety and reliability of autonomous driving systems.

The solution follows a tiered architecture with a clear separation between the web application, backend services, and application data platform.

High-level architecture of the Multimodal Event Explorer

Figure 1. High-level architecture of the Multimodal Event Explorer

The frontend is built with Next.js and uses LeafyGreen UI components for a MongoDB-branded experience. It provides the following main interaction surfaces:

  • A search bar with metadata filter dropdowns.

  • A results grid displaying matched driving event images.

  • A chat panel for conversational queries through the AI agent.

A FastAPI-based Python backend orchestrates the core logic. It exposes:

  • A Hybrid Search API that combines vector and full-text search.

  • A Reranker API that applies Voyage AI reranking to refine results.

  • A ReAct agent that uses AWS Bedrock (Claude) with a tool discovery registry to reason over the database.

All data resides in a single MongoDB Atlas collection. Each document contains:

  • The event image (or a reference to it in S3).

  • A text description.

  • Environmental metadata fields, such as season, weather, time of day, and rarity score,

  • A 1024-dimensional vector embedding generated by Voyage AI's voyage-multimodal-3 model.

MongoDB Atlas provides the vector search index (with scalar quantization), a full-text search index, and aggregation pipelines, all queried through a unified API.

Voyage AI provides the embedding model (voyage-multimodal-3) for generating vector representations of images, metadata and queries, plus a reranker model (rerank-2) for improving result relevance.

AWS Bedrock hosts Claude models, which power the conversational AI agent for this solution. Users can access the agent via the chat panel located in the bottom-right corner of the interface.

Solution User Interface

Figure 2. Solution User Interface

The agent reasons about the user's question, decides which tool to call, executes it against the live MongoDB database, observes the result, and repeats the process until it can provide a final response.

The agent uses these tools:

  1. search_events: Executes hybrid vector and text search against the MongoDB collection.

  2. get_stats: Uses a $facet aggregation that returns weather, season, time-of-day distributions and rarity statistics across the entire collection.

  3. compare_scenarios: Performs two parallel searches returned side by side. The agent streams its execution trace to the UI in real time, so every tool call and result is visible as it happens.

compare_scenarios demonstrates a Human-in-the-Loop (HITL) pattern. When Claude decides to call it, the backend pauses the stream and sends an approval prompt to the UI before the tool executes. The user must click Approve or Reject. If no response arrives within 60 seconds, the tool is automatically skipped.

Note: To trigger this flow, users can open the chat panel and click the suggested question labelled "Human in the loop demo", or type any question asking the agent to compare two driving scenarios, such as "Compare foggy vs clear weather driving scenarios".

The solution stores all event data in a single MongoDB collection, leveraging the flexible document model to co-locate multimodal information that would typically require costly joins across multiple tables in a relational system, resulting in faster queries and a simpler data model as event complexity grows. The following snippet illustrates the document model in practice:

{
"event_id": "e_00601",
"domain": "adas",
"source_dataset": "autonomous-driving-dataset",
"image_path": "adas/e_00601.jpg",
"image_url": null,
"image_embedding": [0.0412, -0.0183, 0.0097, "... 1021 more values ..."],
"text_description": "A foggy night scene on a rural road with low visibility and no other vehicles in sight.",
"metadata": {
"season": "fall",
"time_of_day": "night",
"weather": "foggy",
"environment": "rural",
"rarity_score": 0.847,
"source_index": 601
},
"embedding_metadata": {
"model": "voyage-multimodal-3.5",
"dimensions": 1024,
"original_bytes": 4096,
"quantized_bytes": 1024
},
"created_at": "2025-03-15T09:42:11.000Z",
"updated_at": null
}

Each event document contains these key fields:

  • event_id: Unique identifier for the event, matches the image filename on disk or in S3.

  • domain: Category of the event (e.g. "adas" for autonomous driving).

  • source_dataset: The originating dataset.

  • image_path: Relative path to the image on the local filesystem.

  • image_url: S3 or CloudFront URL once images are migrated to cloud storage; null in local development.

  • image_embedding: 1024-dimensional float32 vector generated by Voyage AI (voyage-multimodal-3 or 3.5).

  • text_description: Natural language description of the scene, used for full-text Atlas Search.

  • metadata: Nested object containing:

    • season: spring, summer, fall, or winter.

    • time_of_day: dawn, day, dusk, or night.

    • weather: clear, cloudy, rainy, or foggy.

    • environment: driving environment (e.g. rural).

    • rarity_score: 0–1 indicator of how uncommon the scenario is.

  • embedding_metadata: Tracks the embedding model name, vector dimensions, and byte sizes for both the original float32 and quantized int8 representations.

  • created_at: UTC timestamp of when the document was ingested.

By co-locating the image reference, text description, metadata, and vector embedding in a single document, the solution eliminates joins. A hybrid search query can match on the vector embedding and the text description simultaneously, apply pre-filters on metadata fields (season, weather, time_of_day), and return complete results in a single round-trip.

The $facet aggregation allows the AI agent to compute distribution statistics—such as weather breakdown, event counts by season, time-of-day histograms, and rarity score statistics (average, min, max)—across the entire collection in a single pipeline execution, with no need for multiple queries.

The vector search index applies scalar quantization at the index layer, compressing the 1024-dimensional float32 embeddings from 4,096 bytes down to 1,024 bytes per vector (int8). This indexing strategy reduces the in-memory vector payload by 75% and retains approximately 90% recall compared to full-fidelity search. The filter fields including domain, metadata.season, metadata.time_of_day, and metadata.weather are declared directly in the same index definition. This setup allows metadata pre-filtering to happen inside $vectorSearch before any results are returned to the application.

The complete source code with detailed README is available on Industry Solutions public Github repository. To deploy the solution follow these steps.

  • Python 3.13

  • Node.js 18 or higher (LTS recommended)

  • uv for Python dependency management

  • A MongoDB Atlas cluster

  • A Voyage AI API key

  • AWS credentials with Bedrock access

1

The solution uses the MIST Autonomous Driving Dataset on HuggingFace (jongwonryu/MIST-autonomous-driving-dataset), but you can substitute your own dataset by following the setup instructions in the README. The ingestion pipeline does the following processes:

  • Streams images from HuggingFace.

  • Applies diversity gating to ensure balanced coverage across weather, season, and time-of-day combinations.

  • Generates multimodal embeddings via Voyage AI (voyage-multimodal-3).

  • Inserts documents with Vector Search and Atlas Search indexes into MongoDB.

Run the pipeline from the backend/ directory:

uv run python services/ingestion_pipeline.py --sample-size 1000

This command streams approximately 1–2 GB and processes around 1,000 diverse images in 15–30 minutes. The default sample size is 500 if the flag is omitted.

2

Copy the example environment file and populate it with your credentials:

cp backend/.env.example backend/.env

Set MONGODB_URI, DATABASE_NAME, VOYAGE_API_KEY, and optionally AWS_REGION / AWS_PROFILE.

Start the FastAPI server from the backend/ directory:

uv run uvicorn main:app --host 0.0.0.0 --port 8000
3

Copy the frontend environment example to frontend/.env.local, install Node dependencies with npm install, and start the dev server with npm run dev. The frontend is accessible at http://localhost:3000.

cp frontend/EXAMPLE.env frontend/.env.local
cd frontend && npm install && npm run dev
4

For containerized deployment, run make build from the root directory. Docker Compose mounts local AWS credentials so the backend can reach Bedrock without static keys. Use make clean to stop and remove containers.

5

Once the demo is live, submit a query like "night drive in overcast conditions" to see the pipeline in action. The pipeline executes in this order:

  1. Pre-filter: If the user selected metadata filters (season, weather, time of day), these fields are applied as pre-filters inside $vectorSearch, narrowing the candidate set for search.

  2. $rankFusion Hybrid Search: The query is embedded using voyage-multimodal-3 and simultaneously runs a vector search against the scalar-quantized index, and a full-text MongoDB Search. The results use Reciprocal Rank Fusion to merge in a single aggregation pipeline.

  3. Reranker: The merged results use Voyage AI rerank-2 for reranking, which scores each candidate against the original query text for improved precision.

  • Accelerate edge case discovery with hybrid search: Combining vector search and full-text search through MongoDB's $rankFusion lets teams find rare driving scenarios using natural language queries. This approach also matches specific technical terms, fault codes, or sensor IDs that pure semantic search might miss.

  • Reduce infrastructure costs with scalar quantization: MongoDB Atlas Vector Search compresses 1024-dimensional float32 vectors to int8, achieving memory savings while preserving recall. For datasets with millions of embeddings, this compression directly translates to lower hardware requirements.

  • Simplify multimodal data management with the document model: Storing images, embeddings, text descriptions, and metadata in a single MongoDB document eliminates the synchronization overhead of maintaining separate databases for operational data, a search engine, and a vector store.

  • Empower data scientists with conversational access: The ReAct-based AI agent backed by AWS Bedrock enables natural-language interaction with fleet data. Instead of writing custom aggregation queries, teams can ask questions like "Compare foggy vs. clear weather driving scenarios" and receive structured analysis with tool execution traces.

  • Improve retrieval quality with Voyage AI reranking: Adding a reranking step after hybrid search meaningfully improves result precision. Voyage AI's rerank-2 model re-scores candidates against the original query, pushing the most contextually relevant results to the top.

  • Humza Akhtar, MongoDB

Back

Context-Aware RAG for Technical Docs

On this page