Optimize AI Performance with MongoDB Atlas and Fireworks AI

MongoDB Atlas and Fireworks AI deliver faster AI inference, reduced costs, and efficient RAG applications.

Use cases: Gen AI, Model Performance Tuning

Solution Overview

AI applications require both high performance and cost efficiency. Enterprises must optimize the price/performance ratio to ensure speed and efficiency gains translate into cost benefits. MongoDB and Fireworks AI provide a solution that improves performance while reducing costs. MongoDB handles data management while Fireworks AI optimizes models. Together, they enhance latency and throughput while minimizing operational costs.

This solution addresses the following topics:

Methods to enhance performance and reduce TCO with MongoDB and Fireworks AI
Caching strategies to optimize RAG with MongoDB Atlas and generative AI models
Fine-tuning SLMs from LLMs for faster performance with comparable response quality
Fireworks AI platform techniques to fine-tune models, accelerate inference, and reduce hardware requirements
A credit card recommendation case study with quantified improvements in latency, memory usage, and cost-effectiveness
Best practices for production deployment and scaling

This solution provides actionable strategies for enhanced AI performance with reduced costs, supported by practical examples and performance metrics.

MongoDB and Fireworks AI

MongoDB's flexible schema, efficient indexing, and distributed architecture enable organizations to scale data infrastructure on demand. Combined with Fireworks AI's model optimization capabilities, this solution improves AI performance while reducing costs.

The FireOptimizer framework integrates MongoDB with Fireworks AI's model tuning process to accelerate batch inference through the following components:

FireAttention: Enhances request processing and optimizes resource utilization
Parameter-Efficient Fine-Tuning: Uses LoRA and QLoRA methods to fine-tune models efficiently with trace or label data, reducing computational requirements

Figure 1. FireOptimizer architecture for adaptive optimization and high-quality inference

This solution provides the following benefits:

Improved inference speed: FireOptimizer's adaptive speculative execution provides up to 3x latency improvements for production workloads.
Automated optimization: FireOptimizer handles complex optimization processes automatically.

FireOptimizer: Adaptive Token Generation

LLMs generate outputs one token at a time, which slows responses for long outputs. Speculative decoding accelerates this process by using a smaller draft model to generate multiple token candidates while the main LLM processes. The LLM evaluates and retains only accurate predictions.

Traditional draft models trained on generic data work well for general tasks but have lower accuracy in specialized domains like coding or financial analysis.

With Fireworks AI, adaptive speculative execution improves this approach by using domain-specific or user-profile-customized models instead of generic ones. This optimization increases accuracy and hit rates, like from 29% to 76% in code generation, reduces inference costs, and provides latency improvements up to 3x.

FireAttention: Long-context Processing

Long prompts that use between 8K-32K tokens create performance bottlenecks for applications like document analysis or code completion. Firework AI's FireAttention V2 addresses this challenge by providing 12x faster processing for long-context tasks through the following features:

Optimized attention scaling: Reduces computational overhead for lengthy inputs
Multi-host deployment: Distributes workloads efficiently across GPUs
Advanced kernels: Streamlines operations for faster execution

FireAttention V3 extends support to AMD's MI300 GPUs as a cost-efficient alternative to NVIDIA. Performance improvements include:

1.4x–1.8x higher throughput for models like LLaMA 8B and 70B
Up to 5.5x speed gains in low-latency scenarios
Improved performance through redesigned attention kernels and optimized memory usage

These capabilities enhance fine-tuning of SLMs by enabling efficient processing of long-context inputs for domain-specific tasks. The optimized attention mechanisms reduce computational overhead, support faster training cycles, and enable larger batch sizes across multi-host GPU deployments.

Adaptive Resource Optimization

Fireworks AI optimizations extend beyond adaptive speculation through three techniques that maximize throughput and cost-efficiency:

Adaptive caching: Reuses frequent computations to skip redundant work, reducing latency by 30–50% for high-traffic workloads
Customizable quantization: Balances 4-/8-bit precision with model quality, doubling speeds while maintaining over 99% accuracy for batch processing tasks
Disaggregated serving: Tailors hardware allocation to workload type by hosting multiple lightweight model copies or sharding large models across GPUs for complex tasks

Build the Solution

Smaller, efficient models offer fine-tuning opportunities for specialized adaptation while maintaining resource efficiency. Research continues in optimizing SLMs for cloud, on-device, and dedicated hardware deployments.

Fine-tuning techniques include:

Additive PEFT

This category introduces additional trainable parameters to existing pre-trained models without modifying original weights.

Adapters: Insert small, trainable layers between model layers. Adapters learn task-specific transformations without altering pre-trained parameters.
Soft prompts: Trainable vector embeddings appended to input sequences. They guide model behavior toward desired tasks.
Prefix tuning: Adds trainable prefixes to input sequences. Prefixes learn task-specific information without core model modifications.

Reparametrization PEFT

This approach reparameterizes existing model weights using low-rank approximations to reduce trainable parameters.

Low-rank adaptation: Approximates weight updates in attention layers using low-rank matrices. This decreases trainable parameters.
Quantized LoRA: Enhances LoRA with quantization techniques, reducing memory usage and computational costs.

Selective fine-tuning

This category selectively fine-tunes specific pre-trained model parameters, improving computational efficiency.

BitFit: Fine-tunes only bias terms or other specific parameters, improving computational efficiency.
DiffPruning: Identifies and removes parameters that contribute minimally to model performance, reducing trainable parameters.

Layer freezing strategies

These strategies freeze certain pre-trained model layers while fine-tuning others, optimizing adaptation.

Freeze and reconfigure: Freezes specific model layers and fine-tunes remaining layers to optimize adaptation.
FishMask: Uses masks to selectively freeze or fine-tune layers for specific tasks.

The most popular technique is PEFT. PEFT techniques adapt large pre-trained models to new tasks by adjusting a small fraction of parameters. This approach prevents overfitting, reduces computational and memory requirements compared to full fine-tuning, and mitigates catastrophic forgetting in LLMs. PEFT enables efficient model customization without full retraining, making it ideal for resource-constrained environments.

You can use PEFT LoRA techniques with trace data generated from model interactions and labeled data, which is explicitly annotated for tasks, to fine-tune smaller models for high performance on specific tasks without extensive computational resources.

For practical applications, this example uses a MongoDB credit card application demo. The demo illustrates MongoDB for credit scoring with predictive analytics, explains credit scoring results using generative AI, and performs credit card recommendations using RAG with LLMs. This fine-tuning example focuses on simplifying credit rating explanations using LLMs. The application includes user profile generation, product recommendations, and reranking with summarization. Find application design and source code details at the MongoDB Credit Card Application Solution Library Page.

Current Challenges and Solutions

LLMs cause slower response times due to complex calculations over billions of parameters. Credit card recommendations require multiple LLM queries, resulting in 10 to 20 seconds total response time, with each query taking 5 or more seconds. LLMs are difficult and expensive to deploy and scale for millions of users.

Small Language Models: SLMs provide faster processing speeds and cost-efficiency. SLMs require less computational power, making them suitable for devices with limited resources. They deliver faster responses and lower operational costs.
PEFT and LoRA: PEFT and LoRA improve efficiency by optimizing parameter subsets. This approach reduces memory requirements and operational costs. MongoDB integration enhances data handling and enables efficient model tuning.
MongoDB: MongoDB provides data management and real-time integration that improves operational efficiency. MongoDB stores trace data as JSON and enables efficient retrieval and storage, adding value to model fine-tuning. MongoDB acts as a caching layer to avoid repeated LLM invocations for identical requests.

Fine-tune a Small Language Model with Fireworks AI

The Credit Card Application demo explains credit scores to customers in clear language. LLMs like Meta's LLaMA 3.1-405B generate these explanations using user profile parameters, model input features, and feature importances from the model that predicts the customer's credit score or rating. SLMs cannot consistently achieve these tasks due to limited parameters for effective reasoning and explanation. Use the fine-tuning process with the Fireworks AI fine-tuning platform to achieve the desired outcome.

The process to fine-tune an SLM uses the following workflow:

Figure 2. LLM/SLM fine-tuning process

The fine-tuning process starts with collecting relevant, task-specific data. Figure 2 shows how MongoDB Atlas caches LLM/SLM responses for specific users based on their Credit Card application inputs. Users can simulate credit profiles on the web UI. The following Python code snippet demonstrates how to set up a decorator to cache LLM/SLM responses in MongoDB Atlas:

class mdbcache:
   def __init__(self, function):
         self.function = function
   def __call__(self, *args, **kwargs):
         key = str(args) + str(kwargs)
         ele = ccol.find_one({"key": key})
         if ele:
            return ele["response"]
         value = self.function(*args, **kwargs)
         ccol.insert_one({"key":key, "response": value})
         return value
@mdbcache
def invoke_llm(prompt):
   """
   Invoke the language model with the given prompt with cache. The llm.invoke  method can invoke either a LLM or SLM based on the Fireworks Model ID provided at the start of applicaiton.
   Args:
         prompt (str): The prompt to pass to the LLM.
   """
   response = llm.invoke(prompt)
   return response

As shown in the diagram, you can generate the training dataset using a simulator. This example simulates user profiles with stratified sampling to select equal samples for all three credit ratings: Good, Normal, and Poor. This demo generates around 1,300 sample responses.

Transform the generated responses into the format that the Fireworks AI platform supports for fine-tuning. Generate the cc_cache.jsonl file used in the fine-tuning process by running the following code:

 from pymongo import MongoClient
 import pandas as pd
 import json
 client = MongoClient("mongodb+srv://<uid>:<pwd>@bfsi-demo.2wqno.mongodb.net/?retryWrites=true&w=majority")
 df = pd.DataFrame.from_records(client["bfsi-genai"]["cc_cache"].find({},{"_id": 0}))
 df["prompt"] = df["key"].apply(lambda x: x.strip('(').strip('"').strip(")").strip("\\"))
 del df["key"]
 df["response"] = df["response"].apply(lambda x: x.strip())
 df.to_json("cc_cache.jsonl", orient="records", lines=True)
 # transform cache to messages
 messages = []
 for item in df.iterrows():
   messages += [{"messages": [{"role": "user", "content": item["prompt"].strip("    \\")}, {"role": "assistant", "content": item["response"]}]}]
 with open("cc_cache.jsonl", "w") as f:
   for item in messages:
      f.write(json.dumps(item) + "\n")

After preparing the dataset and generating the cc_cache.jsonl file, fine-tune the pre-trained llama-v3p1-8b-instruct model with the following the steps:

Steps

Install firectl, the command-line tool for managing Fireworks AI platform resources

pip install firectl

Authenticate firectl with your Fireworks account

firectl login

Upload your fine-tuning dataset to the Fireworks platform

firectl create dataset <dataset_name> cc_cache.jsonl

Create a fine-tuning job with the uploaded dataset and your chosen model

firectl create sftj --base-model accounts/fireworks/models/llama-v3p1-8b-instruct --dataset <dataset_name> --output-model ccmodel --lora-rank 8 --epochs 1

Monitor the fine-tuning process using the portal to ensure it runs smoothly

Figure 3. Monitoring the fine-tuning process

Deploy the fine-tuned model on the Fireworks platform for inference

firectl deploy ccmodel

This procedure demonstrates how MongoDB and Fireworks AI integration improves AI model performance cost-effectively.

After deploying the model on the Fireworks platform as a serverless API, use the model id, models/ft-m88hxaga-pi11m, shown in Figure 2 to invoke the fine-tuned SLM model using your preferred language model framework.

Fine-tuning the SLM for credit card recommendations produces the following results:

Response time improvement: LLM response time averages 5 seconds. SLMs reduce this to approximately 0.15 seconds, providing a 19x latency reduction.

import time
class tiktok:
"""
Decorator to time the execution of a function and log the time taken.
"""
def __init__(self, function):
      self.function = function
def __call__(self, *args, **kwargs):
      import time
      start = time.time()
      value = self.function(*args, **kwargs)
      end = time.time()
      print(f"Time taken for {self.function.__name__}: {end - start} seconds")
      return value
@tiktok
@mdbcache
def invoke_llm(prompt):
"""
Invoke the language model with the given prompt with cache. The
invoke LLM method can invoke either a LLM or SLM based on the
Fireworks Model ID initialized.
Args:
      prompt (str): The prompt to pass to the LLM.
"""
...

Model	Inference Time 1 (s)	Inference Time 2 (s)	Inference Time 3 (s)	Average Time (s)
llama-v3p1-405b-instruct	5.5954	7.5936	4.9121	6.0337
SLM - fine-tuned llama-v3p1-8b	0.3554	0.0480	0.0473	0.1502

Memory reduction: LLMs typically require 8x80GB VRAM. SLMs operate with 16 GB VRAM, reducing memory usage by 97.5%.
Hardware reduction: LLMs require high-end GPUs or multiple servers. SLMs can deploy on standard CPUs or single servers, reducing hardware costs.

Key Learnings

Lower total cost of ownership: LoRA and QLoRA techniques reduce computational requirements for SLM fine-tuning. MongoDB's distributed architecture and efficient indexing scale data infrastructure on-demand, minimizing storage costs and reducing operational expenses.
Streamline data and AI workflows: MongoDB enables real-time data integration for AI models. Integration with Fireworks AI's fine-tuning tools creates workflows that keep models updated and improve decision-making accuracy.
Enhance RAG solutions: MongoDB Atlas and Fireworks AI combine to create RAG frameworks with improved data storage and retrieval. MongoDB Atlas provides scalable embedding storage while Fireworks AI offers managed LLM/SLM hosting.

Authors

Wei You Pan, MongoDB
Ashwin Gangadhar, MongoDB
Peyman Parsi, MongoDB
Benny Chen, Fireworks AI
Ayaan Momin, Fireworks AI

Learn More

FireAttention V2: Improves long-context processing efficiency during inference, providing up to 12x faster processing
FireAttention V3: Enables AMD hardware use for GPU inference, expanding hardware options and reducing costs
FireOptimizer: Customizes latency and quality for production inference workloads

Back

Document Intelligence with Agentic AI

Unified Interface for RAG Applications