Docs Menu
Docs Home
/

Optimize AI Performance with MongoDB Atlas and Fireworks AI

MongoDB Atlas and Fireworks AI deliver faster AI inference, reduced costs, and efficient RAG applications.

Use cases: Gen AI, Model Performance Tuning

Industries: Financial Services

Products: MongoDB Atlas

Partners: Fireworks AI, LangChain

AI applications require both high performance and cost efficiency. Enterprises must optimize the price/performance ratio to ensure speed and efficiency gains translate into cost benefits. MongoDB and Fireworks AI provide a solution that improves performance while reducing costs. MongoDB handles data management while Fireworks AI optimizes models. Together, they enhance latency and throughput while minimizing operational costs.

This solution addresses the following topics:

  • Methods to enhance performance and reduce TCO with MongoDB and Fireworks AI

  • Caching strategies to optimize RAG with MongoDB Atlas and generative AI models

  • Fine-tuning SLMs from LLMs for faster performance with comparable response quality

  • Fireworks AI platform techniques to fine-tune models, accelerate inference, and reduce hardware requirements

  • A credit card recommendation case study with quantified improvements in latency, memory usage, and cost-effectiveness

  • Best practices for production deployment and scaling

This solution provides actionable strategies for enhanced AI performance with reduced costs, supported by practical examples and performance metrics.

MongoDB's flexible schema, efficient indexing, and distributed architecture enable organizations to scale data infrastructure on demand. Combined with Fireworks AI's model optimization capabilities, this solution improves AI performance while reducing costs.

The FireOptimizer framework integrates MongoDB with Fireworks AI's model tuning process to accelerate batch inference through the following components:

  • FireAttention: Enhances request processing and optimizes resource utilization

  • Parameter-Efficient Fine-Tuning: Uses LoRA and QLoRA methods to fine-tune models efficiently with trace or label data, reducing computational requirements

FireOptimizer

Figure 1. FireOptimizer architecture for adaptive optimization and high-quality inference

This solution provides the following benefits:

  • Improved inference speed: FireOptimizer's adaptive speculative execution provides up to 3x latency improvements for production workloads.

  • Automated optimization: FireOptimizer handles complex optimization processes automatically.

LLMs generate outputs one token at a time, which slows responses for long outputs. Speculative decoding accelerates this process by using a smaller draft model to generate multiple token candidates while the main LLM processes. The LLM evaluates and retains only accurate predictions.

Traditional draft models trained on generic data work well for general tasks but have lower accuracy in specialized domains like coding or financial analysis.

With Fireworks AI, adaptive speculative execution improves this approach by using domain-specific or user-profile-customized models instead of generic ones. This optimization increases accuracy and hit rates, like from 29% to 76% in code generation, reduces inference costs, and provides latency improvements up to 3x.

Long prompts that use between 8K-32K tokens create performance bottlenecks for applications like document analysis or code completion. Firework AI's FireAttention V2 addresses this challenge by providing 12x faster processing for long-context tasks through the following features:

  • Optimized attention scaling: Reduces computational overhead for lengthy inputs

  • Multi-host deployment: Distributes workloads efficiently across GPUs

  • Advanced kernels: Streamlines operations for faster execution

FireAttention V3 extends support to AMD's MI300 GPUs as a cost-efficient alternative to NVIDIA. Performance improvements include:

  • 1.4x–1.8x higher throughput for models like LLaMA 8B and 70B

  • Up to 5.5x speed gains in low-latency scenarios

  • Improved performance through redesigned attention kernels and optimized memory usage

These capabilities enhance fine-tuning of SLMs by enabling efficient processing of long-context inputs for domain-specific tasks. The optimized attention mechanisms reduce computational overhead, support faster training cycles, and enable larger batch sizes across multi-host GPU deployments.

Fireworks AI optimizations extend beyond adaptive speculation through three techniques that maximize throughput and cost-efficiency:

  • Adaptive caching: Reuses frequent computations to skip redundant work, reducing latency by 30–50% for high-traffic workloads

  • Customizable quantization: Balances 4-/8-bit precision with model quality, doubling speeds while maintaining over 99% accuracy for batch processing tasks

  • Disaggregated serving: Tailors hardware allocation to workload type by hosting multiple lightweight model copies or sharding large models across GPUs for complex tasks

Smaller, efficient models offer fine-tuning opportunities for specialized adaptation while maintaining resource efficiency. Research continues in optimizing SLMs for cloud, on-device, and dedicated hardware deployments.

Fine-tuning techniques include:

1

This category introduces additional trainable parameters to existing pre-trained models without modifying original weights.

  • Adapters: Insert small, trainable layers between model layers. Adapters learn task-specific transformations without altering pre-trained parameters.

  • Soft prompts: Trainable vector embeddings appended to input sequences. They guide model behavior toward desired tasks.

  • Prefix tuning: Adds trainable prefixes to input sequences. Prefixes learn task-specific information without core model modifications.

2

This approach reparameterizes existing model weights using low-rank approximations to reduce trainable parameters.

  • Low-rank adaptation: Approximates weight updates in attention layers using low-rank matrices. This decreases trainable parameters.

  • Quantized LoRA: Enhances LoRA with quantization techniques, reducing memory usage and computational costs.

3

This category selectively fine-tunes specific pre-trained model parameters, improving computational efficiency.

  • BitFit: Fine-tunes only bias terms or other specific parameters, improving computational efficiency.

  • DiffPruning: Identifies and removes parameters that contribute minimally to model performance, reducing trainable parameters.

4

These strategies freeze certain pre-trained model layers while fine-tuning others, optimizing adaptation.

  • Freeze and reconfigure: Freezes specific model layers and fine-tunes remaining layers to optimize adaptation.

  • FishMask: Uses masks to selectively freeze or fine-tune layers for specific tasks.

The most popular technique is PEFT. PEFT techniques adapt large pre-trained models to new tasks by adjusting a small fraction of parameters. This approach prevents overfitting, reduces computational and memory requirements compared to full fine-tuning, and mitigates catastrophic forgetting in LLMs. PEFT enables efficient model customization without full retraining, making it ideal for resource-constrained environments.

You can use PEFT LoRA techniques with trace data generated from model interactions and labeled data, which is explicitly annotated for tasks, to fine-tune smaller models for high performance on specific tasks without extensive computational resources.

For practical applications, this example uses a MongoDB credit card application demo. The demo illustrates MongoDB for credit scoring with predictive analytics, explains credit scoring results using generative AI, and performs credit card recommendations using RAG with LLMs. This fine-tuning example focuses on simplifying credit rating explanations using LLMs. The application includes user profile generation, product recommendations, and reranking with summarization. Find application design and source code details at the MongoDB Credit Card Application Solution Library Page.

LLMs cause slower response times due to complex calculations over billions of parameters. Credit card recommendations require multiple LLM queries, resulting in 10 to 20 seconds total response time, with each query taking 5 or more seconds. LLMs are difficult and expensive to deploy and scale for millions of users.

  • Small Language Models: SLMs provide faster processing speeds and cost-efficiency. SLMs require less computational power, making them suitable for devices with limited resources. They deliver faster responses and lower operational costs.

  • PEFT and LoRA: PEFT and LoRA improve efficiency by optimizing parameter subsets. This approach reduces memory requirements and operational costs. MongoDB integration enhances data handling and enables efficient model tuning.

  • MongoDB: MongoDB provides data management and real-time integration that improves operational efficiency. MongoDB stores trace data as JSON and enables efficient retrieval and storage, adding value to model fine-tuning. MongoDB acts as a caching layer to avoid repeated LLM invocations for identical requests.

The Credit Card Application demo explains credit scores to customers in clear language. LLMs like Meta's LLaMA 3.1-405B generate these explanations using user profile parameters, model input features, and feature importances from the model that predicts the customer's credit score or rating. SLMs cannot consistently achieve these tasks due to limited parameters for effective reasoning and explanation. Use the fine-tuning process with the Fireworks AI fine-tuning platform to achieve the desired outcome.

The process to fine-tune an SLM uses the following workflow:

LLM Fine-Tuning Process

Figure 2. LLM/SLM fine-tuning process

The fine-tuning process starts with collecting relevant, task-specific data. Figure 2 shows how MongoDB Atlas caches LLM/SLM responses for specific users based on their Credit Card application inputs. Users can simulate credit profiles on the web UI. The following Python code snippet demonstrates how to set up a decorator to cache LLM/SLM responses in MongoDB Atlas:

class mdbcache:
def __init__(self, function):
self.function = function
def __call__(self, *args, **kwargs):
key = str(args) + str(kwargs)
ele = ccol.find_one({"key": key})
if ele:
return ele["response"]
value = self.function(*args, **kwargs)
ccol.insert_one({"key":key, "response": value})
return value
@mdbcache
def invoke_llm(prompt):
"""
Invoke the language model with the given prompt with cache. The llm.invoke method can invoke either a LLM or SLM based on the Fireworks Model ID provided at the start of applicaiton.
Args:
prompt (str): The prompt to pass to the LLM.
"""
response = llm.invoke(prompt)
return response

As shown in the diagram, you can generate the training dataset using a simulator. This example simulates user profiles with stratified sampling to select equal samples for all three credit ratings: Good, Normal, and Poor. This demo generates around 1,300 sample responses.

Transform the generated responses into the format that the Fireworks AI platform supports for fine-tuning. Generate the cc_cache.jsonl file used in the fine-tuning process by running the following code:

from pymongo import MongoClient
import pandas as pd
import json
client = MongoClient("mongodb+srv://<uid>:<pwd>@bfsi-demo.2wqno.mongodb.net/?retryWrites=true&w=majority")
df = pd.DataFrame.from_records(client["bfsi-genai"]["cc_cache"].find({},{"_id": 0}))
df["prompt"] = df["key"].apply(lambda x: x.strip('(').strip('"').strip(")").strip("\\"))
del df["key"]
df["response"] = df["response"].apply(lambda x: x.strip())
df.to_json("cc_cache.jsonl", orient="records", lines=True)
# transform cache to messages
messages = []
for item in df.iterrows():
messages += [{"messages": [{"role": "user", "content": item["prompt"].strip(" \\")}, {"role": "assistant", "content": item["response"]}]}]
with open("cc_cache.jsonl", "w") as f:
for item in messages:
f.write(json.dumps(item) + "\n")

After preparing the dataset and generating the cc_cache.jsonl file, fine-tune the pre-trained llama-v3p1-8b-instruct model with the following the steps:

1
pip install firectl
2
firectl login
3
firectl create dataset <dataset_name> cc_cache.jsonl
4
firectl create sftj --base-model accounts/fireworks/models/llama-v3p1-8b-instruct --dataset <dataset_name> --output-model ccmodel --lora-rank 8 --epochs 1
5
Monitoring the fine-tuning process

Figure 3. Monitoring the fine-tuning process

6
firectl deploy ccmodel

This procedure demonstrates how MongoDB and Fireworks AI integration improves AI model performance cost-effectively.

After deploying the model on the Fireworks platform as a serverless API, use the model id, models/ft-m88hxaga-pi11m, shown in Figure 2 to invoke the fine-tuned SLM model using your preferred language model framework.

Fine-tuning the SLM for credit card recommendations produces the following results:

  1. Response time improvement: LLM response time averages 5 seconds. SLMs reduce this to approximately 0.15 seconds, providing a 19x latency reduction.

    import time
    class tiktok:
    """
    Decorator to time the execution of a function and log the time taken.
    """
    def __init__(self, function):
    self.function = function
    def __call__(self, *args, **kwargs):
    import time
    start = time.time()
    value = self.function(*args, **kwargs)
    end = time.time()
    print(f"Time taken for {self.function.__name__}: {end - start} seconds")
    return value
    @tiktok
    @mdbcache
    def invoke_llm(prompt):
    """
    Invoke the language model with the given prompt with cache. The
    invoke LLM method can invoke either a LLM or SLM based on the
    Fireworks Model ID initialized.
    Args:
    prompt (str): The prompt to pass to the LLM.
    """
    ...
    Model
    Inference Time 1 (s)
    Inference Time 2 (s)
    Inference Time 3 (s)
    Average Time (s)

    llama-v3p1-405b-instruct

    5.5954

    7.5936

    4.9121

    6.0337

    SLM - fine-tuned llama-v3p1-8b

    0.3554

    0.0480

    0.0473

    0.1502

  2. Memory reduction: LLMs typically require 8x80GB VRAM. SLMs operate with 16 GB VRAM, reducing memory usage by 97.5%.

  3. Hardware reduction: LLMs require high-end GPUs or multiple servers. SLMs can deploy on standard CPUs or single servers, reducing hardware costs.

  • Lower total cost of ownership: LoRA and QLoRA techniques reduce computational requirements for SLM fine-tuning. MongoDB's distributed architecture and efficient indexing scale data infrastructure on-demand, minimizing storage costs and reducing operational expenses.

  • Streamline data and AI workflows: MongoDB enables real-time data integration for AI models. Integration with Fireworks AI's fine-tuning tools creates workflows that keep models updated and improve decision-making accuracy.

  • Enhance RAG solutions: MongoDB Atlas and Fireworks AI combine to create RAG frameworks with improved data storage and retrieval. MongoDB Atlas provides scalable embedding storage while Fireworks AI offers managed LLM/SLM hosting.

  • Wei You Pan, MongoDB

  • Ashwin Gangadhar, MongoDB

  • Peyman Parsi, MongoDB

  • Benny Chen, Fireworks AI

  • Ayaan Momin, Fireworks AI

Back

Document Intelligence with Agentic AI

On this page