Building AI and RAG Apps With MongoDB, Anyscale and PyMongo

Pavel Duchovny7 min read • Published Jul 17, 2024 • Updated Jul 17, 2024

AI Python Atlas

SNIPPET

Rate this quickstart

The AI pillar has become a significant consideration in modern applications. The enterprise industry is very interested in modernizing and involving AI models in enterprise applications. However, using publicly available APIs and relying on public services will not suit all enterprises, especially if they require high-scale and secure data segregation.

Having said that, increasingly, more enterprises are moving to the cloud to build applications at scale in fast, competitive cycles, utilizing smart and robust deployment services. This is where the MongoDB Atlas data platform and the Anyscale AI compute platform come in.

Anyscale is a platform for running distributed applications using Ray, an open-source framework for building scalable applications. Integrating MongoDB Atlas with Anyscale and PyMongo, can help you efficiently manage and analyze large datasets by leveraging Ray's distributed computing capabilities while using MongoDB as the backend database.

Using Anyscale for fast and efficient LLM inference and MongoDB Atlas scalable vector indexing for contextual search, users can build and deploy super scalable, AI-based retrieval-augmented generation (RAG) flows and agentic systems.

In this tutorial, we will cover the following main aspects of creating a scalable RAG back end for your application:

Deploy a scalable vector store using MongoDB Atlas.
Deploy the required embedding model and LLM on Anyscale.
Connect the vector store with your deployed service using PyMongo and Anyscale services.

Deploy your Atlas cluster

If you have not created an Atlas cluster, follow our guide.

Get the MongoDB connection string and allow access from relevant IP addresses. For this tutorial, you will use the 0.0.0.0/0 network access list.

For production deployments, users can manage VPCs and private networks on the Anyscale platform and connect them securely via VPC peering or cloud private endpoints to Atlas.

Create a collection and Atlas vector index

Using the Data Explorer or a Compass connection, create the following collection to host our application context for the RAG process. Use database name anyscale_db and collection name stories. Once the collection is created, head to the Atlas Search tab (or the Index tab in Compass using the Atlas search toggle) and create the following vector index:

name : vector_index

1 {
2   "fields": [
3     {
4       "type": "vector",
5       "path": "embedding",
6       "numDimensions": 1024,
7       "similarity": "cosine"
8     }
9   ]
10 }

Deploy self-hosted models for the RAG service

Anyscale offers a simple-to-follow template that you can launch from within the Anyscale platform to optimally self-host the models for your RAG service leveraging ray-LLM. ray-LLM makes use of best-in-class optimizations from inference libraries like vLLM and offers an OpenAI-compatible API for our deployed models.

With Anyscale, you get the flexibility to choose the hardware and autoscaling configurations for the models so you can trade off cost and performance given your use case.

For the purposes of our guide, we will:

Launch this template to deploy the embedding model thenlper/gth-large.
Launch this template to deploy the LLM model llama-2-70b.

We will use the default values given they offer a good starting point for hardware and autoscaling configurations.

To read more about the template used, view the Anyscale documentation page.

Please note that some models require approval on the Hugging Face website.

Once you deploy the above models, you should have two working services with accessible API URL and bearer tokens:

Let's say that the embedding deployed service is on:

1 https://XXXXXXXXXXXX.anyscaleuserdata.com

Take note of the embedding bearer as it will be used as our API key for embedding related tasks.

The LLM is deployed to another endpoint:

1 https://YYYYYYYYYYYYY.anyscaleuserdata.com

Take note of the LLM bearer as it will be used as our API key for LLM-related tasks.

Build the RAG service

Create a workspace in the Anyscale platform

Create a workspace template to build our RAG service. We recommend you launch the Intro to Services template to get a running workspace that showcases how to deploy your first Anyscale service.

You can think of an Anyscale workspace as an IDE (Visual Studio) running against an elastically scalable compute cluster.

We will use this workspace to create all our services.

Let's start by specifying the dependencies and environment variables that our RAG service will require…

1 pip install pymongo openai fastapi

…by clicking on the Dependencies tab as shown in the screenshot below:

You will need to place the API keys as environment variables in Workspace > YOUR NAME > Dependencies:

Place the embedding bearer as an environment variable in the workspace:
- EMBEDDING_API_KEY
Place the LLM bearer as an environment variable in our workspace:
- LLM_API_KEY
Place your Atlas connection string (e.g., mongodb+srv://<user>:<pass>@cluster0.abcd.mongodb.net/):
- MONGODB_ATLAS_URI

Remember to replace the values with your cluster connection string and your keys.

Now, you can use those services to build your RAG service to use private embedding endpoints and LLM. Those are compatible with the OpenAI client, and therefore, you can use the OpenAI client as part of our MongoDB RAG service as follows.

Create a new file named main.py in the root directory of the workspace.

1 from typing import Any
2 from fastapi import FastAPI, HTTPException
3 from pydantic import BaseModel
4 from pymongo import MongoClient
5 import os
6 import openai
7 from dotenv import load_dotenv
8 from ray import serve
9 import logging
10 
11 # Initialize FastAPI
12 fastapi = FastAPI()
13 
14 # Initialize logger
15 logger = logging.getLogger("ray.serve")
16 
17 # Data model for the request
18 class QueryModel(BaseModel):
19    query: str
20 
21 @serve.deployment
22 @serve.ingress(fastapi)
23 class MongoDBRag:
24 
25    def __init__(self):
26        # MongoDB connection
27        client = MongoClient(os.getenv("MONGODB_ATLAS_URI"))
28        db = client["anyscale_db"]
29        self.collection = db["stories"]
30 
31        # AI Client initialization
32        # LLM endpoint for text generation
33        self.openai_llm = openai.OpenAI(
34            base_url="https://YYYYYYYYYYYYY.anyscaleuserdata.com/v1",
35            api_key=os.getenv('LLM_API_KEY')
36        )
37 
38        # Embedding endpoint for vector embeddings
39        self.openai_embedding = openai.OpenAI(
40            base_url="https://XXXXXXXXXXXX.anyscaleuserdata.com/v1",
41            api_key=os.getenv('EMBEDDING_API_KEY')
42        )
43 
44    @fastapi.post("/rag_endpoint/")
45    async def rag_endpoint(self, query: QueryModel) -> Any:
46        try:
47            # Generate embedding for the input query
48            embedding = self.openai_embedding.embeddings.create(
49                model="thenlper/gte-large",
50                input=query.query,
51            )
52 
53            # Perform vector search in MongoDB to find relevant context
54            context = list(self.collection.aggregate([
55                {"$vectorSearch": {
56                    "queryVector": embedding.data[0].embedding,
57                    "index": "vector_index",
58                    "numCandidates": 3,
59                    "path": "embedding",
60                    "limit": 1
61                }},
62                {
63                    "$project": {
64                        "_id": 0,
65                        "embedding": 0
66                    }
67                }
68            ]))
69 
70            # Construct prompt with context and query
71            prompt = f""" Based on the following context please answer and elaborate the user query.
72            context: {context}
73 
74            query: {query.query}
75            Answer: """
76 
77            # Generate response using LLM
78            chat_completion = self.openai_llm.chat.completions.create(
79                model="meta-llama/Llama-2-70b-chat-hf",
80                messages=[
81                    {"role": "system", "content": "You are a helpful QA assistant"},
82                    {"role": "user", "content": prompt}
83                ],
84                temperature=0.7
85            )
86 
87            answer = chat_completion.choices[0].message.content
88 
89            return answer
90 
91        except Exception as e:
92            raise HTTPException(status_code=500, detail=str(e))
93 
94 # Bind the deployment
95 my_app = MongoDBRag.bind()

This service utilizes Ray Serve integrated with FastAPI to connect MongoDB Atlas with the embedding model and the LLM services that we previously deployed.

With Ray Serve, we get to define an automatically scalable deployment by decorating a class where:

All state initialization (e.g., loading model weights, creating database and client connections) is defined under the "init" method.
HTTP routes are defined by wrapping a method with the corresponding FastAPI decorator like we did with the rag_endpoint method.

To generate a sample subset of data to test our RAG service, we suggest running the code snippet below (eg. generate_embeddings.py):

1 from pymongo import MongoClient
2 import os
3 import openai
4 from dotenv import load_dotenv
5 
6 client = MongoClient(os.getenv("MONGODB_ATLAS_URI"))
7 db = client["anyscale_db"]
8 collection = db["stories"]
9 
10 # Embedding endpoint for vector embeddings
11 openai_embedding = openai.OpenAI(
12     base_url="https://XXXXXXXXXXXX.anyscaleuserdata.com/v1",
13     api_key=os.getenv('EMBEDDING_API_KEY')
14 )
15 
16 def generate_data() -> str:
17     # Sample data to be inserted into MongoDB
18     data = [
19         "Karl lives in NYC and works as a chef",
20         "Rio lives in Brasil and works as a barman",
21         "Jessica lives in Paris and works as a programmer."
22     ]
23 
24     for info in data:
25         # Generate embedding for each piece of information
26         embedding = openai_embedding.embeddings.create(
27             model="thenlper/gte-large",
28             input=info,
29         )
30 
31         # Update or insert the data with its embedding into MongoDB
32         collection.update_one(
33             {'story': info},
34             {'$set': {'embedding': embedding.data[0].embedding}},
35             upsert=True
36         )
37 
38     return "Data Ingested Successfully"
39 
40 generate_data()

You can now access the service. Let's ingest the data to create the collection with embeddings and populate our “stories” collection with three sample documents:

1  data = [
2                "Karl lives in NYC and works as a chef",
3                "Rio lives in Brasil and works as a barman",
4                "Jessica lives in Paris and works as a programmer."
5            ]

Use the following command to load data:

1 python generate.py

The collection was added with the following stories and embeddings:

Note for scaling out text embedding generation, you can follow the workspace template for computing text embeddings at scale. This template leverages best practices for batch inference of embedding models to maximize the throughput of the underlying heterogeneous cluster of CPUs and GPUs.

Now, run the service locally:

1 serve run main:my_app --non-blocking

Now, the RAG endpoint is active on rag_endpoint:

1 payload = {"query": "Main character list"}
2 
3 
4 # Perform the POST request
5 response = requests.post("http://localhost:8000/rag_endpoint", json=payload)
6 
7 # Print the response in JSON format
8 print(response.json())

The expected output:

1 Based on the provided context, the main character list is:
2 
3 
4 1. Jessica
5 2. Rio
6 3. Karl

Deploy the RAG service

Now that you have tested the service by running locally within the workspace, you can proceed to deploy it as a standalone Anyscale service which is accessible via a URL and bearer token to external users.

1 anyscale services deploy  main:my_app --name=my-rag-service

Once the command is successfully propagated, you will get the service endpoint URL and its “bearer” token for authorization:

1 (anyscale +2.0s) Starting new service 'my-rag-service'.
2 (anyscale +4.5s) Using workspace runtime dependencies env vars: {'MONGODB_ATLAS_URI': '<YOUR_URI>', 'LLM_API_KEY': '<YOURKEY>',  'EMBEDDING_API_KEY': '<YOURKEY>'}.
3 (anyscale +4.5s) Uploading local dir '.' to cloud storage.
4 (anyscale +5.5s) Including workspace-managed pip dependencies.
5 (anyscale +6.1s) Service 'my-rag-service' deployed.
6 (anyscale +6.1s) View the service in the UI: 'https://console.anyscale.com/services/service2_wx8ar3cbuf84g9u87g3yvx2cq8'
7 (anyscale +6.1s) Query the service once it's running using the following curl command:
8 (anyscale +6.1s) curl -H 'Authorization: Bearer xxxxxxx' https://my-rag-service-yyyy.xxxxxxxxxxxx.s.anyscaleuserdata.com/

Navigate to the services tab

The service you deployed will be visible under the Services tab on the Anyscale platform:

Once the service is healthy, you will be able to access it via the provided URL and credentials the same way that you did in the workspace:

1 curl -X POST \
2      -H "Authorization: Bearer XXXX" \
3      -H "Content-Type: application/json" \
4     https://my-rag-service-yyyy.cld-yga7qm8f6yamqev7.s.anyscaleuserdata.com/rag_endpoint/ \
5      --data-raw '{"query": "What are the main character names?" }'
6 "The main character names mentioned in the stories are Jessica, Karl and Rio."

Conclusion

Integrating MongoDB with Anyscale and PyMongo enables powerful distributed data processing and analysis workflows. By combining MongoDB's flexible document model, Atlas developer data platform built-in scalability, and vector search capabilities with Anyscale's distributed computing platform, developers can build scalable RAG pipelines and AI-powered applications.

Key takeaways

MongoDB Atlas provides a managed database solution with vector indexing for semantic search.
Anyscale offers a platform for running distributed Python applications using Ray.
PyMongo allows easy interaction with MongoDB from Python. These technologies enable the building of performant RAG systems that can scale.

This tutorial walked through setting up a basic RAG service using MongoDB, Anyscale, and open-source language models. For production use cases, consider:

Fine-tuning models on domain-specific data.
Implementing caching and batching for improved performance.
Adding monitoring and observability.
Scaling up compute resources as needed.

Integrating MongoDB and Anyscale opens up exciting possibilities for building AI applications that can efficiently process and reason over large datasets. We encourage you to explore further and push the boundaries of what's possible with these powerful tools.

Please join our forums and set up your Atlas account with Anyscale!

Top Comments in Forums

There are no comments on this article yet.

Start the Conversation

Rate this quickstart

Code Example

Introducing the Tour Planner With MongoDB Vector Search

Sep 24, 2024 | 5 min read

Tutorial

Developing Your Applications More Efficiently with MongoDB Atlas Serverless Instances

Feb 03, 2023 | 7 min read

Tutorial

Semantic search with Jina Embeddings v2 and MongoDB Atlas

Dec 05, 2023 | 12 min read

Tutorial

Create a Multi-Cloud Cluster with MongoDB Atlas

Sep 23, 2022 | 11 min read

Deploy your Atlas cluster
Create a collection and Atlas vector index
Sign up for an Anyscale account
Build the RAG service
Deploy the RAG service
Conclusion

Atlas

Building AI and RAG Apps With MongoDB, Anyscale and PyMongo

Deploy your Atlas cluster

Create a collection and Atlas vector index

Deploy self-hosted models for the RAG service

Build the RAG service

Create a workspace in the Anyscale platform

Deploy the RAG service

Navigate to the services tab

Conclusion

Key takeaways

Top Comments in Forums

Related

Introducing the Tour Planner With MongoDB Vector Search

Developing Your Applications More Efficiently with MongoDB Atlas Serverless Instances

Semantic search with Jina Embeddings v2 and MongoDB Atlas

Create a Multi-Cloud Cluster with MongoDB Atlas

Table of Contents

1	{
2	"fields": [
3	{
4	"type": "vector",
5	"path": "embedding",
6	"numDimensions": 1024,
7	"similarity": "cosine"
8	}
9	]
10	}

1	from typing import Any
2	from fastapi import FastAPI, HTTPException
3	from pydantic import BaseModel
4	from pymongo import MongoClient
5	import os
6	import openai
7	from dotenv import load_dotenv
8	from ray import serve
9	import logging
10
11	# Initialize FastAPI
12	fastapi = FastAPI()
13
14	# Initialize logger
15	logger = logging.getLogger("ray.serve")
16
17	# Data model for the request
18	class QueryModel(BaseModel):
19	query: str
20
21	@serve.deployment
22	@serve.ingress(fastapi)
23	class MongoDBRag:
24
25	def __init__(self):
26	# MongoDB connection
27	client = MongoClient(os.getenv("MONGODB_ATLAS_URI"))
28	db = client["anyscale_db"]
29	self.collection = db["stories"]
30
31	# AI Client initialization
32	# LLM endpoint for text generation
33	self.openai_llm = openai.OpenAI(
34	base_url="https://YYYYYYYYYYYYY.anyscaleuserdata.com/v1",
35	api_key=os.getenv('LLM_API_KEY')
36	)
37
38	# Embedding endpoint for vector embeddings
39	self.openai_embedding = openai.OpenAI(
40	base_url="https://XXXXXXXXXXXX.anyscaleuserdata.com/v1",
41	api_key=os.getenv('EMBEDDING_API_KEY')
42	)
43
44	@fastapi.post("/rag_endpoint/")
45	async def rag_endpoint(self, query: QueryModel) -> Any:
46	try:
47	# Generate embedding for the input query
48	embedding = self.openai_embedding.embeddings.create(
49	model="thenlper/gte-large",
50	input=query.query,
51	)
52
53	# Perform vector search in MongoDB to find relevant context
54	context = list(self.collection.aggregate([
55	{"$vectorSearch": {
56	"queryVector": embedding.data[0].embedding,
57	"index": "vector_index",
58	"numCandidates": 3,
59	"path": "embedding",
60	"limit": 1
61	}},
62	{
63	"$project": {
64	"_id": 0,
65	"embedding": 0
66	}
67	}
68	]))
69
70	# Construct prompt with context and query
71	prompt = f""" Based on the following context please answer and elaborate the user query.
72	context: {context}
73
74	query: {query.query}
75	Answer: """
76
77	# Generate response using LLM
78	chat_completion = self.openai_llm.chat.completions.create(
79	model="meta-llama/Llama-2-70b-chat-hf",
80	messages=[
81	{"role": "system", "content": "You are a helpful QA assistant"},
82	{"role": "user", "content": prompt}
83	],
84	temperature=0.7
85	)
86
87	answer = chat_completion.choices[0].message.content
88
89	return answer
90
91	except Exception as e:
92	raise HTTPException(status_code=500, detail=str(e))
93
94	# Bind the deployment
95	my_app = MongoDBRag.bind()

1	data = [
2	"Karl lives in NYC and works as a chef",
3	"Rio lives in Brasil and works as a barman",
4	"Jessica lives in Paris and works as a programmer."
5	]

1	payload = {"query": "Main character list"}
2
3
4	# Perform the POST request
5	response = requests.post("http://localhost:8000/rag_endpoint", json=payload)
6
7	# Print the response in JSON format
8	print(response.json())

1	Based on the provided context, the main character list is:
2
3
4	1. Jessica
5	2. Rio
6	3. Karl

1	(anyscale +2.0s) Starting new service 'my-rag-service'.
2	(anyscale +4.5s) Using workspace runtime dependencies env vars: {'MONGODB_ATLAS_URI': '<YOUR_URI>', 'LLM_API_KEY': '<YOURKEY>', 'EMBEDDING_API_KEY': '<YOURKEY>'}.
3	(anyscale +4.5s) Uploading local dir '.' to cloud storage.
4	(anyscale +5.5s) Including workspace-managed pip dependencies.
5	(anyscale +6.1s) Service 'my-rag-service' deployed.
6	(anyscale +6.1s) View the service in the UI: 'https://console.anyscale.com/services/service2_wx8ar3cbuf84g9u87g3yvx2cq8'
7	(anyscale +6.1s) Query the service once it's running using the following curl command:
8	(anyscale +6.1s) curl -H 'Authorization: Bearer xxxxxxx' https://my-rag-service-yyyy.xxxxxxxxxxxx.s.anyscaleuserdata.com/

1	curl -X POST \
2	-H "Authorization: Bearer XXXX" \
3	-H "Content-Type: application/json" \
4	https://my-rag-service-yyyy.cld-yga7qm8f6yamqev7.s.anyscaleuserdata.com/rag_endpoint/ \
5	--data-raw '{"query": "What are the main character names?" }'
6	"The main character names mentioned in the stories are Jessica, Karl and Rio."

Atlas

Building AI and RAG Apps With MongoDB, Anyscale and PyMongo

Deploy your Atlas cluster

Create a collection and Atlas vector index

Sign up for an Anyscale account

Deploy self-hosted models for the RAG service

Build the RAG service

Create a workspace in the Anyscale platform

Deploy the RAG service

Navigate to the services tab

Conclusion

Key takeaways

Top Comments in Forums

Related

Introducing the Tour Planner With MongoDB Vector Search

Developing Your Applications More Efficiently with MongoDB Atlas Serverless Instances

Semantic search with Jina Embeddings v2 and MongoDB Atlas

Create a Multi-Cloud Cluster with MongoDB Atlas

Table of Contents