Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

Learn why MongoDB was selected as a leader in the 2024 Gartner® Magic Quadrant™
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Center
chevron-right
Developer Topics
chevron-right
Products
chevron-right
Atlas
chevron-right

Using SuperDuperDB to Accelerate AI Development on MongoDB Atlas Vector Search

Duncan Blythe6 min read • Published Sep 18, 2024 • Updated Sep 18, 2024
AtlasVector SearchPython
SNIPPET
Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty

Introduction

Are you interested in getting started with vector search and AI on MongoDB Atlas but don’t know where to start? The journey can be daunting; developers are confronted with questions such as:
  • Which model should I use?
  • Should I go with an open or closed source?
  • How do I correctly apply my model to my data in Atlas to create vector embeddings?
  • How do I configure my Atlas vector search index correctly?
  • Should I chunk my text or apply a vectorizing model to the text directly?
  • How and where can I robustly serve my model to be ready for new searches, based on incoming text queries?
SuperDuperDB is an open-source Python project designed to accelerate AI development with the database and assist in answering such questions, allowing developers to focus on what they want to build, without getting bogged down in the details of exactly how vector search and AI more generally are implemented.
SuperDuperDB includes computation of model outputs and model training which directly work with data in your database, as well as first-class support for vector search. In particular, SuperDuperDB supports MongoDB community and Atlas deployments.
You can follow along with the code below, but if you prefer, all of the code is available in the SuperDuperDB GitHub repository.

Getting started with SuperDuperDB

SuperDuperDB is super-easy to install using pip:
1python -m pip install -U superduperdb[apis]
Once you’ve installed SuperDuperDB, you’re ready to connect to your MongoDB Atlas deployment:
1from superduperdb import superduper
2
3db = superduper("mongodb+srv://<user>:<password>@...mongodb.net/documents")
The trailing characters after the last “/” denote the database you’d like to connect to. In this case, the database is called "documents." You should make sure that the user is authorized to access this database.
The variable db is a connector that is simultaneously:
  • A database client.
  • An artifact store for AI models (stores large file objects).
  • A meta-data store, storing important information about your models as they relate to the database.
  • A query interface allowing you to easily execute queries including vector search, without needing to explicitly handle the logic of converting the queries into vectors.

Connecting SuperDuperDB with AI models

Let’s see this in action.
With SuperDuperDB, developers can import model wrappers that support a variety of open-source projects as well as AI API providers, such as OpenAI. Developers may even define and program their own models.
For example, to create a vectorizing model using the OpenAI API, first set your OPENAI_API_KEY as an environment variable:
1export OPENAI_API_KEY="sk-..."
Now, simply import the OpenAI model wrapper:
1from superduperdb.ext.openai.model import OpenAIEmbedding
2
3model = OpenAIEmbedding(
4    identifier='text-embedding-ada-002', model='text-embedding-ada-002')
To check this is working, you can apply this model to a single text snippet using the predict
method, specifying that this is a single data point with one=True.
1>>> model.predict('This is a test', one=True)
2[-0.008146246895194054,
3 -0.0036965329200029373,
4 -0.0006024622125551105,
5 -0.005724836140871048,
6 -0.02455105632543564,
7 0.01614714227616787,
8...]
Alternatively, we can also use an open-source model (not behind an API), using, for instance, the sentence-transformers library:
1import sentence_transformers
2from superduperdb.components.model import Model
1from superduperdb import vector
1model = Model(
2    identifier='all-MiniLM-L6-v2',
3    object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),
4    encoder=vector(shape=(384,)),
5    predict_method='encode',
6    postprocess=lambda x: x.tolist(),
7    batch_predict=True,
8)
This code snippet uses the base Model wrapper, which supports arbitrary model class instances, using both open-sourced and in-house code. One simply supplies the class instance to the object parameter, optionally specifying preprocess and/or postprocess functions. The encoder argument tells Atlas Vector Search what size the outputs of the model are, and the batch_predict=True option makes computation quicker.
As before, we can test the model:
1>>> model.predict('This is a test', one=True)
2[-0.008146246895194054,
3 -0.0036965329200029373,
4 -0.0006024622125551105,
5 -0.005724836140871048,
6 -0.02455105632543564,
7 0.01614714227616787,
8...]

Inserting and querying data via SuperDuperDB

Let’s add some data to MongoDB using the db connection. We’ve prepared some data from the PyMongo API to add a meta twist to this walkthrough. You can download this data with this command:
1curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/pymongo.json
1import json
2from superduperdb.backends.mongodb.query import Collection
3from superduperdb.base.document import Document as D
4
5with open('pymongo.json') as f:
6    data = json.load(f)
7
8db.execute(
9    Collection('documents').insert_many([D(r) for r in data])
10)
You’ll see from this command that, in contrast to pymongo, superduperdb
includes query objects (Collection(...)...). This allows superduperdb to pass the queries around to models, computations, and training runs, as well as save the queries for future use.
Other than this fact, superduperdb supports all of the commands that are supported by the core pymongo API.
Here is an example of fetching some data with SuperDuperDB:
1>>> r = db.execute(Collection('documents').find_one())
2>>> r
3Document({
4    'key': 'pymongo.mongo_client.MongoClient',
5    'parent': None,
6    'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n',
7    'document': 'mongo_client.md',
8    'res': 'pymongo.mongo_client.MongoClient',
9    '_fold': 'train',
10    '_id': ObjectId('652e460f6cc2a5f9cc21db4f')
11})
You can see that the usual data from MongoDB is wrapped with the Document class.
You can recover the unwrapped document with unpack:
1>>> r.unpack()
2{'key': 'pymongo.mongo_client.MongoClient',
3 'parent': None,
4 'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n',
5 'document': 'mongo_client.md',
6 'res': 'pymongo.mongo_client.MongoClient',
7 '_fold': 'train',
8 '_id': ObjectId('652e460f6cc2a5f9cc21db4f')}
The reason superduperdb uses the Document abstraction is that, in SuperDuperDB, you don't need to manage converting data to bytes yourself. We have a system of configurable and user-controlled types, or "Encoders," which allow users to insert, for example, images directly. (This is a topic of an upcoming tutorial!)

Configuring models to work with vector search on MongoDB Atlas using SuperDuperDB

Now you have chosen and tested a model and inserted some data, you may configure vector search on MongoDB Atlas using SuperDuperDB. To do that, execute this command:
1from superduperdb import VectorIndex
2from superduperdb import Listener
3
4db.add(
5    VectorIndex(
6        identifier='pymongo-docs',
7        indexing_listener=Listener(
8            model=model,
9            key='value',
10            select=Collection('documents').find(),
11            predict_kwargs={'max_chunk_size': 1000},
12        ),
13    )
14)
This command tells superduperdb to do several things:
  • Search the "documents" collection
  • Set up a vector index on our Atlas cluster, using the text in the "value" field (Listener)
  • Use the model variable to create vector embeddings
After receiving this command, SuperDuperDB:
  • Configures a MongoDB Atlas knn-index in the "documents" collection.
  • Saves the model object in the SuperDuperDB model store hosted on gridfs.
  • Applies model to all data in the "documents" collection, and saves the vectors in the documents.
  • Saves the fact that the model is connected to the "pymongo-docs" vector index.
If you’d like to “reload” your model in a later session, you can do this with the load command:
1>>> db.load("model", 'all-MiniLM-L6-v2')
To look at what happened during the creation of the VectorIndex, we can see that the individual documents now contain vectors:
1>>> db.execute(Collection('documents').find_one()).unpack()
2{'key': 'pymongo.mongo_client.MongoClient',
3 'parent': None,
4 'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n',
5 'document': 'mongo_client.md',
6 'res': 'pymongo.mongo_client.MongoClient',
7 '_fold': 'train',
8 '_id': ObjectId('652e460f6cc2a5f9cc21db4f'),
9 '_outputs': {'value': {'text-embedding-ada-002': [-0.024740776047110558,
10    0.013489063829183578,
11    0.021334229037165642,
12    -0.03423869237303734,
13    ...]}}}
The outputs of models are always saved in the "_outputs.<key>.<model>" path of the documents. This allows MongoDB Atlas Vector Search to know where to look to create the fast vector lookup index.
You can verify also that MongoDB Atlas has created a knn vector search index by logging in to your Atlas account and navigating to the search tab. It will look like this:
The MongoDB Atlas UI, showing a list of indexes attached to the documents collection.
The green
ACTIVE
 status indicates that MongoDB Atlas has finished comprehending and “organizing” the vectors so that they may be searched quickly.
If you navigate to the “...” sign on Actions and click edit with JSON editor*,* then you can inspect the explicit index definition which was automatically configured by superduperdb:
The MongoDB Atlas cluster UI, showing the vector search index details.
You can confirm from this definition that the index looks into the "_outputs.<key>.<model>" path of the documents in our collection.

Querying vector search with a high-level API with SuperDuperDB

Now that our index is ready to go, we can perform some “search-by-meaning” queries using the db connection:
1>>> query = 'Query the database'
2>>> result = db.execute(
3...    Collection('documents')
4...        .like(D({'value': query}), vector_index='pymongo-docs', n=5)
5...        .find({}, {'value': 1, 'key': 1})
6... )
7>>> for r in result:
8...    print(r.unpack())
9
10{'key': 'find', 'value': '\nQuery the database.\n\nThe filter argument is a query document that all results\nmust match. For example:\n\n`pycon\n>>> db'}
11{'key': 'database_name', 'value': '\nThe name of the database this command was run against.\n\n'}
12{'key': 'aggregate', 'value': '\nPerform a database-level aggregation.\n\nSee the [aggregation pipeline](https://mongodb.prakticum-team.ru/docs/manual/reference/operato'}
13{'key': 'alive', 'value': '\nDoes this cursor have the potential to return more data?\n\n'}
14{'key': 'pymongo.cursor.CursorType', 'value': '\n'}
🚀 So that’s it! 🚀
You’ve now queried a vector search index on MongoDB Atlas Vector Search using a model and setup installed with SuperDuperDB. This required only a few key commands in Python, utilizing model libraries and API clients from the Python open-source ecosystem!
superduperdb has lots more to offer:
  • Developers can bring their own model, or install arbitrary models from the open-source ecosystem.
  • The cohere and anthropic APIs are also supported, in addition to openai.
  • Developers may also search through images and videos.
  • Other use cases, in addition to vanilla vector search, are supported:
    • Chat with your docs
    • Classical machine learning
    • Transfer learning
    • Vector search with arbitrary data-types
    • Much, much more…

Contributors are welcome!

SuperDuperDB is open source and permissively licensed under the Apache 2.0 license. We would like to encourage developers interested in open-source development to contribute to our discussion forums and issue boards and make their own pull requests. We'll see you on GitHub!

Become a Design Partner!​

We are looking for visionary organizations we can help to identify and implement transformative AI applications for their business and products. We're offering this absolutely for free. If you would like to learn more about this opportunity, please reach out to us via email: partnerships@superduperdb.com.
Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
Related
News & Announcements

Deprecating MongoDB Atlas GraphQL and Hosting Services


Mar 12, 2024 | 2 min read
Tutorial

Building a Real-Time, Dynamic Seller Dashboard on MongoDB


Aug 05, 2024 | 7 min read
Tutorial

How to Manage Data at Scale With MongoDB Atlas Online Archive


Sep 23, 2022 | 6 min read
Tutorial

Getting Started With MongoDB Atlas Serverless, AWS CDK, and AWS Serverless Computing


Aug 09, 2024 | 18 min read
Table of Contents