Using SuperDuperDB to Accelerate AI Development on MongoDB Atlas Vector Search
Duncan Blythe6 min read • Published Sep 18, 2024 • Updated Sep 18, 2024
SNIPPET
Rate this article
Are you interested in getting started with vector search and AI on MongoDB Atlas but don’t know where to start? The journey can be daunting; developers are confronted with questions such as:
- Which model should I use?
- Should I go with an open or closed source?
- How do I correctly apply my model to my data in Atlas to create vector embeddings?
- How do I configure my Atlas vector search index correctly?
- Should I chunk my text or apply a vectorizing model to the text directly?
- How and where can I robustly serve my model to be ready for new searches, based on incoming text queries?
SuperDuperDB is an open-source Python project designed to accelerate AI development with the database and assist in answering such questions, allowing developers to focus on what they want to build, without getting bogged down in the details of exactly how vector search and AI more generally are implemented.
SuperDuperDB includes computation of model outputs and model training which directly work with data in your database, as well as first-class support for vector search. In particular, SuperDuperDB supports MongoDB community and Atlas deployments.
You can follow along with the code below, but if you prefer, all of the code is available in the SuperDuperDB GitHub repository.
1 python -m pip install -U superduperdb[apis]
Once you’ve installed SuperDuperDB, you’re ready to connect to your MongoDB Atlas deployment:
1 from superduperdb import superduper 2 3 db = superduper("mongodb+srv://<user>:<password>@...mongodb.net/documents")
The trailing characters after the last “/” denote the database you’d like to connect to. In this case, the database is called "documents." You should make sure that the user is authorized to access this database.
The variable
db
is a connector that is simultaneously:- A database client.
- An artifact store for AI models (stores large file objects).
- A meta-data store, storing important information about your models as they relate to the database.
- A query interface allowing you to easily execute queries including vector search, without needing to explicitly handle the logic of converting the queries into vectors.
Let’s see this in action.
With SuperDuperDB, developers can import model wrappers that support a variety of open-source projects as well as AI API providers, such as OpenAI. Developers may even define and program their own models.
For example, to create a vectorizing model using the OpenAI API, first set your
OPENAI_API_KEY
as an environment variable:1 export OPENAI_API_KEY="sk-..."
Now, simply import the OpenAI model wrapper:
1 from superduperdb.ext.openai.model import OpenAIEmbedding 2 3 model = OpenAIEmbedding( 4 identifier='text-embedding-ada-002', model='text-embedding-ada-002')
To check this is working, you can apply this model to a single text snippet using the
predict
method, specifying that this is a single data point with
one=True
.1 'This is a test', one=True) model.predict(2 [-0.008146246895194054, 3 -0.0036965329200029373, 4 -0.0006024622125551105, 5 -0.005724836140871048, 6 -0.02455105632543564, 7 0.01614714227616787, 8 ...]
Alternatively, we can also use an open-source model (not behind an API), using, for instance, the
sentence-transformers
library:1 import sentence_transformers 2 from superduperdb.components.model import Model
1 from superduperdb import vector
1 model = Model( 2 identifier='all-MiniLM-L6-v2', 3 object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'), 4 encoder=vector(shape=(384,)), 5 predict_method='encode', 6 postprocess=lambda x: x.tolist(), 7 batch_predict=True, 8 )
This code snippet uses the base
Model
wrapper, which supports arbitrary model class instances, using both open-sourced and in-house code. One simply supplies the class instance to the object parameter, optionally specifying preprocess
and/or postprocess
functions. The encoder
argument tells Atlas Vector Search what size the outputs of the model are, and the batch_predict=True
option makes computation quicker.As before, we can test the model:
1 'This is a test', one=True) model.predict(2 [-0.008146246895194054, 3 -0.0036965329200029373, 4 -0.0006024622125551105, 5 -0.005724836140871048, 6 -0.02455105632543564, 7 0.01614714227616787, 8 ...]
Let’s add some data to MongoDB using the
db
connection. We’ve prepared some data from the PyMongo API to add a meta twist to this walkthrough. You can download this data with this command:1 curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/pymongo.json
1 import json 2 from superduperdb.backends.mongodb.query import Collection 3 from superduperdb.base.document import Document as D 4 5 with open('pymongo.json') as f: 6 data = json.load(f) 7 8 db.execute( 9 Collection('documents').insert_many([D(r) for r in data]) 10 )
You’ll see from this command that, in contrast to
pymongo
, superduperdb
includes query objects (
Other than this fact,
Collection(...)...
). This allows superduperdb
to pass the queries around to models, computations, and training runs, as well as save the queries for future use.Other than this fact,
superduperdb
supports all of the commands that are supported by the core pymongo
API.Here is an example of fetching some data with SuperDuperDB:
1 'documents').find_one()) r = db.execute(Collection(2 r3 Document({ 4 'key': 'pymongo.mongo_client.MongoClient', 5 'parent': None, 6 'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n', 7 'document': 'mongo_client.md', 8 'res': 'pymongo.mongo_client.MongoClient', 9 '_fold': 'train', 10 '_id': ObjectId('652e460f6cc2a5f9cc21db4f') 11 })
You can see that the usual data from MongoDB is wrapped with the
Document
class.You can recover the unwrapped document with
unpack
:1 r.unpack()2 {'key': 'pymongo.mongo_client.MongoClient', 3 'parent': None, 4 'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n', 5 'document': 'mongo_client.md', 6 'res': 'pymongo.mongo_client.MongoClient', 7 '_fold': 'train', 8 '_id': ObjectId('652e460f6cc2a5f9cc21db4f')}
The reason
superduperdb
uses the Document
abstraction is that, in SuperDuperDB, you don't need to manage converting data to bytes yourself. We have a system of configurable and user-controlled types, or "Encoders," which allow users to insert, for example, images directly. (This is a topic of an upcoming tutorial!)Now you have chosen and tested a model and inserted some data, you may configure vector search on MongoDB Atlas using SuperDuperDB. To do that, execute this command:
1 from superduperdb import VectorIndex 2 from superduperdb import Listener 3 4 db.add( 5 VectorIndex( 6 identifier='pymongo-docs', 7 indexing_listener=Listener( 8 model=model, 9 key='value', 10 select=Collection('documents').find(), 11 predict_kwargs={'max_chunk_size': 1000}, 12 ), 13 ) 14 )
This command tells
superduperdb
to do several things:- Search the "documents" collection
- Set up a vector index on our Atlas cluster, using the text in the "value" field (Listener)
- Use the model variable to create vector embeddings
After receiving this command, SuperDuperDB:
- Configures a MongoDB Atlas knn-index in the "documents" collection.
- Saves the model object in the SuperDuperDB model store hosted on gridfs.
- Applies model to all data in the "documents" collection, and saves the vectors in the documents.
- Saves the fact that the model is connected to the "pymongo-docs" vector index.
If you’d like to “reload” your model in a later session, you can do this with the
load
command:1 "model", 'all-MiniLM-L6-v2') db.load(
To look at what happened during the creation of the VectorIndex, we can see that the individual documents now contain vectors:
1 'documents').find_one()).unpack() db.execute(Collection(2 {'key': 'pymongo.mongo_client.MongoClient', 3 'parent': None, 4 'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n', 5 'document': 'mongo_client.md', 6 'res': 'pymongo.mongo_client.MongoClient', 7 '_fold': 'train', 8 '_id': ObjectId('652e460f6cc2a5f9cc21db4f'), 9 '_outputs': {'value': {'text-embedding-ada-002': [-0.024740776047110558, 10 0.013489063829183578, 11 0.021334229037165642, 12 -0.03423869237303734, 13 ...]}}}
The outputs of models are always saved in the
"_outputs.<key>.<model>"
path of the documents. This allows MongoDB Atlas Vector Search to know where to look to create the fast vector lookup index.You can verify also that MongoDB Atlas has created a
knn
vector search index by logging in to your Atlas account and navigating to the search tab. It will look like this:The green
ACTIVE status indicates that MongoDB Atlas has finished comprehending and “organizing” the vectors so that they may be searched quickly.
If you navigate to the “...” sign on Actions and click edit with JSON editor*,* then you can inspect the explicit index definition which was automatically configured by
superduperdb
:You can confirm from this definition that the index looks into the
"_outputs.<key>.<model>"
path of the documents in our collection.Now that our index is ready to go, we can perform some “search-by-meaning” queries using the
db
connection:1 'Query the database' query = 2 result = db.execute(3 'documents') Collection(4 'value': query}), vector_index='pymongo-docs', n=5) .like(D({5 'value': 1, 'key': 1}) .find({}, {6 )7 for r in result: 8 print(r.unpack()) 9 10 {'key': 'find', 'value': '\nQuery the database.\n\nThe filter argument is a query document that all results\nmust match. For example:\n\n`pycon\n>>> db'} 11 {'key': 'database_name', 'value': '\nThe name of the database this command was run against.\n\n'} 12 {'key': 'aggregate', 'value': '\nPerform a database-level aggregation.\n\nSee the [aggregation pipeline](https://mongodb.prakticum-team.ru/docs/manual/reference/operato'} 13 {'key': 'alive', 'value': '\nDoes this cursor have the potential to return more data?\n\n'} 14 {'key': 'pymongo.cursor.CursorType', 'value': '\n'}
🚀 So that’s it! 🚀
You’ve now queried a vector search index on MongoDB Atlas Vector Search using a model and setup installed with SuperDuperDB. This required only a few key commands in Python, utilizing model libraries and API clients from the Python open-source ecosystem!
superduperdb
has lots more to offer:- Developers can bring their own model, or install arbitrary models from the open-source ecosystem.
- The cohere and anthropic APIs are also supported, in addition to openai.
- Developers may also search through images and videos.
- Other use cases, in addition to vanilla vector search, are supported:
- Chat with your docs
- Classical machine learning
- Transfer learning
- Vector search with arbitrary data-types
- Much, much more…
SuperDuperDB is open source and permissively licensed under the Apache 2.0 license. We would like to encourage developers interested in open-source development to contribute to our discussion forums and issue boards and make their own pull requests. We'll see you on GitHub!
We are looking for visionary organizations we can help to identify and implement transformative AI applications for their business and products. We're offering this absolutely for free. If you would like to learn more about this opportunity, please reach out to us via email: partnerships@superduperdb.com.
Top Comments in Forums
There are no comments on this article yet.