Using SuperDuperDB to Accelerate AI Development on MongoDB Atlas Vector Search

Duncan Blythe6 min read • Published Sep 18, 2024 • Updated Sep 18, 2024

Atlas Vector Search Python

SNIPPET

Rate this article

Introduction

Are you interested in getting started with vector search and AI on MongoDB Atlas but don’t know where to start? The journey can be daunting; developers are confronted with questions such as:

Which model should I use?
Should I go with an open or closed source?
How do I correctly apply my model to my data in Atlas to create vector embeddings?
How do I configure my Atlas vector search index correctly?
Should I chunk my text or apply a vectorizing model to the text directly?
How and where can I robustly serve my model to be ready for new searches, based on incoming text queries?

SuperDuperDB is an open-source Python project designed to accelerate AI development with the database and assist in answering such questions, allowing developers to focus on what they want to build, without getting bogged down in the details of exactly how vector search and AI more generally are implemented.

SuperDuperDB includes computation of model outputs and model training which directly work with data in your database, as well as first-class support for vector search. In particular, SuperDuperDB supports MongoDB community and Atlas deployments.

You can follow along with the code below, but if you prefer, all of the code is available in the SuperDuperDB GitHub repository.

Getting started with SuperDuperDB

SuperDuperDB is super-easy to install using pip:

1 python -m pip install -U superduperdb[apis]

Once you’ve installed SuperDuperDB, you’re ready to connect to your MongoDB Atlas deployment:

1 from superduperdb import superduper
2 
3 db = superduper("mongodb+srv://<user>:<password>@...mongodb.net/documents")

The trailing characters after the last “/” denote the database you’d like to connect to. In this case, the database is called "documents." You should make sure that the user is authorized to access this database.

The variable db is a connector that is simultaneously:

A database client.
An artifact store for AI models (stores large file objects).
A meta-data store, storing important information about your models as they relate to the database.
A query interface allowing you to easily execute queries including vector search, without needing to explicitly handle the logic of converting the queries into vectors.

Connecting SuperDuperDB with AI models

Let’s see this in action.

With SuperDuperDB, developers can import model wrappers that support a variety of open-source projects as well as AI API providers, such as OpenAI. Developers may even define and program their own models.

For example, to create a vectorizing model using the OpenAI API, first set your OPENAI_API_KEY as an environment variable:

1 export OPENAI_API_KEY="sk-..."

Now, simply import the OpenAI model wrapper:

1 from superduperdb.ext.openai.model import OpenAIEmbedding
2 
3 model = OpenAIEmbedding(
4     identifier='text-embedding-ada-002', model='text-embedding-ada-002')

To check this is working, you can apply this model to a single text snippet using the predict

method, specifying that this is a single data point with one=True.

1 >>> model.predict('This is a test', one=True)
2 [-0.008146246895194054,
3  -0.0036965329200029373,
4  -0.0006024622125551105,
5  -0.005724836140871048,
6  -0.02455105632543564,
7  0.01614714227616787,
8 ...]

Alternatively, we can also use an open-source model (not behind an API), using, for instance, the sentence-transformers library:

1 import sentence_transformers
2 from superduperdb.components.model import Model

1 from superduperdb import vector

1 model = Model(
2     identifier='all-MiniLM-L6-v2',
3     object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),
4     encoder=vector(shape=(384,)),
5     predict_method='encode',
6     postprocess=lambda x: x.tolist(),
7     batch_predict=True,
8 )

This code snippet uses the base Model wrapper, which supports arbitrary model class instances, using both open-sourced and in-house code. One simply supplies the class instance to the object parameter, optionally specifying preprocess and/or postprocess functions. The encoder argument tells Atlas Vector Search what size the outputs of the model are, and the batch_predict=True option makes computation quicker.

As before, we can test the model:

1 >>> model.predict('This is a test', one=True)
2 [-0.008146246895194054,
3  -0.0036965329200029373,
4  -0.0006024622125551105,
5  -0.005724836140871048,
6  -0.02455105632543564,
7  0.01614714227616787,
8 ...]

Inserting and querying data via SuperDuperDB

Let’s add some data to MongoDB using the db connection. We’ve prepared some data from the PyMongo API to add a meta twist to this walkthrough. You can download this data with this command:

1 curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/pymongo.json

1 import json
2 from superduperdb.backends.mongodb.query import Collection
3 from superduperdb.base.document import Document as D
4 
5 with open('pymongo.json') as f:
6     data = json.load(f)
7 
8 db.execute(
9     Collection('documents').insert_many([D(r) for r in data])
10 )

You’ll see from this command that, in contrast to pymongo, superduperdb

includes query objects (Collection(...)...). This allows superduperdb to pass the queries around to models, computations, and training runs, as well as save the queries for future use.
Other than this fact, superduperdb supports all of the commands that are supported by the core pymongo API.

Here is an example of fetching some data with SuperDuperDB:

1 >>> r = db.execute(Collection('documents').find_one())
2 >>> r
3 Document({
4     'key': 'pymongo.mongo_client.MongoClient', 
5     'parent': None, 
6     'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n', 
7     'document': 'mongo_client.md',
8     'res': 'pymongo.mongo_client.MongoClient',
9     '_fold': 'train',
10     '_id': ObjectId('652e460f6cc2a5f9cc21db4f')
11 })

You can see that the usual data from MongoDB is wrapped with the Document class.

You can recover the unwrapped document with unpack:

1 >>> r.unpack()
2 {'key': 'pymongo.mongo_client.MongoClient',
3  'parent': None,
4  'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n',
5  'document': 'mongo_client.md',
6  'res': 'pymongo.mongo_client.MongoClient',
7  '_fold': 'train',
8  '_id': ObjectId('652e460f6cc2a5f9cc21db4f')}

The reason superduperdb uses the Document abstraction is that, in SuperDuperDB, you don't need to manage converting data to bytes yourself. We have a system of configurable and user-controlled types, or "Encoders," which allow users to insert, for example, images directly. (This is a topic of an upcoming tutorial!)

Configuring models to work with vector search on MongoDB Atlas using SuperDuperDB

Now you have chosen and tested a model and inserted some data, you may configure vector search on MongoDB Atlas using SuperDuperDB. To do that, execute this command:

1 from superduperdb import VectorIndex
2 from superduperdb import Listener
3 
4 db.add(
5     VectorIndex(
6         identifier='pymongo-docs',
7         indexing_listener=Listener(
8             model=model,
9             key='value',
10             select=Collection('documents').find(),
11             predict_kwargs={'max_chunk_size': 1000},
12         ),
13     )
14 )

This command tells superduperdb to do several things:

Search the "documents" collection
Set up a vector index on our Atlas cluster, using the text in the "value" field (Listener)
Use the model variable to create vector embeddings

After receiving this command, SuperDuperDB:

Configures a MongoDB Atlas knn-index in the "documents" collection.
Saves the model object in the SuperDuperDB model store hosted on gridfs.
Applies model to all data in the "documents" collection, and saves the vectors in the documents.
Saves the fact that the model is connected to the "pymongo-docs" vector index.

If you’d like to “reload” your model in a later session, you can do this with the load command:

1 >>> db.load("model", 'all-MiniLM-L6-v2')

To look at what happened during the creation of the VectorIndex, we can see that the individual documents now contain vectors:

1 >>> db.execute(Collection('documents').find_one()).unpack()
2 {'key': 'pymongo.mongo_client.MongoClient',
3  'parent': None,
4  'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n',
5  'document': 'mongo_client.md',
6  'res': 'pymongo.mongo_client.MongoClient',
7  '_fold': 'train',
8  '_id': ObjectId('652e460f6cc2a5f9cc21db4f'),
9  '_outputs': {'value': {'text-embedding-ada-002': [-0.024740776047110558,
10     0.013489063829183578,
11     0.021334229037165642,
12     -0.03423869237303734,
13     ...]}}}

The outputs of models are always saved in the "_outputs.<key>.<model>" path of the documents. This allows MongoDB Atlas Vector Search to know where to look to create the fast vector lookup index.

You can verify also that MongoDB Atlas has created a knn vector search index by logging in to your Atlas account and navigating to the search tab. It will look like this:

The green

ACTIVE

status indicates that MongoDB Atlas has finished comprehending and “organizing” the vectors so that they may be searched quickly.

If you navigate to the “...” sign on Actions and click edit with JSON editor*,* then you can inspect the explicit index definition which was automatically configured by superduperdb:

You can confirm from this definition that the index looks into the "_outputs.<key>.<model>" path of the documents in our collection.

Querying vector search with a high-level API with SuperDuperDB

Now that our index is ready to go, we can perform some “search-by-meaning” queries using the db connection:

1 >>> query = 'Query the database'
2 >>> result = db.execute(
3 ...    Collection('documents')
4 ...        .like(D({'value': query}), vector_index='pymongo-docs', n=5)
5 ...        .find({}, {'value': 1, 'key': 1})
6 ... )
7 >>> for r in result:
8 ...    print(r.unpack())
9 
10 {'key': 'find', 'value': '\nQuery the database.\n\nThe filter argument is a query document that all results\nmust match. For example:\n\n`pycon\n>>> db'}
11 {'key': 'database_name', 'value': '\nThe name of the database this command was run against.\n\n'}
12 {'key': 'aggregate', 'value': '\nPerform a database-level aggregation.\n\nSee the [aggregation pipeline](https://mongodb.com/docs/manual/reference/operato'}
13 {'key': 'alive', 'value': '\nDoes this cursor have the potential to return more data?\n\n'}
14 {'key': 'pymongo.cursor.CursorType', 'value': '\n'}

🚀 So that’s it! 🚀

You’ve now queried a vector search index on MongoDB Atlas Vector Search using a model and setup installed with SuperDuperDB. This required only a few key commands in Python, utilizing model libraries and API clients from the Python open-source ecosystem!

superduperdb has lots more to offer:

Developers can bring their own model, or install arbitrary models from the open-source ecosystem.
The cohere and anthropic APIs are also supported, in addition to openai.
Developers may also search through images and videos.
Other use cases, in addition to vanilla vector search, are supported:
- Chat with your docs
- Classical machine learning
- Transfer learning
- Vector search with arbitrary data-types
- Much, much more…

Useful Links

Contributors are welcome!

SuperDuperDB is open source and permissively licensed under the Apache 2.0 license. We would like to encourage developers interested in open-source development to contribute to our discussion forums and issue boards and make their own pull requests. We'll see you on GitHub!

Become a Design Partner!

We are looking for visionary organizations we can help to identify and implement transformative AI applications for their business and products. We're offering this absolutely for free. If you would like to learn more about this opportunity, please reach out to us via email: partnerships@superduperdb.com.

Top Comments in Forums

There are no comments on this article yet.

Start the Conversation

Rate this article

Article

Coronavirus Map and Live Data Tracker with MongoDB Charts

Nov 15, 2023 | 3 min read

Tutorial

Movie Score Prediction with BigQuery, Vertex AI, and MongoDB Atlas

Jan 13, 2025 | 11 min read

Tutorial

Building a Restaurant Locator Using Atlas, Neurelo, and AWS Lambda

Apr 02, 2024 | 8 min read

Tutorial

Building a Real-Time, Dynamic Seller Dashboard on MongoDB

Aug 05, 2024 | 7 min read

Introduction
Getting started with SuperDuperDB
Connecting SuperDuperDB with AI models
Inserting and querying data via SuperDuperDB
Configuring models to work with vector search on MongoDB Atlas using SuperDuperDB
Querying vector search with a high-level API with SuperDuperDB
Useful Links
Contributors are welcome!
Become a Design Partner!

Atlas

Using SuperDuperDB to Accelerate AI Development on MongoDB Atlas Vector Search

Introduction

Getting started with SuperDuperDB

Connecting SuperDuperDB with AI models

Inserting and querying data via SuperDuperDB

Configuring models to work with vector search on MongoDB Atlas using SuperDuperDB

Querying vector search with a high-level API with SuperDuperDB

Useful Links

Contributors are welcome!

Become a Design Partner!

Top Comments in Forums

Related

Coronavirus Map and Live Data Tracker with MongoDB Charts

Movie Score Prediction with BigQuery, Vertex AI, and MongoDB Atlas

Building a Restaurant Locator Using Atlas, Neurelo, and AWS Lambda

Building a Real-Time, Dynamic Seller Dashboard on MongoDB

Table of Contents

1	from superduperdb import superduper
2
3	db = superduper("mongodb+srv://<user>:<password>@...mongodb.net/documents")

1	from superduperdb.ext.openai.model import OpenAIEmbedding
2
3	model = OpenAIEmbedding(
4	identifier='text-embedding-ada-002', model='text-embedding-ada-002')

1	>>> model.predict('This is a test', one=True)
2	[-0.008146246895194054,
3	-0.0036965329200029373,
4	-0.0006024622125551105,
5	-0.005724836140871048,
6	-0.02455105632543564,
7	0.01614714227616787,
8	...]

1	import sentence_transformers
2	from superduperdb.components.model import Model

1	model = Model(
2	identifier='all-MiniLM-L6-v2',
3	object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),
4	encoder=vector(shape=(384,)),
5	predict_method='encode',
6	postprocess=lambda x: x.tolist(),
7	batch_predict=True,
8	)

1	import json
2	from superduperdb.backends.mongodb.query import Collection
3	from superduperdb.base.document import Document as D
4
5	with open('pymongo.json') as f:
6	data = json.load(f)
7
8	db.execute(
9	Collection('documents').insert_many([D(r) for r in data])
10	)

1	>>> r = db.execute(Collection('documents').find_one())
2	>>> r
3	Document({
4	'key': 'pymongo.mongo_client.MongoClient',
5	'parent': None,
6	'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n',
7	'document': 'mongo_client.md',
8	'res': 'pymongo.mongo_client.MongoClient',
9	'_fold': 'train',
10	'_id': ObjectId('652e460f6cc2a5f9cc21db4f')
11	})

1	>>> r.unpack()
2	{'key': 'pymongo.mongo_client.MongoClient',
3	'parent': None,
4	'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n',
5	'document': 'mongo_client.md',
6	'res': 'pymongo.mongo_client.MongoClient',
7	'_fold': 'train',
8	'_id': ObjectId('652e460f6cc2a5f9cc21db4f')}

1	from superduperdb import VectorIndex
2	from superduperdb import Listener
3
4	db.add(
5	VectorIndex(
6	identifier='pymongo-docs',
7	indexing_listener=Listener(
8	model=model,
9	key='value',
10	select=Collection('documents').find(),
11	predict_kwargs={'max_chunk_size': 1000},
12	),
13	)
14	)

1	>>> db.execute(Collection('documents').find_one()).unpack()
2	{'key': 'pymongo.mongo_client.MongoClient',
3	'parent': None,
4	'value': '\nClient for a MongoDB instance, a replica set, or a set of mongoses.\n\n',
5	'document': 'mongo_client.md',
6	'res': 'pymongo.mongo_client.MongoClient',
7	'_fold': 'train',
8	'_id': ObjectId('652e460f6cc2a5f9cc21db4f'),
9	'_outputs': {'value': {'text-embedding-ada-002': [-0.024740776047110558,
10	0.013489063829183578,
11	0.021334229037165642,
12	-0.03423869237303734,
13	...]}}}

1	>>> query = 'Query the database'
2	>>> result = db.execute(
3	... Collection('documents')
4	... .like(D({'value': query}), vector_index='pymongo-docs', n=5)
5	... .find({}, {'value': 1, 'key': 1})
6	... )
7	>>> for r in result:
8	... print(r.unpack())
9
10	{'key': 'find', 'value': '\nQuery the database.\n\nThe filter argument is a query document that all results\nmust match. For example:\n\n`pycon\n>>> db'}
11	{'key': 'database_name', 'value': '\nThe name of the database this command was run against.\n\n'}
12	{'key': 'aggregate', 'value': '\nPerform a database-level aggregation.\n\nSee the [aggregation pipeline](https://mongodb.com/docs/manual/reference/operato'}
13	{'key': 'alive', 'value': '\nDoes this cursor have the potential to return more data?\n\n'}
14	{'key': 'pymongo.cursor.CursorType', 'value': '\n'}

Atlas

Using SuperDuperDB to Accelerate AI Development on MongoDB Atlas Vector Search

Introduction

Getting started with SuperDuperDB

Connecting SuperDuperDB with AI models

Inserting and querying data via SuperDuperDB

Configuring models to work with vector search on MongoDB Atlas using SuperDuperDB

Querying vector search with a high-level API with SuperDuperDB

Useful Links

Contributors are welcome!

Become a Design Partner!​

Top Comments in Forums

Related

Coronavirus Map and Live Data Tracker with MongoDB Charts

Movie Score Prediction with BigQuery, Vertex AI, and MongoDB Atlas

Building a Restaurant Locator Using Atlas, Neurelo, and AWS Lambda

Building a Real-Time, Dynamic Seller Dashboard on MongoDB

Table of Contents

Become a Design Partner!