Set Up Your Data for Efficient Vector Storage and Ingestion in Atlas
You can convert your embeddings to BSON BinData vector
subtype float32
or
vector
subtype int8
vectors. We recommend the BSON binData
vector
subtype for the following use cases:
You need to index quantized vectors from embedding models.
You have a large number of float vectors but want to reduce the storage and WiredTiger footprint (such as disk and memory usage) in
mongod
.
The BinData vector
format
requires about three times less disk space in your cluster compared
to arrays of elements. It allows you to index your vectors with
alternate types such as int8
and int1
vectors, reducing the
memory needed to build the Atlas Vector Search index for your collection.
If you don't already have binData
vectors, you can convert your
embeddings to this format by using any supported driver before writing
your data to a collection. This page walks you through the steps for
converting your embeddings to the BinData vector
subtype.
Supported Drivers
BSON BinData vector
subtype
float32
and int8
vector conversion is supported by
PyMongo Driver v4.10 or later.
Prerequisites
To convert your embeddings to BSON BinData vector
subtype, you need the
following:
An Atlas cluster running MongoDB version 6.0.11, 7.0.2, or later.
Ensure that your IP address is included in your Atlas project's access list.
An environment to run interactive Python notebooks such as Colab.
Access to an embedding model that supports byte vector output.
The following embedding model providers support both
int8
andint1
binData
vectors:Embedding Model ProviderEmbedding Modelembed-english-v3.0
nomic-embed-text-v1.5
jina-embeddings-v2-base-en
mxbai-embed-large-v1
You can use any of these embedding model providers to generate
binData
vectors.
Procedure
Create an interactive Python notebook by saving a file with the
.ipynb
extension, and then perform the following steps in the
notebook. The examples in this procedure use sample data and Cohere's embed-english-v3.0
model. To try the
example, replace values shown in curly brackets ({ }
) with valid
values.
Install the required libraries.
Run the following command to install the PyMongo Driver.
pip install pymongo
You must install PyMongo v4.10 or later driver.
Example
Install PyMongo and Cohere
pip --quiet install pymongo cohere
Load the data for which you want to generate BSON vectors in your notebook.
Example
Sample Data to Import
data = [ "The Great Wall of China is visible from space.", "The Eiffel Tower was completed in Paris in 1889.", "Mount Everest is the highest peak on Earth at 8,848m.", "Shakespeare wrote 37 plays and 154 sonnets during his lifetime.", "The Mona Lisa was painted by Leonardo da Vinci.", ]
(Conditional) Generate embeddings from your data.
This step is required if you haven't yet generated embeddings from your data. If you've already generated embeddings, skip this step. To learn more about generating embeddings from your data, see How to Create Vector Embeddings.
Example
Generate Embeddings from Sample Data Using Cohere
import cohere api_key = "{COHERE-API-KEY}" co = cohere.Client(api_key) generated_embeddings = co.embed( texts=data, model="embed-english-v3.0", input_type="search_document", embedding_types=["float", "int8"] ).embeddings float32_embeddings = generated_embeddings.float int8_embeddings = generated_embeddings.int8
Generate the BSON vectors from your embeddings.
You can use the PyMongo driver to convert your native vector embedding to BSON vectors.
Example
Define and Run a Function to Generate BSON Vectors
from bson.binary import Binary, BinaryVectorDtype def generate_bson_vector(vector, vector_dtype): return Binary.from_vector(vector, vector_dtype) # For all vectors in your collection, generate BSON vectors of float32 and int8 embeddings bson_float32_embeddings = [] bson_int8_embeddings = [] for i, (f32_emb, int8_emb) in enumerate(zip(float32_embeddings, int8_embeddings)): bson_float32_embeddings.append(generate_bson_vector(f32_emb, BinaryVectorDtype.FLOAT32)) bson_int8_embeddings.append(generate_bson_vector(int8_emb, BinaryVectorDtype.INT8))
Create documents with the BSON vector embeddings.
If you already have the BSON vector embeddings inside of documents in your collection, skip this step.
Example
Create Documents from the Sample Data
def create_docs_with_bson_vector_embeddings(bson_float32_embeddings, bson_int8_embeddings, data): docs = [] for i, (bson_f32_emb, bson_int8_emb, text) in enumerate(zip(bson_float32_embeddings, bson_int8_embeddings, data)): doc = { "_id":i, "data": text, "{FIELD-NAME-FOR-INT8-TYPE}":bson_int8_emb, "{FIELD-NAME-FOR-FLOAT32-TYPE}":bson_f32_emb, } docs.append(doc) return docs documents = create_docs_with_bson_vector_embeddings(bson_float32_embeddings, bson_int8_embeddings, data)
Load your data into your Atlas cluster.
You can load your data from the Atlas UI and programmatically. To learn how to load your data from the Atlas UI, see Insert Your Data. The following steps and associated examples demonstrate how to load your data programmatically by using the PyMongo driver.
Connect to your Atlas cluster.
Example
import pymongo MONGO_URI = "{ATLAS-CONNECTION-STRING}" def get_mongo_client(mongo_uri): # establish the connection client = pymongo.MongoClient(mongo_uri) if not MONGO_URI: print("MONGO_URI not set in environment variables") Load the data into your Atlas cluster.
Example
client = pymongo.MongoClient(MONGO_URI) db = client["{DB-NAME}"] db.create_collection("{COLLECTION-NAME}") col = db["{COLLECTION-NAME}"] col.insert_many(documents)
Create the Atlas Vector Search index on the collection.
You can create Atlas Vector Search indexes by using the Atlas UI, Atlas CLI, Atlas Administration API, and MongoDB drivers. To learn more, see How to Index Fields for Vector Search.
Example
Create Index for the Sample Collection
import time from pymongo.operations import SearchIndexModel vector_search_index_definition = { "fields":[ { "type": "vector", "path": "{FIELD-NAME-FOR-FLOAT32-TYPE}", "similarity": "euclidean", "numDimensions": 1024, }, { "type": "vector", "path": "{FIELD-NAME-FOR-INT8-TYPE}", "similarity": "euclidean", "numDimensions": 1024, } ] } search_index_model = SearchIndexModel(definition=vector_search_index_definition, name="{INDEX-NAME}", type="vectorSearch") col.create_search_index(model=search_index_model)
Define a function to run the Atlas Vector Search queries.
The function to run Atlas Vector Search queries must perform the following actions:
Convert the query text to a BSON vector.
Define the pipeline for the Atlas Vector Search query.
Example
def run_vector_search(query_text, collection, path): query_text_embeddings = co.embed( texts=[query_text], model="embed-english-v3.0", input_type="search_query", embedding_types=["float", "int8"] ).embeddings if path == "{FIELD-NAME-FOR-FLOAT32-TYPE}": query_vector = query_text_embeddings.float[0] vector_dtype = BinaryVectorDtype.FLOAT32 else: query_vector = query_text_embeddings.int8[0] vector_dtype = BinaryVectorDtype.INT8 bson_query_vector = generate_bson_vector(query_vector, vector_dtype) pipeline = [ { '$vectorSearch': { 'index': '{INDEX-NAME}', 'path': path, 'queryVector': bson_query_vector, 'numCandidates': {NUMBER-OF-CANDIDATES-TO-CONSIDER}, 'limit': {NUMBER-OF-DOCUMENTS-TO-RETURN} } }, { '$project': { '_id': 0, 'data': 1, 'score': { '$meta': 'vectorSearchScore' } } } ] return collection.aggregate(pipeline)
Run the Atlas Vector Search query.
You can run Atlas Vector Search queries programmatically. To learn more, see Run Vector Search Queries.
Example
from pprint import pprint query_text = "tell me a science fact" float32_results = run_vector_search(query_text, col, "{FIELD-NAME-FOR-FLOAT32-TYPE}") int8_results = run_vector_search(query_text, col, "{FIELD-NAME-FOR-INT8-TYPE}") print("results from float32 embeddings") pprint(list(float32_results)) print("--------------------------------------------------------------------------") print("results from int8 embeddings") pprint(list(int8_results))
results from float32 embeddings [{'data': 'Mount Everest is the highest peak on Earth at 8,848m.', 'score': 0.4222325384616852}, {'data': 'The Great Wall of China is visible from space.', 'score': 0.4112812876701355}, {'data': 'The Mona Lisa was painted by Leonardo da Vinci.', 'score': 0.3871753513813019}, {'data': 'The Eiffel Tower was completed in Paris in 1889.', 'score': 0.38428616523742676}, {'data': 'Shakespeare wrote 37 plays and 154 sonnets during his lifetime.', 'score': 0.37546128034591675}] -------------------------------------------------------------------------- results from int8 embeddings [{'data': 'Mount Everest is the highest peak on Earth at 8,848m.', 'score': 4.619598996669083e-07}, {'data': 'The Great Wall of China is visible from space.', 'score': 4.5106872903488693e-07}, {'data': 'The Mona Lisa was painted by Leonardo da Vinci.', 'score': 4.0036800896814384e-07}, {'data': 'The Eiffel Tower was completed in Paris in 1889.', 'score': 3.9345573554783186e-07}, {'data': 'Shakespeare wrote 37 plays and 154 sonnets during his lifetime.', 'score': 3.797164538354991e-07}]
For an advanced demonstration of this procedure on sample data using
Cohere's embed-english-v3.0
embedding model, see
this notebook.