Docs Menu
Docs Home
/
MongoDB Atlas
/

How to Create Vector Embeddings

On this page

  • Get Started
  • Prerequisites
  • Define an Embedding Function
  • Create Embeddings from Data
  • Create Embeddings for Queries
  • Considerations
  • Choosing a Method to Create Embeddings
  • Choosing an Embedding Model
  • Validating Your Embeddings
  • Troubleshooting
  • Next Steps

You can store vector embeddings alongside your other data in Atlas. These embeddings capture meaningful relationships in your data and allow you to perform semantic search and implement RAG with Atlas Vector Search.

Use the following tutorial to learn how to create vector embeddings and query them using Atlas Vector Search. Specifically, you perform the following actions:

  1. Define a function that uses an embedding model to generate vector embeddings.

  2. Create embeddings from your data and store them in Atlas.

  3. Create embeddings from your search terms and run a vector search query.

For production applications, you typically write a script to generate vector embeddings. You can start with the sample code on this page and customize it for your use case.


➤ Use the Select your language drop-down menu to set the language of the examples on this page.

Tip

To complete this tutorial, you must have the following:

  • An Atlas account with a cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs). Ensure that your IP address is included in your Atlas project's access list. To learn more, see Create a Cluster.

  • A terminal and code editor to run your C# project.

  • .NET 8.0 or higher installed.

  • A Hugging Face Access Token or OpenAI API Key.

  • An Atlas account with a cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs). Ensure that your IP address is included in your Atlas project's access list. To learn more, see Create a Cluster.

  • Java Development Kit (JDK) version 8 or later.

  • An environment to set up and run a Java application. We recommend that you use an integrated development environment (IDE) such as IntelliJ IDEA or Eclipse IDE to configure Maven or Gradle to build and run your project.

  • One of the following:

  • An Atlas account with a cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs). Ensure that your IP address is included in your Atlas project's access list. To learn more, see Create a Cluster.

  • A terminal and code editor to run your Node.js project.

  • npm and Node.js installed.

  • If you're using OpenAI models, you must have an OpenAI API Key.

  • An Atlas account with a cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs). Ensure that your IP address is included in your Atlas project's access list. To learn more, see Create a Cluster.

  • An environment to run interactive Python notebooks such as VS Code or Colab.

  • If you're using OpenAI models, you must have an OpenAI API Key.

1

In a terminal window, run the following commands to initialize your project:

dotnet new console -o MyCompany.Embeddings
cd MyCompany.Embeddings
2

In a terminal window, run the following commands:

dotnet add package MongoDB.Driver
3

Export your environment variables, set them in PowerShell, or use your IDE's environment variable manager to make the connection string and HuggingFace access token available to your project.

export HUGGINGFACE_ACCESS_TOKEN="<access-token>"
export ATLAS_CONNECTION_STRING="<connection-string>"

Replace the <access-token> placeholder value with your Hugging Face access token.

Replace the <connection-string> placeholder value with the SRV connection string for your Atlas cluster.

Your connection string should use the following format:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
4

Create a new class in a same-named file named AIService.cs and paste the following code. This code defines an async Task named GetEmbeddingsAsync to generate a array of embeddings for an array of given string inputs. This function uses the mxbai-embed-large-v1 embedding model.

AIService.cs
namespace MyCompany.Embeddings;
using System;
using System.Net.Http;
using System.Text.Json;
using System.Threading.Tasks;
using System.Net.Http.Headers;
public class AIService
{
private static readonly string? HuggingFaceAccessToken = Environment.GetEnvironmentVariable("HUGGINGFACE_ACCESS_TOKEN");
private static readonly HttpClient Client = new HttpClient();
public async Task<Dictionary<string, float[]>> GetEmbeddingsAsync(string[] texts)
{
const string modelName = "mixedbread-ai/mxbai-embed-large-v1";
const string url = $"https://api-inference.huggingface.co/models/{modelName}";
Client.DefaultRequestHeaders.Authorization
= new AuthenticationHeaderValue("Bearer", HuggingFaceAccessToken);
var data = new { inputs = texts };
var dataJson = JsonSerializer.Serialize(data);
var content = new StringContent(dataJson,null, "application/json");
var response = await Client.PostAsync(url, content);
response.EnsureSuccessStatusCode();
var responseString = await response.Content.ReadAsStringAsync();
var embeddings = JsonSerializer.Deserialize<float[][]>(responseString);
if (embeddings is null)
{
throw new ApplicationException("Failed to deserialize embeddings response to an array of floats.");
}
Dictionary<string, float[]> documentData = new Dictionary<string, float[]>();
var embeddingCount = embeddings.Length;
foreach (var value in Enumerable.Range(0, embeddingCount))
{
// Pair each embedding with the text used to generate it.
documentData[texts[value]] = embeddings[value];
}
return documentData;
}
}

Note

503 when calling Hugging Face models

You may occasionally get 503 errors when calling Hugging Face model hub models. To resolve this issue, retry after a short delay.

1

In a terminal window, run the following commands to initialize your project:

dotnet new console -o MyCompany.Embeddings
cd MyCompany.Embeddings
2

In a terminal window, run the following commands:

dotnet add package MongoDB.Driver
dotnet add package OpenAI
3

Export your environment variables, set them in PowerShell, or use your IDE's environment variable manager to make the connection string and HuggingFace access token available to your project.

export OPENAI_API_KEY="<api-key>"
export ATLAS_CONNECTION_STRING="<connection-string>"

Replace the <api-key> placeholder value with your OpenAI API key.

Replace the <connection-string> placeholder value with the SRV connection string for your Atlas cluster.

Your connection string should use the following format:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
4

Create a new class in a same-named file named AIService.cs and paste the following code. This code defines an async Task named GetEmbeddingsAsync to generate a array of embeddings for an array of given string inputs. This function uses OpenAI's text-embedding-3-small model to generate an embedding for a given input.

AIService.cs
namespace MyCompany.Embeddings;
using OpenAI.Embeddings;
using System;
using System.Threading.Tasks;
public class AIService
{
private static readonly string? OpenAIApiKey = Environment.GetEnvironmentVariable("OPENAI_API_KEY");
private static readonly string EmbeddingModelName = "text-embedding-3-small";
public async Task<Dictionary<string, float[]>> GetEmbeddingsAsync(string[] texts)
{
EmbeddingClient embeddingClient = new(model: EmbeddingModelName, apiKey: OpenAIApiKey);
Dictionary<string, float[]> documentData = new Dictionary<string, float[]>();
try
{
var result = await embeddingClient.GenerateEmbeddingsAsync(texts);
var embeddingCount = result.Value.Count;
foreach (var index in Enumerable.Range(0, embeddingCount))
{
// Pair each embedding with the text used to generate it.
documentData[texts[index]] = result.Value[index].ToFloats().ToArray();
}
}
catch (Exception e)
{
throw new ApplicationException(e.Message);
}
return documentData;
}
}

In this section, you define a function to generate vector embeddings by using an embedding model. Select a tab based on whether you want to use an open-source embedding model or a proprietary model such as OpenAI.

Note

Open-source embedding models are free to use and can be loaded locally from your application. Proprietary models require an API key to access the models.

1

In a terminal window, run the following commands to create a new directory named my-embeddings-project and initialize your project:

mkdir my-embeddings-project
cd my-embeddings-project
go mod init my-embeddings-project
2

In a terminal window, run the following commands:

go get github.com/joho/godotenv
go get go.mongodb.org/mongo-driver/mongo
go get github.com/tmc/langchaingo/llms
3

In your project, create a .env file to store your Atlas connection string and Hugging Face access token.

HUGGINGFACEHUB_API_TOKEN = "<access-token>"
ATLAS_CONNECTION_STRING = "<connection-string>"

Replace the <access-token> placeholder value with your Hugging Face access token.

Replace the <connection-string> placeholder value with the SRV connection string for your Atlas cluster.

Your connection string should use the following format:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
4
  1. Create a directory in your project called common to store common code that you'll use in later steps:

    mkdir common && cd common
  2. Create a file named get-embeddings.go and paste the following code. This code defines a function named GetEmbeddings to generate an embedding for a given input. This function specifies:

    get-embeddings.go
    package common
    import (
    "context"
    "log"
    "github.com/tmc/langchaingo/embeddings/huggingface"
    )
    func GetEmbeddings(documents []string) [][]float32 {
    hf, err := huggingface.NewHuggingface(
    huggingface.WithModel("mixedbread-ai/mxbai-embed-large-v1"),
    huggingface.WithTask("feature-extraction"))
    if err != nil {
    log.Fatalf("failed to connect to Hugging Face: %v", err)
    }
    embs, err := hf.EmbedDocuments(context.Background(), documents)
    if err != nil {
    log.Fatalf("failed to generate embeddings: %v", err)
    }
    return embs
    }

    Note

    503 when calling Hugging Face models

    You may occasionally get 503 errors when calling Hugging Face model hub models. To resolve this issue, retry after a short delay.

  3. Move back into the main project root directory.

    cd ../
1

In a terminal window, run the following commands to create a new directory named my-embeddings-project and initialize your project:

mkdir my-embeddings-project
cd my-embeddings-project
go mod init my-embeddings-project
2

In a terminal window, run the following commands:

go get github.com/joho/godotenv
go get go.mongodb.org/mongo-driver/mongo
go get github.com/milosgajdos/go-embeddings/openai
3

In your project, create a .env file to store your connection string and OpenAI API token.

OPENAI_API_KEY = "<api-key>"
ATLAS_CONNECTION_STRING = "<connection-string>"

Replace the <api-key> and <connection-string> placeholder values with your OpenAI API key and the SRV connection string for your Atlas cluster. Your connection string should use the following format:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

Note

Your connection string should use the following format:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
4
  1. Create a directory in your project called common to store code you'll use in multiple steps:

    mkdir common && cd common
  2. Create a file named get-embeddings.go and paste the following code. This code defines a function named GetEmbeddings that uses OpenAI's text-embedding-3-small model to generate an embedding for a given input.

    get-embeddings.go
    package common
    import (
    "context"
    "log"
    "github.com/milosgajdos/go-embeddings/openai"
    )
    func GetEmbeddings(docs []string) [][]float64 {
    c := openai.NewClient()
    embReq := &openai.EmbeddingRequest{
    Input: docs,
    Model: openai.TextSmallV3,
    EncodingFormat: openai.EncodingFloat,
    }
    embs, err := c.Embed(context.Background(), embReq)
    if err != nil {
    log.Fatalf("failed to connect to OpenAI: %v", err)
    }
    var vectors [][]float64
    for _, emb := range embs {
    vectors = append(vectors, emb.Vector)
    }
    return vectors
    }
  3. Move back into the main project root directory.

    cd ../

In this section, you define a function to generate vector embeddings by using an embedding model. Select a tab based on whether you want to use an open-source embedding model or a proprietary model such as OpenAI.

Note

Open-source embedding models are free to use and can be loaded locally from your application. Proprietary models require an API key to access the models.

1
  1. From your IDE, create a Java project using Maven or Gradle.

  2. Add the following dependencies, depending on your package manager:

    If you are using Maven, add the following dependencies to the dependencies array in your project's pom.xml file:

    pom.xml
    <dependencies>
    <!-- MongoDB Java Sync Driver v5.2.0 or later -->
    <dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>mongodb-driver-sync</artifactId>
    <version>[5.2.0,)</version>
    </dependency>
    <!-- Java library for working with Hugging Face models -->
    <dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-hugging-face</artifactId>
    <version>0.35.0</version>
    </dependency>
    </dependencies>

    If you are using Gradle, add the following to the dependencies array in your project's build.gradle file:

    build.gradle
    dependencies {
    // MongoDB Java Sync Driver v5.2.0 or later
    implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)'
    // Java library for working with Hugging Face models
    implementation 'dev.langchain4j:langchain4j-hugging-face:0.35.0'
    }
  3. Run your package manager to install the dependencies to your project.

2

Note

This example sets the variables for the project in the IDE. Production applications might manage environment variables through a deployment configuration, CI/CD pipeline, or secrets manager, but you can adapt the provided code to fit your use case.

In your IDE, create a new configuration template and add the following variables to your project:

  • If you are using IntelliJ IDEA, create a new Application run configuration template, then add your variables as semicolon-separated values in the Environment variables field (for example, FOO=123;BAR=456). Apply the changes and click OK.

    To learn more, see the Create a run/debug configuration from a template section of the IntelliJ IDEA documentation.

  • If you are using Eclipse, create a new Java Application launch configuration, then add each variable as a new key-value pair in the Environment tab. Apply the changes and click OK.

    To learn more, see the Creating a Java application launch configuration section of the Eclipse IDE documentation.

Environment variables
HUGGING_FACE_ACCESS_TOKEN=<access-token>
ATLAS_CONNECTION_STRING=<connection-string>

Update the placeholders with the following values:

  • Replace the``<access-token>`` placeholder value with your Hugging Face access token.

  • Replace the <connection-string> placeholder value with the SRV connection string for your Atlas cluster.

    Your connection string should use the following format:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
3

Create a file named EmbeddingProvider.java and paste the following code.

This code defines two methods to generate embeddings for a given input using the mxbai-embed-large-v1 open-source embedding model:

  • Multiple Inputs: The getEmbeddings method accepts an array of text inputs (List<String>), allowing you to create multiple embeddings in a single API call. The method converts the API-provided arrays of floats to BSON arrays of doubles for storing in your Atlas cluster.

  • Single Input: The getEmbedding method accepts a single String, which represents a query you want to make against your vector data. The method converts the API-provided array of floats to a BSON array of doubles to use when querying your collection.

EmbeddingProvider.java
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.huggingface.HuggingFaceEmbeddingModel;
import dev.langchain4j.model.output.Response;
import org.bson.BsonArray;
import org.bson.BsonDouble;
import java.util.List;
import static java.time.Duration.ofSeconds;
public class EmbeddingProvider {
private static HuggingFaceEmbeddingModel embeddingModel;
private static HuggingFaceEmbeddingModel getEmbeddingModel() {
if (embeddingModel == null) {
String accessToken = System.getenv("HUGGING_FACE_ACCESS_TOKEN");
if (accessToken == null || accessToken.isEmpty()) {
throw new RuntimeException("HUGGING_FACE_ACCESS_TOKEN env variable is not set or is empty.");
}
embeddingModel = HuggingFaceEmbeddingModel.builder()
.accessToken(accessToken)
.modelId("mixedbread-ai/mxbai-embed-large-v1")
.waitForModel(true)
.timeout(ofSeconds(60))
.build();
}
return embeddingModel;
}
/**
* Takes an array of strings and returns a BSON array of embeddings to
* store in the database.
*/
public List<BsonArray> getEmbeddings(List<String> texts) {
List<TextSegment> textSegments = texts.stream()
.map(TextSegment::from)
.toList();
Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments);
return response.content().stream()
.map(e -> new BsonArray(
e.vectorAsList().stream()
.map(BsonDouble::new)
.toList()))
.toList();
}
/**
* Takes a single string and returns a BSON array embedding to
* use in a vector query.
*/
public BsonArray getEmbedding(String text) {
Response<Embedding> response = getEmbeddingModel().embed(text);
return new BsonArray(
response.content().vectorAsList().stream()
.map(BsonDouble::new)
.toList());
}
}
1
  1. From your IDE, create a Java project using Maven or Gradle.

  2. Add the following dependencies, depending on your package manager:

    If you are using Maven, add the following dependencies to the dependencies array in your project's pom.xml file:

    pom.xml
    <dependencies>
    <!-- MongoDB Java Sync Driver v5.2.0 or later -->
    <dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>mongodb-driver-sync</artifactId>
    <version>[5.2.0,)</version>
    </dependency>
    <!-- Java library for working with OpenAI models -->
    <dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-open-ai</artifactId>
    <version>0.35.0</version>
    </dependency>
    </dependencies>

    If you are using Gradle, add the following to the dependencies array in your project's build.gradle file:

    build.gradle
    dependencies {
    // MongoDB Java Sync Driver v5.2.0 or later
    implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)'
    // Java library for working with OpenAI models
    implementation 'dev.langchain4j:langchain4j-open-ai:0.35.0'
    }
  3. Run your package manager to install the dependencies to your project.

2

Note

This example sets the variables for the project in the IDE. Production applications might manage environment variables through a deployment configuration, CI/CD pipeline, or secrets manager, but you can adapt the provided code to fit your use case.

In your IDE, create a new configuration template and add the following variables to your project:

  • If you are using IntelliJ IDEA, create a new Application run configuration template, then add your variables as semicolon-separated values in the Environment variables field (for example, FOO=123;BAR=456). Apply the changes and click OK.

    To learn more, see the Create a run/debug configuration from a template section of the IntelliJ IDEA documentation.

  • If you are using Eclipse, create a new Java Application launch configuration, then add each variable as a new key-value pair in the Environment tab. Apply the changes and click OK.

    To learn more, see the Creating a Java application launch configuration section of the Eclipse IDE documentation.

Environment variables
OPEN_AI_API_KEY=<api-key>
ATLAS_CONNECTION_STRING=<connection-string>

Update the placeholders with the following values:

  • Replace the``<api-key>`` placeholder value with your OpenAI API key.

  • Replace the <connection-string> placeholder value with the SRV connection string for your Atlas cluster.

    Your connection string should use the following format:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
3

Create a file named EmbeddingProvider.java and paste the following code.

This code defines two methods to generate embeddings for a given input using the text-embedding-3-small OpenAI embedding model:

  • Multiple Inputs: The getEmbeddings method accepts an array of text inputs (List<String>), allowing you to create multiple embeddings in a single API call. The method converts the API-provided arrays of floats to BSON arrays of doubles for storing in your Atlas cluster.

  • Single Input: The getEmbedding method accepts a single String, which represents a query you want to make against your vector data. The method converts the API-provided array of floats to a BSON array of doubles to use when querying your collection.

EmbeddingProvider.java
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.openai.OpenAiEmbeddingModel;
import dev.langchain4j.model.output.Response;
import org.bson.BsonArray;
import org.bson.BsonDouble;
import java.util.List;
import static java.time.Duration.ofSeconds;
public class EmbeddingProvider {
private static OpenAiEmbeddingModel embeddingModel;
private static OpenAiEmbeddingModel getEmbeddingModel() {
if (embeddingModel == null) {
String apiKey = System.getenv("OPEN_AI_API_KEY");
if (apiKey == null || apiKey.isEmpty()) {
throw new IllegalStateException("OPEN_AI_API_KEY env variable is not set or is empty.");
}
return OpenAiEmbeddingModel.builder()
.apiKey(apiKey)
.modelName("text-embedding-3-small")
.timeout(ofSeconds(60))
.build();
}
return embeddingModel;
}
/**
* Takes an array of strings and returns a BSON array of embeddings to
* store in the database.
*/
public List<BsonArray> getEmbeddings(List<String> texts) {
List<TextSegment> textSegments = texts.stream()
.map(TextSegment::from)
.toList();
Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments);
return response.content().stream()
.map(e -> new BsonArray(
e.vectorAsList().stream()
.map(BsonDouble::new)
.toList()))
.toList();
}
/**
* Takes a single string and returns a BSON array embedding to
* use in a vector query.
*/
public BsonArray getEmbedding(String text) {
Response<Embedding> response = getEmbeddingModel().embed(text);
return new BsonArray(
response.content().vectorAsList().stream()
.map(BsonDouble::new)
.toList());
}
}

In this section, you define a function to generate vector embeddings by using an embedding model. Select a tab based on whether you want to use an open-source embedding model or a proprietary model such as OpenAI.

Note

Open-source embedding models are free to use and can be loaded locally from your application. Proprietary models require an API key to access the models.

1

In a terminal window, run the following commands to create a new directory named my-embeddings-project and initialize your project:

mkdir my-embeddings-project
cd my-embeddings-project
npm init -y
2

Configure your project to use ES modules by adding "type": "module" to your package.json file and then saving it.

{
"type": "module",
// other fields...
}
3

In a terminal window, run the following command:

npm install mongodb @xenova/transformers
4

In your project, create a .env file to store your Atlas connection string.

ATLAS_CONNECTION_STRING = "<connection-string>"

Replace the <connection-string> placeholder value with the SRV connection string for your Atlas cluster. Your connection string should use the following format:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

Note

Minimum Node.js Version Requirements

Node.js v20.x introduced the --env-file option. If you are using an older version of Node.js, add the dotenv package to your project, or use a different method to manage your environment variables.

5

Create a file named get-embeddings.js and paste the following code. This code defines a function named to generate an embedding for a given input. This function specifies:

get-embeddings.js
import { pipeline } from '@xenova/transformers';
// Function to generate embeddings for a given data source
export async function getEmbedding(data) {
const embedder = await pipeline(
'feature-extraction',
'Xenova/nomic-embed-text-v1');
const results = await embedder(data, { pooling: 'mean', normalize: true });
return Array.from(results.data);
}
1

In a terminal window, run the following commands to create a new directory named my-embeddings-project and initialize your project:

mkdir my-embeddings-project
cd my-embeddings-project
npm init -y
2

Configure your project to use ES modules by adding "type": "module" to your package.json file and then saving it.

{
"type": "module",
// other fields...
}
3

In a terminal window, run the following command:

npm install mongodb openai
4

In your project, create a .env file to store your Atlas connection string and OpenAI API key.

OPENAI_API_KEY = "<api-key>"
ATLAS_CONNECTION_STRING = "<connection-string>"

Replace the <api-key> and <connection-string> placeholder values with your OpenAI API key and the SRV connection string for your Atlas cluster. Your connection string should use the following format:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

Note

Minimum Node.js Version Requirements

Node.js v20.x introduced the --env-file option. If you are using an older version of Node.js, add the dotenv package to your project, or use a different method to manage your environment variables.

5

Create a file named get-embeddings.js and paste the following code. This code defines a function named getEmbedding that uses OpenAI's text-embedding-3-small model to generate an embedding for a given input.

get-embeddings.js
import OpenAI from 'openai';
// Setup OpenAI configuration
const openai = new OpenAI({apiKey: process.env.OPENAI_API_KEY});
// Function to get the embeddings using the OpenAI API
export async function getEmbedding(text) {
const results = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
encoding_format: "float",
});
return results.data[0].embedding;
}

In this section, you define a function to generate vector embeddings by using an embedding model. Select a tab based on whether you want to use an open-source embedding model from Nomic or a proprietary model from OpenAI.

The open-source example also includes a function to convert your embeddings to BSON binData vectors for efficient processing. Only certain embedding models support byte vector outputs. For embedding models that don't, such as models from OpenAI, enable automatic quantization when creating the Atlas Vector Search index.

Note

Open-source embedding models are free to use and can be loaded locally from your application. Proprietary models require an API key to access the models.

1

Create an interactive Python notebook by saving a file with the .ipynb extension, and then run the following command in the notebook to install the dependencies:

pip install --quiet --upgrade sentence-transformers pymongo einops
2

Paste and run the following code in your notebook to create a function that generates vector embeddings by using an open-source embedding model from Nomic AI. This code does the following:

  • Loads the nomic-embed-text-v1 embedding model.

  • Creates a function named get_embedding that uses the model to generate float32 (the default precision), int8, or int1 embeddings for a given text input.

from sentence_transformers import SentenceTransformer
# Load the embedding model
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
# Define a function to generate embeddings in multiple precisions
def get_embedding(data, precision="float32"):
return model.encode(data, precision=precision)
3

Paste and run the following code in your notebook to create a function named generate_bson_vector that converts the full-fidelity embeddings to BSON float32, int8, and int1 vector subtypes for efficient processing of your vector data.

from bson.binary import Binary
# Generate BSON vector using `BinaryVectorDtype`
def generate_bson_vector(vector, vector_dtype):
return Binary.from_vector(vector, vector_dtype)
4

Paste and run the following code in your notebook to create a function named create_docs_with_bson_vector_embeddings that creates the documents with embeddings that you will ingest into Atlas.

# Function to create documents with BSON vector embeddings
def create_docs_with_bson_vector_embeddings(bson_float32, bson_int8, bson_int1, data):
docs = []
for i, (bson_f32_emb, bson_int8_emb, bson_int1_emb, text) in enumerate(zip(bson_float32, bson_int8, bson_int1, data)):
doc = {
"_id": i,
"data": text,
"BSON-Float32-Embedding": bson_f32_emb,
"BSON-Int8-Embedding": bson_int8_emb,
"BSON-Int1-Embedding": bson_int1_emb,
}
docs.append(doc)
return docs
5

Paste and run the following code in your notebook to test the get_embedding function.

This code generates float32, int8, int1 embeddings for the strings foo and bar.

# Example generating embeddings for the strings "foo" and "bar"
data = ["foo", "bar"]
float32_embeddings = get_embedding(data, "float32")
int8_embeddings = get_embedding(data, "int8")
int1_embeddings = get_embedding(data, "ubinary")
print("Float32 Embedding:", float32_embeddings)
print("Int8 Embedding:", int8_embeddings)
print("Int1 Embedding (binary representation):", int1_embeddings)
Float32 Embedding: [
[-0.02980827 0.03841474 -0.02561123 ... -0.0532876
-0.0335409 -0.02591543]
[-0.02748881 0.03717749 -0.03104552 ... 0.02413219 -0.02402252 0.02810651]
]
Int8 Embedding: [
[-128 127 127 ... -128 -128 -128]
[ 126 -128 -128 ... 127 126 127]
]
Int1 Embedding (binary representation): [
[ 77 30 4 131 15 123 146 ... 159 142 205 23 119 120]
[ 79 82 208 180 45 79 209 ... 158 100 141 189 166 173]
]
6

Paste and run the following code in your notebook to test the generate_bson_vector function.

This code quantizes your float32, int8, and int1 embeddings for the strings foo and bar.

from bson.binary BinaryVectorDtype
bson_float32_embeddings = []
bson_int8_embeddings = []
bson_int1_embeddings = []
for (f32_emb, int8_emb, int1_emb) in zip(float32_embeddings, int8_embeddings, int1_embeddings):
bson_float32_embeddings.append(generate_bson_vector(f32_emb, BinaryVectorDtype.FLOAT32))
bson_int8_embeddings.append(generate_bson_vector(int8_emb, BinaryVectorDtype.INT8))
bson_int1_embeddings.append(generate_bson_vector(int1_emb, BinaryVectorDtype.PACKED_BIT))
# Print the embeddings
print(f"The converted bson_float32_new_embedding is: {bson_float32_embeddings}")
print(f"The converted bson_int8_new_embedding is: {bson_int8_embeddings}")
print(f"The converted bson_int1_new_embedding is: {bson_int1_embeddings}")
The converted bson_float32_new_embedding is: [Binary(b'\'\x00x0\xf4\ ... x9bL\xd4\xbc', 9), Binary(b'\'\x007 ... \x9e?\xe6<', 9)]
The converted bson_int8_new_embedding is: [Binary(b'\x03\x00\x80\x7f\ ... x80\x80', 9), Binary(b'\x03\x00~\x80 ... \x7f', 9)]
The converted bson_int1_new_embedding is: [Binary(b'\x10\x00M\x1e\ ... 7wx', 9), Binary(b'\x10\x00OR\ ... \xa6\xad', 9)]
1

Create an interactive Python notebook by saving a file with the .ipynb extension, and then run the following command in the notebook to install the dependencies:

pip install --quiet --upgrade openai pymongo
2

Paste and run the following code in your notebook to create a function that generates vector embeddings by using a proprietary embedding model from OpenAI. Replace <api-key> with your OpenAI API key. This code does the following:

  • Specifies the text-embedding-3-small embedding model.

  • Creates a function named get_embedding that calls the model's API to generate an embedding for a given text input.

  • Tests the function by generating a single embedding for the string foo.

import os
from openai import OpenAI
# Specify your OpenAI API key and embedding model
os.environ["OPENAI_API_KEY"] = "<api-key>"
model = "text-embedding-3-small"
openai_client = OpenAI()
# Define a function to generate embeddings
def get_embedding(text):
"""Generates vector embeddings for the given text."""
embedding = openai_client.embeddings.create(input = [text], model=model).data[0].embedding
return embedding
# Generate an embedding
get_embedding("foo")
[-0.005843308754265308, -0.013111298903822899, -0.014585349708795547, 0.03580040484666824, 0.02671629749238491, ... ]

Tip

See also:

For API details and a list of available models, refer to the OpenAI documentation.

In this section, you create vector embeddings from your data using the function that you defined, and then you store these embeddings in a collection in Atlas.

Select a tab based on whether you want to create embeddings from new data or from existing data that you already have in Atlas.

1

Create a new class in a same-named file named DataService.cs and paste the following code. This code defines an async Task named AddDocumentsAsync add documents to Atlas. This function uses the Collection.InsertManyAsync() C# Driver method to insert a list of the BsonDocument type. Each document contains:

  • A text field that contains the movie summary.

  • An embedding field that contains the array of floats from generating the vector embeddings.

DataService.cs
namespace MyCompany.Embeddings;
using MongoDB.Driver;
using MongoDB.Bson;
public class DataService
{
private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
private static readonly MongoClient Client = new MongoClient(ConnectionString);
private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db");
private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings");
public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings)
{
var documents = new List<BsonDocument>();
foreach( KeyValuePair<string, float[]> var in embeddings )
{
var document = new BsonDocument
{
{
"text", var.Key
},
{
"embedding", new BsonArray(var.Value)
}
};
documents.Add(document);
}
await Collection.InsertManyAsync(documents);
Console.WriteLine($"Successfully inserted {embeddings.Count} documents into Atlas");
documents.Clear();
}
}
2

Use the following code to generate embeddings from an existing collection in Atlas.

Specifically, this code uses the GetEmbeddingsAsync function that you defined to generate embeddings from an array of sample texts and ingest them into the sample_db.embeddings collection in Atlas.

Program.cs
using MyCompany.Embeddings;
var aiService = new AIService();
var texts = new string[]
{
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission"
};
var embeddings = await aiService.GetEmbeddingsAsync(texts);
var dataService = new DataService();
await dataService.AddDocumentsAsync(embeddings);
3
dotnet run MyCompany.Embeddings.csproj
Successfully inserted 3 documents into Atlas

You can also view your vector embeddings in the Atlas UI by navigating to the sample_db.embeddings collection in your cluster.

Note

This example uses the sample_airbnb.listingsAndReviews collection from our sample data, but you can adapt the code to work with any collection in your cluster.

1

Create a new class in a same-named file named DataService.cs and paste the following code. This code creates two functions that do the following:

  • Connect to your Atlas cluster.

  • The GetDocuments method gets a subset of documents from the sample_airbnb.listingsAndReviews collection that have a non-empty summary field.

  • The AddEmbeddings async Task creates a new embeddings field on documents in the sample_airbnb.listingsAndReviews collection whose _id matches one of the documents retrieved in the GetDocuments method.

DataService.cs
namespace MyCompany.Embeddings;
using MongoDB.Driver;
using MongoDB.Bson;
public class DataService
{
private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
private static readonly MongoClient Client = new MongoClient(ConnectionString);
private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb");
private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews");
public List<BsonDocument>? GetDocuments()
{
var filter = Builders<BsonDocument>.Filter.And(
Builders<BsonDocument>.Filter.And(
Builders<BsonDocument>.Filter.Exists("summary", true),
Builders<BsonDocument>.Filter.Ne("summary", "")
),
Builders<BsonDocument>.Filter.Exists("embeddings", false)
);
return Collection.Find(filter).Limit(50).ToList();
}
public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings)
{
var listWrites = new List<WriteModel<BsonDocument>>();
foreach( var kvp in embeddings )
{
var filterForUpdate = Builders<BsonDocument>.Filter.Eq("summary", kvp.Key);
var updateDefinition = Builders<BsonDocument>.Update.Set("embeddings", kvp.Value);
listWrites.Add(new UpdateOneModel<BsonDocument>(filterForUpdate, updateDefinition));
}
var result = await Collection.BulkWriteAsync(listWrites);
listWrites.Clear();
return result.ModifiedCount;
}
}
2

Use the following code to generate embeddings from an existing collection in Atlas.

Specifically, this code uses the GetEmbeddingsAsync function that you defined to generate embeddings from an array of sample texts and ingest them into the sample_db.embeddings collection in Atlas.

Program.cs
using MyCompany.Embeddings;
var dataService = new DataService();
var documents = dataService.GetDocuments();
if (documents != null)
{
Console.WriteLine("Generating embeddings.");
var aiService = new AIService();
var summaries = new List<string>();
foreach (var document in documents)
{
var summary = document.GetValue("summary").ToString();
if (summary != null)
{
summaries.Add(summary);
}
}
try
{
if (summaries.Count > 0)
{
var embeddings = await aiService.GetEmbeddingsAsync(summaries.ToArray());
try
{
var updatedCount = await dataService.AddEmbeddings(embeddings);
Console.WriteLine($"{updatedCount} documents updated successfully.");
} catch (Exception e)
{
Console.WriteLine($"Error adding embeddings to MongoDB: {e.Message}");
}
}
}
catch (Exception e)
{
Console.WriteLine($"Error creating embeddings for summaries: {e.Message}");
}
}
else
{
Console.WriteLine("No documents found");
}
3
dotnet run MyCompany.Embeddings.csproj
Generating embeddings.
50 documents updated successfully.
1

Use the following code to generate embeddings from an existing collection in Atlas.

Specifically, this code uses the GetEmbeddings function that you defined and the MongoDB Go Driver to generate embeddings from an array of sample texts and ingest them into the sample_db.embeddings collection in Atlas.

create-embeddings.go
package main
import (
"context"
"fmt"
"log"
"my-embeddings-project/common"
"os"
"github.com/joho/godotenv"
"go.mongodb.org/mongo-driver/mongo"
"go.mongodb.org/mongo-driver/mongo/options"
)
var data = []string{
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission",
}
type TextWithEmbedding struct {
Text string
Embedding []float32
}
func main() {
ctx := context.Background()
if err := godotenv.Load(); err != nil {
log.Println("no .env file found")
}
// Connect to your Atlas cluster
uri := os.Getenv("ATLAS_CONNECTION_STRING")
if uri == "" {
log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
}
clientOptions := options.Client().ApplyURI(uri)
client, err := mongo.Connect(ctx, clientOptions)
if err != nil {
log.Fatalf("failed to connect to the server: %v", err)
}
defer func() { _ = client.Disconnect(ctx) }()
// Set the namespace
coll := client.Database("sample_db").Collection("embeddings")
embeddings := common.GetEmbeddings(data)
docsToInsert := make([]interface{}, len(embeddings))
for i, string := range data {
docsToInsert[i] = TextWithEmbedding{
Text: string,
Embedding: embeddings[i],
}
}
result, err := coll.InsertMany(ctx, docsToInsert)
if err != nil {
log.Fatalf("failed to insert documents: %v", err)
}
fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs))
}
create-embeddings.go
package main
import (
"context"
"fmt"
"log"
"my-embeddings-project/common"
"os"
"github.com/joho/godotenv"
"go.mongodb.org/mongo-driver/mongo"
"go.mongodb.org/mongo-driver/mongo/options"
)
var data = []string{
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission",
}
type TextWithEmbedding struct {
Text string
Embedding []float64
}
func main() {
ctx := context.Background()
if err := godotenv.Load(); err != nil {
log.Println("no .env file found")
}
// Connect to your Atlas cluster
uri := os.Getenv("ATLAS_CONNECTION_STRING")
if uri == "" {
log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
}
clientOptions := options.Client().ApplyURI(uri)
client, err := mongo.Connect(ctx, clientOptions)
if err != nil {
log.Fatalf("failed to connect to the server: %v", err)
}
defer func() { _ = client.Disconnect(ctx) }()
// Set the namespace
coll := client.Database("sample_db").Collection("embeddings")
embeddings := common.GetEmbeddings(data)
docsToInsert := make([]interface{}, len(data))
for i, movie := range data {
docsToInsert[i] = TextWithEmbedding{
Text: movie,
Embedding: embeddings[i],
}
}
result, err := coll.InsertMany(ctx, docsToInsert)
if err != nil {
log.Fatalf("failed to insert documents: %v", err)
}
fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs))
}
2
go run create-embeddings.go
Successfully inserted 3 documents into Atlas
go run create-embeddings.go
Successfully inserted 3 documents into Atlas

You can also view your vector embeddings in the Atlas UI by navigating to the sample_db.embeddings collection in your cluster.

Note

This example uses the sample_airbnb.listingsAndReviews collection from our sample data, but you can adapt the code to work with any collection in your cluster.

1

Use the following code to generate embeddings from an existing collection in Atlas. Specifically, this code does the following:

  • Connects to your Atlas cluster.

  • Gets a subset of documents from the sample_airbnb.listingsAndReviews collection that have a non-empty summary field.

  • Generates embeddings from each document's summary field by using the GetEmbeddings function that you defined.

  • Updates each document with a new embeddings field that contains the embedding value by using the MongoDB Go Driver.

create-embeddings.go
package main
import (
"context"
"log"
"my-embeddings-project/common"
"os"
"github.com/joho/godotenv"
"go.mongodb.org/mongo-driver/bson"
"go.mongodb.org/mongo-driver/mongo"
"go.mongodb.org/mongo-driver/mongo/options"
)
func main() {
ctx := context.Background()
if err := godotenv.Load(); err != nil {
log.Println("no .env file found")
}
// Connect to your Atlas cluster
uri := os.Getenv("ATLAS_CONNECTION_STRING")
if uri == "" {
log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
}
clientOptions := options.Client().ApplyURI(uri)
client, err := mongo.Connect(ctx, clientOptions)
if err != nil {
log.Fatalf("failed to connect to the server: %v", err)
}
defer func() { _ = client.Disconnect(ctx) }()
// Set the namespace
coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
filter := bson.D{
{"$and",
bson.A{
bson.D{
{"$and",
bson.A{
bson.D{{"summary", bson.D{{"$exists", true}}}},
bson.D{{"summary", bson.D{{"$ne", ""}}}},
},
}},
bson.D{{"embeddings", bson.D{{"$exists", false}}}},
}},
}
opts := options.Find().SetLimit(50)
cursor, err := coll.Find(ctx, filter, opts)
if err != nil {
log.Fatalf("failed to retrieve documents: %v", err)
}
var listings []common.Listing
if err = cursor.All(ctx, &listings); err != nil {
log.Fatalf("failed to unmarshal retrieved documents to Listing object: %v", err)
}
var summaries []string
for _, listing := range listings {
summaries = append(summaries, listing.Summary)
}
log.Println("Generating embeddings.")
embeddings := common.GetEmbeddings(summaries)
docsToUpdate := make([]mongo.WriteModel, len(listings))
for i := range listings {
docsToUpdate[i] = mongo.NewUpdateOneModel().
SetFilter(bson.D{{"_id", listings[i].ID}}).
SetUpdate(bson.D{{"$set", bson.D{{"embeddings", embeddings[i]}}}})
}
bulkWriteOptions := options.BulkWrite().SetOrdered(false)
result, err := coll.BulkWrite(context.Background(), docsToUpdate, bulkWriteOptions)
if err != nil {
log.Fatalf("failed to write embeddings to existing documents: %v", err)
}
log.Printf("Successfully added embeddings to %v documents", result.ModifiedCount)
}
2

To simplify marshalling and unmarshalling Go objects to and from BSON, create a file that contains models for the documents in this collection.

  1. Move into the common directory.

    cd common
  2. Create a file named models.go, and paste the following code into it:

    models.go
    package common
    import (
    "time"
    "go.mongodb.org/mongo-driver/bson/primitive"
    )
    type Image struct {
    ThumbnailURL string `bson:"thumbnail_url"`
    MediumURL string `bson:"medium_url"`
    PictureURL string `bson:"picture_url"`
    XLPictureURL string `bson:"xl_picture_url"`
    }
    type Host struct {
    ID string `bson:"host_id"`
    URL string `bson:"host_url"`
    Name string `bson:"host_name"`
    Location string `bson:"host_location"`
    About string `bson:"host_about"`
    ThumbnailURL string `bson:"host_thumbnail_url"`
    PictureURL string `bson:"host_picture_url"`
    Neighborhood string `bson:"host_neighborhood"`
    IsSuperhost bool `bson:"host_is_superhost"`
    HasProfilePic bool `bson:"host_has_profile_pic"`
    IdentityVerified bool `bson:"host_identity_verified"`
    ListingsCount int32 `bson:"host_listings_count"`
    TotalListingsCount int32 `bson:"host_total_listings_count"`
    Verifications []string `bson:"host_verifications"`
    }
    type Location struct {
    Type string `bson:"type"`
    Coordinates []float64 `bson:"coordinates"`
    IsLocationExact bool `bson:"is_location_exact"`
    }
    type Address struct {
    Street string `bson:"street"`
    Suburb string `bson:"suburb"`
    GovernmentArea string `bson:"government_area"`
    Market string `bson:"market"`
    Country string `bson:"Country"`
    CountryCode string `bson:"country_code"`
    Location Location `bson:"location"`
    }
    type Availability struct {
    Thirty int32 `bson:"availability_30"`
    Sixty int32 `bson:"availability_60"`
    Ninety int32 `bson:"availability_90"`
    ThreeSixtyFive int32 `bson:"availability_365"`
    }
    type ReviewScores struct {
    Accuracy int32 `bson:"review_scores_accuracy"`
    Cleanliness int32 `bson:"review_scores_cleanliness"`
    CheckIn int32 `bson:"review_scores_checkin"`
    Communication int32 `bson:"review_scores_communication"`
    Location int32 `bson:"review_scores_location"`
    Value int32 `bson:"review_scores_value"`
    Rating int32 `bson:"review_scores_rating"`
    }
    type Review struct {
    ID string `bson:"_id"`
    Date time.Time `bson:"date,omitempty"`
    ListingId string `bson:"listing_id"`
    ReviewerId string `bson:"reviewer_id"`
    ReviewerName string `bson:"reviewer_name"`
    Comments string `bson:"comments"`
    }
    type Listing struct {
    ID string `bson:"_id"`
    ListingURL string `bson:"listing_url"`
    Name string `bson:"name"`
    Summary string `bson:"summary"`
    Space string `bson:"space"`
    Description string `bson:"description"`
    NeighborhoodOverview string `bson:"neighborhood_overview"`
    Notes string `bson:"notes"`
    Transit string `bson:"transit"`
    Access string `bson:"access"`
    Interaction string `bson:"interaction"`
    HouseRules string `bson:"house_rules"`
    PropertyType string `bson:"property_type"`
    RoomType string `bson:"room_type"`
    BedType string `bson:"bed_type"`
    MinimumNights string `bson:"minimum_nights"`
    MaximumNights string `bson:"maximum_nights"`
    CancellationPolicy string `bson:"cancellation_policy"`
    LastScraped time.Time `bson:"last_scraped,omitempty"`
    CalendarLastScraped time.Time `bson:"calendar_last_scraped,omitempty"`
    FirstReview time.Time `bson:"first_review,omitempty"`
    LastReview time.Time `bson:"last_review,omitempty"`
    Accommodates int32 `bson:"accommodates"`
    Bedrooms int32 `bson:"bedrooms"`
    Beds int32 `bson:"beds"`
    NumberOfReviews int32 `bson:"number_of_reviews"`
    Bathrooms primitive.Decimal128 `bson:"bathrooms"`
    Amenities []string `bson:"amenities"`
    Price primitive.Decimal128 `bson:"price"`
    WeeklyPrice primitive.Decimal128 `bson:"weekly_price"`
    MonthlyPrice primitive.Decimal128 `bson:"monthly_price"`
    CleaningFee primitive.Decimal128 `bson:"cleaning_fee"`
    ExtraPeople primitive.Decimal128 `bson:"extra_people"`
    GuestsIncluded primitive.Decimal128 `bson:"guests_included"`
    Image Image `bson:"images"`
    Host Host `bson:"host"`
    Address Address `bson:"address"`
    Availability Availability `bson:"availability"`
    ReviewScores ReviewScores `bson:"review_scores"`
    Reviews []Review `bson:"reviews"`
    Embeddings []float32 `bson:"embeddings,omitempty"`
    }
    models.go
    package common
    import (
    "time"
    "go.mongodb.org/mongo-driver/bson/primitive"
    )
    type Image struct {
    ThumbnailURL string `bson:"thumbnail_url"`
    MediumURL string `bson:"medium_url"`
    PictureURL string `bson:"picture_url"`
    XLPictureURL string `bson:"xl_picture_url"`
    }
    type Host struct {
    ID string `bson:"host_id"`
    URL string `bson:"host_url"`
    Name string `bson:"host_name"`
    Location string `bson:"host_location"`
    About string `bson:"host_about"`
    ThumbnailURL string `bson:"host_thumbnail_url"`
    PictureURL string `bson:"host_picture_url"`
    Neighborhood string `bson:"host_neighborhood"`
    IsSuperhost bool `bson:"host_is_superhost"`
    HasProfilePic bool `bson:"host_has_profile_pic"`
    IdentityVerified bool `bson:"host_identity_verified"`
    ListingsCount int32 `bson:"host_listings_count"`
    TotalListingsCount int32 `bson:"host_total_listings_count"`
    Verifications []string `bson:"host_verifications"`
    }
    type Location struct {
    Type string `bson:"type"`
    Coordinates []float64 `bson:"coordinates"`
    IsLocationExact bool `bson:"is_location_exact"`
    }
    type Address struct {
    Street string `bson:"street"`
    Suburb string `bson:"suburb"`
    GovernmentArea string `bson:"government_area"`
    Market string `bson:"market"`
    Country string `bson:"Country"`
    CountryCode string `bson:"country_code"`
    Location Location `bson:"location"`
    }
    type Availability struct {
    Thirty int32 `bson:"availability_30"`
    Sixty int32 `bson:"availability_60"`
    Ninety int32 `bson:"availability_90"`
    ThreeSixtyFive int32 `bson:"availability_365"`
    }
    type ReviewScores struct {
    Accuracy int32 `bson:"review_scores_accuracy"`
    Cleanliness int32 `bson:"review_scores_cleanliness"`
    CheckIn int32 `bson:"review_scores_checkin"`
    Communication int32 `bson:"review_scores_communication"`
    Location int32 `bson:"review_scores_location"`
    Value int32 `bson:"review_scores_value"`
    Rating int32 `bson:"review_scores_rating"`
    }
    type Review struct {
    ID string `bson:"_id"`
    Date time.Time `bson:"date,omitempty"`
    ListingId string `bson:"listing_id"`
    ReviewerId string `bson:"reviewer_id"`
    ReviewerName string `bson:"reviewer_name"`
    Comments string `bson:"comments"`
    }
    type Listing struct {
    ID string `bson:"_id"`
    ListingURL string `bson:"listing_url"`
    Name string `bson:"name"`
    Summary string `bson:"summary"`
    Space string `bson:"space"`
    Description string `bson:"description"`
    NeighborhoodOverview string `bson:"neighborhood_overview"`
    Notes string `bson:"notes"`
    Transit string `bson:"transit"`
    Access string `bson:"access"`
    Interaction string `bson:"interaction"`
    HouseRules string `bson:"house_rules"`
    PropertyType string `bson:"property_type"`
    RoomType string `bson:"room_type"`
    BedType string `bson:"bed_type"`
    MinimumNights string `bson:"minimum_nights"`
    MaximumNights string `bson:"maximum_nights"`
    CancellationPolicy string `bson:"cancellation_policy"`
    LastScraped time.Time `bson:"last_scraped,omitempty"`
    CalendarLastScraped time.Time `bson:"calendar_last_scraped,omitempty"`
    FirstReview time.Time `bson:"first_review,omitempty"`
    LastReview time.Time `bson:"last_review,omitempty"`
    Accommodates int32 `bson:"accommodates"`
    Bedrooms int32 `bson:"bedrooms"`
    Beds int32 `bson:"beds"`
    NumberOfReviews int32 `bson:"number_of_reviews"`
    Bathrooms primitive.Decimal128 `bson:"bathrooms"`
    Amenities []string `bson:"amenities"`
    Price primitive.Decimal128 `bson:"price"`
    WeeklyPrice primitive.Decimal128 `bson:"weekly_price"`
    MonthlyPrice primitive.Decimal128 `bson:"monthly_price"`
    CleaningFee primitive.Decimal128 `bson:"cleaning_fee"`
    ExtraPeople primitive.Decimal128 `bson:"extra_people"`
    GuestsIncluded primitive.Decimal128 `bson:"guests_included"`
    Image Image `bson:"images"`
    Host Host `bson:"host"`
    Address Address `bson:"address"`
    Availability Availability `bson:"availability"`
    ReviewScores ReviewScores `bson:"review_scores"`
    Reviews []Review `bson:"reviews"`
    Embeddings []float64 `bson:"embeddings,omitempty"`
    }
  3. Move back into the project root directory.

    cd ../
3
go run create-embeddings.go
2024/10/10 09:58:03 Generating embeddings.
2024/10/10 09:58:12 Successfully added embeddings to 50 documents

You can view your vector embeddings as they generate by navigating to the sample_airbnb.listingsAndReviews collection in the Atlas UI.

1

Create a file named CreateEmbeddings.java and paste the following code.

This code uses the getEmbeddings method and the MongoDB Java Sync Driver to do the following:

  1. Connect to your Atlas cluster.

  2. Get the array of sample texts.

  3. Generate embeddings from each text using the getEmbeddings method that you defined previously.

  4. Ingest the embeddings into the sample_db.embeddings collection in Atlas.

CreateEmbeddings.java
import com.mongodb.MongoException;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.result.InsertManyResult;
import org.bson.BsonArray;
import org.bson.Document;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class CreateEmbeddings {
static List<String> data = Arrays.asList(
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission"
);
public static void main(String[] args){
String uri = System.getenv("ATLAS_CONNECTION_STRING");
if (uri == null || uri.isEmpty()) {
throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
}
// establish connection and set namespace
try (MongoClient mongoClient = MongoClients.create(uri)) {
MongoDatabase database = mongoClient.getDatabase("sample_db");
MongoCollection<Document> collection = database.getCollection("embeddings");
System.out.println("Creating embeddings for " + data.size() + " documents");
EmbeddingProvider embeddingProvider = new EmbeddingProvider();
// generate embeddings for new inputted data
List<BsonArray> embeddings = embeddingProvider.getEmbeddings(data);
List<Document> documents = new ArrayList<>();
int i = 0;
for (String text : data) {
Document doc = new Document("text", text).append("embedding", embeddings.get(i));
documents.add(doc);
i++;
}
// insert the embeddings into the Atlas collection
List<String> insertedIds = new ArrayList<>();
try {
InsertManyResult result = collection.insertMany(documents);
result.getInsertedIds().values()
.forEach(doc -> insertedIds.add(doc.toString()));
System.out.println("Inserted " + insertedIds.size() + " documents with the following ids to " + collection.getNamespace() + " collection: \n " + insertedIds);
} catch (MongoException me) {
throw new RuntimeException("Failed to insert documents", me);
}
} catch (MongoException me) {
throw new RuntimeException("Failed to connect to MongoDB ", me);
} catch (Exception e) {
throw new RuntimeException("Operation failed: ", e);
}
}
}
2

Save and run the file. The output resembles:

Creating embeddings for 3 documents
Inserted 3 documents with the following ids to sample_db.embeddings collection:
[BsonObjectId{value=6735ff620d88451041f6dd40}, BsonObjectId{value=6735ff620d88451041f6dd41}, BsonObjectId{value=6735ff620d88451041f6dd42}]

You can also view your vector embeddings in the Atlas UI by navigating to the sample_db.embeddings collection in your cluster.

Note

This example uses the sample_airbnb.listingsAndReviews collection from our sample data, but you can adapt the code to work with any collection in your cluster.

1

Create a file named CreateEmbeddings.java and paste the following code.

This code uses the getEmbeddings method and the MongoDB Java Sync Driver to do the following:

  1. Connect to your Atlas cluster.

  2. Get a subset of documents from the sample_airbnb.listingsAndReviews collection that have a non-empty summary field.

  3. Generate embeddings from each document's summary field using the getEmbeddings method that you defined previously.

  4. Update each document with a new embeddings field that contains the embedding value.

CreateEmbeddings.java
import com.mongodb.MongoException;
import com.mongodb.bulk.BulkWriteResult;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoCursor;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.BulkWriteOptions;
import com.mongodb.client.model.Filters;
import com.mongodb.client.model.UpdateOneModel;
import com.mongodb.client.model.Updates;
import com.mongodb.client.model.WriteModel;
import org.bson.BsonArray;
import org.bson.Document;
import org.bson.conversions.Bson;
import java.util.ArrayList;
import java.util.List;
public class CreateEmbeddings {
public static void main(String[] args){
String uri = System.getenv("ATLAS_CONNECTION_STRING");
if (uri == null || uri.isEmpty()) {
throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
}
// establish connection and set namespace
try (MongoClient mongoClient = MongoClients.create(uri)) {
MongoDatabase database = mongoClient.getDatabase("sample_airbnb");
MongoCollection<Document> collection = database.getCollection("listingsAndReviews");
Bson filterCriteria = Filters.and(
Filters.and(Filters.exists("summary"),
Filters.ne("summary", null),
Filters.ne("summary", "")),
Filters.exists("embeddings", false));
try (MongoCursor<Document> cursor = collection.find(filterCriteria).limit(50).iterator()) {
List<String> summaries = new ArrayList<>();
List<String> documentIds = new ArrayList<>();
int i = 0;
while (cursor.hasNext()) {
Document document = cursor.next();
String summary = document.getString("summary");
String id = document.get("_id").toString();
summaries.add(summary);
documentIds.add(id);
i++;
}
System.out.println("Generating embeddings for " + summaries.size() + " documents.");
System.out.println("This operation may take up to several minutes.");
EmbeddingProvider embeddingProvider = new EmbeddingProvider();
List<BsonArray> embeddings = embeddingProvider.getEmbeddings(summaries);
List<WriteModel<Document>> updateDocuments = new ArrayList<>();
for (int j = 0; j < summaries.size(); j++) {
UpdateOneModel<Document> updateDoc = new UpdateOneModel<>(
Filters.eq("_id", documentIds.get(j)),
Updates.set("embeddings", embeddings.get(j)));
updateDocuments.add(updateDoc);
}
int updatedDocsCount = 0;
try {
BulkWriteOptions options = new BulkWriteOptions().ordered(false);
BulkWriteResult result = collection.bulkWrite(updateDocuments, options);
updatedDocsCount = result.getModifiedCount();
} catch (MongoException me) {
throw new RuntimeException("Failed to insert documents", me);
}
System.out.println("Added embeddings successfully to " + updatedDocsCount + " documents.");
}
} catch (MongoException me) {
throw new RuntimeException("Failed to connect to MongoDB", me);
} catch (Exception e) {
throw new RuntimeException("Operation failed: ", e);
}
}
}
2

Save and run the file. The output resembles:

Generating embeddings for 50 documents.
This operation may take up to several minutes.
Added embeddings successfully to 50 documents.

You can also view your vector embeddings in the Atlas UI by navigating to the sample_airbnb.listingsAndReviews collection in your cluster.

1

Use the following code to generate embeddings from an existing collection in Atlas.

Specifically, this code uses the getEmbedding function that you defined and the MongoDB Node.js Driver to generate embeddings from an array of sample texts and ingest them into the sample_db.embeddings collection in Atlas.

create-embeddings.js
import { MongoClient } from 'mongodb';
import { getEmbedding } from './get-embeddings.js';
// Data to embed
const data = [
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission"
]
async function run() {
// Connect to your Atlas cluster
const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
try {
await client.connect();
const db = client.db("sample_db");
const collection = db.collection("embeddings");
await Promise.all(data.map(async text => {
// Check if the document already exists
const existingDoc = await collection.findOne({ text: text });
// Generate an embedding by using the function that you defined
const embedding = await getEmbedding(text);
// Ingest data and embedding into Atlas
if (!existingDoc) {
await collection.insertOne({
text: text,
embedding: embedding
});
console.log(embedding);
}
}));
} catch (err) {
console.log(err.stack);
}
finally {
await client.close();
}
}
run().catch(console.dir);
2
node --env-file=.env create-embeddings.js
[ -0.04323853924870491, -0.008460805751383305, 0.012494648806750774, -0.013014335185289383, ... ]
[ -0.017400473356246948, 0.04922063276171684, -0.002836339408531785, -0.030395228415727615, ... ]
[ -0.016950927674770355, 0.013881809078156948, -0.022074559703469276, -0.02838018536567688, ... ]
node --env-file=.env create-embeddings.js
[ 0.031927742, -0.014192767, -0.021851597, 0.045498233, -0.0077904654, ... ]
[ -0.01664538, 0.013198251, 0.048684783, 0.014485021, -0.018121032, ... ]
[ 0.030449908, 0.046782598, 0.02126599, 0.025799986, -0.015830345, ... ]

Note

The number of dimensions in the output have been truncated for readability.

You can also view your vector embeddings in the Atlas UI by navigating to the sample_db.embeddings collection in your cluster.

Note

This example uses the sample_airbnb.listingsAndReviews collection from our sample data, but you can adapt the code to work with any collection in your cluster.

1

Use the following code to generate embeddings from an existing collection in Atlas. Specifically, this code does the following:

  • Connects to your Atlas cluster.

  • Gets a subset of documents from the sample_airbnb.listingsAndReviews collection that have a non-empty summary field.

  • Generates embeddings from each document's summary field by using the getEmbedding function that you defined.

  • Updates each document with a new embedding field that contains the embedding value by using the MongoDB Node.js Driver.

create-embeddings.js
import { MongoClient } from 'mongodb';
import { getEmbedding } from './get-embeddings.js';
// Connect to your Atlas cluster
const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
async function run() {
try {
await client.connect();
const db = client.db("sample_airbnb");
const collection = db.collection("listingsAndReviews");
// Filter to exclude null or empty summary fields
const filter = { "summary": { "$nin": [ null, "" ] } };
// Get a subset of documents from the collection
const documents = await collection.find(filter).limit(50).toArray();
// Create embeddings from a field in the collection
let updatedDocCount = 0;
console.log("Generating embeddings for documents...");
await Promise.all(documents.map(async doc => {
// Generate an embedding by using the function that you defined
const embedding = await getEmbedding(doc.summary);
// Update the document with a new embedding field
await collection.updateOne({ "_id": doc._id },
{
"$set": {
"embedding": embedding
}
}
);
updatedDocCount += 1;
}));
console.log("Count of documents updated: " + updatedDocCount);
} catch (err) {
console.log(err.stack);
}
finally {
await client.close();
}
}
run().catch(console.dir);
2
node --env-file=.env create-embeddings.js
Generating embeddings for documents...
Count of documents updated: 50

You can view your vector embeddings as they generate by navigating to the sample_airbnb.listingsAndReviews collection in the Atlas UI and expanding the fields in a document.

1

If you haven't already defined the get_embedding, generate_bson_vector, and create_docs_with_bson_vector_embeddings functions in your notebook, see Define an Embedding Function to load these functions in your notebook.

2

Paste and run the following code in your notebook:

# Sample data
sentences = [
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission",
"Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.",
"The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.",
"Forrest Gump: A man with a low IQ recounts several decades of extraordinary events in his life.",
"Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.",
"The Matrix: A hacker discovers the true nature of reality and his role in the war against its controllers.",
"Star Wars: A young farm boy is swept into the struggle between the Rebel Alliance and the Galactic Empire.",
"The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.",
"Indiana Jones and the Last Crusade: An archaeologist pursues the Holy Grail while confronting adversaries from the past.",
"The Dark Knight: Batman faces a new menace, the Joker, who plunges Gotham into anarchy.",
"Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.",
"The Silence of the Lambs: A young FBI agent seeks the help of an incarcerated cannibalistic killer to catch another serial killer.",
"E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.",
"Saving Private Ryan: During WWII, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.",
"Gladiator: A once-powerful Roman general seeks vengeance against the corrupt emperor who betrayed his family.",
"Rocky: A small-time boxer gets a once-in-a-lifetime chance to fight the world heavyweight champion.",
"Pirates of the Caribbean: Jack Sparrow races to recover the heart of Davy Jones to escape eternal servitude.",
"Schindler's List: The true story of a man who saved hundreds of Jews during the Holocaust by employing them in his factory."
]
3

Use the following code to generate embeddings from new data.

Specifically, this code uses the get_embedding function that you defined to generate embeddings from the array of sample texts.

float32_embeddings = get_embedding(sentences, precision="float32")
int8_embeddings = get_embedding(sentences, precision="int8")
int1_embeddings = get_embedding(sentences, precision="ubinary")
# Print stored embeddings
print("Generated embeddings stored in different variables:")
for i, text in enumerate(sentences):
print(f"\nText: {text}")
print(f"Float32 Embedding: {float32_embeddings[i][:3]}... (truncated)")
print(f"Int8 Embedding: {int8_embeddings[i][:3]}... (truncated)")
print(f"Ubinary Embedding: {int1_embeddings[i][:3]}... (truncated)")
Generated embeddings stored in different variables:
Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
Float32 Embedding: [-0.01089042 0.05926645 -0.00291325]... (truncated)
Int8 Embedding: [-15 127 56]... (truncated)
Ubinary Embedding: [ 77 30 209]... (truncated)
Text: The Lion King: Lion cub and future king Simba searches for his identity
Float32 Embedding: [-0.05607051 -0.01360618 0.00523855]... (truncated)
Int8 Embedding: [-128 -109 110]... (truncated)
Ubinary Embedding: [ 37 18 151]... (truncated)
Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
Float32 Embedding: [-0.0275258 0.01144342 -0.02360895]... (truncated)
Int8 Embedding: [-57 -28 -79]... (truncated)
Ubinary Embedding: [ 76 16 144]... (truncated)
Text: Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.
Float32 Embedding: [-0.01759741 0.03254957 -0.02090798]... (truncated)
Int8 Embedding: [-32 40 -61]... (truncated)
Ubinary Embedding: [ 77 27 176]... (truncated)
Text: The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.
Float32 Embedding: [ 0.00503172 0.04311579 -0.00074904]... (truncated)
Int8 Embedding: [23 74 70]... (truncated)
Ubinary Embedding: [215 26 145]... (truncated)
Text: Forrest Gump: A man with a low IQ recounts several decades of extraordinary events in his life.
Float32 Embedding: [0.02349479 0.05669326 0.00458773]... (truncated)
Int8 Embedding: [ 69 118 105]... (truncated)
Ubinary Embedding: [237 154 159]... (truncated)
Text: Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.
Float32 Embedding: [-0.03294644 0.02671233 -0.01864981]... (truncated)
Int8 Embedding: [-70 21 -47]... (truncated)
Ubinary Embedding: [ 77 90 146]... (truncated)
Text: The Matrix: A hacker discovers the true nature of reality and his role in the war against its controllers.
Float32 Embedding: [-0.02489671 0.02847196 -0.00290637]... (truncated)
Int8 Embedding: [-50 27 56]... (truncated)
Ubinary Embedding: [ 95 154 129]... (truncated)
Text: Star Wars: A young farm boy is swept into the struggle between the Rebel Alliance and the Galactic Empire.
Float32 Embedding: [-0.01235448 0.01524397 -0.01063425]... (truncated)
Int8 Embedding: [-19 -15 5]... (truncated)
Ubinary Embedding: [ 68 26 210]... (truncated)
Text: The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.
Float32 Embedding: [ 0.04665203 0.01392298 -0.01743002]... (truncated)
Int8 Embedding: [127 -20 -39]... (truncated)
Ubinary Embedding: [207 88 208]... (truncated)
Text: Indiana Jones and the Last Crusade: An archaeologist pursues the Holy Grail while confronting adversaries from the past.
Float32 Embedding: [0.00929601 0.04206405 0.00701248]... (truncated)
Int8 Embedding: [ 34 71 121]... (truncated)
Ubinary Embedding: [228 90 130]... (truncated)
Text: The Dark Knight: Batman faces a new menace, the Joker, who plunges Gotham into anarchy.
Float32 Embedding: [-0.01451324 -0.00897367 0.0077793 ]... (truncated)
Int8 Embedding: [-24 -94 127]... (truncated)
Ubinary Embedding: [ 57 150 32]... (truncated)
Text: Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.
Float32 Embedding: [-0.01458643 0.03639758 -0.02587282]... (truncated)
Int8 Embedding: [-25 52 -94]... (truncated)
Ubinary Embedding: [ 78 218 216]... (truncated)
Text: The Silence of the Lambs: A young FBI agent seeks the help of an incarcerated cannibalistic killer to catch another serial killer.
Float32 Embedding: [-0.00205381 -0.00039482 -0.01630799]... (truncated)
Int8 Embedding: [ 6 -66 -31]... (truncated)
Ubinary Embedding: [ 9 82 154]... (truncated)
Text: E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.
Float32 Embedding: [ 0.01105334 0.00776658 -0.03092942]... (truncated)
Int8 Embedding: [ 38 -40 -128]... (truncated)
Ubinary Embedding: [205 24 146]... (truncated)
Text: Saving Private Ryan: During WWII, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.
Float32 Embedding: [ 0.00266668 -0.01926583 -0.00727963]... (truncated)
Int8 Embedding: [ 17 -128 27]... (truncated)
Ubinary Embedding: [148 82 194]... (truncated)
Text: Gladiator: A once-powerful Roman general seeks vengeance against the corrupt emperor who betrayed his family.
Float32 Embedding: [-0.00031873 -0.01352339 -0.02882693]... (truncated)
Int8 Embedding: [ 10 -109 -114]... (truncated)
Ubinary Embedding: [ 12 26 144]... (truncated)
Text: Rocky: A small-time boxer gets a once-in-a-lifetime chance to fight the world heavyweight champion.
Float32 Embedding: [ 0.00957429 0.01855557 -0.02353773]... (truncated)
Int8 Embedding: [ 34 -5 -79]... (truncated)
Ubinary Embedding: [212 18 144]... (truncated)
Text: Pirates of the Caribbean: Jack Sparrow races to recover the heart of Davy Jones to escape eternal servitude.
Float32 Embedding: [-0.01787405 0.03672816 -0.00972007]... (truncated)
Int8 Embedding: [-33 53 11]... (truncated)
Ubinary Embedding: [ 68 154 145]... (truncated)
Text: Schindler's List: The true story of a man who saved hundreds of Jews during the Holocaust by employing them in his factory.
Float32 Embedding: [-0.03515214 -0.00503571 0.00183181]... (truncated)
Int8 Embedding: [-76 -81 87]... (truncated)
Ubinary Embedding: [ 35 222 152]... (truncated)
4

Use the following code to convert the generated vector embeddings to BSON vectors.

Specifically, this code converts the generated embeddings to float32, int8, and bit-packed int1 types and then quantizes the float32, int8, and int1 vectors.

from bson.binary import BinaryVectorDtype
bson_float32_embeddings = []
bson_int8_embeddings = []
bson_int1_embeddings = []
# Convert each embedding to BSON
for (f32_emb, int8_emb, int1_emb) in zip(float32_embeddings, int8_embeddings, int1_embeddings):
bson_float32_embeddings.append(generate_bson_vector(f32_emb, BinaryVectorDtype.FLOAT32))
bson_int8_embeddings.append(generate_bson_vector(int8_emb, BinaryVectorDtype.INT8))
bson_int1_embeddings.append(generate_bson_vector(int1_emb, BinaryVectorDtype.PACKED_BIT))
# Print the embeddings
for idx, text in enumerate(sentences):
print(f"\nText: {text}")
print(f"Float32 BSON: {bson_float32_embeddings[idx]}")
print(f"Int8 BSON: {bson_int8_embeddings[idx]}")
print(f"Int1 BSON: {bson_int1_embeddings[idx]}")
Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
Float32 BSON: b'\'\x00\xbam2\xbc`\xc1r=7\xec>\xbb\xe6\xf3\x...'
Int8 BSON: b'\x03\x00\xf1\x7f8\xdf\xfeC\x1e>\xef\xd6\xf5\x9...'
Int1 BSON: b'\x10\x00M\x1e\xd1\xd2\x05\xaeq\xdf\x9a\x1d\xbc...'
Text: The Lion King: Lion cub and future king Simba searches for his identity
Float32 BSON: b'\'\x001\xaae\xbdr\xec^\xbc"\xa8\xab;\x91\xd...'
Int8 BSON: b'\x03\x00\x80\x93n\x06\x80\xca\xd3.\xa2\xe3\xd1...'
Int1 BSON: b'\x10\x00%\x12\x97\xa6\x8f\xdf\x89\x9d2\xcb\x99...'
Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
Float32 BSON: b'\'\x00\xcc}\xe1\xbc-};<\x8eg\xc1\xbc\xcb\xd...'
Int8 BSON: b'\x03\x00\xc7\xe4\xb1\xdf/\xe2\xd2\x90\xf7\x02|...'
Int1 BSON: b'\x10\x00L\x10\x90\xb6\x0f\x8a\x91\xaf\x92|\xf9...'
Text: Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.
Float32 BSON: b'\'\x00o(\x90\xbc\xb3R\x05=8G\xab\xbc\xfb\xc...'
Int8 BSON: b'\x03\x00\xe0(\xc3\x10*\xda\xfe\x19\xbf&<\xd1\x...'
Int1 BSON: b'\x10\x00M\x1b\xb0\x86\rn\x93\xaf:w\x9f}\x92\xd...'
Text: The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.
Float32 BSON: b'\'\x00\x1d\xe1\xa4;0\x9a0=C[D\xba\xb5\xf2\x...'
Int8 BSON: b'\x03\x00\x17JF2\xb9\xddZ8\xa1\x0c\xc6\x80\xd8$...'
Int1 BSON: b'\x10\x00\xd7\x1a\x91\x87\x0e\xc9\x91\x8b\xba\x...'
Text: Forrest Gump: A man with a low IQ recounts several decades of extraordinary events in his life.
Float32 BSON: b'\'\x00#x\xc0<27h=\xb5T\x96;:\xc4\x9c\xbd\x1...'
Int8 BSON: b'\x03\x00Evi\x80\x13\xd6\x1cCW\x80\x01\x9e\xe58...'
Int1 BSON: b'\x10\x00\xed\x9a\x9f\x97\x1f.\x12\xf9\xba];\x7...'
Text: Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.
Float32 BSON: b'\'\x00\xd9\xf2\x06\xbd\xd2\xd3\xda<\x7f\xc7...'
Int8 BSON: b'\x03\x00\xba\x15\xd1-\x0c\x03\xe6\xea\rQ\x1f\x...'
Int1 BSON: b'\x10\x00MZ\x92\xb7#\xaa\x99=\x9a\x99\x9c|<\xf8...'
Text: The Matrix: A hacker discovers the true nature of reality and his role in the war against its controllers.
Float32 BSON: b'\'\x00/\xf4\xcb\xbc\t>\xe9<\xc9x>\xbb\xcc\x...'
Int8 BSON: b'\x03\x00\xce\x1b815\xcf1\xc6s\xe5\n\xe4\x192G\...'
Int1 BSON: b'\x10\x00_\x9a\x81\xa6\x0f\x0f\x93o2\xd8\xfe|\x...'
Text: Star Wars: A young farm boy is swept into the struggle between the Rebel Alliance and the Galactic Empire.
Float32 BSON: b'\'\x00sjJ\xbc\xd6\xc1y<I;.\xbc\xb1\x80\t\xb...'
Int8 BSON: b'\x03\x00\xed\xf1\x05\xe2\xc7\xfa\xd4\xab5\xeb\...'
Int1 BSON: b'\x10\x00D\x1a\xd2\x86\x0ey\x92\x8f\xaa\x89\x1c...'
Text: The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.
Float32 BSON: b'\'\x004\x16?=9\x1dd<g\xc9\x8e\xbc\xdf\x81\x...'
Int8 BSON: b'\x03\x00\x7f\xec\xd9\xdc)\xd6)\x05\x18\x7f\xa6...'
Int1 BSON: b"\x10\x00\xcfX\xd0\xb7\x0e\xcf\xd9\r\xf0U\xb4]6..."
Text: Indiana Jones and the Last Crusade: An archaeologist pursues the Holy Grail while confronting adversaries from the past.
Float32 BSON: b'\'\x00HN\x18<ZK,=\xf9\xc8\xe5;\x9e\xed\xa0\...'
Int8 BSON: b'\x03\x00"Gy\x01\xeb\xec\xfc\x80\xe4a\x7f\x88\x...'
Int1 BSON: b'\x10\x00\xe4Z\x82\xb6\xad\xec\x10-\x9a\x99;?j\...'
Text: The Dark Knight: Batman faces a new menace, the Joker, who plunges Gotham into anarchy.
Float32 BSON: b'\'\x00\xef\xc8m\xbcJ\x06\x13\xbcv\xe9\xfe;...'
Int8 BSON: b'\x03\x00\xe8\xa2\x7fIE\xba\x9f\xfaT2\xf1\xc1\...'
Int1 BSON: b'\x10\x009\x96 \xb7\x8e\xc9\x81\xaf\xaa\x9f\xa...'
Text: Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.
Float32 BSON: b'\'\x00\xee\xfbn\xbc\xa0\x15\x15=<\xf3\xd3x...'
Int8 BSON: b'\x03\x00\xe74\xa2\xe5\x15\x165\xb9dM8C\xd7E\x...'
Int1 BSON: b'\x10\x00N\xda\xd8\xb6\x03N\x98\xbd\xdaY\x1b| ...'
Text: The Silence of the Lambs: A young FBI agent seeks the help of an incarcerated cannibalistic killer to catch another serial killer.
Float32 BSON: b'\'\x002\x99\x06\xbb\x82\x00\xcf\xb9X\x98\x...'
Int8 BSON: b'\x03\x00\x06\xbe\xe1.\x7f\x80\x04C\xd7e\x80\x...'
Int1 BSON: b'\x10\x00\tR\x9a\xd6\x0c\xb1\x9a\xbc\x90\xf5\x...'
Text: E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.
Float32 BSON: b'\'\x00\x14\x195<\xd4~\xfe;\xb3_\xfd\xbc \xe...'
Int8 BSON: b'\x03\x00&\xd8\x80\x92\x01\x7f\xbfF\xd4\x10\xf0...'
Int1 BSON: b'\x10\x00\xcd\x18\x92\x92\x8dJ\x92\xbd\x9a\xd3\...'
Text: Saving Private Ryan: During WWII, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.
Float32 BSON: b'\'\x00\x8a\xc3.;_\xd3\x9d\xbc\xf2\x89\xee\x...'
Int8 BSON: b'\x03\x00\x11\x80\x1b5\xe9\x19\x80\x8f\xb1N\xda...'
Int1 BSON: b"\x10\x00\x94R\xc2\xd2\x0f\xfa\x90\xbc\xd8\xd6\...'
Text: Gladiator: A once-powerful Roman general seeks vengeance against the corrupt emperor who betrayed his family.
Float32 BSON: b'\'\x00\xe6\x1a\xa7\xb94\x91]\xbcs&\xec\xbc\...'
Int8 BSON: b'\x03\x00\n\x93\x8e,n\xce\xe8\x9b@\x00\xf9\x7f\...'
Int1 BSON: b'\x10\x00\x0c\x1a\x90\x97\x0f\x19\x80/\xba\x98\...'
Text: Rocky: A small-time boxer gets a once-in-a-lifetime chance to fight the world heavyweight champion.
Float32 BSON: b'\'\x00\x7f\xdd\x1c<\xd9\x01\x98<1\xd2\xc0\xb...'
Int8 BSON: b'\x03\x00"\xfb\xb1\x7f\xd3\xd6\x04\xbe\x80\xf9L\...'
Int1 BSON: b'\x10\x00\xd4\x12\x90\xa6\x8by\x99\x8d\xa2\xbd\x...'
Text: Pirates of the Caribbean: Jack Sparrow races to recover the heart of Davy Jones to escape eternal servitude.
Float32 BSON: b'\'\x00\x98l\x92\xbcDp\x16=\xf0@\x1f\xbc\xd0\...'
Int8 BSON: b'\x03\x00\xdf5\x0b\xe3\xbf\xe5\xa5\xad\x7f\x02\x...'
Int1 BSON: b'\x10\x00D\x9a\x91\x96\x07\xfa\x93\x8d\xb2D\x92]...'
Text: Schindler's List: The true story of a man who saved hundreds of Jews during the Holocaust by employing them in his factory.
Float32 BSON: b'\'\x00\xb0\xfb\x0f\xbd\x9b\x02\xa5\xbbZ\x19\...'
Int8 BSON: b'\x03\x00\xb4\xafW\xd9\xd7\xc3\x7f~QM\x86\x83\xf...'
Int1 BSON: b'\x10\x00#\xde\x98\x96\x0e\xcc\x12\xf6\xbb\xdd2}...'
5

Use the following code to create documents with the BSON vector embeddings. The code uses the create_docs_with_bson_vector_embeddings function to create the documents.

# Create BSON documents
docs = create_docs_with_bson_vector_embeddings(bson_float32_embeddings, bson_int8_embeddings, bson_int1_embeddings, sentences)
6

Paste the following code in your notebook, replace the <connection-string> with your Atlas cluster's SRV connection string, and run the code.

Note

Your connection string should use the following format:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
db = mongo_client["sample_db"]
collection = db["embeddings"]
# Ingest data into Atlas
collection.insert_many(docs)
InsertManyResult([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], acknowledged=True)

You can verify your vector embeddings by viewing them in the Atlas UI inside the sample_db.embeddings namespace in your cluster.

Note

This example uses the sample_airbnb.listingsAndReviews collection from our sample data, but you can adapt the code to work with any collection in your cluster.

1

Use the following code to generate embeddings from new data.

Specifically, this code uses the get_embedding function that you defined and the MongoDB PyMongo Driver to generate embeddings from an array of sample texts and ingest them into the sample_db.embeddings collection.

import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
db = mongo_client["sample_db"]
collection = db["embeddings"]
# Sample data
data = [
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission"
]
# Ingest data into Atlas
inserted_doc_count = 0
for text in data:
embedding = get_embedding(text)
collection.insert_one({ "text": text, "embedding": embedding })
inserted_doc_count += 1
print(f"Inserted {inserted_doc_count} documents.")
Inserted 3 documents.
2

Replace <connection-string> with your Atlas cluster's SRV connection string.

Note

Your connection string should use the following format:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
3

You can verify your vector embeddings by viewing them in the Atlas UI by navigating to the sample_db.embeddings collection in your cluster.

1

If you haven't already defined the get_embedding and generate_bson_vector functions in your notebook, see Define an Embedding Function to load these functions in your notebook.</