如何创建向量嵌入

在此页面上

开始体验

先决条件
定义嵌入函数
从数据创建嵌入
为查询创建嵌入
Considerations
选择创建嵌入的方法
选择内嵌模型
验证您的嵌入
故障排除
后续步骤

您可以将向量嵌入与其他数据一起存储在 Atlas 中。这些嵌入可以捕获数据中的有意义的关系，使您能够执行语义搜索并使用 Atlas Vector Search 实现 RAG。

开始体验

通过以下教程，学习；了解如何创建向量嵌入并使用Atlas Vector Search进行查询。具体来说，您执行以下操作：

定义一个使用嵌入模型生成向量嵌入的函数。
从您的数据中创建嵌入并将其存储在 Atlas 中。
根据搜索词创建嵌入，并运行向量搜索查询。

对于生产应用程序，您通常会写入脚本来生成向量嵌入。您可以从此页面上的示例代码开始，并根据您的使用案例进行自定义。

➤ 使用 Select your language（选择您的语言）下拉菜单设置此页面上示例的语言。

选择您的语言

提示

使用可运行版本的本教程以作为 Python 笔记本。

先决条件

如要完成本教程，您必须具备以下条件：

一个 Atlas 帐户，而其集群运行着 MongoDB 版本 6.0.11、7.0.2 或更高版本（包括 RC）。确保您的 IP 地址包含在 Atlas 项目的访问列表中。如需了解详情，请参阅创建集群。
用于运行 C# 项目的终端和代码编辑器。
已安装 .NET 8.0 或更高版本。
Hugging Face 访问令牌或 OpenAI API 密钥。

一个 Atlas 帐户，而其集群运行着 MongoDB 版本 6.0.11、7.0.2 或更高版本（包括 RC）。确保您的 IP 地址包含在 Atlas 项目的访问列表中。如需了解详情，请参阅创建集群。
用于运行 Go 项目的终端和代码编辑器。
Go 已安装。
Hugging Face 访问令牌或 OpenAI API 密钥。

一个 Atlas 帐户，而其集群运行着 MongoDB 版本 6.0.11、7.0.2 或更高版本（包括 RC）。确保您的 IP 地址包含在 Atlas 项目的访问列表中。如需了解详情，请参阅创建集群。

Java 开发工具包 (JDK) 版本 8 或更高版本。
设立和运行Java应用程序的环境。我们建议您使用 IntelliJ IDEA 或 Eclipse IDE 等集成开发环境来配置 Maven 或 Gradle，以构建和运行项目。

以下之一：
- 具有读取访问权限的“拥抱”脸部访问令牌
- OpenAI API密钥。您必须拥有一个具有可用于API请求的积分的 OpenAI 帐户。要学习；了解有关注册 OpenAI 帐户的更多信息，请参阅 OpenAI API网站。

一个 Atlas 帐户，而其集群运行着 MongoDB 版本 6.0.11、7.0.2 或更高版本（包括 RC）。确保您的 IP 地址包含在 Atlas 项目的访问列表中。如需了解详情，请参阅创建集群。
用于运行 Node.js 项目的终端和代码编辑器。
npm 和 Node.js 已安装。
如果您使用 OpenAI 模型，则必须拥有 OpenAI API 密钥。

一个 Atlas 帐户，而其集群运行着 MongoDB 版本 6.0.11、7.0.2 或更高版本（包括 RC）。确保您的 IP 地址包含在 Atlas 项目的访问列表中。如需了解详情，请参阅创建集群。
运行交互式 Python 笔记本（如 VS Code 或 Colab）的环境。
如果您使用 OpenAI 模型，则必须拥有 OpenAI API 密钥。

定义嵌入函数

初始化您的 .NET 项目。

在终端窗口中，运行以下命令以初始化您的项目：

dotnet new console -o MyCompany.Embeddings
cd MyCompany.Embeddings

安装并导入依赖项。

在终端窗口中，运行以下命令：

dotnet add package MongoDB.Driver

设置环境变量。

导出您的环境变量，在 PoweShell 中对其进行 set，或是使用 IDE 的环境变量管理器以使该连接字符串和 HuggingFace 访问令牌可用于您的项目。

export HUGGINGFACE_ACCESS_TOKEN="<access-token>"
export ATLAS_CONNECTION_STRING="<connection-string>"

将 <access-token> 占位符值替换为您的 Huging Face访问权限令牌。

用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

定义一个函数来生成向量嵌入。

在名为 AIService.cs 的同名文件中创建一个新类并粘贴以下代码。此代码定义了一个名为 GetEmbeddingsAsync 的异步任务，用于为给定字符串输入的数组生成嵌入数组。此函数使用了 mxbai-embed-large-v1 嵌入模型。

AIService.cs

namespace MyCompany.Embeddings;
using System;
using System.Net.Http;
using System.Text.Json;
using System.Threading.Tasks;
using System.Net.Http.Headers;
public class AIService
{
    private static readonly string? HuggingFaceAccessToken = Environment.GetEnvironmentVariable("HUGGINGFACE_ACCESS_TOKEN");
    private static readonly HttpClient Client = new HttpClient();
    public async Task<Dictionary<string, float[]>> GetEmbeddingsAsync(string[] texts)
    {
        const string modelName = "mixedbread-ai/mxbai-embed-large-v1";
        const string url = $"https://api-inference.huggingface.co/models/{modelName}";
        Client.DefaultRequestHeaders.Authorization 
            = new AuthenticationHeaderValue("Bearer", HuggingFaceAccessToken);
        var data = new { inputs = texts };
        var dataJson = JsonSerializer.Serialize(data);
        var content = new StringContent(dataJson,null, "application/json");
        var response = await Client.PostAsync(url, content);
        response.EnsureSuccessStatusCode();
        var responseString = await response.Content.ReadAsStringAsync();
        var embeddings = JsonSerializer.Deserialize<float[][]>(responseString);
        if (embeddings is null)
        {
            throw new ApplicationException("Failed to deserialize embeddings response to an array of floats.");
        }
        Dictionary<string, float[]> documentData = new Dictionary<string, float[]>();
        var embeddingCount = embeddings.Length;
        foreach (var value in Enumerable.Range(0, embeddingCount))
        {
            // Pair each embedding with the text used to generate it.
            documentData[texts[value]] = embeddings[value];
        }
        return documentData;
    }
}

注意

503 调用 Hushing Face 模型时

在调用 Hugging Face 模型中心模型时，您偶尔可能会遇到 503 错误。要解决此问题，请在短暂等待后重试。

初始化您的 .NET 项目。

在终端窗口中，运行以下命令以初始化您的项目：

dotnet new console -o MyCompany.Embeddings
cd MyCompany.Embeddings

安装并导入依赖项。

在终端窗口中，运行以下命令：

dotnet add package MongoDB.Driver
dotnet add package OpenAI

设置环境变量。

导出您的环境变量，在 PowerShell 中对其进行 set，或是使用 IDE 的环境变量管理器以使该连接字符串和 HuggingFace 访问令牌可用于您的项目。

export OPENAI_API_KEY="<api-key>"
export ATLAS_CONNECTION_STRING="<connection-string>"

将 <api-key> 占位符值替换为您的 OpenAI API密钥。

用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

定义一个函数来生成向量嵌入。

在名为 AIService.cs 的同名文件中创建一个新类并粘贴以下代码。此代码定义了一个名为 GetEmbeddingsAsync 的异步任务，用于为给定字符串输入的数组生成嵌入数组。此函数使用 OpenAI 的 text-embedding-3-small 模型为给定输入生成嵌入。

AIService.cs

namespace MyCompany.Embeddings;
using OpenAI.Embeddings;
using System;
using System.Threading.Tasks;
public class AIService
{
    private static readonly string? OpenAIApiKey = Environment.GetEnvironmentVariable("OPENAI_API_KEY");
    private static readonly string EmbeddingModelName = "text-embedding-3-small";
    public async Task<Dictionary<string, float[]>> GetEmbeddingsAsync(string[] texts)
    {
        EmbeddingClient embeddingClient = new(model: EmbeddingModelName, apiKey: OpenAIApiKey);
        Dictionary<string, float[]> documentData = new Dictionary<string, float[]>();
        try
        {
            var result = await embeddingClient.GenerateEmbeddingsAsync(texts);
            var embeddingCount = result.Value.Count;
            foreach (var index in Enumerable.Range(0, embeddingCount))
            {
                // Pair each embedding with the text used to generate it.
                documentData[texts[index]] = result.Value[index].ToFloats().ToArray();
            }
        }
        catch (Exception e)
        {
            throw new ApplicationException(e.Message);
        }
        return documentData;
    }
}

在本节中，您将定义一个函数以使用嵌入模型生成向量嵌入。根据您想使用开源嵌入模型还是类似 OpenAI 的专有模型，选择一个标签页。

注意

开源嵌入模型可以免费使用，并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。

初始化您的 Go 项目。

在终端窗口中，运行以下命令，创建一个名为 my-embeddings-project的新目录并初始化项目：

mkdir my-embeddings-project
cd my-embeddings-project
go mod init my-embeddings-project

安装并导入依赖项。

在终端窗口中，运行以下命令：

go get github.com/joho/godotenv
go get go.mongodb.org/mongo-driver/mongo
go get github.com/tmc/langchaingo/llms

创建`.env` 文件来管理密钥。

在项目中，创建 .env 文件来存储 Atlas 连接字符串和 Hugging Face 访问令牌。

HUGGINGFACEHUB_API_TOKEN = "<access-token>"
ATLAS_CONNECTION_STRING = "<connection-string>"

将 <access-token> 占位符值替换为您的 Huging Face访问权限令牌。

用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

定义一个函数来生成向量嵌入。

在项目中创建一个名为 common 的目录，用于存储您将在后续步骤中使用的常用代码：
```
mkdir common && cd common
```

创建一个名为 get-embeddings.go 的文件并粘贴以下代码。此代码定义了一个名为 GetEmbeddings 的函数，用于为给定输入生成嵌入。此函数指定：

feature-extraction 任务使用 LangChain 库的 Go 端口。要了解更多信息，请参阅 LangChain JavaScript 文档中的任务文档。
mxbai-embed-large-v1 嵌入模型。

get-embeddings.go

package common
import (
	"context"
	"log"
	"github.com/tmc/langchaingo/embeddings/huggingface"
)
func GetEmbeddings(documents []string) [][]float32 {
	hf, err := huggingface.NewHuggingface(
		huggingface.WithModel("mixedbread-ai/mxbai-embed-large-v1"),
		huggingface.WithTask("feature-extraction"))
	if err != nil {
		log.Fatalf("failed to connect to Hugging Face: %v", err)
	}
	embs, err := hf.EmbedDocuments(context.Background(), documents)
	if err != nil {
		log.Fatalf("failed to generate embeddings: %v", err)
	}
	return embs
}

注意

503 调用 Hushing Face 模型时

在调用 Hugging Face 模型中心模型时，您偶尔可能会遇到 503 错误。要解决此问题，请在短暂等待后重试。

返回到主项目根目录。
```
cd ../
```

初始化您的 Go 项目。

在终端窗口中，运行以下命令，创建一个名为 my-embeddings-project的新目录并初始化项目：

mkdir my-embeddings-project
cd my-embeddings-project
go mod init my-embeddings-project

安装并导入依赖项。

在终端窗口中，运行以下命令：

go get github.com/joho/godotenv
go get go.mongodb.org/mongo-driver/mongo
go get github.com/milosgajdos/go-embeddings/openai

创建`.env` 文件来管理密钥。

在您的项目中，创建一个.env string文件来存储连接字符串和 OpenAIAPI API令牌。

OPENAI_API_KEY = "<api-key>"
ATLAS_CONNECTION_STRING = "<connection-string>"

用 OpenAI API 密钥和 Atlas 集群的 SRV 连接字符串替换 <api-key> 和 <connection-string> 占位符值。连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

注意

连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

定义一个函数来生成向量嵌入。

在项目中创建一个名为 common 的目录，用于存储将在多个步骤中使用的代码：
```
mkdir common && cd common
```

创建一个名为 get-embeddings.go 的文件并粘贴以下代码。此代码定义了一个名为 GetEmbeddings 的函数，该函数使用 OpenAI 的 text-embedding-3-small 模型为给定输入生成嵌入。

get-embeddings.go

package common
import (
	"context"
	"log"
	"github.com/milosgajdos/go-embeddings/openai"
)
func GetEmbeddings(docs []string) [][]float64 {
	c := openai.NewClient()
	embReq := &openai.EmbeddingRequest{
		Input:          docs,
		Model:          openai.TextSmallV3,
		EncodingFormat: openai.EncodingFloat,
	}
	embs, err := c.Embed(context.Background(), embReq)
	if err != nil {
		log.Fatalf("failed to connect to OpenAI: %v", err)
	}
	var vectors [][]float64
	for _, emb := range embs {
		vectors = append(vectors, emb.Vector)
	}
	return vectors
}

返回到主项目根目录。
```
cd ../
```

在本节中，您将定义一个函数以使用嵌入模型生成向量嵌入。根据您想使用开源嵌入模型还是类似 OpenAI 的专有模型，选择一个标签页。

注意

开源嵌入模型可以免费使用，并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。

创建Java项目并安装依赖项。

在 IDE 中，使用 Maven 或 Gradle 创建Java项目。

根据您的包管理器，添加以下依赖项：

如果使用 Maven，请将以下依赖项添加到项目的 pom.xml文件的 dependencies大量中：

pom.xml

<dependencies>
   <!-- MongoDB Java Sync Driver v5.2.0 or later -->
   <dependency>
      <groupId>org.mongodb</groupId>
      <artifactId>mongodb-driver-sync</artifactId>
      <version>[5.2.0,)</version>
   </dependency>
   <!-- Java library for working with Hugging Face models -->
   <dependency>
      <groupId>dev.langchain4j</groupId>
      <artifactId>langchain4j-hugging-face</artifactId>
      <version>0.35.0</version>
   </dependency>
</dependencies>

如果您使用 Gradle，请将以下内容添加到项目 build.gradle文件的 dependencies大量中：

build.gradle

dependencies {
   // MongoDB Java Sync Driver v5.2.0 or later
   implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)'
   // Java library for working with Hugging Face models
   implementation 'dev.langchain4j:langchain4j-hugging-face:0.35.0'
}

运行包管理器以安装项目的依赖项。

设置环境变量。

注意

此示例在 IDE 中设置项目的变量。生产应用程序可以通过部署配置、CI/CD管道或密钥管理器管理环境变量，但您可以调整提供的代码以适合您的使用案例。

在 IDE 中，创建新的配置模板并将以下变量添加到项目中：

如果您使用的是 IntelliJ IDEA，则请创建一个新的 Application 运行配置模板，然后在 Environment variables 字段中将变量添加为用分号分隔的值（例如，FOO=123;BAR=456）。应用这些更改，然后单击 OK。
要学习；了解更多信息，请参阅 IntelliJ IDEA 文档的从模板创建运行/调试配置部分。
如果您使用的是 Eclipse，请创建新的 Java Application 启动配置，然后将每个变量作为新的键值对添加到 Environment标签页中。应用更改并单击 OK。
要学习；了解更多信息，请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。

环境变量

HUGGING_FACE_ACCESS_TOKEN=<access-token>
ATLAS_CONNECTION_STRING=<connection-string>

用以下值更新占位符：

将 ``<access-token>`` 占位符值替换为您的 Hugging Face 访问令牌。
用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。
连接字符串应使用以下格式：
```
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
```

定义生成向量嵌入的方法。

创建一个名为EmbeddingProvider.java的文件并粘贴以下代码。

此代码定义了两种使用 mxbai-embed-large-v1 开源嵌入模型为给定输入生成嵌入的方法：

getEmbeddings多个输入：List<String> 方法接受文本输入大量()，允许您在单个API调用中创建多个嵌入。该方法将API提供的浮点数数组转换为BSON双精度数组，以便存储在Atlas 集群中。
单个输入：getEmbedding 方法接受单个String ，它表示要对向量数据进行的查询。该方法将API提供的浮点数大量转换为BSON双精度大量，以便在查询集合时使用。

EmbeddingProvider.java

import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.huggingface.HuggingFaceEmbeddingModel;
import dev.langchain4j.model.output.Response;
import org.bson.BsonArray;
import org.bson.BsonDouble;
import java.util.List;
import static java.time.Duration.ofSeconds;
public class EmbeddingProvider {
    private static HuggingFaceEmbeddingModel embeddingModel;
    private static HuggingFaceEmbeddingModel getEmbeddingModel() {
        if (embeddingModel == null) {
            String accessToken = System.getenv("HUGGING_FACE_ACCESS_TOKEN");
            if (accessToken == null || accessToken.isEmpty()) {
                throw new RuntimeException("HUGGING_FACE_ACCESS_TOKEN env variable is not set or is empty.");
            }
            embeddingModel = HuggingFaceEmbeddingModel.builder()
                    .accessToken(accessToken)
                    .modelId("mixedbread-ai/mxbai-embed-large-v1")
                    .waitForModel(true)
                    .timeout(ofSeconds(60))
                    .build();
        }
        return embeddingModel;
    }
    /**
     * Takes an array of strings and returns a BSON array of embeddings to
     * store in the database.
     */
    public List<BsonArray> getEmbeddings(List<String> texts) {
        List<TextSegment> textSegments = texts.stream()
                .map(TextSegment::from)
                .toList();
        Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments);
        return response.content().stream()
                .map(e -> new BsonArray(
                        e.vectorAsList().stream()
                                .map(BsonDouble::new)
                                .toList()))
                .toList();
    }
    /**
     * Takes a single string and returns a BSON array embedding to
     * use in a vector query.
     */
    public BsonArray getEmbedding(String text) {
        Response<Embedding> response = getEmbeddingModel().embed(text);
        return new BsonArray(
                response.content().vectorAsList().stream()
                        .map(BsonDouble::new)
                        .toList());
    }
}

创建Java项目并安装依赖项。

在 IDE 中，使用 Maven 或 Gradle 创建Java项目。

根据您的包管理器，添加以下依赖项：

如果使用 Maven，请将以下依赖项添加到项目的 pom.xml文件的 dependencies大量中：

pom.xml

<dependencies>
   <!-- MongoDB Java Sync Driver v5.2.0 or later -->
   <dependency>
      <groupId>org.mongodb</groupId>
      <artifactId>mongodb-driver-sync</artifactId>
      <version>[5.2.0,)</version>
   </dependency>
   <!-- Java library for working with OpenAI models -->
   <dependency>
      <groupId>dev.langchain4j</groupId>
      <artifactId>langchain4j-open-ai</artifactId>
      <version>0.35.0</version>
   </dependency>
</dependencies>

如果您使用 Gradle，请将以下内容添加到项目 build.gradle文件的 dependencies大量中：

build.gradle

dependencies {
   // MongoDB Java Sync Driver v5.2.0 or later
   implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)'
   // Java library for working with OpenAI models
   implementation 'dev.langchain4j:langchain4j-open-ai:0.35.0'
}

运行包管理器以安装项目的依赖项。

设置环境变量。

注意

在 IDE 中，创建新的配置模板并将以下变量添加到项目中：

如果您使用的是 IntelliJ IDEA，则应创建一个新的 Application 运行配置模板，然后在 Environment variables 字段中以分号分隔的值的形式来添加变量（例如，FOO=123;BAR=456）。应用这些更改并单击 OK。
要学习；了解更多信息，请参阅 IntelliJ IDEA 文档的从模板创建运行/调试配置部分。
如果您使用的是 Eclipse，请创建新的 Java Application 启动配置，然后将每个变量作为新的键值对添加到 Environment标签页中。应用更改并单击 OK。
要学习；了解更多信息，请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。

环境变量

OPEN_AI_API_KEY=<api-key>
ATLAS_CONNECTION_STRING=<connection-string>

用以下值更新占位符：

将``<api-key>``占位符值替换为您的 OpenAI API 密钥。
用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。
连接字符串应使用以下格式：
```
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
```

定义生成向量嵌入的方法。

创建一个名为EmbeddingProvider.java的文件并粘贴以下代码。

此代码定义了两种使用 text-embedding-3 -small OpenAI 嵌入模型为给定输入生成嵌入的方法：

getEmbeddings多个输入：List<String> 方法接受文本输入大量()，允许您在单个API调用中创建多个嵌入。该方法将API提供的浮点数数组转换为BSON双精度数组，以便存储在Atlas 集群中。
单个输入：getEmbedding 方法接受单个String ，它表示要对向量数据进行的查询。该方法将API提供的浮点数大量转换为BSON双精度大量，以便在查询集合时使用。

EmbeddingProvider.java

import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.openai.OpenAiEmbeddingModel;
import dev.langchain4j.model.output.Response;
import org.bson.BsonArray;
import org.bson.BsonDouble;
import java.util.List;
import static java.time.Duration.ofSeconds;
public class EmbeddingProvider {
    private static OpenAiEmbeddingModel embeddingModel;
    private static OpenAiEmbeddingModel getEmbeddingModel() {
        if (embeddingModel == null) {
            String apiKey = System.getenv("OPEN_AI_API_KEY");
            if (apiKey == null || apiKey.isEmpty()) {
                throw new IllegalStateException("OPEN_AI_API_KEY env variable is not set or is empty.");
            }
            return OpenAiEmbeddingModel.builder()
                    .apiKey(apiKey)
                    .modelName("text-embedding-3-small")
                    .timeout(ofSeconds(60))
                    .build();
        }
        return embeddingModel;
    }
    /**
     * Takes an array of strings and returns a BSON array of embeddings to
     * store in the database.
     */
    public List<BsonArray> getEmbeddings(List<String> texts) {
        List<TextSegment> textSegments = texts.stream()
                .map(TextSegment::from)
                .toList();
        Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments);
        return response.content().stream()
                .map(e -> new BsonArray(
                        e.vectorAsList().stream()
                                .map(BsonDouble::new)
                                .toList()))
                .toList();
    }
    /**
     * Takes a single string and returns a BSON array embedding to
     * use in a vector query.
     */
    public BsonArray getEmbedding(String text) {
        Response<Embedding> response = getEmbeddingModel().embed(text);
        return new BsonArray(
                response.content().vectorAsList().stream()
                        .map(BsonDouble::new)
                        .toList());
    }
}

在本节中，您将定义一个函数以使用嵌入模型生成向量嵌入。根据您想使用开源嵌入模型还是类似 OpenAI 的专有模型，选择一个标签页。

注意

开源嵌入模型可以免费使用，并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。

初始化您的 Node.js 项目。

在终端窗口中，运行以下命令，创建一个名为 my-embeddings-project的新目录并初始化项目：

mkdir my-embeddings-project
cd my-embeddings-project
npm init -y

更新您的 `package.json` 文件。

将您的项目配置为使用 ES 模块，方法是将 "type": "module" 添加到 package.json 文件中，然后将其保存。

{
  "type": "module",
  // other fields...
}

安装并导入依赖项。

在终端窗口中，运行以下命令：

npm install mongodb @xenova/transformers

创建 `.env` 文件。

在您的项目中，创建一个 .env 文件来存储您的 Atlas 连接字符串。

ATLAS_CONNECTION_STRING = "<connection-string>"

用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

注意

最低 Node.js 版本要求

Node.js v 20 .x 引入了 --env-file 选项。如果您使用的是旧版本的 Node.js，请将 dotenv 包添加到项目中，或使用其他方法来管理环境变量。

定义一个函数来生成向量嵌入。

创建一个名为 get-embeddings.js 的文件并粘贴以下代码。这段代码定义了一个被命名用于为给定输入生成嵌入的函数。此函数指定：

来自 Hugging Face 的transformers.js 库的 feature-extraction 任务。要了解更多信息，请参阅任务。
nomic-embed-text-v1 嵌入模型。

get-embeddings.js

import { pipeline } from '@xenova/transformers';
// Function to generate embeddings for a given data source
export async function getEmbedding(data) {
    const embedder = await pipeline(
        'feature-extraction', 
        'Xenova/nomic-embed-text-v1');
    const results = await embedder(data, { pooling: 'mean', normalize: true });
    return Array.from(results.data);
}

初始化您的 Node.js 项目。

在终端窗口中，运行以下命令，创建一个名为 my-embeddings-project的新目录并初始化项目：

mkdir my-embeddings-project
cd my-embeddings-project
npm init -y

更新您的 `package.json` 文件。

将您的项目配置为使用 ES 模块，方法是将 "type": "module" 添加到 package.json 文件中，然后将其保存。

{
  "type": "module",
  // other fields...
}

安装并导入依赖项。

在终端窗口中，运行以下命令：

npm install mongodb openai

创建 `.env` 文件。

在项目中，创建一个 .env 文件来存储 Atlas 连接字符串和 OpenAI API 密钥。

OPENAI_API_KEY = "<api-key>"
ATLAS_CONNECTION_STRING = "<connection-string>"

用 OpenAI API 密钥和 Atlas 集群的 SRV 连接字符串替换 <api-key> 和 <connection-string> 占位符值。连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

注意

最低 Node.js 版本要求

Node.js v 20 .x 引入了 --env-file 选项。如果您使用的是旧版本的 Node.js，请将 dotenv 包添加到项目中，或使用其他方法来管理环境变量。

定义一个函数来生成向量嵌入。

创建一个名为 get-embeddings.js 的文件并粘贴以下代码。此代码定义了一个名为 getEmbedding 的函数，该函数使用 OpenAI 的 text-embedding-3-small 模型为给定输入生成嵌入。

get-embeddings.js

import OpenAI from 'openai';
// Setup OpenAI configuration
const openai = new OpenAI({apiKey: process.env.OPENAI_API_KEY});
// Function to get the embeddings using the OpenAI API
export async function getEmbedding(text) {
    const results = await openai.embeddings.create({
        model: "text-embedding-3-small",
        input: text,
        encoding_format: "float",
    });
    return results.data[0].embedding;
}

在本节中，您将定义一个函数以使用嵌入模型生成向量嵌入。根据您想使用 Nomic 的开源嵌入模型还是 OpenAI 的专有模型，选择一个标签页。

此开源示例还包括一个函数，它可用于将您的嵌入转换为 BSON binData 向量，以便实现高效处理。仅某些嵌入模型支持字节向量输出。对于不支持此类输出的模型（例如，OpenAI 的模型），请在创建 Atlas Vector Search 索引时启用自动量化。

注意

开源嵌入模型可以免费使用，并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。

设置环境。

通过保存扩展名为 .ipynb 的文件来创建交互式 Python 笔记本，然后在笔记本中运行以下命令以安装依赖项：

pip install --quiet --upgrade sentence-transformers pymongo einops

定义函数来生成向量嵌入。

在笔记本中粘贴并运行以下代码，以创建一个函数，该函数使用 Nomic AI 的开源嵌入模型生成向量嵌入。此代码执行以下操作：

加载 nomic-embed-text-v1 嵌入模型。
创建一个名为 get_embedding 的函数，该函数使用模型为给定的文本输入生成 float32（默认精度）、int8 或 int1 嵌入。

from sentence_transformers import SentenceTransformer
# Load the embedding model
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
# Define a function to generate embeddings in multiple precisions
def get_embedding(data, precision="float32"):
    return model.encode(data, precision=precision)

定义用于转换向量嵌入的函数。

在笔记本中粘贴并运行以下代码，以创建一个名为generate_bson_vector 的函数，该函数将全保真嵌入转换为BSONfloat32 、int8 和int1vector 子类型，以便高效处理向量数据。

from bson.binary import Binary
# Generate BSON vector using `BinaryVectorDtype`
def generate_bson_vector(vector, vector_dtype):
    return Binary.from_vector(vector, vector_dtype)

定义用于创建附带这些嵌入的文档的函数。

在笔记本中粘贴并运行以下代码，以创建一个名为 create_docs_with_bson_vector_embeddings 的函数，该函数创建带有嵌入的文档，您将这些文档摄取到Atlas中。

# Function to create documents with BSON vector embeddings
def create_docs_with_bson_vector_embeddings(bson_float32, bson_int8, bson_int1, data):
  docs = []
  for i, (bson_f32_emb, bson_int8_emb, bson_int1_emb, text) in enumerate(zip(bson_float32, bson_int8, bson_int1, data)):
        doc = {
            "_id": i,
            "data": text,
            "BSON-Float32-Embedding": bson_f32_emb,
            "BSON-Int8-Embedding": bson_int8_emb,
            "BSON-Int1-Embedding": bson_int1_emb,
        }
        docs.append(doc)
  return docs

测试该函数以生成嵌入。

在笔记本中粘贴并运行以下代码以测试 get_embedding 函数。

此代码会为 foo 与 bar 字符串生成 float32、int8 与 int1 嵌入。

# Example generating embeddings for the strings "foo" and "bar"
data = ["foo", "bar"]
float32_embeddings = get_embedding(data, "float32")
int8_embeddings = get_embedding(data, "int8")
int1_embeddings = get_embedding(data, "ubinary")
print("Float32 Embedding:", float32_embeddings)
print("Int8 Embedding:", int8_embeddings)
print("Int1 Embedding (binary representation):", int1_embeddings)

Float32 Embedding: [
   [-0.02980827  0.03841474 -0.02561123 ... -0.0532876
   -0.0335409 -0.02591543]
   [-0.02748881  0.03717749 -0.03104552 ...  0.02413219 -0.02402252 0.02810651]
]
Int8 Embedding: [
   [-128  127  127 ... -128 -128 -128]
   [ 126 -128 -128 ...  127  126  127]
]
Int1 Embedding (binary representation): [
   [ 77  30   4 131  15 123 146 ... 159 142 205  23 119 120]
   [ 79  82 208 180  45  79 209 ... 158 100 141 189 166 173]
]

测试用于将嵌入转换为 BSON 向量的函数。

在笔记本中粘贴并运行以下代码以测试 generate_bson_vector 函数。

此代码将您的float32、int8和int1嵌入量化为字符串foo和bar。

from bson.binary BinaryVectorDtype
bson_float32_embeddings = []
bson_int8_embeddings = []
bson_int1_embeddings = []
for (f32_emb, int8_emb, int1_emb) in zip(float32_embeddings, int8_embeddings, int1_embeddings):
    bson_float32_embeddings.append(generate_bson_vector(f32_emb, BinaryVectorDtype.FLOAT32))
    bson_int8_embeddings.append(generate_bson_vector(int8_emb, BinaryVectorDtype.INT8))
    bson_int1_embeddings.append(generate_bson_vector(int1_emb, BinaryVectorDtype.PACKED_BIT))
# Print the embeddings
print(f"The converted bson_float32_new_embedding is: {bson_float32_embeddings}")
print(f"The converted bson_int8_new_embedding is: {bson_int8_embeddings}")
print(f"The converted bson_int1_new_embedding is: {bson_int1_embeddings}")

The converted bson_float32_new_embedding is: [Binary(b'\'\x00x0\xf4\ ... x9bL\xd4\xbc', 9), Binary(b'\'\x007 ... \x9e?\xe6<', 9)]
The converted bson_int8_new_embedding is: [Binary(b'\x03\x00\x80\x7f\ ... x80\x80', 9), Binary(b'\x03\x00~\x80 ... \x7f', 9)]
The converted bson_int1_new_embedding is: [Binary(b'\x10\x00M\x1e\ ... 7wx', 9), Binary(b'\x10\x00OR\ ... \xa6\xad', 9)]

设置环境。

通过保存扩展名为 .ipynb 的文件来创建交互式 Python 笔记本，然后在笔记本中运行以下命令以安装依赖项：

pip install --quiet --upgrade openai pymongo

定义一个函数来生成向量嵌入。

在笔记本中粘贴并运行以下代码，创建一个函数，该函数利用 OpenAI 的专有嵌入模型生成向量嵌入。用您的 OpenAI API 密钥替换 <api-key>。此代码执行以下操作：

指定 text-embedding-3-small 内嵌模型。
创建一个名为 get_embedding 的函数，该函数调用模型的 API 来为给定的文本输入生成嵌入。
通过为字符串 foo 生成单个嵌入来测试函数。

import os
from openai import OpenAI
# Specify your OpenAI API key and embedding model
os.environ["OPENAI_API_KEY"] = "<api-key>"
model = "text-embedding-3-small"
openai_client = OpenAI()
# Define a function to generate embeddings
def get_embedding(text):
   """Generates vector embeddings for the given text."""
   embedding = openai_client.embeddings.create(input = [text], model=model).data[0].embedding
   return embedding
# Generate an embedding
get_embedding("foo")

[-0.005843308754265308, -0.013111298903822899, -0.014585349708795547, 0.03580040484666824, 0.02671629749238491, ... ]

提示

另请参阅：

有关 API 详细信息和可用模型列表，请参阅 OpenAI 文档。

从数据创建嵌入

在本部分中，您将使用您定义的函数从数据创建向量嵌入，然后将这些嵌入存储在Atlas的集合中。

根据要从新数据还是从 Atlas 中已有的现有数据创建嵌入，选择一个标签页。

定义一个 `DataService` 类。

在名为 DataService.cs 的同名文件中创建一个新类并粘贴以下代码。此代码会定义一个名为 AddDocumentsAsync 的异步任务，以将文档添加到 Atlas。此函数使用 Collection.InsertManyAsync() C# 驱动程序方法来插入 BsonDocument 类型的列表。每个文档包含：

text 字段包含电影摘要。
一个 embedding 字段，包含生成向量嵌入的浮点数组。

DataService.cs

namespace MyCompany.Embeddings;
using MongoDB.Driver;
using MongoDB.Bson;
public class DataService
{
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings");
    
    public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings)
    {
        var documents = new List<BsonDocument>();
        foreach( KeyValuePair<string, float[]> var in embeddings )
        {
            var document = new BsonDocument
            {
                {
                    "text", var.Key
                },
                {
                    "embedding", new BsonArray(var.Value)
                }
            };
            documents.Add(document);
        }
        await Collection.InsertManyAsync(documents);
        Console.WriteLine($"Successfully inserted {embeddings.Count} documents into Atlas");
        documents.Clear();
    }
}

更新您项目中的 `Program.cs`。

使用以下代码从 Atlas 中的现有集合生成嵌入。

具体来说，这段代码使用您定义的 GetEmbeddingsAsync 函数，从示例文本数组生成嵌入，并将其摄取到 Atlas 中的 sample_db.embeddings 集合中。

Program.cs

using MyCompany.Embeddings;
var aiService = new AIService();
var texts = new string[]
{
    "Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
    "The Lion King: Lion cub and future king Simba searches for his identity",
    "Avatar: A marine is dispatched to the moon Pandora on a unique mission"
};
var embeddings = await aiService.GetEmbeddingsAsync(texts);
var dataService = new DataService();
await dataService.AddDocumentsAsync(embeddings);

编译并运行项目。

dotnet run MyCompany.Embeddings.csproj

Successfully inserted 3 documents into Atlas

您还可以导航到集群中的 sample_db.embeddings 集合，在 Atlas UI 中查看向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合，但您可以调整代码以适用于集群中的任何集合。

定义一个 `DataService` 类。

在名为 DataService.cs 的同名文件中创建一个新类并粘贴以下代码。此代码创建了两个函数来执行以下操作：

连接到您的 Atlas 集群。
GetDocuments 方法从 sample_airbnb.listingsAndReviews 集合中获取具有非空 summary 字段的文档子集。
AddEmbeddings 异步任务在 sample_airbnb.listingsAndReviews 集合中的文档上创建一个新的 embeddings 字段，其 _id 与 GetDocuments 方法中检索到的文档之一匹配。

DataService.cs

namespace MyCompany.Embeddings;
using MongoDB.Driver;
using MongoDB.Bson;
public class DataService
{
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews");
    public List<BsonDocument>? GetDocuments()
    {
        var filter = Builders<BsonDocument>.Filter.And(
            Builders<BsonDocument>.Filter.And(
                Builders<BsonDocument>.Filter.Exists("summary", true),
                Builders<BsonDocument>.Filter.Ne("summary", "")
            ),
            Builders<BsonDocument>.Filter.Exists("embeddings", false)
        );
        return Collection.Find(filter).Limit(50).ToList(); 
    }
    public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings)
    {
        var listWrites = new List<WriteModel<BsonDocument>>();
        foreach( var kvp in embeddings )
        {
            var filterForUpdate = Builders<BsonDocument>.Filter.Eq("summary", kvp.Key);
            var updateDefinition = Builders<BsonDocument>.Update.Set("embeddings", kvp.Value);
            listWrites.Add(new UpdateOneModel<BsonDocument>(filterForUpdate, updateDefinition));
        }
        var result = await Collection.BulkWriteAsync(listWrites);
        listWrites.Clear();
        return result.ModifiedCount;
    }
}

更新您项目中的 `Program.cs`。

使用以下代码从 Atlas 中的现有集合生成嵌入。

具体来说，这段代码使用您定义的 GetEmbeddingsAsync 函数，从示例文本数组生成嵌入，并将其摄取到 Atlas 中的 sample_db.embeddings 集合中。

Program.cs

using MyCompany.Embeddings;
var dataService = new DataService();
var documents = dataService.GetDocuments();
if (documents != null)
{
    Console.WriteLine("Generating embeddings.");
    var aiService = new AIService();
    var summaries = new List<string>();
    foreach (var document in documents)
    {
        var summary = document.GetValue("summary").ToString();
        if (summary != null)
        {
            summaries.Add(summary);
        }
    }
    
    try
    {
        if (summaries.Count > 0)
        {
            var embeddings = await aiService.GetEmbeddingsAsync(summaries.ToArray());
        
            try
            {
                var updatedCount = await dataService.AddEmbeddings(embeddings);
                Console.WriteLine($"{updatedCount} documents updated successfully.");
            } catch (Exception e)
            {
                Console.WriteLine($"Error adding embeddings to MongoDB: {e.Message}");
            }
        }
    }
    catch (Exception e)
    {
        Console.WriteLine($"Error creating embeddings for summaries: {e.Message}");
    }
}
else
{
    Console.WriteLine("No documents found");
}

编译并运行项目。

dotnet run MyCompany.Embeddings.csproj

Generating embeddings.
50 documents updated successfully.

创建一个名为 `create-embeddings.go` 的文件并粘贴以下代码。

使用以下代码从 Atlas 中的现有集合生成嵌入。

具体来说，这段代码使用您定义的 GetEmbeddings 函数和 MongoDB Go 驱动程序，从示例文本数组生成嵌入，并将其摄取到 Atlas 中的 sample_db.embeddings 集合中。

create-embeddings.go

package main
import (
	"context"
	"fmt"
	"log"
	"my-embeddings-project/common"
	"os"
	"github.com/joho/godotenv"
	"go.mongodb.org/mongo-driver/mongo"
	"go.mongodb.org/mongo-driver/mongo/options"
)
var data = []string{
	"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
	"The Lion King: Lion cub and future king Simba searches for his identity",
	"Avatar: A marine is dispatched to the moon Pandora on a unique mission",
}
type TextWithEmbedding struct {
	Text      string
	Embedding []float32
}
func main() {
	ctx := context.Background()
	if err := godotenv.Load(); err != nil {
		log.Println("no .env file found")
	}
	// Connect to your Atlas cluster
	uri := os.Getenv("ATLAS_CONNECTION_STRING")
	if uri == "" {
		log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
	}
	clientOptions := options.Client().ApplyURI(uri)
	client, err := mongo.Connect(ctx, clientOptions)
	if err != nil {
		log.Fatalf("failed to connect to the server: %v", err)
	}
	defer func() { _ = client.Disconnect(ctx) }()
	// Set the namespace
	coll := client.Database("sample_db").Collection("embeddings")
	embeddings := common.GetEmbeddings(data)
	docsToInsert := make([]interface{}, len(embeddings))
	for i, string := range data {
		docsToInsert[i] = TextWithEmbedding{
			Text:      string,
			Embedding: embeddings[i],
		}
	}
	result, err := coll.InsertMany(ctx, docsToInsert)
	if err != nil {
		log.Fatalf("failed to insert documents: %v", err)
	}
	fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs))
}

create-embeddings.go

package main
import (
	"context"
	"fmt"
	"log"
	"my-embeddings-project/common"
	"os"
	"github.com/joho/godotenv"
	"go.mongodb.org/mongo-driver/mongo"
	"go.mongodb.org/mongo-driver/mongo/options"
)
var data = []string{
	"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
	"The Lion King: Lion cub and future king Simba searches for his identity",
	"Avatar: A marine is dispatched to the moon Pandora on a unique mission",
}
type TextWithEmbedding struct {
	Text      string
	Embedding []float64
}
func main() {
	ctx := context.Background()
	if err := godotenv.Load(); err != nil {
		log.Println("no .env file found")
	}
	// Connect to your Atlas cluster
	uri := os.Getenv("ATLAS_CONNECTION_STRING")
	if uri == "" {
		log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
	}
	clientOptions := options.Client().ApplyURI(uri)
	client, err := mongo.Connect(ctx, clientOptions)
	if err != nil {
		log.Fatalf("failed to connect to the server: %v", err)
	}
	defer func() { _ = client.Disconnect(ctx) }()
	// Set the namespace
	coll := client.Database("sample_db").Collection("embeddings")
	embeddings := common.GetEmbeddings(data)
	docsToInsert := make([]interface{}, len(data))
	for i, movie := range data {
		docsToInsert[i] = TextWithEmbedding{
			Text:      movie,
			Embedding: embeddings[i],
		}
	}
	result, err := coll.InsertMany(ctx, docsToInsert)
	if err != nil {
		log.Fatalf("failed to insert documents: %v", err)
	}
	fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs))
}

保存并运行该文件。

go run create-embeddings.go

Successfully inserted 3 documents into Atlas

go run create-embeddings.go

Successfully inserted 3 documents into Atlas

您还可以导航到集群中的 sample_db.embeddings 集合，在 Atlas UI 中查看向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合，但您可以调整代码以适用于集群中的任何集合。

创建一个名为 `create-embeddings.go` 的文件并粘贴以下代码。

使用以下代码从 Atlas 中的现有集合生成嵌入。具体而言，此代码执行以下操作：

连接到您的 Atlas 集群。
从 sample_airbnb.listingsAndReviews 集合中获取具有非空 summary 字段的文档子集。
使用您定义的 GetEmbeddings 函数，从每个文档的 summary 字段生成嵌入。
使用 MongoDB Go 驱动程序，用包含嵌入值的新 embeddings 字段更新每个文档。

create-embeddings.go

package main
import (
	"context"
	"log"
	"my-embeddings-project/common"
	"os"
	"github.com/joho/godotenv"
	"go.mongodb.org/mongo-driver/bson"
	"go.mongodb.org/mongo-driver/mongo"
	"go.mongodb.org/mongo-driver/mongo/options"
)
func main() {
	ctx := context.Background()
	if err := godotenv.Load(); err != nil {
		log.Println("no .env file found")
	}
	// Connect to your Atlas cluster
	uri := os.Getenv("ATLAS_CONNECTION_STRING")
	if uri == "" {
		log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
	}
	clientOptions := options.Client().ApplyURI(uri)
	client, err := mongo.Connect(ctx, clientOptions)
	if err != nil {
		log.Fatalf("failed to connect to the server: %v", err)
	}
	defer func() { _ = client.Disconnect(ctx) }()
	// Set the namespace
	coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
	filter := bson.D{
		{"$and",
			bson.A{
				bson.D{
					{"$and",
						bson.A{
							bson.D{{"summary", bson.D{{"$exists", true}}}},
							bson.D{{"summary", bson.D{{"$ne", ""}}}},
						},
					}},
				bson.D{{"embeddings", bson.D{{"$exists", false}}}},
			}},
	}
	opts := options.Find().SetLimit(50)
	cursor, err := coll.Find(ctx, filter, opts)
	if err != nil {
		log.Fatalf("failed to retrieve documents: %v", err)
	}
	var listings []common.Listing
	if err = cursor.All(ctx, &listings); err != nil {
		log.Fatalf("failed to unmarshal retrieved documents to Listing object: %v", err)
	}
	var summaries []string
	for _, listing := range listings {
		summaries = append(summaries, listing.Summary)
	}
	log.Println("Generating embeddings.")
	embeddings := common.GetEmbeddings(summaries)
	docsToUpdate := make([]mongo.WriteModel, len(listings))
	for i := range listings {
		docsToUpdate[i] = mongo.NewUpdateOneModel().
			SetFilter(bson.D{{"_id", listings[i].ID}}).
			SetUpdate(bson.D{{"$set", bson.D{{"embeddings", embeddings[i]}}}})
	}
	bulkWriteOptions := options.BulkWrite().SetOrdered(false)
	result, err := coll.BulkWrite(context.Background(), docsToUpdate, bulkWriteOptions)
	if err != nil {
		log.Fatalf("failed to write embeddings to existing documents: %v", err)
	}
	log.Printf("Successfully added embeddings to %v documents", result.ModifiedCount)
}

创建一个包含该集合的Go模型的文件。

为了简化与BSON之间的编组和解组Go对象，请创建一个包含此集合中文档模型的文件。

进入 common目录。
```
cd common
```

创建一个名为 models.go 的文件，并将以下代码粘贴到其中：

models.go

package common
import (
	"time"
	"go.mongodb.org/mongo-driver/bson/primitive"
)
type Image struct {
	ThumbnailURL string `bson:"thumbnail_url"`
	MediumURL    string `bson:"medium_url"`
	PictureURL   string `bson:"picture_url"`
	XLPictureURL string `bson:"xl_picture_url"`
}
type Host struct {
	ID                 string   `bson:"host_id"`
	URL                string   `bson:"host_url"`
	Name               string   `bson:"host_name"`
	Location           string   `bson:"host_location"`
	About              string   `bson:"host_about"`
	ThumbnailURL       string   `bson:"host_thumbnail_url"`
	PictureURL         string   `bson:"host_picture_url"`
	Neighborhood       string   `bson:"host_neighborhood"`
	IsSuperhost        bool     `bson:"host_is_superhost"`
	HasProfilePic      bool     `bson:"host_has_profile_pic"`
	IdentityVerified   bool     `bson:"host_identity_verified"`
	ListingsCount      int32    `bson:"host_listings_count"`
	TotalListingsCount int32    `bson:"host_total_listings_count"`
	Verifications      []string `bson:"host_verifications"`
}
type Location struct {
	Type            string    `bson:"type"`
	Coordinates     []float64 `bson:"coordinates"`
	IsLocationExact bool      `bson:"is_location_exact"`
}
type Address struct {
	Street         string   `bson:"street"`
	Suburb         string   `bson:"suburb"`
	GovernmentArea string   `bson:"government_area"`
	Market         string   `bson:"market"`
	Country        string   `bson:"Country"`
	CountryCode    string   `bson:"country_code"`
	Location       Location `bson:"location"`
}
type Availability struct {
	Thirty         int32 `bson:"availability_30"`
	Sixty          int32 `bson:"availability_60"`
	Ninety         int32 `bson:"availability_90"`
	ThreeSixtyFive int32 `bson:"availability_365"`
}
type ReviewScores struct {
	Accuracy      int32 `bson:"review_scores_accuracy"`
	Cleanliness   int32 `bson:"review_scores_cleanliness"`
	CheckIn       int32 `bson:"review_scores_checkin"`
	Communication int32 `bson:"review_scores_communication"`
	Location      int32 `bson:"review_scores_location"`
	Value         int32 `bson:"review_scores_value"`
	Rating        int32 `bson:"review_scores_rating"`
}
type Review struct {
	ID           string    `bson:"_id"`
	Date         time.Time `bson:"date,omitempty"`
	ListingId    string    `bson:"listing_id"`
	ReviewerId   string    `bson:"reviewer_id"`
	ReviewerName string    `bson:"reviewer_name"`
	Comments     string    `bson:"comments"`
}
type Listing struct {
	ID                   string               `bson:"_id"`
	ListingURL           string               `bson:"listing_url"`
	Name                 string               `bson:"name"`
	Summary              string               `bson:"summary"`
	Space                string               `bson:"space"`
	Description          string               `bson:"description"`
	NeighborhoodOverview string               `bson:"neighborhood_overview"`
	Notes                string               `bson:"notes"`
	Transit              string               `bson:"transit"`
	Access               string               `bson:"access"`
	Interaction          string               `bson:"interaction"`
	HouseRules           string               `bson:"house_rules"`
	PropertyType         string               `bson:"property_type"`
	RoomType             string               `bson:"room_type"`
	BedType              string               `bson:"bed_type"`
	MinimumNights        string               `bson:"minimum_nights"`
	MaximumNights        string               `bson:"maximum_nights"`
	CancellationPolicy   string               `bson:"cancellation_policy"`
	LastScraped          time.Time            `bson:"last_scraped,omitempty"`
	CalendarLastScraped  time.Time            `bson:"calendar_last_scraped,omitempty"`
	FirstReview          time.Time            `bson:"first_review,omitempty"`
	LastReview           time.Time            `bson:"last_review,omitempty"`
	Accommodates         int32                `bson:"accommodates"`
	Bedrooms             int32                `bson:"bedrooms"`
	Beds                 int32                `bson:"beds"`
	NumberOfReviews      int32                `bson:"number_of_reviews"`
	Bathrooms            primitive.Decimal128 `bson:"bathrooms"`
	Amenities            []string             `bson:"amenities"`
	Price                primitive.Decimal128 `bson:"price"`
	WeeklyPrice          primitive.Decimal128 `bson:"weekly_price"`
	MonthlyPrice         primitive.Decimal128 `bson:"monthly_price"`
	CleaningFee          primitive.Decimal128 `bson:"cleaning_fee"`
	ExtraPeople          primitive.Decimal128 `bson:"extra_people"`
	GuestsIncluded       primitive.Decimal128 `bson:"guests_included"`
	Image                Image                `bson:"images"`
	Host                 Host                 `bson:"host"`
	Address              Address              `bson:"address"`
	Availability         Availability         `bson:"availability"`
	ReviewScores         ReviewScores         `bson:"review_scores"`
	Reviews              []Review             `bson:"reviews"`
	Embeddings           []float32            `bson:"embeddings,omitempty"`
}

models.go

package common
import (
	"time"
	"go.mongodb.org/mongo-driver/bson/primitive"
)
type Image struct {
	ThumbnailURL string `bson:"thumbnail_url"`
	MediumURL    string `bson:"medium_url"`
	PictureURL   string `bson:"picture_url"`
	XLPictureURL string `bson:"xl_picture_url"`
}
type Host struct {
	ID                 string   `bson:"host_id"`
	URL                string   `bson:"host_url"`
	Name               string   `bson:"host_name"`
	Location           string   `bson:"host_location"`
	About              string   `bson:"host_about"`
	ThumbnailURL       string   `bson:"host_thumbnail_url"`
	PictureURL         string   `bson:"host_picture_url"`
	Neighborhood       string   `bson:"host_neighborhood"`
	IsSuperhost        bool     `bson:"host_is_superhost"`
	HasProfilePic      bool     `bson:"host_has_profile_pic"`
	IdentityVerified   bool     `bson:"host_identity_verified"`
	ListingsCount      int32    `bson:"host_listings_count"`
	TotalListingsCount int32    `bson:"host_total_listings_count"`
	Verifications      []string `bson:"host_verifications"`
}
type Location struct {
	Type            string    `bson:"type"`
	Coordinates     []float64 `bson:"coordinates"`
	IsLocationExact bool      `bson:"is_location_exact"`
}
type Address struct {
	Street         string   `bson:"street"`
	Suburb         string   `bson:"suburb"`
	GovernmentArea string   `bson:"government_area"`
	Market         string   `bson:"market"`
	Country        string   `bson:"Country"`
	CountryCode    string   `bson:"country_code"`
	Location       Location `bson:"location"`
}
type Availability struct {
	Thirty         int32 `bson:"availability_30"`
	Sixty          int32 `bson:"availability_60"`
	Ninety         int32 `bson:"availability_90"`
	ThreeSixtyFive int32 `bson:"availability_365"`
}
type ReviewScores struct {
	Accuracy      int32 `bson:"review_scores_accuracy"`
	Cleanliness   int32 `bson:"review_scores_cleanliness"`
	CheckIn       int32 `bson:"review_scores_checkin"`
	Communication int32 `bson:"review_scores_communication"`
	Location      int32 `bson:"review_scores_location"`
	Value         int32 `bson:"review_scores_value"`
	Rating        int32 `bson:"review_scores_rating"`
}
type Review struct {
	ID           string    `bson:"_id"`
	Date         time.Time `bson:"date,omitempty"`
	ListingId    string    `bson:"listing_id"`
	ReviewerId   string    `bson:"reviewer_id"`
	ReviewerName string    `bson:"reviewer_name"`
	Comments     string    `bson:"comments"`
}
type Listing struct {
	ID                   string               `bson:"_id"`
	ListingURL           string               `bson:"listing_url"`
	Name                 string               `bson:"name"`
	Summary              string               `bson:"summary"`
	Space                string               `bson:"space"`
	Description          string               `bson:"description"`
	NeighborhoodOverview string               `bson:"neighborhood_overview"`
	Notes                string               `bson:"notes"`
	Transit              string               `bson:"transit"`
	Access               string               `bson:"access"`
	Interaction          string               `bson:"interaction"`
	HouseRules           string               `bson:"house_rules"`
	PropertyType         string               `bson:"property_type"`
	RoomType             string               `bson:"room_type"`
	BedType              string               `bson:"bed_type"`
	MinimumNights        string               `bson:"minimum_nights"`
	MaximumNights        string               `bson:"maximum_nights"`
	CancellationPolicy   string               `bson:"cancellation_policy"`
	LastScraped          time.Time            `bson:"last_scraped,omitempty"`
	CalendarLastScraped  time.Time            `bson:"calendar_last_scraped,omitempty"`
	FirstReview          time.Time            `bson:"first_review,omitempty"`
	LastReview           time.Time            `bson:"last_review,omitempty"`
	Accommodates         int32                `bson:"accommodates"`
	Bedrooms             int32                `bson:"bedrooms"`
	Beds                 int32                `bson:"beds"`
	NumberOfReviews      int32                `bson:"number_of_reviews"`
	Bathrooms            primitive.Decimal128 `bson:"bathrooms"`
	Amenities            []string             `bson:"amenities"`
	Price                primitive.Decimal128 `bson:"price"`
	WeeklyPrice          primitive.Decimal128 `bson:"weekly_price"`
	MonthlyPrice         primitive.Decimal128 `bson:"monthly_price"`
	CleaningFee          primitive.Decimal128 `bson:"cleaning_fee"`
	ExtraPeople          primitive.Decimal128 `bson:"extra_people"`
	GuestsIncluded       primitive.Decimal128 `bson:"guests_included"`
	Image                Image                `bson:"images"`
	Host                 Host                 `bson:"host"`
	Address              Address              `bson:"address"`
	Availability         Availability         `bson:"availability"`
	ReviewScores         ReviewScores         `bson:"review_scores"`
	Reviews              []Review             `bson:"reviews"`
	Embeddings           []float64            `bson:"embeddings,omitempty"`
}

返回到项目根目录。
```
cd ../
```

生成嵌入。

go run create-embeddings.go

2024/10/10 09:58:03 Generating embeddings.
2024/10/10 09:58:12 Successfully added embeddings to 50 documents

您可以通过导航到 Atlas UI 中的 sample_airbnb.listingsAndReviews 集合来查看生成的矢量嵌入。

定义代码以从Atlas中的现有集合生成嵌入。

创建一个名为CreateEmbeddings.java的文件并粘贴以下代码。

此代码使用getEmbeddings 方法和MongoDB Java同步驱动驱动程序来执行以下操作：

连接到您的 Atlas 集群。
获取示例文本大量。
使用您之前定义的 getEmbeddings 方法从每个文本生成嵌入。
将嵌入引入Atlas中的 sample_db.embeddings集合。

CreateEmbeddings.java

import com.mongodb.MongoException;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.result.InsertManyResult;
import org.bson.BsonArray;
import org.bson.Document;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class CreateEmbeddings {
    static List<String> data = Arrays.asList(
            "Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
            "The Lion King: Lion cub and future king Simba searches for his identity",
            "Avatar: A marine is dispatched to the moon Pandora on a unique mission"
    );
    public static void main(String[] args){
        String uri = System.getenv("ATLAS_CONNECTION_STRING");
        if (uri == null || uri.isEmpty()) {
            throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
        }
        // establish connection and set namespace
        try (MongoClient mongoClient = MongoClients.create(uri)) {
            MongoDatabase database = mongoClient.getDatabase("sample_db");
            MongoCollection<Document> collection = database.getCollection("embeddings");
            System.out.println("Creating embeddings for " + data.size() + " documents");
            EmbeddingProvider embeddingProvider = new EmbeddingProvider();
            // generate embeddings for new inputted data
            List<BsonArray> embeddings = embeddingProvider.getEmbeddings(data);
            List<Document> documents = new ArrayList<>();
            int i = 0;
            for (String text : data) {
                Document doc = new Document("text", text).append("embedding", embeddings.get(i));
                documents.add(doc);
                i++;
            }
            // insert the embeddings into the Atlas collection
            List<String> insertedIds = new ArrayList<>();
            try {
                InsertManyResult result = collection.insertMany(documents);
                result.getInsertedIds().values()
                        .forEach(doc -> insertedIds.add(doc.toString()));
                System.out.println("Inserted " + insertedIds.size() + " documents with the following ids to " + collection.getNamespace() + " collection: \n " + insertedIds);
            } catch (MongoException me) {
                throw new RuntimeException("Failed to insert documents", me);
            }
        } catch (MongoException me) {
            throw new RuntimeException("Failed to connect to MongoDB ", me);
        } catch (Exception e) {
            throw new RuntimeException("Operation failed: ", e);
        }
    }
}

生成嵌入。

保存并运行文件。输出类似于：

Creating embeddings for 3 documents
Inserted 3 documents with the following ids to sample_db.embeddings collection:
 [BsonObjectId{value=6735ff620d88451041f6dd40}, BsonObjectId{value=6735ff620d88451041f6dd41}, BsonObjectId{value=6735ff620d88451041f6dd42}]

您还可以导航到集群中的 sample_db.embeddings 集合，在 Atlas UI 中查看向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合，但您可以调整代码以适用于集群中的任何集合。

定义代码以从Atlas中的现有集合生成嵌入。

创建一个名为CreateEmbeddings.java的文件并粘贴以下代码。

此代码使用getEmbeddings 方法和MongoDB Java同步驱动驱动程序来执行以下操作：

连接到您的 Atlas 集群。
从 sample_airbnb.listingsAndReviews集合中获取具有非空 summary字段的文档子集。
使用您之前定义的 getEmbeddings 方法，从每个文档的 summary字段生成嵌入。
使用包含嵌入值的新 embeddings字段更新每个文档。

CreateEmbeddings.java

import com.mongodb.MongoException;
import com.mongodb.bulk.BulkWriteResult;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoCursor;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.BulkWriteOptions;
import com.mongodb.client.model.Filters;
import com.mongodb.client.model.UpdateOneModel;
import com.mongodb.client.model.Updates;
import com.mongodb.client.model.WriteModel;
import org.bson.BsonArray;
import org.bson.Document;
import org.bson.conversions.Bson;
import java.util.ArrayList;
import java.util.List;
public class CreateEmbeddings {
    public static void main(String[] args){
        String uri = System.getenv("ATLAS_CONNECTION_STRING");
        if (uri == null || uri.isEmpty()) {
            throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
        }
        // establish connection and set namespace
        try (MongoClient mongoClient = MongoClients.create(uri)) {
            MongoDatabase database = mongoClient.getDatabase("sample_airbnb");
            MongoCollection<Document> collection = database.getCollection("listingsAndReviews");
            Bson filterCriteria = Filters.and(
                    Filters.and(Filters.exists("summary"),
                            Filters.ne("summary", null),
                            Filters.ne("summary", "")),
                    Filters.exists("embeddings", false));
            try (MongoCursor<Document> cursor = collection.find(filterCriteria).limit(50).iterator()) {
                List<String> summaries = new ArrayList<>();
                List<String> documentIds = new ArrayList<>();
                int i = 0;
                while (cursor.hasNext()) {
                    Document document = cursor.next();
                    String summary = document.getString("summary");
                    String id = document.get("_id").toString();
                    summaries.add(summary);
                    documentIds.add(id);
                    i++;
                }
                System.out.println("Generating embeddings for " + summaries.size() + " documents.");
                System.out.println("This operation may take up to several minutes.");
                EmbeddingProvider embeddingProvider = new EmbeddingProvider();
                List<BsonArray> embeddings = embeddingProvider.getEmbeddings(summaries);
                List<WriteModel<Document>> updateDocuments = new ArrayList<>();
                for (int j = 0; j < summaries.size(); j++) {
                    UpdateOneModel<Document> updateDoc = new UpdateOneModel<>(
                            Filters.eq("_id", documentIds.get(j)),
                            Updates.set("embeddings", embeddings.get(j)));
                    updateDocuments.add(updateDoc);
                }
                int updatedDocsCount = 0;
                try {
                    BulkWriteOptions options = new BulkWriteOptions().ordered(false);
                    BulkWriteResult result = collection.bulkWrite(updateDocuments, options);
                    updatedDocsCount = result.getModifiedCount();
                } catch (MongoException me) {
                    throw new RuntimeException("Failed to insert documents", me);
                }
                System.out.println("Added embeddings successfully to " + updatedDocsCount + " documents.");
            }
        } catch (MongoException me) {
            throw new RuntimeException("Failed to connect to MongoDB", me);
        } catch (Exception e) {
            throw new RuntimeException("Operation failed: ", e);
        }
    }
}

生成嵌入。

保存并运行文件。输出类似于：

Generating embeddings for 50 documents.
This operation may take up to several minutes.
Added embeddings successfully to 50 documents.

您还可以导航到集群中的 sample_airbnb.listingsAndReviews 集合，在 Atlas UI 中查看向量嵌入。

创建一个名为 `create-embeddings.js` 的文件并粘贴以下代码。

使用以下代码从 Atlas 中的现有集合生成嵌入。

具体来说，这段代码使用您定义的 getEmbedding 函数和 MongoDB Node.js 驱动程序，从示例文本数组中生成嵌入，并将其摄取到 Atlas 中的 sample_db.embeddings 集合中。

create-embeddings.js

import { MongoClient } from 'mongodb';
import { getEmbedding } from './get-embeddings.js';
// Data to embed
const data = [ 
    "Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
    "The Lion King: Lion cub and future king Simba searches for his identity",
    "Avatar: A marine is dispatched to the moon Pandora on a unique mission"
]
async function run() {
    // Connect to your Atlas cluster
    const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
    
    try {
        await client.connect();
        const db = client.db("sample_db");
        const collection = db.collection("embeddings");
        console.log("Generating embeddings and inserting documents...");
        const insertDocuments = [];
        await Promise.all(data.map(async text => {
            // Check if the document already exists
            const existingDoc = await collection.findOne({ text: text });
            // Generate an embedding using the function that you defined
            const embedding = await getEmbedding(text);
            
            // Add the document with the embedding to array of documents for bulk insert
            if (!existingDoc) {
                insertDocuments.push({
                    text: text,
                    embedding: embedding
                })
                console.log(embedding)
            }
        }));
        // Continue processing documents if an error occurs during an operation
        const options = { ordered: false };
        // Insert documents with embeddings into Atlas
        const result = await collection.insertMany(insertDocuments, options);  
        console.log("Count of documents inserted: " + result.insertedCount); 
    } catch (err) {
        console.log(err.stack);
    }
    finally {
        await client.close();
    }
}
run().catch(console.dir);

保存并运行该文件。

node --env-file=.env create-embeddings.js

Generating embeddings and inserting documents...
[ 0.031927742, -0.014192767, -0.021851597, 0.045498233, -0.0077904654, ... ]
[ -0.01664538, 0.013198251, 0.048684783, 0.014485021, -0.018121032, ... ]
[ 0.030449908, 0.046782598, 0.02126599, 0.025799986, -0.015830345, ... ]
Count of documents inserted: 3

node --env-file=.env create-embeddings.js

Generating embeddings and inserting documents...
[ 0.031927742, -0.014192767, -0.021851597, 0.045498233, -0.0077904654, ... ]
[ -0.01664538, 0.013198251, 0.048684783, 0.014485021, -0.018121032, ... ]
[ 0.030449908, 0.046782598, 0.02126599, 0.025799986, -0.015830345, ... ]
Count of documents inserted: 3

注意

为了便于阅读，输出中的维数已被截断。

您还可以导航到集群中的 sample_db.embeddings 集合，在 Atlas UI 中查看向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合，但您可以调整代码以适用于集群中的任何集合。

创建一个名为 `create-embeddings.js` 的文件并粘贴以下代码。

使用以下代码从 Atlas 中的现有集合生成嵌入。具体而言，此代码执行以下操作：

连接到您的 Atlas 集群。
从 sample_airbnb.listingsAndReviews 集合中获取具有非空 summary 字段的文档子集。
使用您定义的 getEmbedding 函数，从每个文档的 summary 字段生成嵌入。
使用 MongoDB Node.js 驱动程序，用包含嵌入值的新 embedding 字段更新每个文档。

create-embeddings.js

import { MongoClient } from 'mongodb';
import { getEmbedding } from './get-embeddings.js';
async function run() {
    // Connect to your Atlas cluster
    const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
    try {
        await client.connect();
        const db = client.db("sample_airbnb");
        const collection = db.collection("listingsAndReviews");
        // Filter to exclude null or empty summary fields
        const filter = { "summary": { "$nin": [ null, "" ] } };
        // Get a subset of documents from the collection
        const documents = await collection.find(filter).limit(50).toArray();
        console.log("Generating embeddings and updating documents...");
        const updateDocuments = [];
        await Promise.all(documents.map(async doc => {
            // Generate an embedding using the function that you defined
            const embedding = await getEmbedding(doc.summary);
            // Add the embedding to an array of update operations
            updateDocuments.push(
                {
                    updateOne: { 
                        filter: { "_id": doc._id },
                        update: { $set: { "embedding": embedding } }
                    }
                }
           )
       }));
       // Continue processing documents if an error occurs during an operation
       const options = { ordered: false };
       // Update documents with the new embedding field
       const result = await collection.bulkWrite(updateDocuments, options); 
       console.log("Count of documents updated: " + result.modifiedCount); 
            
    } catch (err) {
        console.log(err.stack);
    }
    finally {
        await client.close();
    }
}
run().catch(console.dir);

保存并运行该文件。

node --env-file=.env create-embeddings.js

Generating embeddings and updating documents...
Count of documents updated: 50

sample_airbnb.listingsAndReviews您可以导航到Atlas用户界面中的集合并展开文档中的字段，以查看生成的向量嵌入。

（视情况而定）定义用于生成 BSON 嵌入的函数。

如果您尚未在笔记本中定义get_embedding generate_bson_vector、和create_docs_with_bson_vector_embeddings 函数，请参阅定义嵌入函数以在笔记本中加载这些函数。

加载示例数据。

将以下代码粘贴到您的笔记本中并运行：

 # Sample data
 sentences = [
  "Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
  "The Lion King: Lion cub and future king Simba searches for his identity",
  "Avatar: A marine is dispatched to the moon Pandora on a unique mission",
  "Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.",
  "The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.",
  "Forrest Gump: A man with a low IQ recounts several decades of extraordinary events in his life.",
  "Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.",
  "The Matrix: A hacker discovers the true nature of reality and his role in the war against its controllers.",
  "Star Wars: A young farm boy is swept into the struggle between the Rebel Alliance and the Galactic Empire.",
  "The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.",
  "Indiana Jones and the Last Crusade: An archaeologist pursues the Holy Grail while confronting adversaries from the past.",
  "The Dark Knight: Batman faces a new menace, the Joker, who plunges Gotham into anarchy.",
  "Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.",
  "The Silence of the Lambs: A young FBI agent seeks the help of an incarcerated cannibalistic killer to catch another serial killer.",
  "E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.",
  "Saving Private Ryan: During WWII, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.",
  "Gladiator: A once-powerful Roman general seeks vengeance against the corrupt emperor who betrayed his family.",
  "Rocky: A small-time boxer gets a once-in-a-lifetime chance to fight the world heavyweight champion.",
  "Pirates of the Caribbean: Jack Sparrow races to recover the heart of Davy Jones to escape eternal servitude.",
  "Schindler's List: The true story of a man who saved hundreds of Jews during the Holocaust by employing them in his factory."
]

为您的数据生成嵌入。

请使用以下代码从新数据生成嵌入。

具体来说，此代码使用您定义的 get_embedding 函数从示例文本大量生成嵌入。

float32_embeddings = get_embedding(sentences, precision="float32")
int8_embeddings = get_embedding(sentences, precision="int8")
int1_embeddings = get_embedding(sentences, precision="ubinary")
# Print stored embeddings
print("Generated embeddings stored in different variables:")
for i, text in enumerate(sentences):
    print(f"\nText: {text}")
    print(f"Float32 Embedding: {float32_embeddings[i][:3]}... (truncated)")
    print(f"Int8 Embedding: {int8_embeddings[i][:3]}... (truncated)")
    print(f"Ubinary Embedding: {int1_embeddings[i][:3]}... (truncated)")

Generated embeddings stored in different variables:
Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
Float32 Embedding: [-0.01089042  0.05926645 -0.00291325]... (truncated)
Int8 Embedding: [-15 127  56]... (truncated)
Ubinary Embedding: [ 77  30 209]... (truncated)
Text: The Lion King: Lion cub and future king Simba searches for his identity
Float32 Embedding: [-0.05607051 -0.01360618  0.00523855]... (truncated)
Int8 Embedding: [-128 -109  110]... (truncated)
Ubinary Embedding: [ 37  18 151]... (truncated)
Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
Float32 Embedding: [-0.0275258   0.01144342 -0.02360895]... (truncated)
Int8 Embedding: [-57 -28 -79]... (truncated)
Ubinary Embedding: [ 76  16 144]... (truncated)
Text: Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.
Float32 Embedding: [-0.01759741  0.03254957 -0.02090798]... (truncated)
Int8 Embedding: [-32  40 -61]... (truncated)
Ubinary Embedding: [ 77  27 176]... (truncated)
Text: The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.
Float32 Embedding: [ 0.00503172  0.04311579 -0.00074904]... (truncated)
Int8 Embedding: [23 74 70]... (truncated)
Ubinary Embedding: [215  26 145]... (truncated)
Text: Forrest Gump: A man with a low IQ recounts several decades of extraordinary events in his life.
Float32 Embedding: [0.02349479 0.05669326 0.00458773]... (truncated)
Int8 Embedding: [ 69 118 105]... (truncated)
Ubinary Embedding: [237 154 159]... (truncated)
Text: Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.
Float32 Embedding: [-0.03294644  0.02671233 -0.01864981]... (truncated)
Int8 Embedding: [-70  21 -47]... (truncated)
Ubinary Embedding: [ 77  90 146]... (truncated)
Text: The Matrix: A hacker discovers the true nature of reality and his role in the war against its controllers.
Float32 Embedding: [-0.02489671  0.02847196 -0.00290637]... (truncated)
Int8 Embedding: [-50  27  56]... (truncated)
Ubinary Embedding: [ 95 154 129]... (truncated)
Text: Star Wars: A young farm boy is swept into the struggle between the Rebel Alliance and the Galactic Empire.
Float32 Embedding: [-0.01235448  0.01524397 -0.01063425]... (truncated)
Int8 Embedding: [-19 -15   5]... (truncated)
Ubinary Embedding: [ 68  26 210]... (truncated)
Text: The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.
Float32 Embedding: [ 0.04665203  0.01392298 -0.01743002]... (truncated)
Int8 Embedding: [127 -20 -39]... (truncated)
Ubinary Embedding: [207  88 208]... (truncated)
Text: Indiana Jones and the Last Crusade: An archaeologist pursues the Holy Grail while confronting adversaries from the past.
Float32 Embedding: [0.00929601 0.04206405 0.00701248]... (truncated)
Int8 Embedding: [ 34  71 121]... (truncated)
Ubinary Embedding: [228  90 130]... (truncated)
Text: The Dark Knight: Batman faces a new menace, the Joker, who plunges Gotham into anarchy.
Float32 Embedding: [-0.01451324 -0.00897367  0.0077793 ]... (truncated)
Int8 Embedding: [-24 -94 127]... (truncated)
Ubinary Embedding: [ 57 150  32]... (truncated)
Text: Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.
Float32 Embedding: [-0.01458643  0.03639758 -0.02587282]... (truncated)
Int8 Embedding: [-25  52 -94]... (truncated)
Ubinary Embedding: [ 78 218 216]... (truncated)
Text: The Silence of the Lambs: A young FBI agent seeks the help of an incarcerated cannibalistic killer to catch another serial killer.
Float32 Embedding: [-0.00205381 -0.00039482 -0.01630799]... (truncated)
Int8 Embedding: [  6 -66 -31]... (truncated)
Ubinary Embedding: [  9  82 154]... (truncated)
Text: E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.
Float32 Embedding: [ 0.01105334  0.00776658 -0.03092942]... (truncated)
Int8 Embedding: [  38  -40 -128]... (truncated)
Ubinary Embedding: [205  24 146]... (truncated)
Text: Saving Private Ryan: During WWII, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.
Float32 Embedding: [ 0.00266668 -0.01926583 -0.00727963]... (truncated)
Int8 Embedding: [  17 -128   27]... (truncated)
Ubinary Embedding: [148  82 194]... (truncated)
Text: Gladiator: A once-powerful Roman general seeks vengeance against the corrupt emperor who betrayed his family.
Float32 Embedding: [-0.00031873 -0.01352339 -0.02882693]... (truncated)
Int8 Embedding: [  10 -109 -114]... (truncated)
Ubinary Embedding: [ 12  26 144]... (truncated)
Text: Rocky: A small-time boxer gets a once-in-a-lifetime chance to fight the world heavyweight champion.
Float32 Embedding: [ 0.00957429  0.01855557 -0.02353773]... (truncated)
Int8 Embedding: [ 34  -5 -79]... (truncated)
Ubinary Embedding: [212  18 144]... (truncated)
Text: Pirates of the Caribbean: Jack Sparrow races to recover the heart of Davy Jones to escape eternal servitude.
Float32 Embedding: [-0.01787405  0.03672816 -0.00972007]... (truncated)
Int8 Embedding: [-33  53  11]... (truncated)
Ubinary Embedding: [ 68 154 145]... (truncated)
Text: Schindler's List: The true story of a man who saved hundreds of Jews during the Holocaust by employing them in his factory.
Float32 Embedding: [-0.03515214 -0.00503571  0.00183181]... (truncated)
Int8 Embedding: [-76 -81  87]... (truncated)
Ubinary Embedding: [ 35 222 152]... (truncated)

从嵌入中生成 BSON 向量。

使用以下代码将生成的向量嵌入转换为BSON向量。

具体而言，此代码会将生成的嵌入转换为 float32、int8 与位组合 int1 类型，然后对 float32、int8 与 int1 向量进行量化。

from bson.binary import BinaryVectorDtype
bson_float32_embeddings = []
bson_int8_embeddings = []
bson_int1_embeddings = []
# Convert each embedding to BSON
for (f32_emb, int8_emb, int1_emb) in zip(float32_embeddings, int8_embeddings, int1_embeddings):
    bson_float32_embeddings.append(generate_bson_vector(f32_emb, BinaryVectorDtype.FLOAT32))
    bson_int8_embeddings.append(generate_bson_vector(int8_emb, BinaryVectorDtype.INT8))
    bson_int1_embeddings.append(generate_bson_vector(int1_emb, BinaryVectorDtype.PACKED_BIT))
# Print the embeddings
for idx, text in enumerate(sentences):
    print(f"\nText: {text}")
    print(f"Float32 BSON: {bson_float32_embeddings[idx]}")
    print(f"Int8 BSON: {bson_int8_embeddings[idx]}")
    print(f"Int1 BSON: {bson_int1_embeddings[idx]}")

Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
Float32 BSON: b'\'\x00\xbam2\xbc`\xc1r=7\xec>\xbb\xe6\xf3\x...'
Int8 BSON: b'\x03\x00\xf1\x7f8\xdf\xfeC\x1e>\xef\xd6\xf5\x9...'
Int1 BSON: b'\x10\x00M\x1e\xd1\xd2\x05\xaeq\xdf\x9a\x1d\xbc...'
Text: The Lion King: Lion cub and future king Simba searches for his identity
Float32 BSON: b'\'\x001\xaae\xbdr\xec^\xbc"\xa8\xab;\x91\xd...'
Int8 BSON: b'\x03\x00\x80\x93n\x06\x80\xca\xd3.\xa2\xe3\xd1...'
Int1 BSON: b'\x10\x00%\x12\x97\xa6\x8f\xdf\x89\x9d2\xcb\x99...'
Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
Float32 BSON: b'\'\x00\xcc}\xe1\xbc-};<\x8eg\xc1\xbc\xcb\xd...'
Int8 BSON: b'\x03\x00\xc7\xe4\xb1\xdf/\xe2\xd2\x90\xf7\x02|...'
Int1 BSON: b'\x10\x00L\x10\x90\xb6\x0f\x8a\x91\xaf\x92|\xf9...'
Text: Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.
Float32 BSON: b'\'\x00o(\x90\xbc\xb3R\x05=8G\xab\xbc\xfb\xc...'
Int8 BSON: b'\x03\x00\xe0(\xc3\x10*\xda\xfe\x19\xbf&<\xd1\x...'
Int1 BSON: b'\x10\x00M\x1b\xb0\x86\rn\x93\xaf:w\x9f}\x92\xd...'
Text: The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.
Float32 BSON: b'\'\x00\x1d\xe1\xa4;0\x9a0=C[D\xba\xb5\xf2\x...'
Int8 BSON: b'\x03\x00\x17JF2\xb9\xddZ8\xa1\x0c\xc6\x80\xd8$...'
Int1 BSON: b'\x10\x00\xd7\x1a\x91\x87\x0e\xc9\x91\x8b\xba\x...'
Text: Forrest Gump: A man with a low IQ recounts several decades of extraordinary events in his life.
Float32 BSON: b'\'\x00#x\xc0<27h=\xb5T\x96;:\xc4\x9c\xbd\x1...'
Int8 BSON: b'\x03\x00Evi\x80\x13\xd6\x1cCW\x80\x01\x9e\xe58...'
Int1 BSON: b'\x10\x00\xed\x9a\x9f\x97\x1f.\x12\xf9\xba];\x7...'
Text: Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.
Float32 BSON: b'\'\x00\xd9\xf2\x06\xbd\xd2\xd3\xda<\x7f\xc7...'
Int8 BSON: b'\x03\x00\xba\x15\xd1-\x0c\x03\xe6\xea\rQ\x1f\x...'
Int1 BSON: b'\x10\x00MZ\x92\xb7#\xaa\x99=\x9a\x99\x9c|<\xf8...'
Text: The Matrix: A hacker discovers the true nature of reality and his role in the war against its controllers.
Float32 BSON: b'\'\x00/\xf4\xcb\xbc\t>\xe9<\xc9x>\xbb\xcc\x...'
Int8 BSON: b'\x03\x00\xce\x1b815\xcf1\xc6s\xe5\n\xe4\x192G\...'
Int1 BSON: b'\x10\x00_\x9a\x81\xa6\x0f\x0f\x93o2\xd8\xfe|\x...'
Text: Star Wars: A young farm boy is swept into the struggle between the Rebel Alliance and the Galactic Empire.
Float32 BSON: b'\'\x00sjJ\xbc\xd6\xc1y<I;.\xbc\xb1\x80\t\xb...'
Int8 BSON: b'\x03\x00\xed\xf1\x05\xe2\xc7\xfa\xd4\xab5\xeb\...'
Int1 BSON: b'\x10\x00D\x1a\xd2\x86\x0ey\x92\x8f\xaa\x89\x1c...'
Text: The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.
Float32 BSON: b'\'\x004\x16?=9\x1dd<g\xc9\x8e\xbc\xdf\x81\x...'
Int8 BSON: b'\x03\x00\x7f\xec\xd9\xdc)\xd6)\x05\x18\x7f\xa6...'
Int1 BSON: b"\x10\x00\xcfX\xd0\xb7\x0e\xcf\xd9\r\xf0U\xb4]6..."
Text: Indiana Jones and the Last Crusade: An archaeologist pursues the Holy Grail while confronting adversaries from the past.
Float32 BSON: b'\'\x00HN\x18<ZK,=\xf9\xc8\xe5;\x9e\xed\xa0\...'
Int8 BSON: b'\x03\x00"Gy\x01\xeb\xec\xfc\x80\xe4a\x7f\x88\x...'
Int1 BSON: b'\x10\x00\xe4Z\x82\xb6\xad\xec\x10-\x9a\x99;?j\...'
Text: The Dark Knight: Batman faces a new menace, the Joker, who plunges Gotham into anarchy.
Float32 BSON: b'\'\x00\xef\xc8m\xbcJ\x06\x13\xbcv\xe9\xfe;...'
Int8 BSON: b'\x03\x00\xe8\xa2\x7fIE\xba\x9f\xfaT2\xf1\xc1\...'
Int1 BSON: b'\x10\x009\x96 \xb7\x8e\xc9\x81\xaf\xaa\x9f\xa...'
Text: Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.
Float32 BSON: b'\'\x00\xee\xfbn\xbc\xa0\x15\x15=<\xf3\xd3x...'
Int8 BSON: b'\x03\x00\xe74\xa2\xe5\x15\x165\xb9dM8C\xd7E\x...'
Int1 BSON: b'\x10\x00N\xda\xd8\xb6\x03N\x98\xbd\xdaY\x1b| ...'
Text: The Silence of the Lambs: A young FBI agent seeks the help of an incarcerated cannibalistic killer to catch another serial killer.
Float32 BSON: b'\'\x002\x99\x06\xbb\x82\x00\xcf\xb9X\x98\x...'
Int8 BSON: b'\x03\x00\x06\xbe\xe1.\x7f\x80\x04C\xd7e\x80\x...'
Int1 BSON: b'\x10\x00\tR\x9a\xd6\x0c\xb1\x9a\xbc\x90\xf5\x...'
Text: E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.
Float32 BSON: b'\'\x00\x14\x195<\xd4~\xfe;\xb3_\xfd\xbc \xe...'
Int8 BSON: b'\x03\x00&\xd8\x80\x92\x01\x7f\xbfF\xd4\x10\xf0...'
Int1 BSON: b'\x10\x00\xcd\x18\x92\x92\x8dJ\x92\xbd\x9a\xd3\...'
Text: Saving Private Ryan: During WWII, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.
Float32 BSON: b'\'\x00\x8a\xc3.;_\xd3\x9d\xbc\xf2\x89\xee\x...'
Int8 BSON: b'\x03\x00\x11\x80\x1b5\xe9\x19\x80\x8f\xb1N\xda...'
Int1 BSON: b"\x10\x00\x94R\xc2\xd2\x0f\xfa\x90\xbc\xd8\xd6\...'
Text: Gladiator: A once-powerful Roman general seeks vengeance against the corrupt emperor who betrayed his family.
Float32 BSON: b'\'\x00\xe6\x1a\xa7\xb94\x91]\xbcs&\xec\xbc\...'
Int8 BSON: b'\x03\x00\n\x93\x8e,n\xce\xe8\x9b@\x00\xf9\x7f\...'
Int1 BSON: b'\x10\x00\x0c\x1a\x90\x97\x0f\x19\x80/\xba\x98\...'
Text: Rocky: A small-time boxer gets a once-in-a-lifetime chance to fight the world heavyweight champion.
Float32 BSON: b'\'\x00\x7f\xdd\x1c<\xd9\x01\x98<1\xd2\xc0\xb...'
Int8 BSON: b'\x03\x00"\xfb\xb1\x7f\xd3\xd6\x04\xbe\x80\xf9L\...'
Int1 BSON: b'\x10\x00\xd4\x12\x90\xa6\x8by\x99\x8d\xa2\xbd\x...'
Text: Pirates of the Caribbean: Jack Sparrow races to recover the heart of Davy Jones to escape eternal servitude.
Float32 BSON: b'\'\x00\x98l\x92\xbcDp\x16=\xf0@\x1f\xbc\xd0\...'
Int8 BSON: b'\x03\x00\xdf5\x0b\xe3\xbf\xe5\xa5\xad\x7f\x02\x...'
Int1 BSON: b'\x10\x00D\x9a\x91\x96\x07\xfa\x93\x8d\xb2D\x92]...'
Text: Schindler's List: The true story of a man who saved hundreds of Jews during the Holocaust by employing them in his factory.
Float32 BSON: b'\'\x00\xb0\xfb\x0f\xbd\x9b\x02\xa5\xbbZ\x19\...'
Int8 BSON: b'\x03\x00\xb4\xafW\xd9\xd7\xc3\x7f~QM\x86\x83\xf...'
Int1 BSON: b'\x10\x00#\xde\x98\x96\x0e\xcc\x12\xf6\xbb\xdd2}...'

使用 BSON 向量嵌入创建文档。

使用以下代码创建附带这些 BSON 向量嵌入的文档。此代码使用 create_docs_with_bson_vector_embeddings 函数来创建这些文档。

# Create BSON documents
docs = create_docs_with_bson_vector_embeddings(bson_float32_embeddings, bson_int8_embeddings, bson_int1_embeddings, sentences)

将文档摄取到Atlas中。

在您的笔记本中粘贴以下代码，将 <connection-string> 替换为您的 Atlas 集群的SRV 连接字符串，然后运行代码。

注意

连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
db = mongo_client["sample_db"]
collection = db["embeddings"]
# Ingest data into Atlas
collection.insert_many(docs)

InsertManyResult([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], acknowledged=True)

您可通过在集群内 sample_db.embeddings 命名空间中的 Atlas 用户界面上查看向量嵌入来对其进行验证。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合，但您可以调整代码以适用于集群中的任何集合。

请将以下代码粘贴到您的笔记本中。

请使用以下代码从新数据生成嵌入。

具体来说，这段代码使用您定义的 get_embedding 函数和 MongoDB PyMongo 驱动程序，从示例文本数组中生成嵌入，并将其摄取到 sample_db.embeddings 集合中。

import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
db = mongo_client["sample_db"]
collection = db["embeddings"]
# Sample data
data = [
   "Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
   "The Lion King: Lion cub and future king Simba searches for his identity",
   "Avatar: A marine is dispatched to the moon Pandora on a unique mission"
]
# Ingest data into Atlas
inserted_doc_count = 0
for text in data:
   embedding = get_embedding(text)
   collection.insert_one({ "text": text, "embedding": embedding })
   inserted_doc_count += 1
print(f"Inserted {inserted_doc_count} documents.")

Inserted 3 documents.

指定连接字符串。

将 <connection-string> 替换为 Atlas 集群的 SRV 连接字符串。

注意

连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

运行代码。

您可以通过导航到集群中的 sample_db.embeddings 集合，在 Atlas UI 中查看向量嵌入来验证它们。

（视情况而定）定义用于生成 BSON 嵌入的函数。

如果尚未在笔记本中定义 get_embedding和generate_bson_vector 函数，请参阅定义嵌入函数以在笔记本中加载这些函数。

加载您的现有数据。

将数据从Atlas 集群加载到笔记本中。以下代码从 sample_airbnb.listingAndReviews集合中获取 50 文档的子集。

import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
db = mongo_client["sample_airbnb"]
collection = db["listingsAndReviews"]
# Define a filter to exclude documents with null or empty 'summary' fields
summary_filter = { '$and': [ { 'summary': { '$exists': True, '$ne': None } } ] }
# Get a subset of documents in the collection
documents = collection.find(summary_filter, {'_id': 1, 'summary': 1}).limit(50)

生成、转换并将嵌入导入到您的 Atlas 集群中。

此代码执行以下操作：

从数据生成 float32、int8 和 int1 嵌入。
将嵌入转换为BSONfloat32 、int8 和int1 子类型。
使用新的嵌入字段更新集合中的每个文档。

注意

此操作可能需要几分钟才能完成。

from pymongo import UpdateOne
# Generate the list of bulk write operations
operations = []
for doc in documents:
    summary = doc["summary"]
    # Generate embeddings for this document
    float32_embeddings = get_embedding(summary, precision="float32")
    int8_embeddings = get_embedding(summary, precision="int8")
    int1_embeddings = get_embedding(summary, precision="ubinary")
    # Convert embeddings to BSON vectors
    bson_float32_embeddings = generate_bson_vector(float32_embeddings, BinaryVectorDtype.FLOAT32)
    bson_int8_embeddings = generate_bson_vector(int8_embeddings, BinaryVectorDtype.INT8)
    bson_int1_embeddings = generate_bson_vector(int1_embeddings, BinaryVectorDtype.PACKED_BIT)
    # Add the update operation to the list
    operations.append(UpdateOne(
        {"_id": doc["_id"]},
        {"$set": {
            "BSON-Float32-Embedding": bson_float32_embeddings,
            "BSON-Int8-Embedding": bson_int8_embeddings,
            "BSON-Int1-Embedding": bson_int1_embeddings
        }}
    ))
# Execute the bulk write operation
if operations:
    result = collection.bulk_write(operations)
    updated_doc_count = result.modified_count
print(f"Updated {updated_doc_count} documents.")

...
Updated 50 documents.

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合，但您可以调整代码以适用于集群中的任何集合。

请将以下代码粘贴到您的笔记本中。

使用以下代码从现有集合中的字段生成嵌入。具体而言，此代码执行以下操作：

连接到您的 Atlas 集群。
从 sample_airbnb.listingsAndReviews 集合中获取具有非空 summary 字段的文档子集。
使用您定义的 get_embedding 函数，从每个文档的 summary 字段生成嵌入。
使用 MongoDB PyMongo 驱动程序，用包含嵌入值的新 embedding 字段更新每个文档。

import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
db = mongo_client["sample_airbnb"]
collection = db["listingsAndReviews"]
# Filter to exclude null or empty summary fields
filter = { "summary": {"$nin": [ None, "" ]} }
# Get a subset of documents in the collection
documents = collection.find(filter).limit(50)
# Update each document with a new embedding field
updated_doc_count = 0
for doc in documents:
    embedding = get_embedding(doc["summary"])
    collection.update_one( { "_id": doc["_id"] }, { "$set": { "embedding": embedding } } )
    updated_doc_count += 1
print(f"Updated {updated_doc_count} documents.")

Updated 50 documents.

指定连接字符串。

将 <connection-string> 替换为 Atlas 集群的 SRV 连接字符串。

注意

连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

运行代码。

sample_airbnb.listingsAndReviews您可以导航到Atlas用户界面中的集合并展开文档中的字段，以查看生成的向量嵌入。

为查询创建嵌入

在本节中，您将对集合中的向量嵌入进行索引，并创建一个嵌入，用于运行示例向量搜索查询。

运行查询时，Atlas Vector Search 会返回嵌入距离与向量搜索查询中的嵌入距离最接近的文档。这表明它们的含义相似。

创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询，您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤，在 sample_db.embeddings 集合上创建索引，将 embedding 字段指定为向量类型，将相似度测量指定为 dotProduct。

粘贴以下代码以将 CreateVectorIndex 函数添加到 DataService.cs 中的 DataService 类。

DataService.cs

namespace MyCompany.Embeddings;
using MongoDB.Driver;
using MongoDB.Bson;
public class DataService
{
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings");
    
    public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings)
    {
        // Method details...
    }
    
    public void CreateVectorIndex()
    {
        try
        {
            var searchIndexView = Collection.SearchIndexes;
            var name = "vector_index";
            var type = SearchIndexType.VectorSearch;
            var definition = new BsonDocument
            {
                { "fields", new BsonArray
                    {
                        new BsonDocument
                        {
                            { "type", "vector" },
                            { "path", "embedding" },
                            { "numDimensions", <dimensions> },
                            { "similarity", "dotProduct" }
                        }
                    }
                }
            };
            var model = new CreateSearchIndexModel(name, type, definition);
            searchIndexView.CreateOne(model);
            Console.WriteLine($"New search index named {name} is building.");
            // Polling for index status
            Console.WriteLine("Polling to check if the index is ready. This may take up to a minute.");
            bool queryable = false;
            while (!queryable)
            {
                var indexes = searchIndexView.List();
                foreach (var index in indexes.ToEnumerable())
                {
                    if (index["name"] == name)
                    {
                        queryable = index["queryable"].AsBoolean;
                    }
                }
                if (!queryable)
                {
                    Thread.Sleep(5000);
                }
            }
            Console.WriteLine($"{name} is ready for querying.");
        }
        catch (Exception e)
        {
            Console.WriteLine($"Exception: {e.Message}");
        }
    }
}

如果使用开源模型，则将 <dimensions> 占位符值替换为 1024；如果使用 OpenAI 模型，则将占位符值替换为 1536。
更新您 Program.cs 中的代码。
删除填充初始文档的代码，并将其替换为以下代码以创建索引：
Program.cs
```
using MyCompany.Embeddings;
var dataService = new DataService();
dataService.CreateVectorIndex();
```
保存文件，然后编译并运行您的项目以创建索引：
dotnet run MyCompany.Embeddings
New search index named vector_index is building. Polling to check if the index is ready. This may take up to a minute. vector_index is ready for querying.

注意

构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

有关更多信息，请参阅创建 Atlas Vector Search 索引。

为向量搜索查询创建嵌入并运行查询。

粘贴以下代码以将 PerformVectorQuery 函数添加到 DataService.cs 中的 DataService 类。

要运行向量搜索查询，请生成一个查询向量以传递到聚合管道中。

例如，此代码执行以下操作：

通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到 queryVector 字段，并指定在聚合管道中搜索的路径。
指定管道中的 $vectorSearch 阶段，以便对您的嵌入进行精确最近邻 (ENN) 搜索。
在集合中存储的嵌入上运行示例向量搜索查询。
它会按相关性顺序返回语义相似的文档及其向量搜索分数。

DataService.cs

namespace MyCompany.Embeddings;
using MongoDB.Driver;
using MongoDB.Bson;
public class DataService
{
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings");
    
    public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings)
    {
        // Method details...
    }
    
    public void CreateVectorIndex()
    {
        // Method details...
    }
    
    public List<BsonDocument>? PerformVectorQuery(float[] vector)
    {
        var vectorSearchStage = new BsonDocument
        {
            {
                "$vectorSearch",
                new BsonDocument
                {
                    { "index", "vector_index" },
                    { "path", "embedding" },
                    { "queryVector", new BsonArray(vector) },
                    { "exact", true },
                    { "limit", 5 }
                }
            }
        };
        var projectStage = new BsonDocument
        {
            {
                "$project",
                new BsonDocument
                {
                    { "_id", 0 },
                    { "text", 1 },
                    { "score", 
                        new BsonDocument
                        {
                            { "$meta", "vectorSearchScore"}
                        }
                    }
                }
            }
        };
        var pipeline = new[] { vectorSearchStage, projectStage };
        return Collection.Aggregate<BsonDocument>(pipeline).ToList();
    }
}

更新您 Program.cs 中的代码。

删除创建向量索引的代码，并添加代码以执行查询：

Program.cs

using MongoDB.Bson;
using MyCompany.Embeddings;
var aiService = new AIService();
var queryString = "ocean tragedy";
var queryEmbedding = await aiService.GetEmbeddingsAsync([queryString]);
if (!queryEmbedding.Any())
{
    Console.WriteLine("No embeddings found.");
}
else
{
    var dataService = new DataService();
    var matchingDocuments = dataService.PerformVectorQuery(queryEmbedding[queryString]);
    if (matchingDocuments == null)
    {
        Console.WriteLine("No documents matched the query.");
    }
    else
    {
        foreach (var document in matchingDocuments)
        {
            Console.WriteLine(document.ToJson());
        }
    }
}

保存文件，然后编译并运行您的项目以执行查询：

dotnet run MyCompany.Embeddings.csproj

{ "text" : "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "score" : 100.17414855957031 }
{ "text" : "Avatar: A marine is dispatched to the moon Pandora on a unique mission", "score" : 65.705635070800781 }
{ "text" : "The Lion King: Lion cub and future king Simba searches for his identity", "score" : 52.486415863037109 }

创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询，您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤，在 sample_airbnb.listingsAndReviews 集合上创建索引，将 embeddings 字段指定为向量类型，将相似度测量指定为 euclidean。

粘贴以下代码以将 CreateVectorIndex 函数添加到 DataService.cs 中的 DataService 类。

DataService.cs

namespace MyCompany.Embeddings;
using MongoDB.Driver;
using MongoDB.Bson;
public class DataService
{
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews");
    public List<BsonDocument>? GetDocuments()
    {
        // Method details...
    }
    public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings)
    {
        // Method details...
    }
    public void CreateVectorIndex()
    {
        try
        {
            var searchIndexView = Collection.SearchIndexes;
            var name = "vector_index";
            var type = SearchIndexType.VectorSearch;
            var definition = new BsonDocument
            {
                { "fields", new BsonArray
                    {
                        new BsonDocument
                        {
                            { "type", "vector" },
                            { "path", "embeddings" },
                            { "numDimensions", <dimensions> },
                            { "similarity", "dotProduct" }
                        }
                    }
                }
            };
            var model = new CreateSearchIndexModel(name, type, definition);
            searchIndexView.CreateOne(model);
            Console.WriteLine($"New search index named {name} is building.");
            // Polling for index status
            Console.WriteLine("Polling to check if the index is ready. This may take up to a minute.");
            bool queryable = false;
            while (!queryable)
            {
                var indexes = searchIndexView.List();
                foreach (var index in indexes.ToEnumerable())
                {
                    if (index["name"] == name)
                    {
                        queryable = index["queryable"].AsBoolean;
                    }
                }
                if (!queryable)
                {
                    Thread.Sleep(5000);
                }
            }
            Console.WriteLine($"{name} is ready for querying.");
        }
        catch (Exception e)
        {
            Console.WriteLine($"Exception: {e.Message}");
        }
    }
}

如果使用开源模型，则将 <dimensions> 占位符值替换为 1024；如果使用 OpenAI 模型，则将占位符值替换为 1536。
更新您 Program.cs 中的代码。
删除将嵌入添加到现有文档的代码，并将其替换为以下代码以创建索引：
Program.cs
```
using MyCompany.Embeddings;
var dataService = new DataService();
dataService.CreateVectorIndex();
```
保存文件，然后编译并运行您的项目以创建索引：
dotnet run MyCompany.Embeddings
New search index named vector_index is building. Polling to check if the index is ready. This may take up to a minute. vector_index is ready for querying.

注意

构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

有关更多信息，请参阅创建 Atlas Vector Search 索引。

为向量搜索查询创建嵌入并运行查询。

粘贴以下代码以将 PerformVectorQuery 函数添加到 DataService.cs 中的 DataService 类。

要运行向量搜索查询，请生成一个查询向量以传递到聚合管道中。

例如，此代码执行以下操作：

通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到 queryVector 字段，并指定在聚合管道中搜索的路径。
指定管道中的 $vectorSearch 阶段，以便对您的嵌入进行精确最近邻 (ENN) 搜索。
在集合中存储的嵌入上运行示例向量搜索查询。
它会按相关性顺序返回语义相似的文档及其向量搜索分数。

DataService.cs

namespace MyCompany.Embeddings;
using MongoDB.Driver;
using MongoDB.Bson;
public class DataService
{
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews");
    public List<BsonDocument>? GetDocuments()
    {
        // Method details...
    }
    public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings)
    {
        // Method details...
    }
    public void CreateVectorIndex()
    {
        // Method details...
    }
    public List<BsonDocument>? PerformVectorQuery(float[] vector)
    {
        var vectorSearchStage = new BsonDocument
        {
            {
                "$vectorSearch",
                new BsonDocument
                {
                    { "index", "vector_index" },
                    { "path", "embeddings" },
                    { "queryVector", new BsonArray(vector) },
                    { "exact", true },
                    { "limit", 5 }
                }
            }
        };
        var projectStage = new BsonDocument
        {
            {
                "$project",
                new BsonDocument
                {
                    { "_id", 0 },
                    { "summary", 1 },
                    { "score", 
                        new BsonDocument
                        {
                            { "$meta", "vectorSearchScore"}
                        }
                    }
                }
            }
        };
        var pipeline = new[] { vectorSearchStage, projectStage };
        return Collection.Aggregate<BsonDocument>(pipeline).ToList();
    }
}

更新您 Program.cs 中的代码。

删除创建向量索引的代码，并添加代码以执行查询：

Program.cs

using MongoDB.Bson;
using MyCompany.Embeddings;
var aiService = new AIService();
var queryString = "beach house";
var queryEmbedding = await aiService.GetEmbeddingsAsync([queryString]);
if (!queryEmbedding.Any())
{
    Console.WriteLine("No embeddings found.");
}
else
{
    var dataService = new DataService();
    var matchingDocuments = dataService.PerformVectorQuery(queryEmbedding[queryString]);
    if (matchingDocuments == null)
    {
        Console.WriteLine("No documents matched the query.");
    }
    else
    {
        foreach (var document in matchingDocuments)
        {
            Console.WriteLine(document.ToJson());
        }
    }
}

保存文件，然后编译并运行您的项目以执行查询：

dotnet run MyCompany.Embeddings.csproj

{ "summary" : "Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living.", "score" : 88.884147644042969 }
{ "summary" : "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score" : 86.136398315429688 }
{ "summary" : "Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.", "score" : 86.087783813476562 }
{ "summary" : "Room 2  Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen  and a single bed. Not suitable for group booking All rooms have  TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score" : 85.689559936523438 }
{ "summary" : "Fully furnished 3+1 flat decorated with vintage style.  Located at the heart of Moda/Kadıköy, close to seaside and also to the public transportation (tram, metro, ferry, bus stations) 10 minutes walk.", "score" : 85.614166259765625 }

创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询，您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤，在 sample_db.embeddings 集合上创建索引，将 embedding 字段指定为向量类型，将相似度测量指定为 dotProduct。

创建一个名为 create-index.go 的文件并粘贴以下代码。

create-index.go

package main
import (
	"context"
	"fmt"
	"log"
	"os"
	"time"
	"github.com/joho/godotenv"
	"go.mongodb.org/mongo-driver/bson"
	"go.mongodb.org/mongo-driver/mongo"
	"go.mongodb.org/mongo-driver/mongo/options"
)
func main() {
	ctx := context.Background()
	if err := godotenv.Load(); err != nil {
		log.Println("no .env file found")
	}
	// Connect to your Atlas cluster
	uri := os.Getenv("ATLAS_CONNECTION_STRING")
	if uri == "" {
		log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
	}
	clientOptions := options.Client().ApplyURI(uri)
	client, err := mongo.Connect(ctx, clientOptions)
	if err != nil {
		log.Fatalf("failed to connect to the server: %v", err)
	}
	defer func() { _ = client.Disconnect(ctx) }()
	// Set the namespace
	coll := client.Database("sample_db").Collection("embeddings")
	indexName := "vector_index"
	opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch")
	type vectorDefinitionField struct {
		Type          string `bson:"type"`
		Path          string `bson:"path"`
		NumDimensions int    `bson:"numDimensions"`
		Similarity    string `bson:"similarity"`
	}
	type vectorDefinition struct {
		Fields []vectorDefinitionField `bson:"fields"`
	}
	indexModel := mongo.SearchIndexModel{
		Definition: vectorDefinition{
			Fields: []vectorDefinitionField{{
				Type:          "vector",
				Path:          "embedding",
				NumDimensions: <dimensions>,
				Similarity:    "dotProduct"}},
		},
		Options: opts,
	}
	log.Println("Creating the index.")
	searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel)
	if err != nil {
		log.Fatalf("failed to create the search index: %v", err)
	}
	// Await the creation of the index.
	log.Println("Polling to confirm successful index creation.")
	searchIndexes := coll.SearchIndexes()
	var doc bson.Raw
	for doc == nil {
		cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName))
		if err != nil {
			fmt.Errorf("failed to list search indexes: %w", err)
		}
		if !cursor.Next(ctx) {
			break
		}
		name := cursor.Current.Lookup("name").StringValue()
		queryable := cursor.Current.Lookup("queryable").Boolean()
		if name == searchIndexName && queryable {
			doc = cursor.Current
		} else {
			time.Sleep(5 * time.Second)
		}
	}
	log.Println("Name of Index Created: " + searchIndexName)
}

如果使用开源模型，则将 <dimensions> 占位符值替换为 1024；如果使用 OpenAI 模型，则将占位符值替换为 1536。
保存文件，然后运行以下命令：
go run create-index.go
2024/10/09 17:38:51 Creating the index. 2024/10/09 17:38:52 Polling to confirm successful index creation. 2024/10/09 17:39:22 Name of Index Created: vector_index

注意

构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

有关更多信息，请参阅创建 Atlas Vector Search 索引。

为向量搜索查询创建嵌入并运行查询。

创建一个名为 vector-query.go 的文件并粘贴以下代码。

要运行向量搜索查询，请生成一个查询向量以传递到聚合管道中。

例如，此代码执行以下操作：

通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到 queryVector 字段，并指定在聚合管道中搜索的路径。
指定管道中的 $vectorSearch 阶段，以便对您的嵌入进行精确最近邻 (ENN) 搜索。
在集合中存储的嵌入上运行示例向量搜索查询。
它会按相关性顺序返回语义相似的文档及其向量搜索分数。

vector-query.go

package main
import (
	"context"
	"fmt"
	"log"
	"my-embeddings-project/common"
	"os"
	"github.com/joho/godotenv"
	"go.mongodb.org/mongo-driver/bson"
	"go.mongodb.org/mongo-driver/mongo"
	"go.mongodb.org/mongo-driver/mongo/options"
)
type TextAndScore struct {
	Text  string  `bson:"text"`
	Score float32 `bson:"score"`
}
func main() {
	ctx := context.Background()
	// Connect to your Atlas cluster
	if err := godotenv.Load(); err != nil {
		log.Println("no .env file found")
	}
	// Connect to your Atlas cluster
	uri := os.Getenv("ATLAS_CONNECTION_STRING")
	if uri == "" {
		log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
	}
	clientOptions := options.Client().ApplyURI(uri)
	client, err := mongo.Connect(ctx, clientOptions)
	if err != nil {
		log.Fatalf("failed to connect to the server: %v", err)
	}
	defer func() { _ = client.Disconnect(ctx) }()
	// Set the namespace
	coll := client.Database("sample_db").Collection("embeddings")
	query := "ocean tragedy"
	queryEmbedding := common.GetEmbeddings([]string{query})
	pipeline := mongo.Pipeline{
		bson.D{
			{"$vectorSearch", bson.D{
				{"queryVector", queryEmbedding[0]},
				{"index", "vector_index"},
				{"path", "embedding"},
				{"exact", true},
				{"limit", 5},
			}},
		},
		bson.D{
			{"$project", bson.D{
				{"_id", 0},
				{"text", 1},
				{"score", bson.D{
					{"$meta", "vectorSearchScore"},
				}},
			}},
		},
	}
	// Run the pipeline
	cursor, err := coll.Aggregate(ctx, pipeline)
	if err != nil {
		log.Fatalf("failed to run aggregation: %v", err)
	}
	defer func() { _ = cursor.Close(ctx) }()
	var matchingDocs []TextAndScore
	if err = cursor.All(ctx, &matchingDocs); err != nil {
		log.Fatalf("failed to unmarshal results to TextAndScore objects: %v", err)
	}
	for _, doc := range matchingDocs {
		fmt.Printf("Text: %v\nScore: %v\n", doc.Text, doc.Score)
	}
}

vector-query.go

package main
import (
	"context"
	"fmt"
	"log"
	"my-embeddings-project/common"
	"os"
	"github.com/joho/godotenv"
	"go.mongodb.org/mongo-driver/bson"
	"go.mongodb.org/mongo-driver/mongo"
	"go.mongodb.org/mongo-driver/mongo/options"
)
type TextAndScore struct {
	Text  string  `bson:"text"`
	Score float64 `bson:"score"`
}
func main() {
	ctx := context.Background()
	// Connect to your Atlas cluster
	if err := godotenv.Load(); err != nil {
		log.Println("no .env file found")
	}
	// Connect to your Atlas cluster
	uri := os.Getenv("ATLAS_CONNECTION_STRING")
	if uri == "" {
		log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
	}
	clientOptions := options.Client().ApplyURI(uri)
	client, err := mongo.Connect(ctx, clientOptions)
	if err != nil {
		log.Fatalf("failed to connect to the server: %v", err)
	}
	defer func() { _ = client.Disconnect(ctx) }()
	// Set the namespace
	coll := client.Database("sample_db").Collection("embeddings")
	query := "ocean tragedy"
	queryEmbedding := common.GetEmbeddings([]string{query})
	pipeline := mongo.Pipeline{
		bson.D{
			{"$vectorSearch", bson.D{
				{"queryVector", queryEmbedding[0]},
				{"index", "vector_index"},
				{"path", "embedding"},
				{"exact", true},
				{"limit", 5},
			}},
		},
		bson.D{
			{"$project", bson.D{
				{"_id", 0},
				{"text", 1},
				{"score", bson.D{
					{"$meta", "vectorSearchScore"},
				}},
			}},
		},
	}
	// Run the pipeline
	cursor, err := coll.Aggregate(ctx, pipeline)
	if err != nil {
		log.Fatalf("failed to run aggregation: %v", err)
	}
	defer func() { _ = cursor.Close(ctx) }()
	var matchingDocs []TextAndScore
	if err = cursor.All(ctx, &matchingDocs); err != nil {
		log.Fatalf("failed to unmarshal results to TextAndScore objects: %v", err)
	}
	for _, doc := range matchingDocs {
		fmt.Printf("Text: %v\nScore: %v\n", doc.Text, doc.Score)
	}
}

保存文件，然后运行以下命令：

go run vector-query.go

Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
Score: 0.0042472864
Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
Score: 0.0031167597
Text: The Lion King: Lion cub and future king Simba searches for his identity
Score: 0.0024476869

go run vector-query.go

Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
Score: 0.4552372694015503
Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
Score: 0.4050072133541107
Text: The Lion King: Lion cub and future king Simba searches for his identity
Score: 0.35942140221595764

创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询，您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤，在 sample_airbnb.listingsAndReviews 集合上创建索引，将 embeddings 字段指定为向量类型，将相似度测量指定为 euclidean。

创建一个名为 create-index.go 的文件并粘贴以下代码。

create-index.go

package main
import (
	"context"
	"fmt"
	"log"
	"os"
	"time"
	"github.com/joho/godotenv"
	"go.mongodb.org/mongo-driver/bson"
	"go.mongodb.org/mongo-driver/mongo"
	"go.mongodb.org/mongo-driver/mongo/options"
)
func main() {
	ctx := context.Background()
	if err := godotenv.Load(); err != nil {
		log.Println("no .env file found")
	}
	// Connect to your Atlas cluster
	uri := os.Getenv("ATLAS_CONNECTION_STRING")
	if uri == "" {
		log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
	}
	clientOptions := options.Client().ApplyURI(uri)
	client, err := mongo.Connect(ctx, clientOptions)
	if err != nil {
		log.Fatalf("failed to connect to the server: %v", err)
	}
	defer func() { _ = client.Disconnect(ctx) }()
	// Set the namespace
	coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
	indexName := "vector_index"
	opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch")
	type vectorDefinitionField struct {
		Type          string `bson:"type"`
		Path          string `bson:"path"`
		NumDimensions int    `bson:"numDimensions"`
		Similarity    string `bson:"similarity"`
	}
	type vectorDefinition struct {
		Fields []vectorDefinitionField `bson:"fields"`
	}
	indexModel := mongo.SearchIndexModel{
		Definition: vectorDefinition{
			Fields: []vectorDefinitionField{{
				Type:          "vector",
				Path:          "embeddings",
				NumDimensions: <dimensions>,
				Similarity:    "dotProduct",
			    Quantization:  "scalar"}},
		},
		Options: opts,
	}
	log.Println("Creating the index.")
	searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel)
	if err != nil {
		log.Fatalf("failed to create the search index: %v", err)
	}
	// Await the creation of the index.
	log.Println("Polling to confirm successful index creation.")
	searchIndexes := coll.SearchIndexes()
	var doc bson.Raw
	for doc == nil {
		cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName))
		if err != nil {
			fmt.Errorf("failed to list search indexes: %w", err)
		}
		if !cursor.Next(ctx) {
			break
		}
		name := cursor.Current.Lookup("name").StringValue()
		queryable := cursor.Current.Lookup("queryable").Boolean()
		if name == searchIndexName && queryable {
			doc = cursor.Current
		} else {
			time.Sleep(5 * time.Second)
		}
	}
	log.Println("Name of Index Created: " + searchIndexName)
}

如果使用开源模型，则将 <dimensions> 占位符值替换为 1024；如果使用 OpenAI 模型，则将占位符值替换为 1536。
保存文件，然后运行以下命令：
go run create-index.go
2024/10/10 10:03:12 Creating the index. 2024/10/10 10:03:13 Polling to confirm successful index creation. 2024/10/10 10:03:44 Name of Index Created: vector_index

注意

构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

有关更多信息，请参阅创建 Atlas Vector Search 索引。

为向量搜索查询创建嵌入并运行查询。

创建一个名为 vector-query.go 的文件并粘贴以下代码。

要运行向量搜索查询，请生成一个查询向量以传递到聚合管道中。

例如，此代码执行以下操作：

通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到 queryVector 字段，并指定在聚合管道中搜索的路径。
指定管道中的 $vectorSearch 阶段，以便对您的嵌入进行精确最近邻 (ENN) 搜索。
在集合中存储的嵌入上运行示例向量搜索查询。
它会按相关性顺序返回语义相似的文档及其向量搜索分数。

vector-query.go

package main
import (
	"context"
	"fmt"
	"log"
	"my-embeddings-project/common"
	"os"
	"github.com/joho/godotenv"
	"go.mongodb.org/mongo-driver/bson"
	"go.mongodb.org/mongo-driver/mongo"
	"go.mongodb.org/mongo-driver/mongo/options"
)
type SummaryAndScore struct {
	Summary string  `bson:"summary"`
	Score   float32 `bson:"score"`
}
func main() {
	ctx := context.Background()
	// Connect to your Atlas cluster
	if err := godotenv.Load(); err != nil {
		log.Println("no .env file found")
	}
	// Connect to your Atlas cluster
	uri := os.Getenv("ATLAS_CONNECTION_STRING")
	if uri == "" {
		log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
	}
	clientOptions := options.Client().ApplyURI(uri)
	client, err := mongo.Connect(ctx, clientOptions)
	if err != nil {
		log.Fatalf("failed to connect to the server: %v", err)
	}
	defer func() { _ = client.Disconnect(ctx) }()
	// Set the namespace
	coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
	query := "beach house"
	queryEmbedding := common.GetEmbeddings([]string{query})
	pipeline := mongo.Pipeline{
		bson.D{
			{"$vectorSearch", bson.D{
				{"queryVector", queryEmbedding[0]},
				{"index", "vector_index"},
				{"path", "embeddings"},
				{"exact", true},
				{"limit", 5},
			}},
		},
		bson.D{
			{"$project", bson.D{
				{"_id", 0},
				{"summary", 1},
				{"score", bson.D{
					{"$meta", "vectorSearchScore"},
				}},
			}},
		},
	}
	// Run the pipeline
	cursor, err := coll.Aggregate(ctx, pipeline)
	if err != nil {
		log.Fatalf("failed to run aggregation: %v", err)
	}
	defer func() { _ = cursor.Close(ctx) }()
	var matchingDocs []SummaryAndScore
	if err = cursor.All(ctx, &matchingDocs); err != nil {
		log.Fatalf("failed to unmarshal results to SummaryAndScore objects: %v", err)
	}
	for _, doc := range matchingDocs {
		fmt.Printf("Summary: %v\nScore: %v\n", doc.Summary, doc.Score)
	}
}

vector-query.go

package main
import (
	"context"
	"fmt"
	"log"
	"my-embeddings-project/common"
	"os"
	"github.com/joho/godotenv"
	"go.mongodb.org/mongo-driver/bson"
	"go.mongodb.org/mongo-driver/mongo"
	"go.mongodb.org/mongo-driver/mongo/options"
)
type SummaryAndScore struct {
	Summary string  `bson:"summary"`
	Score   float64 `bson:"score"`
}
func main() {
	ctx := context.Background()
	// Connect to your Atlas cluster
	if err := godotenv.Load(); err != nil {
		log.Println("no .env file found")
	}
	// Connect to your Atlas cluster
	uri := os.Getenv("ATLAS_CONNECTION_STRING")
	if uri == "" {
		log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
	}
	clientOptions := options.Client().ApplyURI(uri)
	client, err := mongo.Connect(ctx, clientOptions)
	if err != nil {
		log.Fatalf("failed to connect to the server: %v", err)
	}
	defer func() { _ = client.Disconnect(ctx) }()
	// Set the namespace
	coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
	query := "beach house"
	queryEmbedding := common.GetEmbeddings([]string{query})
	pipeline := mongo.Pipeline{
		bson.D{
			{"$vectorSearch", bson.D{
				{"queryVector", queryEmbedding[0]},
				{"index", "vector_index"},
				{"path", "embeddings"},
				{"exact", true},
				{"limit", 5},
			}},
		},
		bson.D{
			{"$project", bson.D{
				{"_id", 0},
				{"summary", 1},
				{"score", bson.D{
					{"$meta", "vectorSearchScore"},
				}},
			}},
		},
	}
	// Run the pipeline
	cursor, err := coll.Aggregate(ctx, pipeline)
	if err != nil {
		log.Fatalf("failed to run aggregation: %v", err)
	}
	defer func() { _ = cursor.Close(ctx) }()
	var matchingDocs []SummaryAndScore
	if err = cursor.All(ctx, &matchingDocs); err != nil {
		log.Fatalf("failed to unmarshal results to SummaryAndScore objects: %v", err)
	}
	for _, doc := range matchingDocs {
		fmt.Printf("Summary: %v\nScore: %v\n", doc.Summary, doc.Score)
	}
}

保存文件，然后运行以下命令：

go run vector-query.go

Summary: Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living.
Score: 0.0045180833
Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom.  Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe.  The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling  paddle boarding, surfing are all just minutes from the front door.
Score: 0.004480799
Summary: Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.
Score: 0.0042421296
Summary: Room 2  Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen  and a single bed. Not suitable for group booking All rooms have  TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.
Score: 0.004227752
Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.
Score: 0.0042201905

go run vector-query.go

Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.
Score: 0.4832950830459595
Summary: Room 2  Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen  and a single bed. Not suitable for group booking All rooms have  TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.
Score: 0.48093676567077637
Summary: THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!
Score: 0.4629695415496826
Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom.  Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe.  The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling  paddle boarding, surfing are all just minutes from the front door.
Score: 0.45800843834877014
Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use.  The property boasts 45 m2.
Score: 0.45398443937301636

创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询，您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤，在 sample_db.embeddings 集合上创建索引，将 embedding 字段指定为向量类型，将相似度测量指定为 dotProduct。

创建一个名为 CreateIndex.java 的文件并粘贴以下代码：

CreateIndex.java

import com.mongodb.MongoException;
import com.mongodb.client.ListSearchIndexesIterable;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoCursor;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.SearchIndexModel;
import com.mongodb.client.model.SearchIndexType;
import org.bson.Document;
import org.bson.conversions.Bson;
import java.util.Collections;
import java.util.List;
public class CreateIndex {
    public static void main(String[] args) {
        String uri = System.getenv("ATLAS_CONNECTION_STRING");
        if (uri == null || uri.isEmpty()) {
            throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
        }
        // establish connection and set namespace
        try (MongoClient mongoClient = MongoClients.create(uri)) {
            MongoDatabase database = mongoClient.getDatabase("sample_db");
            MongoCollection<Document> collection = database.getCollection("embeddings");
            // define the index details
            String indexName = "vector_index";
            int dimensionsHuggingFaceModel = 1024;
            int dimensionsOpenAiModel = 1536;
            Bson definition = new Document(
                    "fields",
                    Collections.singletonList(
                                    new Document("type", "vector")
                                            .append("path", "embedding")
                                            .append("numDimensions", <dimensions>) // replace with var for the model used
                            .append("similarity", "dotProduct")));
            // define the index model using the specified details
            SearchIndexModel indexModel = new SearchIndexModel(
                    indexName,
                    definition,
                    SearchIndexType.vectorSearch());
            // Create the index using the model
            try {
                List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel));
                System.out.println("Successfully created a vector index named: " + result);
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
            // Wait for Atlas to build the index and make it queryable
            System.out.println("Polling to confirm the index has completed building.");
            System.out.println("It may take up to a minute for the index to build before you can query using it.");
            ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes();
            Document doc = null;
            while (doc == null) {
                try (MongoCursor<Document> cursor = searchIndexes.iterator()) {
                    if (!cursor.hasNext()) {
                        break;
                    }
                    Document current = cursor.next();
                    String name = current.getString("name");
                    boolean queryable = current.getBoolean("queryable");
                    if (name.equals(indexName) && queryable) {
                        doc = current;
                    } else {
                        Thread.sleep(500);
                    }
                }  catch (Exception e) {
                    throw new RuntimeException(e);
                }
            }
            System.out.println(indexName + " index is ready to query");
        } catch (MongoException me) {
            throw new RuntimeException("Failed to connect to MongoDB ", me);
        } catch (Exception e) {
            throw new RuntimeException("Operation failed: ", e);
        }
    }
}

将 <dimensions> 占位符值替换为所用模型的相应变量：
- dimensionsHuggingFaceModel：1024 维度（"mixedbread-ai/mxbai-embed-large-v1" 模型）
- dimensionsOpenAiModel：1536 个维度（"text-embedding-3-small" 模型）
注意
维数由用于生成嵌入的模型确定。如果您调整此代码以使用不同的模型，请确保将正确的值传递给 numDimensions。另请参阅“选择嵌入模型”部分。

保存并运行文件。输出类似于：

Successfully created a vector index named: [vector_index]
Polling to confirm the index has completed building.
It may take up to a minute for the index to build before you can query using it.
vector_index index is ready to query

注意

构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

有关更多信息，请参阅创建 Atlas Vector Search 索引。

为向量搜索查询创建嵌入并运行查询。

创建一个名为 VectorQuery.java 的文件并粘贴以下代码。

要运行向量搜索查询，请生成一个查询向量以传递到聚合管道中。

例如，此代码执行以下操作：

通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到 queryVector 字段，并指定在聚合管道中搜索的路径。
指定管道中的 $vectorSearch 阶段，以便对您的嵌入进行精确最近邻 (ENN) 搜索。
在集合中存储的嵌入上运行示例向量搜索查询。
它会按相关性顺序返回语义相似的文档及其向量搜索分数。

VectorQuery.java

import com.mongodb.MongoException;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.search.FieldSearchPath;
import org.bson.BsonArray;
import org.bson.BsonValue;
import org.bson.Document;
import org.bson.conversions.Bson;
import java.util.ArrayList;
import java.util.List;
import static com.mongodb.client.model.Aggregates.project;
import static com.mongodb.client.model.Aggregates.vectorSearch;
import static com.mongodb.client.model.Projections.exclude;
import static com.mongodb.client.model.Projections.fields;
import static com.mongodb.client.model.Projections.include;
import static com.mongodb.client.model.Projections.metaVectorSearchScore;
import static com.mongodb.client.model.search.SearchPath.fieldPath;
import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions;
import static java.util.Arrays.asList;
public class VectorQuery {
    public static void main(String[] args) {
        String uri = System.getenv("ATLAS_CONNECTION_STRING");
        if (uri == null || uri.isEmpty()) {
            throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
        }
        // establish connection and set namespace
        try (MongoClient mongoClient = MongoClients.create(uri)) {
            MongoDatabase database = mongoClient.getDatabase("sample_db");
            MongoCollection<Document> collection = database.getCollection("embeddings");
            // define $vectorSearch query options
            String query = "ocean tragedy";
            EmbeddingProvider embeddingProvider = new EmbeddingProvider();
            BsonArray embeddingBsonArray = embeddingProvider.getEmbedding(query);
            List<Double> embedding = new ArrayList<>();
            for (BsonValue value : embeddingBsonArray.stream().toList()) {
                embedding.add(value.asDouble().getValue());
            }
            // define $vectorSearch pipeline
            String indexName = "vector_index";
            FieldSearchPath fieldSearchPath = fieldPath("embedding");
            int limit = 5;
            List<Bson> pipeline = asList(
                    vectorSearch(
                            fieldSearchPath,
                            embedding,
                            indexName,
                            limit,
                            exactVectorSearchOptions()
                    ),
                    project(
                            fields(exclude("_id"), include("text"),
                                    metaVectorSearchScore("score"))));
            // run query and print results
            List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>());
            if (results.isEmpty()) {
                System.out.println("No results found.");
            } else {
                results.forEach(doc -> {
                    System.out.println("Text: " + doc.getString("text"));
                    System.out.println("Score: " + doc.getDouble("score"));
                });
            }
        } catch (MongoException me) {
            throw new RuntimeException("Failed to connect to MongoDB ", me);
        } catch (Exception e) {
            throw new RuntimeException("Operation failed: ", e);
        }
    }
}

保存并运行文件。输出类似于以下内容之一，具体取决于您使用的模型：

Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
Score: 0.004247286356985569
Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
Score: 0.003116759704425931
Text: The Lion King: Lion cub and future king Simba searches for his identity
Score: 0.002447686856612563

Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
Score: 0.45522359013557434
Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
Score: 0.4049977660179138
Text: The Lion King: Lion cub and future king Simba searches for his identity
Score: 0.35942474007606506

创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询，您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤，在 sample_airbnb.listingsAndReviews 集合上创建索引，将 embeddings 字段指定为向量类型，将相似度测量指定为 euclidean。

创建一个名为 CreateIndex.java 的文件并粘贴以下代码：

CreateIndex.java

import com.mongodb.MongoException;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.ListSearchIndexesIterable;
import com.mongodb.client.MongoCursor;
import com.mongodb.client.model.SearchIndexModel;
import com.mongodb.client.model.SearchIndexType;
import org.bson.Document;
import org.bson.conversions.Bson;
import java.util.Collections;
import java.util.List;
public class CreateIndex {
    public static void main(String[] args) {
        String uri = System.getenv("ATLAS_CONNECTION_STRING");
        if (uri == null || uri.isEmpty()) {
            throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
        }
        // establish connection and set namespace
        try (MongoClient mongoClient = MongoClients.create(uri)) {
            MongoDatabase database = mongoClient.getDatabase("sample_airbnb");
            MongoCollection<Document> collection = database.getCollection("listingsAndReviews");
            // define the index details
            String indexName = "vector_index";
            int dimensionsHuggingFaceModel = 1024;
            int dimensionsOpenAiModel = 1536;
            Bson definition = new Document(
                    "fields",
                    Collections.singletonList(
                            new Document("type", "vector")
                                    .append("path", "embeddings")
                                    .append("numDimensions", <dimensions>) // replace with var for the model used
                                    .append("similarity", "dotProduct")
                                    .append('quantization', "scalar")));
            // define the index model using the specified details
            SearchIndexModel indexModel = new SearchIndexModel(
                    indexName,
                    definition,
                    SearchIndexType.vectorSearch());
            // create the index using the model
            try {
                List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel));
                System.out.println("Successfully created a vector index named: " + result);
                System.out.println("It may take up to a minute for the index to build before you can query using it.");
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
            // wait for Atlas to build the index and make it queryable
            System.out.println("Polling to confirm the index has completed building.");
            ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes();
            Document doc = null;
            while (doc == null) {
                try (MongoCursor<Document> cursor = searchIndexes.iterator()) {
                    if (!cursor.hasNext()) {
                        break;
                    }
                    Document current = cursor.next();
                    String name = current.getString("name");
                    boolean queryable = current.getBoolean("queryable");
                    if (name.equals(indexName) && queryable) {
                        doc = current;
                    } else {
                        Thread.sleep(500);
                    }
                }  catch (Exception e) {
                    throw new RuntimeException(e);
                }
            }
            System.out.println(indexName + " index is ready to query");
        } catch (MongoException me) {
            throw new RuntimeException("Failed to connect to MongoDB ", me);
        } catch (Exception e) {
            throw new RuntimeException("Operation failed: ", e);
        }
    }
}

将 <dimensions> 占位符值替换为所用模型的相应变量：
- dimensionsHuggingFaceModel：1024 维度（开源）
- dimensionsOpenAiModel：1536 维度
注意
维数由用于生成嵌入的模型确定。如果您使用其他模型，请确保将正确的值传递给 numDimensions。另请参阅“选择嵌入模型”部分。

保存并运行文件。输出类似于：

Successfully created a vector index named: [vector_index]
Polling to confirm the index has completed building.
It may take up to a minute for the index to build before you can query using it.
vector_index index is ready to query

注意

构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

有关更多信息，请参阅创建 Atlas Vector Search 索引。

为向量搜索查询创建嵌入并运行查询。

创建一个名为 VectorQuery.java 的文件并粘贴以下代码。

要运行向量搜索查询，请生成一个查询向量以传递到聚合管道中。

例如，此代码执行以下操作：

通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到 queryVector 字段，并指定在聚合管道中搜索的路径。
指定管道中的 $vectorSearch 阶段，以便对您的嵌入进行精确最近邻 (ENN) 搜索。
在集合中存储的嵌入上运行示例向量搜索查询。
它会按相关性顺序返回语义相似的文档及其向量搜索分数。

VectorQuery.java

import com.mongodb.MongoException;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.search.FieldSearchPath;
import org.bson.BsonArray;
import org.bson.BsonValue;
import org.bson.Document;
import org.bson.conversions.Bson;
import java.util.ArrayList;
import java.util.List;
import static com.mongodb.client.model.Aggregates.project;
import static com.mongodb.client.model.Aggregates.vectorSearch;
import static com.mongodb.client.model.Projections.exclude;
import static com.mongodb.client.model.Projections.fields;
import static com.mongodb.client.model.Projections.include;
import static com.mongodb.client.model.Projections.metaVectorSearchScore;
import static com.mongodb.client.model.search.SearchPath.fieldPath;
import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions;
import static java.util.Arrays.asList;
public class VectorQuery {
    public static void main(String[] args) {
        String uri = System.getenv("ATLAS_CONNECTION_STRING");
        if (uri == null || uri.isEmpty()) {
            throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
        }
        // establish connection and set namespace
        try (MongoClient mongoClient = MongoClients.create(uri)) {
            MongoDatabase database = mongoClient.getDatabase("sample_airbnb");
            MongoCollection<Document> collection = database.getCollection("listingsAndReviews");
            // define the query and get the embedding
            String query = "beach house";
            EmbeddingProvider embeddingProvider = new EmbeddingProvider();
            BsonArray embeddingBsonArray = embeddingProvider.getEmbedding(query);
            List<Double> embedding = new ArrayList<>();
            for (BsonValue value : embeddingBsonArray.stream().toList()) {
                embedding.add(value.asDouble().getValue());
            }
            // define $vectorSearch pipeline
            String indexName = "vector_index";
            FieldSearchPath fieldSearchPath = fieldPath("embeddings");
            int limit = 5;
            List<Bson> pipeline = asList(
                    vectorSearch(
                            fieldSearchPath,
                            embedding,
                            indexName,
                            limit,
                            exactVectorSearchOptions()),
                    project(
                            fields(exclude("_id"), include("summary"),
                                    metaVectorSearchScore("score"))));
            // run query and print results
            List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>());
            if (results.isEmpty()) {
                System.out.println("No results found.");
            } else {
                results.forEach(doc -> {
                    System.out.println("Summary: " + doc.getString("summary"));
                    System.out.println("Score: " + doc.getDouble("score"));
                });
            }
        } catch (MongoException me) {
            throw new RuntimeException("Failed to connect to MongoDB ", me);
        } catch (Exception e) {
            throw new RuntimeException("Operation failed: ", e);
        }
    }
}

保存并运行文件。输出类似于以下内容之一，具体取决于您使用的模型：

Summary: Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living.
Score: 0.004518083296716213
Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom.  Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe.  The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling  paddle boarding, surfing are all just minutes from the front door.
Score: 0.0044807991944253445
Summary: Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.
Score: 0.004242129623889923
Summary: Room 2  Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen  and a single bed. Not suitable for group booking All rooms have  TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.
Score: 0.004227751865983009
Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.
Score: 0.004220190457999706

Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.
Score: 0.4832950830459595
Summary: Room 2  Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen  and a single bed. Not suitable for group booking All rooms have  TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.
Score: 0.48092085123062134
Summary: THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!
Score: 0.4629460275173187
Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom.  Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe.  The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling  paddle boarding, surfing are all just minutes from the front door.
Score: 0.4581468403339386
Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use.  The property boasts 45 m2.
Score: 0.45398443937301636

创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询，您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤，在 sample_db.embeddings 集合上创建索引，将 embedding 字段指定为向量类型，将相似度测量指定为 dotProduct。

创建一个名为 create-index.js 的文件并粘贴以下代码。

create-index.js

import { MongoClient } from 'mongodb';
// connect to your Atlas deployment
const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
async function run() {
   try {
     const database = client.db("sample_db");
     const collection = database.collection("embeddings");
    
     // define your Atlas Vector Search index
     const index = {
         name: "vector_index",
         type: "vectorSearch",
         definition: {
           "fields": [
             {
               "type": "vector",
               "path": "embedding",
               "similarity": "dotProduct",
               "numDimensions": <dimensions>
             }
           ]
         }
     }
     // run the helper method
     const result = await collection.createSearchIndex(index);
     console.log(result);
   } finally {
     await client.close();
   }
}
run().catch(console.dir);

如果使用开源模型，则将 <dimensions> 占位符值替换为 768；如果使用 OpenAI 模型，则将占位符值替换为 1536。
保存文件，然后运行以下命令：
```
node --env-file=.env create-index.js
```

注意

构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

有关更多信息，请参阅创建 Atlas Vector Search 索引。

为向量搜索查询创建嵌入并运行查询。

创建一个名为 vector-query.js 的文件并粘贴以下代码。

要运行向量搜索查询，请生成一个查询向量以传递到聚合管道中。

例如，此代码执行以下操作：

通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到 queryVector 字段，并指定在聚合管道中搜索的路径。
指定管道中的 $vectorSearch 阶段，以便对您的嵌入进行精确最近邻 (ENN) 搜索。
在集合中存储的嵌入上运行示例向量搜索查询。
它会按相关性顺序返回语义相似的文档及其向量搜索分数。

vector-query.js

import { MongoClient } from 'mongodb';
import { getEmbedding } from './get-embeddings.js';
// MongoDB connection URI and options
const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
async function run() {
    try {
        // Connect to the MongoDB client
        await client.connect();
        // Specify the database and collection
        const database = client.db("sample_db"); 
        const collection = database.collection("embeddings"); 
        // Generate embedding for the search query
        const queryEmbedding = await getEmbedding("ocean tragedy");
        // Define the sample vector search pipeline
        const pipeline = [
            {
                $vectorSearch: {
                    index: "vector_index",
                    queryVector: queryEmbedding,
                    path: "embedding",
                    exact: true,
                    limit: 5
                }
            },
            {
                $project: {
                    _id: 0,
                    text: 1,
                    score: {
                        $meta: "vectorSearchScore"
                    }
                }
            }
        ];
        // run pipeline
        const result = collection.aggregate(pipeline);
        // print results
        for await (const doc of result) {
            console.dir(JSON.stringify(doc));
        }
        } finally {
        await client.close();
    }
}
run().catch(console.dir);

保存文件，然后运行以下命令：

node --env-file=.env vector-query.js

'{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.5103757977485657}'
'{"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4616812467575073}'
'{"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.4115804433822632}'

node --env-file=.env vector-query.js

{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.4551968574523926}
{"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4050074517726898}
{"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.3594386577606201}

创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询，您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤，在 sample_airbnb.listingsAndReviews 集合上创建索引，将 embedding 字段指定为向量类型，将相似度测量指定为 euclidean。

创建一个名为 create-index.js 的文件并粘贴以下代码。

create-index.js

import { MongoClient } from 'mongodb';
// connect to your Atlas deployment
const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
async function run() {
  try {
    const database = client.db("sample_airbnb");
    const collection = database.collection("listingsAndReviews");
   
    // Define your Atlas Vector Search index
    const index = {
        name: "vector_index",
        type: "vectorSearch",
        definition: {
          "fields": [
            {
              "type": "vector",
              "path": "embedding",
              "similarity": "dotProduct",
              "numDimensions": <dimensions>,
              "quantization": "scalar"
            }
          ]
        }
    }
    // Call the method to create the index
    const result = await collection.createSearchIndex(index);
    console.log(result);
  } finally {
    await client.close();
  }
}
run().catch(console.dir);

如果使用开源模型，则将 <dimensions> 占位符值替换为 768；如果使用 OpenAI 模型，则将占位符值替换为 1536。
保存文件，然后运行以下命令：
```
node --env-file=.env create-index.js
```

注意

构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

有关更多信息，请参阅创建 Atlas Vector Search 索引。

为向量搜索查询创建嵌入并运行查询。

创建一个名为 vector-query.js 的文件并粘贴以下代码。

要运行向量搜索查询，请生成一个查询向量以传递到聚合管道中。

例如，此代码执行以下操作：

通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到 queryVector 字段，并指定在聚合管道中搜索的路径。
指定管道中的 $vectorSearch 阶段，以便对您的嵌入进行精确最近邻 (ENN) 搜索。
在集合中存储的嵌入上运行示例向量搜索查询。
它会按相关性顺序返回语义相似的文档及其向量搜索分数。

vector-query.js

import { MongoClient } from 'mongodb';
import { getEmbedding } from './get-embeddings.js';
// MongoDB connection URI and options
const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
async function run() {
    try {
        // Connect to the MongoDB client
        await client.connect();
        // Specify the database and collection
        const database = client.db("sample_airbnb"); 
        const collection = database.collection("listingsAndReviews"); 
        // Generate embedding for the search query
        const queryEmbedding = await getEmbedding("beach house");
        // Define the sample vector search pipeline
        const pipeline = [
            {
                $vectorSearch: {
                    index: "vector_index",
                    queryVector: queryEmbedding,
                    path: "embedding",
                    exact: true,
                    limit: 5
                }
            },
            {
                $project: {
                    _id: 0,
                    summary: 1,
                    score: {
                        $meta: "vectorSearchScore"
                    }
                }
            }
        ];
        // run pipeline
        const result = collection.aggregate(pipeline);
        // print results
        for await (const doc of result) {
            console.dir(JSON.stringify(doc));
        }
        } finally {
        await client.close();
    }
}
run().catch(console.dir);

保存文件，然后运行以下命令：

node --env-file=.env vector-query.js

'{"summary":"Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.","score":0.5334879159927368}'
'{"summary":"A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom.  Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe.  The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling  paddle boarding, surfing are all just minutes from the front door.","score":0.5240535736083984}'
'{"summary":"The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use.  The property boasts 45 m2.","score":0.5232879519462585}'
'{"summary":"Room 2  Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen  and a single bed. Not suitable for group booking All rooms have  TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.","score":0.5186381340026855}'
'{"summary":"A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.","score":0.5078228116035461}'

node --env-file=.env vector-query.js

{"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.483333021402359}
{"summary": "Room 2  Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen  and a single bed. Not suitable for group booking All rooms have  TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.48092877864837646}
{"summary": "THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!", "score": 0.46294474601745605}
{"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom.  Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe.  The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling  paddle boarding, surfing are all just minutes from the front door.", "score": 0.4580020606517792}
{"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use.  The property boasts 45 m2.", "score": 0.45400717854499817}

创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询，您必须在集合上创建 Atlas Vector Search 索引。

请将以下代码粘贴到您的笔记本中。

此代码会对您的集合创建一个索引，而该索引指定了以下内容：

BSON-Float32-Embedding、BSON-Int8-Embedding 与 BSON-Int1-Embedding 字段作为 vector 类型字段。
euclidean 作为 int1 嵌入的相似度函数，并将 dotProduct 作为 float32 与 int8 嵌入的相似度类型。
768 作为这些嵌入中的维度数量。

from pymongo.operations import SearchIndexModel
# Create your index model, then create the search index
search_index_model = SearchIndexModel(
  definition = {
    "fields": [
      {
        "type": "vector",
        "path": "BSON-Float32-Embedding",
        "similarity": "dotProduct",
        "numDimensions": 768
      },
      {
        "type": "vector",
        "path": "BSON-Int8-Embedding",
        "similarity": "dotProduct",
        "numDimensions": 768
      },
      {
        "type": "vector",
        "path": "BSON-Int1-Embedding",
        "similarity": "euclidean",
        "numDimensions": 768
      }
    ]
  },
  name="vector_index",
  type="vectorSearch",
)
collection.create_search_index(model=search_index_model)

运行代码。
构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

有关更多信息，请参阅创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询，您必须在集合上创建 Atlas Vector Search 索引。

请将以下代码粘贴到您的笔记本中。

此代码会对您的集合创建一个索引，并将 embedding 字段指定为 vector 类型、将相似度函数指定为 dotProduct 并将维度数量指定为 1536。此外，它还会为 embedding 字段中的向量启用自动标量量化，以高效处理您的向量数据。

from pymongo.operations import SearchIndexModel
# Create your index model, then create the search index
search_index_model = SearchIndexModel(
  definition = {
    "fields": [
      {
        "type": "vector",
        "path": "embedding",
        "similarity": "dotProduct",
        "numDimensions": 1536,
        "quantization": "scalar"
      }
    ]
  },
  name="vector_index",
  type="vectorSearch",
)
collection.create_search_index(model=search_index_model)

运行代码。
构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

有关更多信息，请参阅创建 Atlas Vector Search 索引。

为向量搜索查询创建嵌入，然后运行查询。

要运行向量搜索查询，请生成一个查询向量以传递到聚合管道中。

例如，此代码执行以下操作：

通过调用您定义的嵌入函数创建示例查询嵌入。
将此查询嵌入转换为 BSON float32、int8 与 int1 向量子类型。
将嵌入传递到 queryVector 字段，并指定在聚合管道中搜索的路径。
指定管道中的 $vectorSearch 阶段，以便对您的嵌入进行精确最近邻 (ENN) 搜索。
在集合中存储的嵌入上运行示例向量搜索查询。
它会按相关性顺序返回语义相似的文档及其向量搜索分数。

注意

查询可能需要一些时间才能完成。

# Prepare your query
query_text = "ocean tragedy"
# Generate embedding for the search query
query_float32_embeddings = get_embedding(query_text, precision="float32")
query_int8_embeddings = get_embedding(query_text, precision="int8")
query_int1_embeddings = get_embedding(query_text, precision="ubinary")
# Convert each embedding to BSON format
query_bson_float32_embeddings = generate_bson_vector(query_float32_embeddings, BinaryVectorDtype.FLOAT32)
query_bson_int8_embeddings = generate_bson_vector(query_int8_embeddings, BinaryVectorDtype.INT8)
query_bson_int1_embeddings = generate_bson_vector(query_int1_embeddings, BinaryVectorDtype.PACKED_BIT)
# Define vector search pipeline for each precision
pipelines = []
for query_embedding, path in zip(
    [query_bson_float32_embeddings, query_bson_int8_embeddings, query_bson_int1_embeddings],
    ["BSON-Float32-Embedding", "BSON-Int8-Embedding", "BSON-Int1-Embedding"]
):
    pipeline = [
       {
          "$vectorSearch": {
                "index": "vector_index",  # Adjust if necessary
                "queryVector": query_embedding,
                "path": path,
                "exact": True,
                "limit": 5
          }
       },
       {
          "$project": {
             "_id": 0,
             "data": 1,
             "score": {
                "$meta": "vectorSearchScore"
             }
          }
       }
    ]
    pipelines.append(pipeline)
# Execute the search for each precision
for pipeline in pipelines:
    print(f"\nResults for {pipeline[0]['$vectorSearch']['path']}:")
    results = collection.aggregate(pipeline)
    # Print results
    for i in results:
        print(i)

Results for BSON-Float32-Embedding:
{'data': 'Titanic: The story of the 1912 sinking of the largest luxury liner ever built', 'score': 0.7661113739013672}
{'data': 'Avatar: A marine is dispatched to the moon Pandora on a unique mission', 'score': 0.7050272822380066}
{'data': 'The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.', 'score': 0.7024770379066467}
{'data': 'Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.', 'score': 0.7011005282402039}
{'data': 'E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.', 'score': 0.6877288818359375}
Results for BSON-Int8-Embedding:
{'data': 'Titanic: The story of the 1912 sinking of the largest luxury liner ever built', 'score': 0.5}
{'data': 'The Lion King: Lion cub and future king Simba searches for his identity', 'score': 0.5}
{'data': 'Avatar: A marine is dispatched to the moon Pandora on a unique mission', 'score': 0.5}
{'data': "Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.", 'score': 0.5}
{'data': 'The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.', 'score': 0.5}
Results for BSON-Int1-Embedding:
{'data': 'Titanic: The story of the 1912 sinking of the largest luxury liner ever built', 'score': 0.671875}
{'data': 'The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.', 'score': 0.6484375}
{'data': 'Avatar: A marine is dispatched to the moon Pandora on a unique mission', 'score': 0.640625}
{'data': 'Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.', 'score': 0.6145833134651184}
{'data': 'Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.', 'score': 0.61328125}

# Generate embedding for the search query
query_embedding = get_embedding("ocean tragedy")
# Sample vector search pipeline
pipeline = [
   {
      "$vectorSearch": {
            "index": "vector_index",
            "queryVector": query_embedding,
            "path": "embedding",
            "exact": True,
            "limit": 5
      }
   }, 
   {
      "$project": {
         "_id": 0, 
         "text": 1,
         "score": {
            "$meta": "vectorSearchScore"
         }
      }
   }
]
# Execute the search
results = collection.aggregate(pipeline)
# Print results
for i in results:
   print(i)

{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.4551968574523926}
{"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4050074517726898}
{"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.3594386577606201}

创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询，您必须在集合上创建 Atlas Vector Search 索引。

请将以下代码粘贴到您的笔记本中。

此代码会对您的集合创建一个索引，而该索引指定了以下内容：

BSON-Float32-Embedding、BSON-Int8-Embedding 与 BSON-Int1-Embedding 字段作为 vector 类型字段。
euclidean 作为 int1 嵌入的相似度函数，并将 dotProduct 作为 float32 与 int8 嵌入的相似度类型。
768 作为这些嵌入中的维度数量。

from pymongo.operations import SearchIndexModel
# Create your index model, then create the search index
search_index_model = SearchIndexModel(
  definition = {
    "fields": [
      {
        "type": "vector",
        "path": "BSON-Float32-Embedding",
        "similarity": "dotProduct",
        "numDimensions": 768
      },
      {
        "type": "vector",
        "path": "BSON-Int8-Embedding",
        "similarity": "dotProduct",
        "numDimensions": 768
      },
      {
        "type": "vector",
        "path": "BSON-Int1-Embedding",
        "similarity": "euclidean",
        "numDimensions": 768
      }
    ]
  },
  name="vector_index",
  type="vectorSearch",
)
collection.create_search_index(model=search_index_model)

运行代码。
构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

有关更多信息，请参阅创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询，您必须在集合上创建 Atlas Vector Search 索引。

请将以下代码粘贴到您的笔记本中。

from pymongo.operations import SearchIndexModel
# Create your index model, then create the search index
search_index_model = SearchIndexModel(
  definition = {
    "fields": [
      {
        "type": "vector",
        "path": "embedding",
        "similarity": "dotProduct",
        "numDimensions": 1536,
        "quantization": "scalar"
      }
    ]
  },
  name="vector_index",
  type="vectorSearch",
)
collection.create_search_index(model=search_index_model)

运行代码。
构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

有关更多信息，请参阅创建 Atlas Vector Search 索引。

为向量搜索查询创建嵌入并运行查询。

要运行向量搜索查询，请生成一个查询向量以传递到聚合管道中。

例如，此代码执行以下操作：

通过调用您定义的嵌入函数创建示例查询嵌入。
将此查询嵌入转换为 BSON float32、int8 与 int1 向量子类型。
将嵌入传递到 queryVector 字段，并指定在聚合管道中搜索的路径。
指定管道中的 $vectorSearch 阶段，以便对您的嵌入进行精确最近邻 (ENN) 搜索。
在集合中存储的嵌入上运行示例向量搜索查询。
它会按相关性顺序返回语义相似的文档及其向量搜索分数。

# Prepare your query
query_text = "beach house"
# Generate embedding for the search query
query_float32_embeddings = get_embedding(query_text, precision="float32")
query_int8_embeddings = get_embedding(query_text, precision="int8")
query_int1_embeddings = get_embedding(query_text, precision="ubinary")
# Convert each embedding to BSON format
query_bson_float32_embeddings = generate_bson_vector(query_float32_embeddings, BinaryVectorDtype.FLOAT32)
query_bson_int8_embeddings = generate_bson_vector(query_int8_embeddings, BinaryVectorDtype.INT8)
query_bson_int1_embeddings = generate_bson_vector(query_int1_embeddings, BinaryVectorDtype.PACKED_BIT)
# Define vector search pipeline for each precision
pipelines = []
for query_embedding, path in zip(
    [query_bson_float32_embeddings, query_bson_int8_embeddings, query_bson_int1_embeddings],
    ["BSON-Float32-Embedding", "BSON-Int8-Embedding", "BSON-Int1-Embedding"]
):
    pipeline = [
       {
          "$vectorSearch": {
                "index": "vector_index",  # Adjust if necessary
                "queryVector": query_embedding,
                "path": path,
                "exact": True,
                "limit": 5
          }
       },
       {
          "$project": {
             "_id": 0,
             "summary": 1,
             "score": {
                "$meta": "vectorSearchScore"
             }
          }
       }
    ]
    pipelines.append(pipeline)
# Execute the search for each precision
for pipeline in pipelines:
    print(f"\nResults for {pipeline[0]['$vectorSearch']['path']}:")
    results = collection.aggregate(pipeline)
    
    # Print results
    for i in results:
        print(i)

Results for BSON-Float32-Embedding:
{'summary': 'Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.', 'score': 0.7847104072570801}
{'summary': 'The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use.  The property boasts 45 m2.', 'score': 0.7780507802963257}
{'summary': "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom.  Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe.  The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling  paddle boarding, surfing are all just minutes from the front door.", 'score': 0.7723637223243713}
{'summary': 'Room 2  Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen  and a single bed. Not suitable for group booking All rooms have  TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.', 'score': 0.7665778398513794}
{'summary': 'A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.', 'score': 0.7593404650688171}
Results for BSON-Int8-Embedding:
{'summary': 'Fantastic duplex apartment with three bedrooms, located in the historic area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary building fully rehabilitated, without losing their original character.', 'score': 0.5}
{'summary': 'One bedroom + sofa-bed in quiet and bucolic neighbourhood right next to the Botanical Garden. Small garden, outside shower, well equipped kitchen and bathroom with shower and tub. Easy for transport with many restaurants and basic facilities in the area.', 'score': 0.5}
{'summary': "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.", 'score': 0.5}
{'summary': 'Here exists a very cozy room for rent in a shared 4-bedroom apartment. It is located one block off of the JMZ at Myrtle Broadway.  The neighborhood is diverse and appeals to a variety of people.', 'score': 0.5}
{'summary': 'Quarto com vista para a Lagoa Rodrigo de Freitas, cartão postal do Rio de Janeiro. Linda Vista.  1 Quarto e 1 banheiro  Amplo, arejado, vaga na garagem. Prédio com piscina, sauna e playground.  Fácil acesso, próximo da praia e shoppings.', 'score': 0.5}
Results for BSON-Int1-Embedding:
{'summary': 'Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.', 'score': 0.6901041865348816}
{'summary': 'Cozy and comfortable apartment. Ideal for families and vacations.  3 bedrooms, 2 of them suites.  Located 20-min walk to the beach and close to the Rio 2016 Olympics Venues. Situated in a modern and secure condominium, with many entertainment available options around.', 'score': 0.6731770634651184}
{'summary': "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom.  Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe.  The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling  paddle boarding, surfing are all just minutes from the front door.", 'score': 0.6731770634651184}
{'summary': 'The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use.  The property boasts 45 m2.', 'score': 0.671875}
{'summary': 'THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!', 'score': 0.6705729365348816}

要运行向量搜索查询，请生成一个查询向量以传递到聚合管道中。

例如，此代码执行以下操作：

通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到 queryVector 字段，并指定在聚合管道中搜索的路径。
指定管道中的 $vectorSearch 阶段，以便对您的嵌入进行精确最近邻 (ENN) 搜索。
在集合中存储的嵌入上运行示例向量搜索查询。
它会按相关性顺序返回语义相似的文档及其向量搜索分数。

# Generate embedding for the search query
query_embedding = get_embedding("beach house")
# Sample vector search pipeline
pipeline = [
   {
      "$vectorSearch": {
            "index": "vector_index",
            "queryVector": query_embedding,
            "path": "embedding",
            "exact": True,
            "limit": 5
      }
   }, 
   {
      "$project": {
         "_id": 0, 
         "summary": 1,
         "score": {
            "$meta": "vectorSearchScore"
         }
      }
   }
]
# Execute the search
results = collection.aggregate(pipeline)
# Print results
for i in results:
   print(i)

{"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.483333021402359}
{"summary": "Room 2  Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen  and a single bed. Not suitable for group booking All rooms have  TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.48092877864837646}
{"summary": "THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!", "score": 0.46294474601745605}
{"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom.  Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe.  The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling  paddle boarding, surfing are all just minutes from the front door.", "score": 0.4580020606517792}
{"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use.  The property boasts 45 m2.", "score": 0.45400717854499817}

Considerations

在创建向量嵌入时请考虑以下因素：

选择创建嵌入的方法

要创建向量嵌入，您必须使用嵌入模型。嵌入模型是用于将数据转换为嵌入的算法。您可以选择以下方法之一连接到嵌入模型并创建向量嵌入：

方法	说明
加载开源模型	如果您没有专有嵌入模型的 API 密钥，请从您的应用程序本地加载开源嵌入模型。
使用专有模型	大多数 AI 提供商都为其专有的嵌入模型提供 API，您可以使用这些模型创建向量嵌入。
利用一个集成	您可以将 Atlas Vector Search 与开源框架和 AI 服务集成，以快速连接到开源和专有嵌入模型，并为 Atlas Vector Search 生成矢量嵌入。要了解更多信息，请参阅将 Vector Search 与 AI 技术集成。

选择内嵌模型

您选择的嵌入模型会影响您的查询结果，并决定您在 Atlas Vector Search 索引中指定的维度数。每种模型都提供不同的优势，具体取决于您的数据和使用案例。

有关常用嵌入模型的列表，请参阅海量文本嵌入基准 (MTEB) 。此列表提供了对各种开源和专有文本嵌入模型的深入见解，并允许您按使用案例、模型类型和特定模型指标来过滤模型。

为 Atlas Vector Search 选择嵌入模型时，请考虑以下指标：

嵌入维度：向量嵌入的长度。
较小的嵌入可以提高存储效率，而较大的嵌入可以捕获数据中更细微的关系。您选择的模型应在效率和复杂性之间取得平衡。
Max Tokens（最大标记数）：可在单个嵌入中压缩的令牌数。
模型大小：模型的大小，以千兆字节为单位。
虽然较大的模型性能更好，但当您将 Atlas Vector Search 扩展到生产环境时，它们需要更多的计算资源。
检索平均值：衡量检索系统性能的分数。
分数越高，表示模型更擅长在检索结果列表中将相关文档排在较高的位置。在为 RAG 应用程序选择模型时，此分数很重要。

提示

另请参阅：

如何为您的应用程序选择正确的内嵌模型

验证您的嵌入

请考虑以下策略，以确保您的嵌入是正确且最佳的：

测试您的函数和脚本。
生成嵌入需要时间和计算资源。在从大型数据集或集合创建嵌入之前，请测试嵌入函数或脚本是否在一小部分数据上按预期运行。
分批创建嵌入。
如果要从大型数据集或包含大量文档的集合生成嵌入，请分批创建嵌入，以避免内存问题并优化性能。
评估性能。
运行测试查询，检查搜索结果是否相关且排名是否准确。
要详细学习；了解如何评估结果以及微调索引和查询的性能，请参阅如何衡量查询结果的准确性并提高向量搜索性能。

故障排除

如果您在嵌入时遇到问题，请考虑以下策略：

验证您的环境。
检查是否已安装必要的依赖项且这些依赖项是最新的。库版本冲突可能会导致意外行为。创建新环境并仅安装所需的包，确保不存在冲突。
注意
如果使用 Colab，请确保笔记本会话的 IP 地址包含在 Atlas 项目的访问列表中。
监控内存使用情况。
如果您遇到性能问题，请检查 RAM、CPU 和磁盘使用情况，以确定任何潜在的瓶颈。对于 Colab 或 Jupyter Notebooks 等托管环境，请确保您的实例预配了足够的资源，并在必要时升级实例。
确保维度一致
验证Atlas Vector Search索引定义是否与Atlas中存储的嵌入维度相匹配，以及您的查询嵌入与索引嵌入的维度相匹配。否则，在运行向量搜索查询时可能会遇到错误。

要解决特定问题，请参阅故障排除。

后续步骤

学会了如何创建嵌入和使用 Atlas Vector Search 查询嵌入之后，就可以通过实施检索增强生成 (RAG) 开始构建生成式人工智能应用：

您还可以将嵌入转换为BSON向量，以便在Atlas中高效存储和摄取向量。要学习；了解更多信息，请参阅如何摄取预量化向量。

后退

Atlas Vector Search 快速入门

来年

创建和管理索引

开始体验

提示

先决条件

定义嵌入函数

初始化您的 .NET 项目。

安装并导入依赖项。

设置环境变量。

定义一个函数来生成向量嵌入。

注意

503 调用 Hushing Face 模型时

初始化您的 .NET 项目。

安装并导入依赖项。

设置环境变量。

定义一个函数来生成向量嵌入。

注意

初始化您的 Go 项目。

安装并导入依赖项。

创建.env 文件来管理密钥。

定义一个函数来生成向量嵌入。

注意

503 调用 Hushing Face 模型时

初始化您的 Go 项目。

安装并导入依赖项。

创建.env 文件来管理密钥。

注意

定义一个函数来生成向量嵌入。

注意

创建Java项目并安装依赖项。

设置环境变量。

注意

定义生成向量嵌入的方法。

创建Java项目并安装依赖项。

设置环境变量。

注意

定义生成向量嵌入的方法。

注意

初始化您的 Node.js 项目。

更新您的 package.json 文件。

安装并导入依赖项。

创建 .env 文件。

注意

最低 Node.js 版本要求

定义一个函数来生成向量嵌入。

初始化您的 Node.js 项目。

更新您的 package.json 文件。

安装并导入依赖项。

创建 .env 文件。

注意

最低 Node.js 版本要求

定义一个函数来生成向量嵌入。

注意

设置环境。

定义函数来生成向量嵌入。

定义用于转换向量嵌入的函数。

定义用于创建附带这些嵌入的文档的函数。

测试该函数以生成嵌入。

测试用于将嵌入转换为 BSON 向量的函数。

设置环境。

定义一个函数来生成向量嵌入。

提示

另请参阅：

从数据创建嵌入

定义一个 DataService 类。

更新您项目中的 Program.cs。

编译并运行项目。

注意

定义一个 DataService 类。

更新您项目中的 Program.cs。

编译并运行项目。

创建一个名为 create-embeddings.go 的文件并粘贴以下代码。

保存并运行该文件。

注意

创建一个名为 create-embeddings.go 的文件并粘贴以下代码。

创建一个包含该集合的Go模型的文件。

生成嵌入。

定义代码以从Atlas中的现有集合生成嵌入。

生成嵌入。

注意

定义代码以从Atlas中的现有集合生成嵌入。

生成嵌入。

创建`.env` 文件来管理密钥。

创建`.env` 文件来管理密钥。

更新您的 `package.json` 文件。

创建 `.env` 文件。

更新您的 `package.json` 文件。

创建 `.env` 文件。

定义一个 `DataService` 类。

更新您项目中的 `Program.cs`。

定义一个 `DataService` 类。

更新您项目中的 `Program.cs`。

创建一个名为 `create-embeddings.go` 的文件并粘贴以下代码。

创建一个名为 `create-embeddings.go` 的文件并粘贴以下代码。

创建一个名为 `create-embeddings.js` 的文件并粘贴以下代码。

创建一个名为 `create-embeddings.js` 的文件并粘贴以下代码。