Docs 菜单
Docs 主页
/
Atlas
/

如何创建向量嵌入

在此页面上

  • 开始体验
  • 先决条件
  • 使用嵌入模型
  • 从数据创建嵌入
  • 为查询创建嵌入
  • Considerations
  • 选择创建嵌入的方法
  • 选择内嵌模型
  • 向量压缩
  • 验证您的嵌入
  • 后续步骤

您可以将向量嵌入与其他数据一起存储在 Atlas 中。这些嵌入可以捕获数据中的有意义的关系,使您能够执行语义搜索并使用 Atlas Vector Search 实现 RAG

通过以下教程,学习;了解如何创建向量嵌入并使用Atlas Vector Search进行查询。 具体来说,您执行以下操作:

  1. 定义一个使用 嵌入模型生成向量嵌入的函数。

  2. 从您的数据中创建嵌入并将其存储在 Atlas 中。

  3. 根据搜索词创建嵌入,并运行向量搜索查询。

对于生产应用程序,您通常会写入脚本来生成向量嵌入。 您可以从此页面上的示例代码开始,并根据您的使用案例进行自定义。


➤ 使用 Select your language(选择您的语言)下拉菜单设置此页面上示例的语言。

使用可运行版本的本教程以作为 Python 笔记本

如要完成本教程,您必须具备以下条件:

  • 一个 Atlas 帐户,而其集群运行着 MongoDB 版本 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的 IP 地址包含在 Atlas 项目的访问列表中。如需了解详情,请参阅创建集群

  • 用于运行 C# 项目的终端和代码编辑器。

  • 已安装 .NET 8.0 或更高版本

  • Hugging Face 访问令牌或 OpenAI API 密钥。

  • 一个 Atlas 帐户,而其集群运行着 MongoDB 版本 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的 IP 地址包含在 Atlas 项目的访问列表中。如需了解详情,请参阅创建集群

  • 用于运行 Go 项目的终端和代码编辑器。

  • Go 已安装。

  • Hugging Face 访问令牌或 OpenAI API 密钥。

  • 一个 Atlas 帐户,而其集群运行着 MongoDB 版本 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的 IP 地址包含在 Atlas 项目的访问列表中。如需了解详情,请参阅创建集群

  • 以下之一:

    • 具有读取访问权限的“拥抱”脸部访问令牌

    • 一个 OpenAI API 密钥。您必须拥有一个 OpenAI 账号,该账号具有可用于 API 请求的信用额度。要了解有关注册 OpenAI 账号的更多信息,请参阅 OpenAI API 网站

  • 一个 Atlas 帐户,而其集群运行着 MongoDB 版本 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的 IP 地址包含在 Atlas 项目的访问列表中。如需了解详情,请参阅创建集群

  • 用于运行 Node.js 项目的终端和代码编辑器。

  • npm 和 Node.js 已安装。

  • 如果您使用 OpenAI 模型,则必须拥有 OpenAI API 密钥。

  • 一个 Atlas 帐户,而其集群运行着 MongoDB 版本 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的 IP 地址包含在 Atlas 项目的访问列表中。如需了解详情,请参阅创建集群

  • 运行交互式 Python 笔记本(如 VS Code Colab) 的环境。

  • 如果您使用 OpenAI 模型,则必须拥有 OpenAI API 密钥。

1

在终端窗口中,运行以下命令以初始化您的项目:

dotnet new console -o MyCompany.Embeddings
cd MyCompany.Embeddings
2

在终端窗口中,运行以下命令:

dotnet add package MongoDB.Driver
3

导出您的环境变量,在 PoweShell 中对其进行 set,或是使用 IDE 的环境变量管理器以使该连接字符串和 HuggingFace 访问令牌可用于您的项目。

export HUGGINGFACE_ACCESS_TOKEN="<access-token>"
export ATLAS_CONNECTION_STRING="<connection-string>"

<access-token> 占位符值替换为您的 Huging Face访问权限令牌。

用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
4

在名为 AIService.cs 的同名文件中创建一个新类并粘贴以下代码。此代码定义了一个名为 GetEmbeddingsAsync 的异步任务,用于为给定字符串输入的数组生成嵌入数组。此函数使用了 mxbai-embed-large-v1 嵌入模型。

AIService.cs
namespace MyCompany.Embeddings;
using System;
using System.Net.Http;
using System.Text.Json;
using System.Threading.Tasks;
using System.Net.Http.Headers;
public class AIService
{
private static readonly string? HuggingFaceAccessToken = Environment.GetEnvironmentVariable("HUGGINGFACE_ACCESS_TOKEN");
private static readonly HttpClient Client = new HttpClient();
public async Task<Dictionary<string, float[]>> GetEmbeddingsAsync(string[] texts)
{
const string modelName = "mixedbread-ai/mxbai-embed-large-v1";
const string url = $"https://api-inference.huggingface.co/models/{modelName}";
Client.DefaultRequestHeaders.Authorization
= new AuthenticationHeaderValue("Bearer", HuggingFaceAccessToken);
var data = new { inputs = texts };
var dataJson = JsonSerializer.Serialize(data);
var content = new StringContent(dataJson,null, "application/json");
var response = await Client.PostAsync(url, content);
response.EnsureSuccessStatusCode();
var responseString = await response.Content.ReadAsStringAsync();
var embeddings = JsonSerializer.Deserialize<float[][]>(responseString);
if (embeddings is null)
{
throw new ApplicationException("Failed to deserialize embeddings response to an array of floats.");
}
Dictionary<string, float[]> documentData = new Dictionary<string, float[]>();
var embeddingCount = embeddings.Length;
foreach (var value in Enumerable.Range(0, embeddingCount))
{
// Pair each embedding with the text used to generate it.
documentData[texts[value]] = embeddings[value];
}
return documentData;
}
}

注意

503 调用 Hushing Face 模型时

在调用 Hugging Face 模型中心模型时,您偶尔可能会遇到 503 错误。要解决此问题,请在短暂等待后重试。

1

在终端窗口中,运行以下命令以初始化您的项目:

dotnet new console -o MyCompany.Embeddings
cd MyCompany.Embeddings
2

在终端窗口中,运行以下命令:

dotnet add package MongoDB.Driver
dotnet add package OpenAI
3

导出您的环境变量,在 PowerShell 中对其进行 set,或是使用 IDE 的环境变量管理器以使该连接字符串和 HuggingFace 访问令牌可用于您的项目。

export OPENAI_API_KEY="<api-key>"
export ATLAS_CONNECTION_STRING="<connection-string>"

<api-key> 占位符值替换为您的 OpenAI API密钥。

用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
4

在名为 AIService.cs 的同名文件中创建一个新类并粘贴以下代码。此代码定义了一个名为 GetEmbeddingsAsync 的异步任务,用于为给定字符串输入的数组生成嵌入数组。此函数使用 OpenAI 的 text-embedding-3-small 模型为给定输入生成嵌入。

AIService.cs
namespace MyCompany.Embeddings;
using OpenAI.Embeddings;
using System;
using System.Threading.Tasks;
public class AIService
{
private static readonly string? OpenAIApiKey = Environment.GetEnvironmentVariable("OPENAI_API_KEY");
private static readonly string EmbeddingModelName = "text-embedding-3-small";
public async Task<Dictionary<string, float[]>> GetEmbeddingsAsync(string[] texts)
{
EmbeddingClient embeddingClient = new(model: EmbeddingModelName, apiKey: OpenAIApiKey);
Dictionary<string, float[]> documentData = new Dictionary<string, float[]>();
try
{
var result = await embeddingClient.GenerateEmbeddingsAsync(texts);
var embeddingCount = result.Value.Count;
foreach (var index in Enumerable.Range(0, embeddingCount))
{
// Pair each embedding with the text used to generate it.
documentData[texts[index]] = result.Value[index].ToFloats().ToArray();
}
}
catch (Exception e)
{
throw new ApplicationException(e.Message);
}
return documentData;
}
}

在本节中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您想使用开源嵌入模型还是类似 OpenAI 的专有模型,选择一个标签页。

注意

开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。

1

在终端窗口中,运行以下命令,创建一个名为 my-embeddings-project的新目录并初始化项目:

mkdir my-embeddings-project
cd my-embeddings-project
go mod init my-embeddings-project
2

在终端窗口中,运行以下命令:

go get github.com/joho/godotenv
go get go.mongodb.org/mongo-driver/v2/mongo
go get github.com/tmc/langchaingo/llms
3

在项目中,创建 .env 文件来存储 Atlas 连接字符串和 Hugging Face 访问令牌。

HUGGINGFACEHUB_API_TOKEN = "<access-token>"
ATLAS_CONNECTION_STRING = "<connection-string>"

<access-token> 占位符值替换为您的 Huging Face访问权限令牌。

用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
4
  1. 在项目中创建一个名为 common 的目录,用于存储您将在后续步骤中使用的常用代码:

    mkdir common && cd common
  2. 创建一个名为 get-embeddings.go 的文件并粘贴以下代码。此代码定义了一个名为 GetEmbeddings 的函数,用于为给定输入生成嵌入。此函数指定:

    get-embeddings.go
    package common
    import (
    "context"
    "log"
    "github.com/tmc/langchaingo/embeddings/huggingface"
    )
    func GetEmbeddings(documents []string) [][]float32 {
    hf, err := huggingface.NewHuggingface(
    huggingface.WithModel("mixedbread-ai/mxbai-embed-large-v1"),
    huggingface.WithTask("feature-extraction"))
    if err != nil {
    log.Fatalf("failed to connect to Hugging Face: %v", err)
    }
    embs, err := hf.EmbedDocuments(context.Background(), documents)
    if err != nil {
    log.Fatalf("failed to generate embeddings: %v", err)
    }
    return embs
    }

    注意

    503 调用 Hushing Face 模型时

    在调用 Hugging Face 模型中心模型时,您偶尔可能会遇到 503 错误。要解决此问题,请在短暂等待后重试。

  3. 返回到主项目根目录。

    cd ../
1

在终端窗口中,运行以下命令,创建一个名为 my-embeddings-project的新目录并初始化项目:

mkdir my-embeddings-project
cd my-embeddings-project
go mod init my-embeddings-project
2

在终端窗口中,运行以下命令:

go get github.com/joho/godotenv
go get go.mongodb.org/mongo-driver/v2/mongo
go get github.com/milosgajdos/go-embeddings/openai
3

在您的项目中,创建一个.env string文件来存储连接字符串和 OpenAIAPI API令牌。

OPENAI_API_KEY = "<api-key>"
ATLAS_CONNECTION_STRING = "<connection-string>"

用 OpenAI API 密钥和 Atlas 集群的 SRV 连接字符串替换 <api-key><connection-string> 占位符值。连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

注意

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
4
  1. 在项目中创建一个名为 common 的目录,用于存储将在多个步骤中使用的代码:

    mkdir common && cd common
  2. 创建一个名为 get-embeddings.go 的文件并粘贴以下代码。此代码定义了一个名为 GetEmbeddings 的函数,该函数使用 OpenAI 的 text-embedding-3-small 模型为给定输入生成嵌入。

    get-embeddings.go
    package common
    import (
    "context"
    "log"
    "github.com/milosgajdos/go-embeddings/openai"
    )
    func GetEmbeddings(docs []string) [][]float64 {
    c := openai.NewClient()
    embReq := &openai.EmbeddingRequest{
    Input: docs,
    Model: openai.TextSmallV3,
    EncodingFormat: openai.EncodingFloat,
    }
    embs, err := c.Embed(context.Background(), embReq)
    if err != nil {
    log.Fatalf("failed to connect to OpenAI: %v", err)
    }
    var vectors [][]float64
    for _, emb := range embs {
    vectors = append(vectors, emb.Vector)
    }
    return vectors
    }
  3. 返回到主项目根目录。

    cd ../

在本节中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您想使用开源嵌入模型还是类似 OpenAI 的专有模型,选择一个标签页。

注意

开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。

1
  1. 在 IDE 中,使用 Maven 或 Gradle 创建Java项目。

  2. 根据您的包管理器,添加以下依赖项:

    如果使用 Maven,请将以下依赖项添加到项目的 pom.xml文件的 dependencies大量中:

    pom.xml
    <dependencies>
    <!-- MongoDB Java Sync Driver v5.2.0 or later -->
    <dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>mongodb-driver-sync</artifactId>
    <version>[5.2.0,)</version>
    </dependency>
    <!-- Java library for working with Hugging Face models -->
    <dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-hugging-face</artifactId>
    <version>0.35.0</version>
    </dependency>
    </dependencies>

    如果您使用 Gradle,请将以下内容添加到项目 build.gradle文件的 dependencies大量中:

    build.gradle
    dependencies {
    // MongoDB Java Sync Driver v5.2.0 or later
    implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)'
    // Java library for working with Hugging Face models
    implementation 'dev.langchain4j:langchain4j-hugging-face:0.35.0'
    }
  3. 运行包管理器以安装项目的依赖项。

2

注意

此示例在 IDE 中设置项目的变量。 生产应用程序可以通过部署配置、CI/CD管道或密钥管理器管理环境变量,但您可以调整提供的代码以适合您的使用案例。

在 IDE 中,创建新的配置模板并将以下变量添加到项目中:

  • 如果您使用的是 IntelliJ IDEA,则请创建一个新的 Application 运行配置模板,然后在 Environment variables 字段中将变量添加为用分号分隔的值(例如,FOO=123;BAR=456)。应用这些更改,然后单击 OK

    要学习;了解更多信息,请参阅 IntelliJ IDEA 文档的从模板创建运行/调试配置部分。

  • 如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。 应用更改并单击 OK

    要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。

环境变量
HUGGING_FACE_ACCESS_TOKEN=<access-token>
ATLAS_CONNECTION_STRING=<connection-string>

用以下值更新占位符:

  • 将 ``<access-token>`` 占位符值替换为您的 Hugging Face 访问令牌。

  • 用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

    连接字符串应使用以下格式:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
3

创建一个名为EmbeddingProvider.java的文件并粘贴以下代码。

此代码定义了两种使用 mxbai-embed-large-v1 开源嵌入模型为给定输入生成嵌入的方法:

  • getEmbeddings多个输入:List<String> 方法接受文本输入大量(),允许您在单个API调用中创建多个嵌入。该方法将API提供的浮点数数组转换为BSON双精度数组,以便存储在Atlas 集群中。

  • 单个输入:getEmbedding 方法接受单个String ,它表示要对向量数据进行的查询。该方法将API提供的浮点数大量转换为BSON双精度大量,以便在查询集合时使用。

EmbeddingProvider.java
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.huggingface.HuggingFaceEmbeddingModel;
import dev.langchain4j.model.output.Response;
import org.bson.BsonArray;
import org.bson.BsonDouble;
import java.util.List;
import static java.time.Duration.ofSeconds;
public class EmbeddingProvider {
private static HuggingFaceEmbeddingModel embeddingModel;
private static HuggingFaceEmbeddingModel getEmbeddingModel() {
if (embeddingModel == null) {
String accessToken = System.getenv("HUGGING_FACE_ACCESS_TOKEN");
if (accessToken == null || accessToken.isEmpty()) {
throw new RuntimeException("HUGGING_FACE_ACCESS_TOKEN env variable is not set or is empty.");
}
embeddingModel = HuggingFaceEmbeddingModel.builder()
.accessToken(accessToken)
.modelId("mixedbread-ai/mxbai-embed-large-v1")
.waitForModel(true)
.timeout(ofSeconds(60))
.build();
}
return embeddingModel;
}
/**
* Takes an array of strings and returns a BSON array of embeddings to
* store in the database.
*/
public List<BsonArray> getEmbeddings(List<String> texts) {
List<TextSegment> textSegments = texts.stream()
.map(TextSegment::from)
.toList();
Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments);
return response.content().stream()
.map(e -> new BsonArray(
e.vectorAsList().stream()
.map(BsonDouble::new)
.toList()))
.toList();
}
/**
* Takes a single string and returns a BSON array embedding to
* use in a vector query.
*/
public BsonArray getEmbedding(String text) {
Response<Embedding> response = getEmbeddingModel().embed(text);
return new BsonArray(
response.content().vectorAsList().stream()
.map(BsonDouble::new)
.toList());
}
}
1
  1. 在 IDE 中,使用 Maven 或 Gradle 创建Java项目。

  2. 根据您的包管理器,添加以下依赖项:

    如果使用 Maven,请将以下依赖项添加到项目的 pom.xml文件的 dependencies大量中:

    pom.xml
    <dependencies>
    <!-- MongoDB Java Sync Driver v5.2.0 or later -->
    <dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>mongodb-driver-sync</artifactId>
    <version>[5.2.0,)</version>
    </dependency>
    <!-- Java library for working with OpenAI models -->
    <dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-open-ai</artifactId>
    <version>0.35.0</version>
    </dependency>
    </dependencies>

    如果您使用 Gradle,请将以下内容添加到项目 build.gradle文件的 dependencies大量中:

    build.gradle
    dependencies {
    // MongoDB Java Sync Driver v5.2.0 or later
    implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)'
    // Java library for working with OpenAI models
    implementation 'dev.langchain4j:langchain4j-open-ai:0.35.0'
    }
  3. 运行包管理器以安装项目的依赖项。

2

注意

此示例在 IDE 中设置项目的变量。 生产应用程序可以通过部署配置、CI/CD管道或密钥管理器管理环境变量,但您可以调整提供的代码以适合您的使用案例。

在 IDE 中,创建新的配置模板并将以下变量添加到项目中:

  • 如果您使用的是 IntelliJ IDEA,则应创建一个新的 Application 运行配置模板,然后在 Environment variables 字段中以分号分隔的值的形式来添加变量(例如,FOO=123;BAR=456)。应用这些更改并单击 OK

    要学习;了解更多信息,请参阅 IntelliJ IDEA 文档的从模板创建运行/调试配置部分。

  • 如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。 应用更改并单击 OK

    要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。

环境变量
OPEN_AI_API_KEY=<api-key>
ATLAS_CONNECTION_STRING=<connection-string>

用以下值更新占位符:

  • 将``<api-key>``占位符值替换为您的 OpenAI API 密钥。

  • 用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

    连接字符串应使用以下格式:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
3

创建一个名为EmbeddingProvider.java的文件并粘贴以下代码。

此代码定义了两种使用 text-embedding-3 -small OpenAI 嵌入模型为给定输入生成嵌入的方法:

  • getEmbeddings多个输入:List<String> 方法接受文本输入大量(),允许您在单个API调用中创建多个嵌入。该方法将API提供的浮点数数组转换为BSON双精度数组,以便存储在Atlas 集群中。

  • 单个输入:getEmbedding 方法接受单个String ,它表示要对向量数据进行的查询。该方法将API提供的浮点数大量转换为BSON双精度大量,以便在查询集合时使用。

EmbeddingProvider.java
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.openai.OpenAiEmbeddingModel;
import dev.langchain4j.model.output.Response;
import org.bson.BsonArray;
import org.bson.BsonDouble;
import java.util.List;
import static java.time.Duration.ofSeconds;
public class EmbeddingProvider {
private static OpenAiEmbeddingModel embeddingModel;
private static OpenAiEmbeddingModel getEmbeddingModel() {
if (embeddingModel == null) {
String apiKey = System.getenv("OPEN_AI_API_KEY");
if (apiKey == null || apiKey.isEmpty()) {
throw new IllegalStateException("OPEN_AI_API_KEY env variable is not set or is empty.");
}
return OpenAiEmbeddingModel.builder()
.apiKey(apiKey)
.modelName("text-embedding-3-small")
.timeout(ofSeconds(60))
.build();
}
return embeddingModel;
}
/**
* Takes an array of strings and returns a BSON array of embeddings to
* store in the database.
*/
public List<BsonArray> getEmbeddings(List<String> texts) {
List<TextSegment> textSegments = texts.stream()
.map(TextSegment::from)
.toList();
Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments);
return response.content().stream()
.map(e -> new BsonArray(
e.vectorAsList().stream()
.map(BsonDouble::new)
.toList()))
.toList();
}
/**
* Takes a single string and returns a BSON array embedding to
* use in a vector query.
*/
public BsonArray getEmbedding(String text) {
Response<Embedding> response = getEmbeddingModel().embed(text);
return new BsonArray(
response.content().vectorAsList().stream()
.map(BsonDouble::new)
.toList());
}
}

在本部分中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您想要使用开源嵌入模型还是 OpenAI 的专有模型选择一个标签页。

本部分还包括一个可选函数,您可以使用该函数来压缩嵌入,从而实现高效存储并提高查询性能。要学习;了解更多信息,请参阅向量压缩。

注意

开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。

1

在终端窗口中,运行以下命令,创建一个名为 my-embeddings-project的新目录并初始化项目:

mkdir my-embeddings-project
cd my-embeddings-project
npm init -y
2

将您的项目配置为使用 ES 模块 ,方法是将 "type": "module" 添加到 package.json 文件中,然后将其保存。

{
"type": "module",
// other fields...
}
3

在终端窗口中,运行以下命令:

npm install mongodb @xenova/transformers
4
  1. 在您的项目中,创建一个 .env 文件来存储您的 Atlas 连接字符串。

    ATLAS_CONNECTION_STRING="<connection-string>"
  2. 用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。连接字符串应使用以下格式:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

    注意

    最低 Node.js 版本要求

    Node.js v 20 .x 引入了 --env-file 选项。如果您使用的是旧版本的 Node.js,请将 dotenv 包添加到项目中,或使用其他方法来管理环境变量。

  3. 保存文件。

5
  1. 创建一个名为 get-embeddings.js 的文件。

    touch get-embeddings.js
  2. 将以下代码粘贴到文件中。

    此代码定义了一个名为 getEmbedding 的函数,用于为给定输入生成嵌入。该函数指定:

    get-embeddings.js
    import { pipeline } from '@xenova/transformers';
    // Function to generate embeddings for a given data source
    export async function getEmbedding(data) {
    const embedder = await pipeline(
    'feature-extraction',
    'Xenova/nomic-embed-text-v1');
    const results = await embedder(data, { pooling: 'mean', normalize: true });
    return Array.from(results.data);
    }
  3. 保存文件。

1

在终端窗口中,运行以下命令,创建一个名为 my-embeddings-project的新目录并初始化项目:

mkdir my-embeddings-project
cd my-embeddings-project
npm init -y
2

将您的项目配置为使用 ES 模块 ,方法是将 "type": "module" 添加到 package.json 文件中,然后将其保存。

{
"type": "module",
// other fields...
}
3

在终端窗口中,运行以下命令:

npm install mongodb openai
4
  1. 在项目中,创建一个 .env 文件来存储 Atlas 连接字符串和 OpenAI API 密钥。

    OPENAI_API_KEY = "<api-key>"
    ATLAS_CONNECTION_STRING = "<connection-string>"
  2. 用 OpenAI API 密钥和 Atlas 集群的 SRV 连接字符串替换 <api-key><connection-string> 占位符值。连接字符串应使用以下格式:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

    注意

    最低 Node.js 版本要求

    Node.js v 20 .x 引入了 --env-file 选项。如果您使用的是旧版本的 Node.js,请将 dotenv 包添加到项目中,或使用其他方法来管理环境变量。

  3. 保存文件。

5
  1. 创建一个名为 get-embeddings.js 的文件。

  2. 粘贴以下代码。此代码定义了一个名为 getEmbedding 的函数,该函数使用 OpenAI 的 text-embedding-3-small 模型为给定输入生成嵌入。

    get-embeddings.js
    import OpenAI from 'openai';
    // Setup OpenAI configuration
    const openai = new OpenAI({apiKey: process.env.OPENAI_API_KEY});
    // Function to get the embeddings using the OpenAI API
    export async function getEmbedding(text) {
    const results = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
    encoding_format: "float",
    });
    return results.data[0].embedding;
    }
  3. 保存文件。

在本部分中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您想要使用开源嵌入模型还是 OpenAI 的专有模型选择一个标签页。

本部分还包括一个可选函数,您可以使用该函数来压缩嵌入,从而实现高效存储并提高查询性能。要学习;了解更多信息,请参阅向量压缩。

注意

开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。

1

通过保存扩展名为 .ipynb 的文件来创建交互式 Python 笔记本,然后在笔记本中运行以下命令以安装依赖项:

pip install --quiet --upgrade sentence-transformers pymongo einops
2

在笔记本中粘贴并运行以下代码,以创建一个函数,该函数使用 Nomic AI 的开源嵌入模型生成向量嵌入。此代码执行以下操作:

  • 加载 nomic-embed-text-v1 嵌入模型。

  • 创建一个名为 get_embedding 的函数,该函数使用该模型为给定的文本输入生成嵌入。默认精度为 float32

  • 为字符串 foo 生成嵌入。

from sentence_transformers import SentenceTransformer
# Load the embedding model
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
# Define a function to generate embeddings
def get_embedding(data, precision="float32"):
return model.encode(data, precision=precision).tolist()
# Generate an embedding
embedding = get_embedding("foo")
print(embedding)
[-0.02980827 0.03841474 -0.02561123 ... -0.0532876 -0.0335409 -0.02591543]
1

通过保存扩展名为 .ipynb 的文件来创建交互式 Python 笔记本,然后在笔记本中运行以下命令以安装依赖项:

pip install --quiet --upgrade openai pymongo
2

在笔记本中粘贴并运行以下代码,创建一个函数,该函数利用 OpenAI 的专有嵌入模型生成向量嵌入。用您的 OpenAI API 密钥替换 <api-key>。此代码执行以下操作:

  • 指定 text-embedding-3-small 内嵌模型。

  • 创建一个名为 get_embedding 的函数,该函数调用模型的 API 来为给定的文本输入生成嵌入。

  • 通过为字符串 foo 生成单个嵌入来测试函数。

import os
from openai import OpenAI
# Specify your OpenAI API key and embedding model
os.environ["OPENAI_API_KEY"] = "<api-key>"
model = "text-embedding-3-small"
openai_client = OpenAI()
# Define a function to generate embeddings
def get_embedding(text):
"""Generates vector embeddings for the given text."""
embedding = openai_client.embeddings.create(input = [text], model=model).data[0].embedding
return embedding
# Generate an embedding
embedding = get_embedding("foo")
print(embedding)
[-0.005843308754265308, -0.013111298903822899, -0.014585349708795547, 0.03580040484666824, 0.02671629749238491, ... ]

另请参阅:

有关 API 详细信息和可用模型列表,请参阅 OpenAI 文档。

在本部分中,您将使用您定义的函数从数据创建向量嵌入,然后将这些嵌入存储在Atlas的集合中。

根据要从新数据还是从 Atlas 中已有的现有数据创建嵌入,选择一个标签页。

1

在名为 DataService.cs 的同名文件中创建一个新类并粘贴以下代码。此代码会定义一个名为 AddDocumentsAsync 的异步任务,以将文档添加到 Atlas。此函数使用 Collection.InsertManyAsync() C# 驱动程序方法来插入 BsonDocument 类型的列表。每个文档包含:

  • text 字段包含电影摘要。

  • 一个 embedding 字段,包含生成向量嵌入的浮点数组。

DataService.cs
namespace MyCompany.Embeddings;
using MongoDB.Driver;
using MongoDB.Bson;
public class DataService
{
private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
private static readonly MongoClient Client = new MongoClient(ConnectionString);
private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db");
private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings");
public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings)
{
var documents = new List<BsonDocument>();
foreach( KeyValuePair<string, float[]> var in embeddings )
{
var document = new BsonDocument
{
{
"text", var.Key
},
{
"embedding", new BsonArray(var.Value)
}
};
documents.Add(document);
}
await Collection.InsertManyAsync(documents);
Console.WriteLine($"Successfully inserted {embeddings.Count} documents into Atlas");
documents.Clear();
}
}
2

使用以下代码从 Atlas 中的现有集合生成嵌入。

具体来说,这段代码使用您定义的 GetEmbeddingsAsync 函数,从示例文本数组生成嵌入,并将其摄取到 Atlas 中的 sample_db.embeddings 集合中。

Program.cs
using MyCompany.Embeddings;
var aiService = new AIService();
var texts = new string[]
{
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission"
};
var embeddings = await aiService.GetEmbeddingsAsync(texts);
var dataService = new DataService();
await dataService.AddDocumentsAsync(embeddings);
3
dotnet run MyCompany.Embeddings.csproj
Successfully inserted 3 documents into Atlas

您还可以导航到集群中的 sample_db.embeddings 集合,在 Atlas UI 中查看向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合,但您可以调整代码以适用于集群中的任何集合。

1

在名为 DataService.cs 的同名文件中创建一个新类并粘贴以下代码。此代码创建了两个函数来执行以下操作:

  • 连接到您的 Atlas 集群。

  • GetDocuments 方法从 sample_airbnb.listingsAndReviews 集合中获取具有非空 summary 字段的文档子集。

  • AddEmbeddings 异步任务在 sample_airbnb.listingsAndReviews 集合中的文档上创建一个新的 embeddings 字段,其 _idGetDocuments 方法中检索到的文档之一匹配。

DataService.cs
namespace MyCompany.Embeddings;
using MongoDB.Driver;
using MongoDB.Bson;
public class DataService
{
private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
private static readonly MongoClient Client = new MongoClient(ConnectionString);
private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb");
private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews");
public List<BsonDocument>? GetDocuments()
{
var filter = Builders<BsonDocument>.Filter.And(
Builders<BsonDocument>.Filter.And(
Builders<BsonDocument>.Filter.Exists("summary", true),
Builders<BsonDocument>.Filter.Ne("summary", "")
),
Builders<BsonDocument>.Filter.Exists("embeddings", false)
);
return Collection.Find(filter).Limit(50).ToList();
}
public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings)
{
var listWrites = new List<WriteModel<BsonDocument>>();
foreach( var kvp in embeddings )
{
var filterForUpdate = Builders<BsonDocument>.Filter.Eq("summary", kvp.Key);
var updateDefinition = Builders<BsonDocument>.Update.Set("embeddings", kvp.Value);
listWrites.Add(new UpdateOneModel<BsonDocument>(filterForUpdate, updateDefinition));
}
var result = await Collection.BulkWriteAsync(listWrites);
listWrites.Clear();
return result.ModifiedCount;
}
}
2

使用以下代码从 Atlas 中的现有集合生成嵌入。

具体来说,这段代码使用您定义的 GetEmbeddingsAsync 函数,从示例文本数组生成嵌入,并将其摄取到 Atlas 中的 sample_db.embeddings 集合中。

Program.cs
using MyCompany.Embeddings;
var dataService = new DataService();
var documents = dataService.GetDocuments();
if (documents != null)
{
Console.WriteLine("Generating embeddings.");
var aiService = new AIService();
var summaries = new List<string>();
foreach (var document in documents)
{
var summary = document.GetValue("summary").ToString();
if (summary != null)
{
summaries.Add(summary);
}
}
try
{
if (summaries.Count > 0)
{
var embeddings = await aiService.GetEmbeddingsAsync(summaries.ToArray());
try
{
var updatedCount = await dataService.AddEmbeddings(embeddings);
Console.WriteLine($"{updatedCount} documents updated successfully.");
} catch (Exception e)
{
Console.WriteLine($"Error adding embeddings to MongoDB: {e.Message}");
}
}
}
catch (Exception e)
{
Console.WriteLine($"Error creating embeddings for summaries: {e.Message}");
}
}
else
{
Console.WriteLine("No documents found");
}
3
dotnet run MyCompany.Embeddings.csproj
Generating embeddings.
50 documents updated successfully.
1

使用以下代码从 Atlas 中的现有集合生成嵌入。

具体来说,这段代码使用您定义的 GetEmbeddings 函数和 MongoDB Go 驱动程序,从示例文本数组生成嵌入,并将其摄取到 Atlas 中的 sample_db.embeddings 集合中。

create-embeddings.go
package main
import (
"context"
"fmt"
"log"
"my-embeddings-project/common"
"os"
"github.com/joho/godotenv"
"go.mongodb.org/mongo-driver/v2/mongo"
"go.mongodb.org/mongo-driver/v2/mongo/options"
)
var data = []string{
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission",
}
type TextWithEmbedding struct {
Text string
Embedding []float32
}
func main() {
ctx := context.Background()
if err := godotenv.Load(); err != nil {
log.Println("no .env file found")
}
// Connect to your Atlas cluster
uri := os.Getenv("ATLAS_CONNECTION_STRING")
if uri == "" {
log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
}
clientOptions := options.Client().ApplyURI(uri)
client, err := mongo.Connect(clientOptions)
if err != nil {
log.Fatalf("failed to connect to the server: %v", err)
}
defer func() { _ = client.Disconnect(ctx) }()
// Set the namespace
coll := client.Database("sample_db").Collection("embeddings")
embeddings := common.GetEmbeddings(data)
docsToInsert := make([]interface{}, len(embeddings))
for i, string := range data {
docsToInsert[i] = TextWithEmbedding{
Text: string,
Embedding: embeddings[i],
}
}
result, err := coll.InsertMany(ctx, docsToInsert)
if err != nil {
log.Fatalf("failed to insert documents: %v", err)
}
fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs))
}
create-embeddings.go
package main
import (
"context"
"fmt"
"log"
"my-embeddings-project/common"
"os"
"github.com/joho/godotenv"
"go.mongodb.org/mongo-driver/v2/mongo"
"go.mongodb.org/mongo-driver/v2/mongo/options"
)
var data = []string{
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission",
}
type TextWithEmbedding struct {
Text string
Embedding []float64
}
func main() {
ctx := context.Background()
if err := godotenv.Load(); err != nil {
log.Println("no .env file found")
}
// Connect to your Atlas cluster
uri := os.Getenv("ATLAS_CONNECTION_STRING")
if uri == "" {
log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
}
clientOptions := options.Client().ApplyURI(uri)
client, err := mongo.Connect(clientOptions)
if err != nil {
log.Fatalf("failed to connect to the server: %v", err)
}
defer func() { _ = client.Disconnect(ctx) }()
// Set the namespace
coll := client.Database("sample_db").Collection("embeddings")
embeddings := common.GetEmbeddings(data)
docsToInsert := make([]interface{}, len(data))
for i, movie := range data {
docsToInsert[i] = TextWithEmbedding{
Text: movie,
Embedding: embeddings[i],
}
}
result, err := coll.InsertMany(ctx, docsToInsert)
if err != nil {
log.Fatalf("failed to insert documents: %v", err)
}
fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs))
}
2
go run create-embeddings.go
Successfully inserted 3 documents into Atlas
go run create-embeddings.go
Successfully inserted 3 documents into Atlas

您还可以导航到集群中的 sample_db.embeddings 集合,在 Atlas UI 中查看向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合,但您可以调整代码以适用于集群中的任何集合。

1

使用以下代码从 Atlas 中的现有集合生成嵌入。具体而言,此代码执行以下操作:

  • 连接到您的 Atlas 集群。

  • sample_airbnb.listingsAndReviews 集合中获取具有非空 summary 字段的文档子集。

  • 使用您定义的 GetEmbeddings 函数,从每个文档的 summary 字段生成嵌入。

  • 使用 MongoDB Go 驱动程序,用包含嵌入值的新 embeddings 字段更新每个文档。

create-embeddings.go
package main
import (
"context"
"log"
"my-embeddings-project/common"
"os"
"github.com/joho/godotenv"
"go.mongodb.org/mongo-driver/v2/bson"
"go.mongodb.org/mongo-driver/v2/mongo"
"go.mongodb.org/mongo-driver/v2/mongo/options"
)
func main() {
ctx := context.Background()
if err := godotenv.Load(); err != nil {
log.Println("no .env file found")
}
// Connect to your Atlas cluster
uri := os.Getenv("ATLAS_CONNECTION_STRING")
if uri == "" {
log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
}
clientOptions := options.Client().ApplyURI(uri)
client, err := mongo.Connect(clientOptions)
if err != nil {
log.Fatalf("failed to connect to the server: %v", err)
}
defer func() { _ = client.Disconnect(ctx) }()
// Set the namespace
coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
filter := bson.D{
{"$and",
bson.A{
bson.D{
{"$and",
bson.A{
bson.D{{"summary", bson.D{{"$exists", true}}}},
bson.D{{"summary", bson.D{{"$ne", ""}}}},
},
}},
bson.D{{"embeddings", bson.D{{"$exists", false}}}},
}},
}
opts := options.Find().SetLimit(50)
cursor, err := coll.Find(ctx, filter, opts)
if err != nil {
log.Fatalf("failed to retrieve documents: %v", err)
}
var listings []common.Listing
if err = cursor.All(ctx, &listings); err != nil {
log.Fatalf("failed to unmarshal retrieved documents to Listing object: %v", err)
}
var summaries []string
for _, listing := range listings {
summaries = append(summaries, listing.Summary)
}
log.Println("Generating embeddings.")
embeddings := common.GetEmbeddings(summaries)
docsToUpdate := make([]mongo.WriteModel, len(listings))
for i := range listings {
docsToUpdate[i] = mongo.NewUpdateOneModel().
SetFilter(bson.D{{"_id", listings[i].ID}}).
SetUpdate(bson.D{{"$set", bson.D{{"embeddings", embeddings[i]}}}})
}
bulkWriteOptions := options.BulkWrite().SetOrdered(false)
result, err := coll.BulkWrite(context.Background(), docsToUpdate, bulkWriteOptions)
if err != nil {
log.Fatalf("failed to write embeddings to existing documents: %v", err)
}
log.Printf("Successfully added embeddings to %v documents", result.ModifiedCount)
}
2

为了简化与BSON之间的编组和解组Go对象,请创建一个包含此集合中文档模型的文件。

  1. 进入 common目录。

    cd common
  2. 创建一个名为 models.go 的文件,并将以下代码粘贴到其中:

    models.go
    package common
    import (
    "time"
    "go.mongodb.org/mongo-driver/v2/bson"
    )
    type Image struct {
    ThumbnailURL string `bson:"thumbnail_url"`
    MediumURL string `bson:"medium_url"`
    PictureURL string `bson:"picture_url"`
    XLPictureURL string `bson:"xl_picture_url"`
    }
    type Host struct {
    ID string `bson:"host_id"`
    URL string `bson:"host_url"`
    Name string `bson:"host_name"`
    Location string `bson:"host_location"`
    About string `bson:"host_about"`
    ThumbnailURL string `bson:"host_thumbnail_url"`
    PictureURL string `bson:"host_picture_url"`
    Neighborhood string `bson:"host_neighborhood"`
    IsSuperhost bool `bson:"host_is_superhost"`
    HasProfilePic bool `bson:"host_has_profile_pic"`
    IdentityVerified bool `bson:"host_identity_verified"`
    ListingsCount int32 `bson:"host_listings_count"`
    TotalListingsCount int32 `bson:"host_total_listings_count"`
    Verifications []string `bson:"host_verifications"`
    }
    type Location struct {
    Type string `bson:"type"`
    Coordinates []float64 `bson:"coordinates"`
    IsLocationExact bool `bson:"is_location_exact"`
    }
    type Address struct {
    Street string `bson:"street"`
    Suburb string `bson:"suburb"`
    GovernmentArea string `bson:"government_area"`
    Market string `bson:"market"`
    Country string `bson:"Country"`
    CountryCode string `bson:"country_code"`
    Location Location `bson:"location"`
    }
    type Availability struct {
    Thirty int32 `bson:"availability_30"`
    Sixty int32 `bson:"availability_60"`
    Ninety int32 `bson:"availability_90"`
    ThreeSixtyFive int32 `bson:"availability_365"`
    }
    type ReviewScores struct {
    Accuracy int32 `bson:"review_scores_accuracy"`
    Cleanliness int32 `bson:"review_scores_cleanliness"`
    CheckIn int32 `bson:"review_scores_checkin"`
    Communication int32 `bson:"review_scores_communication"`
    Location int32 `bson:"review_scores_location"`
    Value int32 `bson:"review_scores_value"`
    Rating int32 `bson:"review_scores_rating"`
    }
    type Review struct {
    ID string `bson:"_id"`
    Date time.Time `bson:"date,omitempty"`
    ListingId string `bson:"listing_id"`
    ReviewerId string `bson:"reviewer_id"`
    ReviewerName string `bson:"reviewer_name"`
    Comments string `bson:"comments"`
    }
    type Listing struct {
    ID string `bson:"_id"`
    ListingURL string `bson:"listing_url"`
    Name string `bson:"name"`
    Summary string `bson:"summary"`
    Space string `bson:"space"`
    Description string `bson:"description"`
    NeighborhoodOverview string `bson:"neighborhood_overview"`
    Notes string `bson:"notes"`
    Transit string `bson:"transit"`
    Access string `bson:"access"`
    Interaction string `bson:"interaction"`
    HouseRules string `bson:"house_rules"`
    PropertyType string `bson:"property_type"`
    RoomType string `bson:"room_type"`
    BedType string `bson:"bed_type"`
    MinimumNights string `bson:"minimum_nights"`
    MaximumNights string `bson:"maximum_nights"`
    CancellationPolicy string `bson:"cancellation_policy"`
    LastScraped time.Time `bson:"last_scraped,omitempty"`
    CalendarLastScraped time.Time `bson:"calendar_last_scraped,omitempty"`
    FirstReview time.Time `bson:"first_review,omitempty"`
    LastReview time.Time `bson:"last_review,omitempty"`
    Accommodates int32 `bson:"accommodates"`
    Bedrooms int32 `bson:"bedrooms"`
    Beds int32 `bson:"beds"`
    NumberOfReviews int32 `bson:"number_of_reviews"`
    Bathrooms bson.Decimal128 `bson:"bathrooms"`
    Amenities []string `bson:"amenities"`
    Price bson.Decimal128 `bson:"price"`
    WeeklyPrice bson.Decimal128 `bson:"weekly_price"`
    MonthlyPrice bson.Decimal128 `bson:"monthly_price"`
    CleaningFee bson.Decimal128 `bson:"cleaning_fee"`
    ExtraPeople bson.Decimal128 `bson:"extra_people"`
    GuestsIncluded bson.Decimal128 `bson:"guests_included"`
    Image Image `bson:"images"`
    Host Host `bson:"host"`
    Address Address `bson:"address"`
    Availability Availability `bson:"availability"`
    ReviewScores ReviewScores `bson:"review_scores"`
    Reviews []Review `bson:"reviews"`
    Embeddings []float32 `bson:"embeddings,omitempty"`
    }
    models.go
    package common
    import (
    "time"
    "go.mongodb.org/mongo-driver/v2/bson"
    )
    type Image struct {
    ThumbnailURL string `bson:"thumbnail_url"`
    MediumURL string `bson:"medium_url"`
    PictureURL string `bson:"picture_url"`
    XLPictureURL string `bson:"xl_picture_url"`
    }
    type Host struct {
    ID string `bson:"host_id"`
    URL string `bson:"host_url"`
    Name string `bson:"host_name"`
    Location string `bson:"host_location"`
    About string `bson:"host_about"`
    ThumbnailURL string `bson:"host_thumbnail_url"`
    PictureURL string `bson:"host_picture_url"`
    Neighborhood string `bson:"host_neighborhood"`
    IsSuperhost bool `bson:"host_is_superhost"`
    HasProfilePic bool `bson:"host_has_profile_pic"`
    IdentityVerified bool `bson:"host_identity_verified"`
    ListingsCount int32 `bson:"host_listings_count"`
    TotalListingsCount int32 `bson:"host_total_listings_count"`
    Verifications []string `bson:"host_verifications"`
    }
    type Location struct {
    Type string `bson:"type"`
    Coordinates []float64 `bson:"coordinates"`
    IsLocationExact bool `bson:"is_location_exact"`
    }
    type Address struct {
    Street string `bson:"street"`
    Suburb string `bson:"suburb"`
    GovernmentArea string `bson:"government_area"`
    Market string `bson:"market"`
    Country string `bson:"Country"`
    CountryCode string `bson:"country_code"`
    Location Location `bson:"location"`
    }
    type Availability struct {
    Thirty int32 `bson:"availability_30"`
    Sixty int32 `bson:"availability_60"`
    Ninety int32 `bson:"availability_90"`
    ThreeSixtyFive int32 `bson:"availability_365"`
    }
    type ReviewScores struct {
    Accuracy int32 `bson:"review_scores_accuracy"`
    Cleanliness int32 `bson:"review_scores_cleanliness"`
    CheckIn int32 `bson:"review_scores_checkin"`
    Communication int32 `bson:"review_scores_communication"`
    Location int32 `bson:"review_scores_location"`
    Value int32 `bson:"review_scores_value"`
    Rating int32 `bson:"review_scores_rating"`
    }
    type Review struct {
    ID string `bson:"_id"`
    Date time.Time `bson:"date,omitempty"`
    ListingId string `bson:"listing_id"`
    ReviewerId string `bson:"reviewer_id"`
    ReviewerName string `bson:"reviewer_name"`
    Comments string `bson:"comments"`
    }
    type Listing struct {
    ID string `bson:"_id"`
    ListingURL string `bson:"listing_url"`
    Name string `bson:"name"`
    Summary string `bson:"summary"`
    Space string `bson:"space"`
    Description string `bson:"description"`
    NeighborhoodOverview string `bson:"neighborhood_overview"`
    Notes string `bson:"notes"`
    Transit string `bson:"transit"`
    Access string `bson:"access"`
    Interaction string `bson:"interaction"`
    HouseRules string `bson:"house_rules"`
    PropertyType string `bson:"property_type"`
    RoomType string `bson:"room_type"`
    BedType string `bson:"bed_type"`
    MinimumNights string `bson:"minimum_nights"`
    MaximumNights string `bson:"maximum_nights"`
    CancellationPolicy string `bson:"cancellation_policy"`
    LastScraped time.Time `bson:"last_scraped,omitempty"`
    CalendarLastScraped time.Time `bson:"calendar_last_scraped,omitempty"`
    FirstReview time.Time `bson:"first_review,omitempty"`
    LastReview time.Time `bson:"last_review,omitempty"`
    Accommodates int32 `bson:"accommodates"`
    Bedrooms int32 `bson:"bedrooms"`
    Beds int32 `bson:"beds"`
    NumberOfReviews int32 `bson:"number_of_reviews"`
    Bathrooms bson.Decimal128 `bson:"bathrooms"`
    Amenities []string `bson:"amenities"`
    Price bson.Decimal128 `bson:"price"`
    WeeklyPrice bson.Decimal128 `bson:"weekly_price"`
    MonthlyPrice bson.Decimal128 `bson:"monthly_price"`
    CleaningFee bson.Decimal128 `bson:"cleaning_fee"`
    ExtraPeople bson.Decimal128 `bson:"extra_people"`
    GuestsIncluded bson.Decimal128 `bson:"guests_included"`
    Image Image `bson:"images"`
    Host Host `bson:"host"`
    Address Address `bson:"address"`
    Availability Availability `bson:"availability"`
    ReviewScores ReviewScores `bson:"review_scores"`
    Reviews []Review `bson:"reviews"`
    Embeddings []float64 `bson:"embeddings,omitempty"`
    }
  3. 返回到项目根目录。

    cd ../
3
go run create-embeddings.go
2024/10/10 09:58:03 Generating embeddings.
2024/10/10 09:58:12 Successfully added embeddings to 50 documents

您可以通过导航到集群中的 sample_airbnb.listingsAndReviews 集合并展开文档中的字段,在 Atlas UI 中查看向量嵌入。

1

创建一个名为CreateEmbeddings.java的文件并粘贴以下代码。

此代码使用getEmbeddings 方法和MongoDB Java同步驱动驱动程序来执行以下操作:

  1. 连接到您的 Atlas 集群。

  2. 获取示例文本大量。

  3. 使用您之前定义的 getEmbeddings 方法从每个文本生成嵌入。

  4. 将嵌入引入Atlas中的 sample_db.embeddings集合。

CreateEmbeddings.java
import com.mongodb.MongoException;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.result.InsertManyResult;
import org.bson.BsonArray;
import org.bson.Document;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class CreateEmbeddings {
static List<String> data = Arrays.asList(
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission"
);
public static void main(String[] args){
String uri = System.getenv("ATLAS_CONNECTION_STRING");
if (uri == null || uri.isEmpty()) {
throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
}
// establish connection and set namespace
try (MongoClient mongoClient = MongoClients.create(uri)) {
MongoDatabase database = mongoClient.getDatabase("sample_db");
MongoCollection<Document> collection = database.getCollection("embeddings");
System.out.println("Creating embeddings for " + data.size() + " documents");
EmbeddingProvider embeddingProvider = new EmbeddingProvider();
// generate embeddings for new inputted data
List<BsonArray> embeddings = embeddingProvider.getEmbeddings(data);
List<Document> documents = new ArrayList<>();
int i = 0;
for (String text : data) {
Document doc = new Document("text", text).append("embedding", embeddings.get(i));
documents.add(doc);
i++;
}
// insert the embeddings into the Atlas collection
List<String> insertedIds = new ArrayList<>();
try {
InsertManyResult result = collection.insertMany(documents);
result.getInsertedIds().values()
.forEach(doc -> insertedIds.add(doc.toString()));
System.out.println("Inserted " + insertedIds.size() + " documents with the following ids to " + collection.getNamespace() + " collection: \n " + insertedIds);
} catch (MongoException me) {
throw new RuntimeException("Failed to insert documents", me);
}
} catch (MongoException me) {
throw new RuntimeException("Failed to connect to MongoDB ", me);
} catch (Exception e) {
throw new RuntimeException("Operation failed: ", e);
}
}
}
2

保存并运行文件。 输出类似于:

Creating embeddings for 3 documents
Inserted 3 documents with the following ids to sample_db.embeddings collection:
[BsonObjectId{value=6735ff620d88451041f6dd40}, BsonObjectId{value=6735ff620d88451041f6dd41}, BsonObjectId{value=6735ff620d88451041f6dd42}]

您还可以导航到集群中的 sample_db.embeddings 集合,在 Atlas UI 中查看向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合,但您可以调整代码以适用于集群中的任何集合。

1

创建一个名为CreateEmbeddings.java的文件并粘贴以下代码。

此代码使用getEmbeddings 方法和MongoDB Java同步驱动驱动程序来执行以下操作:

  1. 连接到您的 Atlas 集群。

  2. sample_airbnb.listingsAndReviews集合中获取具有非空 summary字段的文档子集。

  3. 使用您之前定义的 getEmbeddings 方法,从每个文档的 summary字段生成嵌入。

  4. 使用包含嵌入值的新 embeddings字段更新每个文档。

CreateEmbeddings.java
import com.mongodb.MongoException;
import com.mongodb.bulk.BulkWriteResult;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoCursor;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.BulkWriteOptions;
import com.mongodb.client.model.Filters;
import com.mongodb.client.model.UpdateOneModel;
import com.mongodb.client.model.Updates;
import com.mongodb.client.model.WriteModel;
import org.bson.BsonArray;
import org.bson.Document;
import org.bson.conversions.Bson;
import java.util.ArrayList;
import java.util.List;
public class CreateEmbeddings {
public static void main(String[] args){
String uri = System.getenv("ATLAS_CONNECTION_STRING");
if (uri == null || uri.isEmpty()) {
throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
}
// establish connection and set namespace
try (MongoClient mongoClient = MongoClients.create(uri)) {
MongoDatabase database = mongoClient.getDatabase("sample_airbnb");
MongoCollection<Document> collection = database.getCollection("listingsAndReviews");
Bson filterCriteria = Filters.and(
Filters.and(Filters.exists("summary"),
Filters.ne("summary", null),
Filters.ne("summary", "")),
Filters.exists("embeddings", false));
try (MongoCursor<Document> cursor = collection.find(filterCriteria).limit(50).iterator()) {
List<String> summaries = new ArrayList<>();
List<String> documentIds = new ArrayList<>();
int i = 0;
while (cursor.hasNext()) {
Document document = cursor.next();
String summary = document.getString("summary");
String id = document.get("_id").toString();
summaries.add(summary);
documentIds.add(id);
i++;
}
System.out.println("Generating embeddings for " + summaries.size() + " documents.");
System.out.println("This operation may take up to several minutes.");
EmbeddingProvider embeddingProvider = new EmbeddingProvider();
List<BsonArray> embeddings = embeddingProvider.getEmbeddings(summaries);
List<WriteModel<Document>> updateDocuments = new ArrayList<>();
for (int j = 0; j < summaries.size(); j++) {
UpdateOneModel<Document> updateDoc = new UpdateOneModel<>(
Filters.eq("_id", documentIds.get(j)),
Updates.set("embeddings", embeddings.get(j)));
updateDocuments.add(updateDoc);
}
int updatedDocsCount = 0;
try {
BulkWriteOptions options = new BulkWriteOptions().ordered(false);
BulkWriteResult result = collection.bulkWrite(updateDocuments, options);
updatedDocsCount = result.getModifiedCount();
} catch (MongoException me) {
throw new RuntimeException("Failed to insert documents", me);
}
System.out.println("Added embeddings successfully to " + updatedDocsCount + " documents.");
}
} catch (MongoException me) {
throw new RuntimeException("Failed to connect to MongoDB", me);
} catch (Exception e) {
throw new RuntimeException("Operation failed: ", e);
}
}
}
2

保存并运行文件。 输出类似于:

Generating embeddings for 50 documents.
This operation may take up to several minutes.
Added embeddings successfully to 50 documents.

您还可以导航到集群中的 sample_airbnb.listingsAndReviews 集合,在 Atlas UI 中查看向量嵌入。

1

使用以下代码从Atlas中的现有集合生成嵌入。此代码使用您定义的 getEmbedding 函数和 Node.js驱动程序 从示例文本大量生成嵌入,并将它们引入Atlas中的 sample_db.embeddings集合。

如果您定义了 convertEmbeddingsToBSON 函数,请取消注释 332-33 行,将嵌入转换为BSON binData 向量。

create-embeddings.js
import { MongoClient } from 'mongodb';
import { getEmbedding } from './get-embeddings.js';
// import { convertEmbeddingsToBSON } from './convert-embeddings.js';
// Data to embed
const texts = [
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission"
]
async function run() {
// Connect to your Atlas cluster
const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
try {
await client.connect();
const db = client.db("sample_db");
const collection = db.collection("embeddings");
console.log("Generating embeddings and inserting documents...");
const insertDocuments = [];
await Promise.all(texts.map(async text => {
// Check if the document already exists
const existingDoc = await collection.findOne({ text: text });
// Generate an embedding using the function that you defined
var embedding = await getEmbedding(text);
// Uncomment the following lines to convert the generated embedding into BSON format
// const bsonEmbedding = await convertEmbeddingsToBSON([embedding]); // Since convertEmbeddingsToBSON is designed to handle arrays
// embedding = bsonEmbedding; // Use BSON embedding instead of the original float32 embedding
// Add the document with the embedding to array of documents for bulk insert
if (!existingDoc) {
insertDocuments.push({
text: text,
embedding: embedding
})
}
}));
// Continue processing documents if an error occurs during an operation
const options = { ordered: false };
// Insert documents with embeddings into Atlas
const result = await collection.insertMany(insertDocuments, options);
console.log("Count of documents inserted: " + result.insertedCount);
} catch (err) {
console.log(err.stack);
}
finally {
await client.close();
}
}
run().catch(console.dir);
2
node --env-file=.env create-embeddings.js
Generating embeddings and inserting documents...
Count of documents inserted: 3

您可以导航到集群中的 sample_db.embeddings集合,在Atlas用户界面中查看向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合,但您可以调整代码以适用于集群中的任何集合。

1

使用以下代码从 Atlas 中的现有集合生成嵌入。具体而言,此代码执行以下操作:

  • 连接到您的 Atlas 集群。

  • sample_airbnb.listingsAndReviews 集合中获取具有非空 summary 字段的文档子集。

  • 使用您定义的 getEmbedding 函数,从每个文档的 summary 字段生成嵌入。

  • 使用 MongoDB Node.js 驱动程序,用包含嵌入值的新 embedding 字段更新每个文档。

如果您定义了 convertEmbeddingsToBSON 函数,请取消注释 329-30 行,将嵌入转换为BSON binData 向量。

create-embeddings.js
import { MongoClient } from 'mongodb';
import { getEmbedding } from './get-embeddings.js';
// import { convertEmbeddingsToBSON } from './convert-embeddings.js';
async function run() {
// Connect to your Atlas cluster
const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
try {
await client.connect();
const db = client.db("sample_airbnb");
const collection = db.collection("listingsAndReviews");
// Filter to exclude null or empty summary fields
const filter = { "summary": { "$nin": [ null, "" ] } };
// Get a subset of documents from the collection
const documents = await collection.find(filter).limit(50).toArray();
console.log("Generating embeddings and updating documents...");
const updateDocuments = [];
await Promise.all(documents.map(async doc => {
// Generate an embedding using the function that you defined
var embedding = await getEmbedding(doc.summary);
// Uncomment the following lines to convert the generated embedding into BSON format
// const bsonEmbedding = await convertEmbeddingsToBSON([embedding]); // Since convertEmbeddingsToBSON is designed to handle arrays
// embedding = bsonEmbedding; // Use BSON embedding instead of the original float32 embedding
// Add the embedding to an array of update operations
updateDocuments.push(
{
updateOne: {
filter: { "_id": doc._id },
update: { $set: { "embedding": embedding } }
}
}
)
}));
// Continue processing documents if an error occurs during an operation
const options = { ordered: false };
// Update documents with the new embedding field
const result = await collection.bulkWrite(updateDocuments, options);
console.log("Count of documents updated: " + result.modifiedCount);
} catch (err) {
console.log(err.stack);
}
finally {
await client.close();
}
}
run().catch(console.dir);
2
node --env-file=.env create-embeddings.js
Generating embeddings and updating documents...
Count of documents updated: 50

您可以通过导航到集群中的 sample_airbnb.listingsAndReviews 集合并展开文档中的字段,在 Atlas UI 中查看向量嵌入。

1

以下代码定义了一个示例文本大量。

# Sample data
texts = [
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission"
]
2

使用 get_embedding 函数根据数据生成嵌入。使用以下代码从示例文本生成嵌入。

如果您定义了 generate_bson_vector 函数,请取消注释调用此函数以将嵌入压缩为 binData 向量的行。您的嵌入将以二进制格式显示。

# Generate embeddings from the sample data
embeddings = []
for text in texts:
embedding = get_embedding(text)
# Uncomment the following line to convert to BSON
# embedding = generate_bson_vector(embedding, BinaryVectorDtype.FLOAT32)
embeddings.append(embedding)
# Print the embeddings
print(f"\nText: {text}")
print(f"Embedding: {embedding[:3]}... (truncated)")
Generated embeddings:
Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
Embedding: [-0.01089042 0.05926645 -0.00291325]... (truncated)
Text: The Lion King: Lion cub and future king Simba searches for his identity
Embedding: [-0.05607051 -0.01360618 0.00523855]... (truncated)
Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
Embedding: [-0.0275258 0.01144342 -0.02360895]... (truncated)
3

执行以下步骤以创建具有嵌入的文档并将其引入Atlas 集群:

  1. 定义用于创建文档的函数。

    def create_docs_with_embeddings(embeddings, data):
    docs = []
    for i, (embedding, text) in enumerate(zip(embeddings, data)):
    doc = {
    "_id": i,
    "text": text,
    "embedding": embedding,
    }
    docs.append(doc)
    return docs
  2. 使用嵌入创建文档。

    # Create documents with embeddings and sample data
    docs = create_docs_with_embeddings(embeddings, texts)
  3. 将文档导入到 Atlas 中。

    在笔记本中粘贴并运行以下代码,将 <connection-string> 替换为Atlas集群的 SRV 连接字符串。

    注意

    连接字符串应使用以下格式:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

    此代码执行以下操作:

    • 连接到您的 Atlas 集群。

    • 将文档插入指定的数据库和集合。

    import pymongo
    # Connect to your Atlas cluster
    mongo_client = pymongo.MongoClient("<connection-string>")
    db = mongo_client["sample_db"]
    collection = db["embeddings"]
    # Ingest data into Atlas
    collection.insert_many(docs)
    InsertManyResult([0, 1, 2], acknowledged=True)

    您可以在Atlas用户界面中查看集群中 sample_db.embeddings命名空间的向量嵌入,以验证向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合,但您可以调整代码以适用于集群中的任何集合。

1

从Atlas 集群加载数据。以下代码从 sample_airbnb.listingAndReviews集合中获取 50 文档的子集。

<connection-string> 替换为 Atlas 集群的 SRV 连接字符串。

注意

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
db = mongo_client["sample_airbnb"]
collection = db["listingsAndReviews"]
# Define a filter to exclude documents with null or empty 'summary' fields
filter = { 'summary': { '$exists': True, "$nin": [ None, "" ] } }
# Get a subset of documents in the collection
documents = collection.find(filter, {'_id': 1, 'summary': 1}).limit(50)
2

从上一步中加载的文档生成嵌入。此代码执行以下操作:

  • 使用您定义的 get_embedding 函数,从每个文档的 summary 字段生成嵌入。

  • 使用包含嵌入值的新 embedding字段更新每个文档。

如果您定义了 generate_bson_vector 函数将向量嵌入转换为BSON binData 向量,请在运行代码之前取消注释调用此函数的行。

注意

此操作可能需要几分钟才能完成。

from pymongo import UpdateOne
# Generate the list of bulk write operations
operations = []
for doc in documents:
summary = doc["summary"]
# Generate embeddings for this document
embedding = get_embedding(summary)
# Uncomment the following line to convert to BSON vectors
# embedding = generate_bson_vector(embedding, BinaryVectorDtype.FLOAT32)
# Add the update operation to the list
operations.append(UpdateOne(
{"_id": doc["_id"]},
{"$set": {
"embedding": embedding
}}
))
# Execute the bulk write operation
if operations:
result = collection.bulk_write(operations)
updated_doc_count = result.modified_count
print(f"Updated {updated_doc_count} documents.")
Updated 50 documents.

在本节中,您将对集合中的向量嵌入进行索引,并创建一个嵌入,用于运行示例向量搜索查询。

运行查询时,Atlas Vector Search 会返回嵌入距离与向量搜索查询中的嵌入距离最接近的文档。这表明它们的含义相似。

1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_db.embeddings 集合上创建索引,将 embedding 字段指定为向量类型,将相似度测量指定为 dotProduct

  1. 粘贴以下代码以将 CreateVectorIndex 函数添加到 DataService.cs 中的 DataService 类。

    DataService.cs
    namespace MyCompany.Embeddings;
    using MongoDB.Driver;
    using MongoDB.Bson;
    public class DataService
    {
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings");
    public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings)
    {
    // Method details...
    }
    public void CreateVectorIndex()
    {
    try
    {
    var searchIndexView = Collection.SearchIndexes;
    var name = "vector_index";
    var type = SearchIndexType.VectorSearch;
    var definition = new BsonDocument
    {
    { "fields", new BsonArray
    {
    new BsonDocument
    {
    { "type", "vector" },
    { "path", "embedding" },
    { "numDimensions", <dimensions> },
    { "similarity", "dotProduct" }
    }
    }
    }
    };
    var model = new CreateSearchIndexModel(name, type, definition);
    searchIndexView.CreateOne(model);
    Console.WriteLine($"New search index named {name} is building.");
    // Polling for index status
    Console.WriteLine("Polling to check if the index is ready. This may take up to a minute.");
    bool queryable = false;
    while (!queryable)
    {
    var indexes = searchIndexView.List();
    foreach (var index in indexes.ToEnumerable())
    {
    if (index["name"] == name)
    {
    queryable = index["queryable"].AsBoolean;
    }
    }
    if (!queryable)
    {
    Thread.Sleep(5000);
    }
    }
    Console.WriteLine($"{name} is ready for querying.");
    }
    catch (Exception e)
    {
    Console.WriteLine($"Exception: {e.Message}");
    }
    }
    }
  2. 如果使用开源模型,则将 <dimensions> 占位符值替换为 1024;如果使用 OpenAI 模型,则将占位符值替换为 1536

  3. 更新您 Program.cs 中的代码。

    删除填充初始文档的代码,并将其替换为以下代码以创建索引:

    Program.cs
    using MyCompany.Embeddings;
    var dataService = new DataService();
    dataService.CreateVectorIndex();
  4. 保存文件,然后编译并运行您的项目以创建索引:

    dotnet run MyCompany.Embeddings
    New search index named vector_index is building.
    Polling to check if the index is ready. This may take up to a minute.
    vector_index is ready for querying.

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 粘贴以下代码以将 PerformVectorQuery 函数添加到 DataService.cs 中的 DataService 类。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    • 使用您定义的嵌入函数创建查询嵌入。

    • queryVector字段中使用此嵌入并指定要查询的路径。

    • 使用 $vectorSearch 执行 ENN 搜索。

    • 返回语义相似的文档,按其搜索分数的相关性进行排序。

    DataService.cs
    namespace MyCompany.Embeddings;
    using MongoDB.Driver;
    using MongoDB.Bson;
    public class DataService
    {
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings");
    public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings)
    {
    // Method details...
    }
    public void CreateVectorIndex()
    {
    // Method details...
    }
    public List<BsonDocument>? PerformVectorQuery(float[] vector)
    {
    var vectorSearchStage = new BsonDocument
    {
    {
    "$vectorSearch",
    new BsonDocument
    {
    { "index", "vector_index" },
    { "path", "embedding" },
    { "queryVector", new BsonArray(vector) },
    { "exact", true },
    { "limit", 5 }
    }
    }
    };
    var projectStage = new BsonDocument
    {
    {
    "$project",
    new BsonDocument
    {
    { "_id", 0 },
    { "text", 1 },
    { "score",
    new BsonDocument
    {
    { "$meta", "vectorSearchScore"}
    }
    }
    }
    }
    };
    var pipeline = new[] { vectorSearchStage, projectStage };
    return Collection.Aggregate<BsonDocument>(pipeline).ToList();
    }
    }
  2. 更新您 Program.cs 中的代码。

    删除创建向量索引的代码,并添加代码以执行查询:

    Program.cs
    using MongoDB.Bson;
    using MyCompany.Embeddings;
    var aiService = new AIService();
    var queryString = "ocean tragedy";
    var queryEmbedding = await aiService.GetEmbeddingsAsync([queryString]);
    if (!queryEmbedding.Any())
    {
    Console.WriteLine("No embeddings found.");
    }
    else
    {
    var dataService = new DataService();
    var matchingDocuments = dataService.PerformVectorQuery(queryEmbedding[queryString]);
    if (matchingDocuments == null)
    {
    Console.WriteLine("No documents matched the query.");
    }
    else
    {
    foreach (var document in matchingDocuments)
    {
    Console.WriteLine(document.ToJson());
    }
    }
    }
  3. 保存文件,然后编译并运行您的项目以执行查询:

    dotnet run MyCompany.Embeddings.csproj
    { "text" : "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "score" : 100.17414855957031 }
    { "text" : "Avatar: A marine is dispatched to the moon Pandora on a unique mission", "score" : 65.705635070800781 }
    { "text" : "The Lion King: Lion cub and future king Simba searches for his identity", "score" : 52.486415863037109 }
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_airbnb.listingsAndReviews 集合上创建索引,将 embeddings 字段指定为向量类型,将相似度测量指定为 euclidean

  1. 粘贴以下代码以将 CreateVectorIndex 函数添加到 DataService.cs 中的 DataService 类。

    DataService.cs
    namespace MyCompany.Embeddings;
    using MongoDB.Driver;
    using MongoDB.Bson;
    public class DataService
    {
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews");
    public List<BsonDocument>? GetDocuments()
    {
    // Method details...
    }
    public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings)
    {
    // Method details...
    }
    public void CreateVectorIndex()
    {
    try
    {
    var searchIndexView = Collection.SearchIndexes;
    var name = "vector_index";
    var type = SearchIndexType.VectorSearch;
    var definition = new BsonDocument
    {
    { "fields", new BsonArray
    {
    new BsonDocument
    {
    { "type", "vector" },
    { "path", "embeddings" },
    { "numDimensions", <dimensions> },
    { "similarity", "dotProduct" }
    }
    }
    }
    };
    var model = new CreateSearchIndexModel(name, type, definition);
    searchIndexView.CreateOne(model);
    Console.WriteLine($"New search index named {name} is building.");
    // Polling for index status
    Console.WriteLine("Polling to check if the index is ready. This may take up to a minute.");
    bool queryable = false;
    while (!queryable)
    {
    var indexes = searchIndexView.List();
    foreach (var index in indexes.ToEnumerable())
    {
    if (index["name"] == name)
    {
    queryable = index["queryable"].AsBoolean;
    }
    }
    if (!queryable)
    {
    Thread.Sleep(5000);
    }
    }
    Console.WriteLine($"{name} is ready for querying.");
    }
    catch (Exception e)
    {
    Console.WriteLine($"Exception: {e.Message}");
    }
    }
    }
  2. 如果使用开源模型,则将 <dimensions> 占位符值替换为 1024;如果使用 OpenAI 模型,则将占位符值替换为 1536

  3. 更新您 Program.cs 中的代码。

    删除将嵌入添加到现有文档的代码,并将其替换为以下代码以创建索引:

    Program.cs
    using MyCompany.Embeddings;
    var dataService = new DataService();
    dataService.CreateVectorIndex();
  4. 保存文件,然后编译并运行您的项目以创建索引:

    dotnet run MyCompany.Embeddings
    New search index named vector_index is building.
    Polling to check if the index is ready. This may take up to a minute.
    vector_index is ready for querying.

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 粘贴以下代码以将 PerformVectorQuery 函数添加到 DataService.cs 中的 DataService 类。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    • 使用您定义的嵌入函数创建查询嵌入。

    • queryVector字段中使用此嵌入并指定要查询的路径。

    • 使用 $vectorSearch 执行 ENN 搜索。

    • 返回语义相似的文档,按其搜索分数的相关性进行排序。

    DataService.cs
    namespace MyCompany.Embeddings;
    using MongoDB.Driver;
    using MongoDB.Bson;
    public class DataService
    {
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews");
    public List<BsonDocument>? GetDocuments()
    {
    // Method details...
    }
    public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings)
    {
    // Method details...
    }
    public void CreateVectorIndex()
    {
    // Method details...
    }
    public List<BsonDocument>? PerformVectorQuery(float[] vector)
    {
    var vectorSearchStage = new BsonDocument
    {
    {
    "$vectorSearch",
    new BsonDocument
    {
    { "index", "vector_index" },
    { "path", "embeddings" },
    { "queryVector", new BsonArray(vector) },
    { "exact", true },
    { "limit", 5 }
    }
    }
    };
    var projectStage = new BsonDocument
    {
    {
    "$project",
    new BsonDocument
    {
    { "_id", 0 },
    { "summary", 1 },
    { "score",
    new BsonDocument
    {
    { "$meta", "vectorSearchScore"}
    }
    }
    }
    }
    };
    var pipeline = new[] { vectorSearchStage, projectStage };
    return Collection.Aggregate<BsonDocument>(pipeline).ToList();
    }
    }
  2. 更新您 Program.cs 中的代码。

    删除创建向量索引的代码,并添加代码以执行查询:

    Program.cs
    using MongoDB.Bson;
    using MyCompany.Embeddings;
    var aiService = new AIService();
    var queryString = "beach house";
    var queryEmbedding = await aiService.GetEmbeddingsAsync([queryString]);
    if (!queryEmbedding.Any())
    {
    Console.WriteLine("No embeddings found.");
    }
    else
    {
    var dataService = new DataService();
    var matchingDocuments = dataService.PerformVectorQuery(queryEmbedding[queryString]);
    if (matchingDocuments == null)
    {
    Console.WriteLine("No documents matched the query.");
    }
    else
    {
    foreach (var document in matchingDocuments)
    {
    Console.WriteLine(document.ToJson());
    }
    }
    }
  3. 保存文件,然后编译并运行您的项目以执行查询:

    dotnet run MyCompany.Embeddings.csproj
    { "summary" : "Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living.", "score" : 88.884147644042969 }
    { "summary" : "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score" : 86.136398315429688 }
    { "summary" : "Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.", "score" : 86.087783813476562 }
    { "summary" : "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score" : 85.689559936523438 }
    { "summary" : "Fully furnished 3+1 flat decorated with vintage style. Located at the heart of Moda/Kadıköy, close to seaside and also to the public transportation (tram, metro, ferry, bus stations) 10 minutes walk.", "score" : 85.614166259765625 }
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_db.embeddings 集合上创建索引,将 embedding 字段指定为向量类型,将相似度测量指定为 dotProduct

  1. 创建一个名为 create-index.go 的文件并粘贴以下代码。

    create-index.go
    package main
    import (
    "context"
    "fmt"
    "log"
    "os"
    "time"
    "github.com/joho/godotenv"
    "go.mongodb.org/mongo-driver/v2/bson"
    "go.mongodb.org/mongo-driver/v2/mongo"
    "go.mongodb.org/mongo-driver/v2/mongo/options"
    )
    func main() {
    ctx := context.Background()
    if err := godotenv.Load(); err != nil {
    log.Println("no .env file found")
    }
    // Connect to your Atlas cluster
    uri := os.Getenv("ATLAS_CONNECTION_STRING")
    if uri == "" {
    log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
    }
    clientOptions := options.Client().ApplyURI(uri)
    client, err := mongo.Connect(clientOptions)
    if err != nil {
    log.Fatalf("failed to connect to the server: %v", err)
    }
    defer func() { _ = client.Disconnect(ctx) }()
    // Set the namespace
    coll := client.Database("sample_db").Collection("embeddings")
    indexName := "vector_index"
    opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch")
    type vectorDefinitionField struct {
    Type string `bson:"type"`
    Path string `bson:"path"`
    NumDimensions int `bson:"numDimensions"`
    Similarity string `bson:"similarity"`
    }
    type vectorDefinition struct {
    Fields []vectorDefinitionField `bson:"fields"`
    }
    indexModel := mongo.SearchIndexModel{
    Definition: vectorDefinition{
    Fields: []vectorDefinitionField{{
    Type: "vector",
    Path: "embedding",
    NumDimensions: <dimensions>,
    Similarity: "dotProduct"}},
    },
    Options: opts,
    }
    log.Println("Creating the index.")
    searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel)
    if err != nil {
    log.Fatalf("failed to create the search index: %v", err)
    }
    // Await the creation of the index.
    log.Println("Polling to confirm successful index creation.")
    searchIndexes := coll.SearchIndexes()
    var doc bson.Raw
    for doc == nil {
    cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName))
    if err != nil {
    fmt.Errorf("failed to list search indexes: %w", err)
    }
    if !cursor.Next(ctx) {
    break
    }
    name := cursor.Current.Lookup("name").StringValue()
    queryable := cursor.Current.Lookup("queryable").Boolean()
    if name == searchIndexName && queryable {
    doc = cursor.Current
    } else {
    time.Sleep(5 * time.Second)
    }
    }
    log.Println("Name of Index Created: " + searchIndexName)
    }
  2. 如果使用开源模型,则将 <dimensions> 占位符值替换为 1024;如果使用 OpenAI 模型,则将占位符值替换为 1536

  3. 保存文件,然后运行以下命令:

    go run create-index.go
    2024/10/09 17:38:51 Creating the index.
    2024/10/09 17:38:52 Polling to confirm successful index creation.
    2024/10/09 17:39:22 Name of Index Created: vector_index

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 创建一个名为 vector-query.go 的文件并粘贴以下代码。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    • 使用您定义的嵌入函数创建查询嵌入。

    • queryVector字段中使用此嵌入并指定要查询的路径。

    • 使用 $vectorSearch 执行 ENN 搜索。

    • 返回语义相似的文档,按其搜索分数的相关性进行排序。

    vector-query.go
    package main
    import (
    "context"
    "fmt"
    "log"
    "my-embeddings-project/common"
    "os"
    "github.com/joho/godotenv"
    "go.mongodb.org/mongo-driver/v2/bson"
    "go.mongodb.org/mongo-driver/v2/mongo"
    "go.mongodb.org/mongo-driver/v2/mongo/options"
    )
    type TextAndScore struct {
    Text string `bson:"text"`
    Score float32 `bson:"score"`
    }
    func main() {
    ctx := context.Background()
    // Connect to your Atlas cluster
    if err := godotenv.Load(); err != nil {
    log.Println("no .env file found")
    }
    // Connect to your Atlas cluster
    uri := os.Getenv("ATLAS_CONNECTION_STRING")
    if uri == "" {
    log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
    }
    clientOptions := options.Client().ApplyURI(uri)
    client, err := mongo.Connect(clientOptions)
    if err != nil {
    log.Fatalf("failed to connect to the server: %v", err)
    }
    defer func() { _ = client.Disconnect(ctx) }()
    // Set the namespace
    coll := client.Database("sample_db").Collection("embeddings")
    query := "ocean tragedy"
    queryEmbedding := common.GetEmbeddings([]string{query})
    pipeline := mongo.Pipeline{
    bson.D{
    {"$vectorSearch", bson.D{
    {"queryVector", queryEmbedding[0]},
    {"index", "vector_index"},
    {"path", "embedding"},
    {"exact", true},
    {"limit", 5},
    }},
    },
    bson.D{
    {"$project", bson.D{
    {"_id", 0},
    {"text", 1},
    {"score", bson.D{
    {"$meta", "vectorSearchScore"},
    }},
    }},
    },
    }
    // Run the pipeline
    cursor, err := coll.Aggregate(ctx, pipeline)
    if err != nil {
    log.Fatalf("failed to run aggregation: %v", err)
    }
    defer func() { _ = cursor.Close(ctx) }()
    var matchingDocs []TextAndScore
    if err = cursor.All(ctx, &matchingDocs); err != nil {
    log.Fatalf("failed to unmarshal results to TextAndScore objects: %v", err)
    }
    for _, doc := range matchingDocs {
    fmt.Printf("Text: %v\nScore: %v\n", doc.Text, doc.Score)
    }
    }
    vector-query.go
    package main
    import (
    "context"
    "fmt"
    "log"
    "my-embeddings-project/common"
    "os"
    "github.com/joho/godotenv"
    "go.mongodb.org/mongo-driver/v2/bson"
    "go.mongodb.org/mongo-driver/v2/mongo"
    "go.mongodb.org/mongo-driver/v2/mongo/options"
    )
    type TextAndScore struct {
    Text string `bson:"text"`
    Score float64 `bson:"score"`
    }
    func main() {
    ctx := context.Background()
    // Connect to your Atlas cluster
    if err := godotenv.Load(); err != nil {
    log.Println("no .env file found")
    }
    // Connect to your Atlas cluster
    uri := os.Getenv("ATLAS_CONNECTION_STRING")
    if uri == "" {
    log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
    }
    clientOptions := options.Client().ApplyURI(uri)
    client, err := mongo.Connect(clientOptions)
    if err != nil {
    log.Fatalf("failed to connect to the server: %v", err)
    }
    defer func() { _ = client.Disconnect(ctx) }()
    // Set the namespace
    coll := client.Database("sample_db").Collection("embeddings")
    query := "ocean tragedy"
    queryEmbedding := common.GetEmbeddings([]string{query})
    pipeline := mongo.Pipeline{
    bson.D{
    {"$vectorSearch", bson.D{
    {"queryVector", queryEmbedding[0]},
    {"index", "vector_index"},
    {"path", "embedding"},
    {"exact", true},
    {"limit", 5},
    }},
    },
    bson.D{
    {"$project", bson.D{
    {"_id", 0},
    {"text", 1},
    {"score", bson.D{
    {"$meta", "vectorSearchScore"},
    }},
    }},
    },
    }
    // Run the pipeline
    cursor, err := coll.Aggregate(ctx, pipeline)
    if err != nil {
    log.Fatalf("failed to run aggregation: %v", err)
    }
    defer func() { _ = cursor.Close(ctx) }()
    var matchingDocs []TextAndScore
    if err = cursor.All(ctx, &matchingDocs); err != nil {
    log.Fatalf("failed to unmarshal results to TextAndScore objects: %v", err)
    }
    for _, doc := range matchingDocs {
    fmt.Printf("Text: %v\nScore: %v\n", doc.Text, doc.Score)
    }
    }
  2. 保存文件,然后运行以下命令:

    go run vector-query.go
    Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
    Score: 0.0042472864
    Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
    Score: 0.0031167597
    Text: The Lion King: Lion cub and future king Simba searches for his identity
    Score: 0.0024476869
    go run vector-query.go
    Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
    Score: 0.4552372694015503
    Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
    Score: 0.4050072133541107
    Text: The Lion King: Lion cub and future king Simba searches for his identity
    Score: 0.35942140221595764
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_airbnb.listingsAndReviews 集合上创建索引,将 embeddings 字段指定为向量类型,将相似度测量指定为 euclidean

  1. 创建一个名为 create-index.go 的文件并粘贴以下代码。

    create-index.go
    package main
    import (
    "context"
    "fmt"
    "log"
    "os"
    "time"
    "github.com/joho/godotenv"
    "go.mongodb.org/mongo-driver/bson"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
    )
    func main() {
    ctx := context.Background()
    if err := godotenv.Load(); err != nil {
    log.Println("no .env file found")
    }
    // Connect to your Atlas cluster
    uri := os.Getenv("ATLAS_CONNECTION_STRING")
    if uri == "" {
    log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
    }
    clientOptions := options.Client().ApplyURI(uri)
    client, err := mongo.Connect(ctx, clientOptions)
    if err != nil {
    log.Fatalf("failed to connect to the server: %v", err)
    }
    defer func() { _ = client.Disconnect(ctx) }()
    // Set the namespace
    coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
    indexName := "vector_index"
    opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch")
    type vectorDefinitionField struct {
    Type string `bson:"type"`
    Path string `bson:"path"`
    NumDimensions int `bson:"numDimensions"`
    Similarity string `bson:"similarity"`
    }
    type vectorDefinition struct {
    Fields []vectorDefinitionField `bson:"fields"`
    }
    indexModel := mongo.SearchIndexModel{
    Definition: vectorDefinition{
    Fields: []vectorDefinitionField{{
    Type: "vector",
    Path: "embeddings",
    NumDimensions: <dimensions>,
    Similarity: "dotProduct"}},
    },
    Options: opts,
    }
    log.Println("Creating the index.")
    searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel)
    if err != nil {
    log.Fatalf("failed to create the search index: %v", err)
    }
    // Await the creation of the index.
    log.Println("Polling to confirm successful index creation.")
    searchIndexes := coll.SearchIndexes()
    var doc bson.Raw
    for doc == nil {
    cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName))
    if err != nil {
    fmt.Errorf("failed to list search indexes: %w", err)
    }
    if !cursor.Next(ctx) {
    break
    }
    name := cursor.Current.Lookup("name").StringValue()
    queryable := cursor.Current.Lookup("queryable").Boolean()
    if name == searchIndexName && queryable {
    doc = cursor.Current
    } else {
    time.Sleep(5 * time.Second)
    }
    }
    log.Println("Name of Index Created: " + searchIndexName)
    }
  2. 如果使用开源模型,则将 <dimensions> 占位符值替换为 1024;如果使用 OpenAI 模型,则将占位符值替换为 1536

  3. 保存文件,然后运行以下命令:

    go run create-index.go
    2024/10/10 10:03:12 Creating the index.
    2024/10/10 10:03:13 Polling to confirm successful index creation.
    2024/10/10 10:03:44 Name of Index Created: vector_index

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 创建一个名为 vector-query.go 的文件并粘贴以下代码。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    • 使用您定义的嵌入函数创建查询嵌入。

    • queryVector字段中使用此嵌入并指定要查询的路径。

    • 使用 $vectorSearch 执行 ENN 搜索。

    • 返回语义相似的文档,按其搜索分数的相关性进行排序。

    vector-query.go
    package main
    import (
    "context"
    "fmt"
    "log"
    "my-embeddings-project/common"
    "os"
    "github.com/joho/godotenv"
    "go.mongodb.org/mongo-driver/bson"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
    )
    type SummaryAndScore struct {
    Summary string `bson:"summary"`
    Score float32 `bson:"score"`
    }
    func main() {
    ctx := context.Background()
    // Connect to your Atlas cluster
    if err := godotenv.Load(); err != nil {
    log.Println("no .env file found")
    }
    // Connect to your Atlas cluster
    uri := os.Getenv("ATLAS_CONNECTION_STRING")
    if uri == "" {
    log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
    }
    clientOptions := options.Client().ApplyURI(uri)
    client, err := mongo.Connect(ctx, clientOptions)
    if err != nil {
    log.Fatalf("failed to connect to the server: %v", err)
    }
    defer func() { _ = client.Disconnect(ctx) }()
    // Set the namespace
    coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
    query := "beach house"
    queryEmbedding := common.GetEmbeddings([]string{query})
    pipeline := mongo.Pipeline{
    bson.D{
    {"$vectorSearch", bson.D{
    {"queryVector", queryEmbedding[0]},
    {"index", "vector_index"},
    {"path", "embeddings"},
    {"exact", true},
    {"limit", 5},
    }},
    },
    bson.D{
    {"$project", bson.D{
    {"_id", 0},
    {"summary", 1},
    {"score", bson.D{
    {"$meta", "vectorSearchScore"},
    }},
    }},
    },
    }
    // Run the pipeline
    cursor, err := coll.Aggregate(ctx, pipeline)
    if err != nil {
    log.Fatalf("failed to run aggregation: %v", err)
    }
    defer func() { _ = cursor.Close(ctx) }()
    var matchingDocs []SummaryAndScore
    if err = cursor.All(ctx, &matchingDocs); err != nil {
    log.Fatalf("failed to unmarshal results to SummaryAndScore objects: %v", err)
    }
    for _, doc := range matchingDocs {
    fmt.Printf("Summary: %v\nScore: %v\n", doc.Summary, doc.Score)
    }
    }
    vector-query.go
    package main
    import (
    "context"
    "fmt"
    "log"
    "my-embeddings-project/common"
    "os"
    "github.com/joho/godotenv"
    "go.mongodb.org/mongo-driver/bson"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
    )
    type SummaryAndScore struct {
    Summary string `bson:"summary"`
    Score float64 `bson:"score"`
    }
    func main() {
    ctx := context.Background()
    // Connect to your Atlas cluster
    if err := godotenv.Load(); err != nil {
    log.Println("no .env file found")
    }
    // Connect to your Atlas cluster
    uri := os.Getenv("ATLAS_CONNECTION_STRING")
    if uri == "" {
    log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
    }
    clientOptions := options.Client().ApplyURI(uri)
    client, err := mongo.Connect(ctx, clientOptions)
    if err != nil {
    log.Fatalf("failed to connect to the server: %v", err)
    }
    defer func() { _ = client.Disconnect(ctx) }()
    // Set the namespace
    coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
    query := "beach house"
    queryEmbedding := common.GetEmbeddings([]string{query})
    pipeline := mongo.Pipeline{
    bson.D{
    {"$vectorSearch", bson.D{
    {"queryVector", queryEmbedding[0]},
    {"index", "vector_index"},
    {"path", "embeddings"},
    {"exact", true},
    {"limit", 5},
    }},
    },
    bson.D{
    {"$project", bson.D{
    {"_id", 0},
    {"summary", 1},
    {"score", bson.D{
    {"$meta", "vectorSearchScore"},
    }},
    }},
    },
    }
    // Run the pipeline
    cursor, err := coll.Aggregate(ctx, pipeline)
    if err != nil {
    log.Fatalf("failed to run aggregation: %v", err)
    }
    defer func() { _ = cursor.Close(ctx) }()
    var matchingDocs []SummaryAndScore
    if err = cursor.All(ctx, &matchingDocs); err != nil {
    log.Fatalf("failed to unmarshal results to SummaryAndScore objects: %v", err)
    }
    for _, doc := range matchingDocs {
    fmt.Printf("Summary: %v\nScore: %v\n", doc.Summary, doc.Score)
    }
    }
  2. 保存文件,然后运行以下命令:

    go run vector-query.go
    Summary: Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living.
    Score: 0.0045180833
    Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.
    Score: 0.004480799
    Summary: Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.
    Score: 0.0042421296
    Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.
    Score: 0.004227752
    Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.
    Score: 0.0042201905
    go run vector-query.go
    Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.
    Score: 0.4832950830459595
    Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.
    Score: 0.48093676567077637
    Summary: THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!
    Score: 0.4629695415496826
    Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.
    Score: 0.45800843834877014
    Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.
    Score: 0.45398443937301636
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_db.embeddings 集合上创建索引,将 embedding 字段指定为向量类型,将相似度测量指定为 dotProduct

  1. 创建一个名为 CreateIndex.java 的文件并粘贴以下代码:

    CreateIndex.java
    import com.mongodb.MongoException;
    import com.mongodb.client.ListSearchIndexesIterable;
    import com.mongodb.client.MongoClient;
    import com.mongodb.client.MongoClients;
    import com.mongodb.client.MongoCollection;
    import com.mongodb.client.MongoCursor;
    import com.mongodb.client.MongoDatabase;
    import com.mongodb.client.model.SearchIndexModel;
    import com.mongodb.client.model.SearchIndexType;
    import org.bson.Document;
    import org.bson.conversions.Bson;
    import java.util.Collections;
    import java.util.List;
    public class CreateIndex {
    public static void main(String[] args) {
    String uri = System.getenv("ATLAS_CONNECTION_STRING");
    if (uri == null || uri.isEmpty()) {
    throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
    }
    // establish connection and set namespace
    try (MongoClient mongoClient = MongoClients.create(uri)) {
    MongoDatabase database = mongoClient.getDatabase("sample_db");
    MongoCollection<Document> collection = database.getCollection("embeddings");
    // define the index details
    String indexName = "vector_index";
    int dimensionsHuggingFaceModel = 1024;
    int dimensionsOpenAiModel = 1536;
    Bson definition = new Document(
    "fields",
    Collections.singletonList(
    new Document("type", "vector")
    .append("path", "embedding")
    .append("numDimensions", <dimensions>) // replace with var for the model used
    .append("similarity", "dotProduct")));
    // define the index model using the specified details
    SearchIndexModel indexModel = new SearchIndexModel(
    indexName,
    definition,
    SearchIndexType.vectorSearch());
    // Create the index using the model
    try {
    List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel));
    System.out.println("Successfully created a vector index named: " + result);
    } catch (Exception e) {
    throw new RuntimeException(e);
    }
    // Wait for Atlas to build the index and make it queryable
    System.out.println("Polling to confirm the index has completed building.");
    System.out.println("It may take up to a minute for the index to build before you can query using it.");
    ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes();
    Document doc = null;
    while (doc == null) {
    try (MongoCursor<Document> cursor = searchIndexes.iterator()) {
    if (!cursor.hasNext()) {
    break;
    }
    Document current = cursor.next();
    String name = current.getString("name");
    boolean queryable = current.getBoolean("queryable");
    if (name.equals(indexName) && queryable) {
    doc = current;
    } else {
    Thread.sleep(500);
    }
    } catch (Exception e) {
    throw new RuntimeException(e);
    }
    }
    System.out.println(indexName + " index is ready to query");
    } catch (MongoException me) {
    throw new RuntimeException("Failed to connect to MongoDB ", me);
    } catch (Exception e) {
    throw new RuntimeException("Operation failed: ", e);
    }
    }
    }
  2. <dimensions> 占位符值替换为所用模型的相应变量:

    • dimensionsHuggingFaceModel1024 维度("mixedbread-ai/mxbai-embed-large-v1" 模型)

    • dimensionsOpenAiModel1536 个维度("text-embedding-3-small" 模型)

    注意

    维数由用于生成嵌入的模型确定。 如果您调整此代码以使用不同的模型,请确保将正确的值传递给 numDimensions另请参阅“选择嵌入模型”部分。

  3. 保存并运行文件。 输出类似于:

    Successfully created a vector index named: [vector_index]
    Polling to confirm the index has completed building.
    It may take up to a minute for the index to build before you can query using it.
    vector_index index is ready to query

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 创建一个名为 VectorQuery.java 的文件并粘贴以下代码。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    • 使用您定义的嵌入函数创建查询嵌入。

    • queryVector字段中使用此嵌入并指定要查询的路径。

    • 使用 $vectorSearch 执行 ENN 搜索。

    • 返回语义相似的文档,按其搜索分数的相关性进行排序。

    VectorQuery.java
    import com.mongodb.MongoException;
    import com.mongodb.client.MongoClient;
    import com.mongodb.client.MongoClients;
    import com.mongodb.client.MongoCollection;
    import com.mongodb.client.MongoDatabase;
    import com.mongodb.client.model.search.FieldSearchPath;
    import org.bson.BsonArray;
    import org.bson.BsonValue;
    import org.bson.Document;
    import org.bson.conversions.Bson;
    import java.util.ArrayList;
    import java.util.List;
    import static com.mongodb.client.model.Aggregates.project;
    import static com.mongodb.client.model.Aggregates.vectorSearch;
    import static com.mongodb.client.model.Projections.exclude;
    import static com.mongodb.client.model.Projections.fields;
    import static com.mongodb.client.model.Projections.include;
    import static com.mongodb.client.model.Projections.metaVectorSearchScore;
    import static com.mongodb.client.model.search.SearchPath.fieldPath;
    import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions;
    import static java.util.Arrays.asList;
    public class VectorQuery {
    public static void main(String[] args) {
    String uri = System.getenv("ATLAS_CONNECTION_STRING");
    if (uri == null || uri.isEmpty()) {
    throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
    }
    // establish connection and set namespace
    try (MongoClient mongoClient = MongoClients.create(uri)) {
    MongoDatabase database = mongoClient.getDatabase("sample_db");
    MongoCollection<Document> collection = database.getCollection("embeddings");
    // define $vectorSearch query options
    String query = "ocean tragedy";
    EmbeddingProvider embeddingProvider = new EmbeddingProvider();
    BsonArray embeddingBsonArray = embeddingProvider.getEmbedding(query);
    List<Double> embedding = new ArrayList<>();
    for (BsonValue value : embeddingBsonArray.stream().toList()) {
    embedding.add(value.asDouble().getValue());
    }
    // define $vectorSearch pipeline
    String indexName = "vector_index";
    FieldSearchPath fieldSearchPath = fieldPath("embedding");
    int limit = 5;
    List<Bson> pipeline = asList(
    vectorSearch(
    fieldSearchPath,
    embedding,
    indexName,
    limit,
    exactVectorSearchOptions()
    ),
    project(
    fields(exclude("_id"), include("text"),
    metaVectorSearchScore("score"))));
    // run query and print results
    List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>());
    if (results.isEmpty()) {
    System.out.println("No results found.");
    } else {
    results.forEach(doc -> {
    System.out.println("Text: " + doc.getString("text"));
    System.out.println("Score: " + doc.getDouble("score"));
    });
    }
    } catch (MongoException me) {
    throw new RuntimeException("Failed to connect to MongoDB ", me);
    } catch (Exception e) {
    throw new RuntimeException("Operation failed: ", e);
    }
    }
    }
  2. 保存并运行文件。 输出类似于以下内容之一,具体取决于您使用的模型:

    Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
    Score: 0.004247286356985569
    Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
    Score: 0.003116759704425931
    Text: The Lion King: Lion cub and future king Simba searches for his identity
    Score: 0.002447686856612563
    Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
    Score: 0.45522359013557434
    Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
    Score: 0.4049977660179138
    Text: The Lion King: Lion cub and future king Simba searches for his identity
    Score: 0.35942474007606506
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_airbnb.listingsAndReviews 集合上创建索引,将 embeddings 字段指定为向量类型,将相似度测量指定为 euclidean

  1. 创建一个名为 CreateIndex.java 的文件并粘贴以下代码:

    CreateIndex.java
    import com.mongodb.MongoException;
    import com.mongodb.client.MongoClient;
    import com.mongodb.client.MongoClients;
    import com.mongodb.client.MongoCollection;
    import com.mongodb.client.MongoDatabase;
    import com.mongodb.client.ListSearchIndexesIterable;
    import com.mongodb.client.MongoCursor;
    import com.mongodb.client.model.SearchIndexModel;
    import com.mongodb.client.model.SearchIndexType;
    import org.bson.Document;
    import org.bson.conversions.Bson;
    import java.util.Collections;
    import java.util.List;
    public class CreateIndex {
    public static void main(String[] args) {
    String uri = System.getenv("ATLAS_CONNECTION_STRING");
    if (uri == null || uri.isEmpty()) {
    throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
    }
    // establish connection and set namespace
    try (MongoClient mongoClient = MongoClients.create(uri)) {
    MongoDatabase database = mongoClient.getDatabase("sample_airbnb");
    MongoCollection<Document> collection = database.getCollection("listingsAndReviews");
    // define the index details
    String indexName = "vector_index";
    int dimensionsHuggingFaceModel = 1024;
    int dimensionsOpenAiModel = 1536;
    Bson definition = new Document(
    "fields",
    Collections.singletonList(
    new Document("type", "vector")
    .append("path", "embeddings")
    .append("numDimensions", <dimensions>) // replace with var for the model used
    .append("similarity", "dotProduct")));
    // define the index model using the specified details
    SearchIndexModel indexModel = new SearchIndexModel(
    indexName,
    definition,
    SearchIndexType.vectorSearch());
    // create the index using the model
    try {
    List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel));
    System.out.println("Successfully created a vector index named: " + result);
    System.out.println("It may take up to a minute for the index to build before you can query using it.");
    } catch (Exception e) {
    throw new RuntimeException(e);
    }
    // wait for Atlas to build the index and make it queryable
    System.out.println("Polling to confirm the index has completed building.");
    ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes();
    Document doc = null;
    while (doc == null) {
    try (MongoCursor<Document> cursor = searchIndexes.iterator()) {
    if (!cursor.hasNext()) {
    break;
    }
    Document current = cursor.next();
    String name = current.getString("name");
    boolean queryable = current.getBoolean("queryable");
    if (name.equals(indexName) && queryable) {
    doc = current;
    } else {
    Thread.sleep(500);
    }
    } catch (Exception e) {
    throw new RuntimeException(e);
    }
    }
    System.out.println(indexName + " index is ready to query");
    } catch (MongoException me) {
    throw new RuntimeException("Failed to connect to MongoDB ", me);
    } catch (Exception e) {
    throw new RuntimeException("Operation failed: ", e);
    }
    }
    }
  2. <dimensions> 占位符值替换为所用模型的相应变量:

    • dimensionsHuggingFaceModel1024 维度(开源)

    • dimensionsOpenAiModel1536 维度

    注意

    维数由用于生成嵌入的模型确定。 如果您使用其他模型,请确保将正确的值传递给 numDimensions另请参阅“选择嵌入模型”部分。

  3. 保存并运行文件。 输出类似于:

    Successfully created a vector index named: [vector_index]
    Polling to confirm the index has completed building.
    It may take up to a minute for the index to build before you can query using it.
    vector_index index is ready to query

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 创建一个名为 VectorQuery.java 的文件并粘贴以下代码。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    • 使用您定义的嵌入函数创建查询嵌入。

    • queryVector字段中使用此嵌入并指定要查询的路径。

    • 使用 $vectorSearch 执行 ENN 搜索。

    • 返回语义相似的文档,按其搜索分数的相关性进行排序。

    VectorQuery.java
    import com.mongodb.MongoException;
    import com.mongodb.client.MongoClient;
    import com.mongodb.client.MongoClients;
    import com.mongodb.client.MongoCollection;
    import com.mongodb.client.MongoDatabase;
    import com.mongodb.client.model.search.FieldSearchPath;
    import org.bson.BsonArray;
    import org.bson.BsonValue;
    import org.bson.Document;
    import org.bson.conversions.Bson;
    import java.util.ArrayList;
    import java.util.List;
    import static com.mongodb.client.model.Aggregates.project;
    import static com.mongodb.client.model.Aggregates.vectorSearch;
    import static com.mongodb.client.model.Projections.exclude;
    import static com.mongodb.client.model.Projections.fields;
    import static com.mongodb.client.model.Projections.include;
    import static com.mongodb.client.model.Projections.metaVectorSearchScore;
    import static com.mongodb.client.model.search.SearchPath.fieldPath;
    import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions;
    import static java.util.Arrays.asList;
    public class VectorQuery {
    public static void main(String[] args) {
    String uri = System.getenv("ATLAS_CONNECTION_STRING");
    if (uri == null || uri.isEmpty()) {
    throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
    }
    // establish connection and set namespace
    try (MongoClient mongoClient = MongoClients.create(uri)) {
    MongoDatabase database = mongoClient.getDatabase("sample_airbnb");
    MongoCollection<Document> collection = database.getCollection("listingsAndReviews");
    // define the query and get the embedding
    String query = "beach house";
    EmbeddingProvider embeddingProvider = new EmbeddingProvider();
    BsonArray embeddingBsonArray = embeddingProvider.getEmbedding(query);
    List<Double> embedding = new ArrayList<>();
    for (BsonValue value : embeddingBsonArray.stream().toList()) {
    embedding.add(value.asDouble().getValue());
    }
    // define $vectorSearch pipeline
    String indexName = "vector_index";
    FieldSearchPath fieldSearchPath = fieldPath("embeddings");
    int limit = 5;
    List<Bson> pipeline = asList(
    vectorSearch(
    fieldSearchPath,
    embedding,
    indexName,
    limit,
    exactVectorSearchOptions()),
    project(
    fields(exclude("_id"), include("summary"),
    metaVectorSearchScore("score"))));
    // run query and print results
    List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>());
    if (results.isEmpty()) {
    System.out.println("No results found.");
    } else {
    results.forEach(doc -> {
    System.out.println("Summary: " + doc.getString("summary"));
    System.out.println("Score: " + doc.getDouble("score"));
    });
    }
    } catch (MongoException me) {
    throw new RuntimeException("Failed to connect to MongoDB ", me);
    } catch (Exception e) {
    throw new RuntimeException("Operation failed: ", e);
    }
    }
    }
  2. 保存并运行文件。 输出类似于以下内容之一,具体取决于您使用的模型:

    Summary: Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living.
    Score: 0.004518083296716213
    Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.
    Score: 0.0044807991944253445
    Summary: Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.
    Score: 0.004242129623889923
    Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.
    Score: 0.004227751865983009
    Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.
    Score: 0.004220190457999706
    Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.
    Score: 0.4832950830459595
    Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.
    Score: 0.48092085123062134
    Summary: THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!
    Score: 0.4629460275173187
    Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.
    Score: 0.4581468403339386
    Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.
    Score: 0.45398443937301636
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_db.embeddings 集合上创建索引,将 embedding 字段指定为向量类型,将相似度测量指定为 dotProduct

  1. 创建一个名为 create-index.js 的文件并粘贴以下代码。

    create-index.js
    import { MongoClient } from 'mongodb';
    // connect to your Atlas deployment
    const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
    async function run() {
    try {
    const database = client.db("sample_db");
    const collection = database.collection("embeddings");
    // define your Atlas Vector Search index
    const index = {
    name: "vector_index",
    type: "vectorSearch",
    definition: {
    "fields": [
    {
    "type": "vector",
    "path": "embedding",
    "similarity": "dotProduct",
    "numDimensions": <dimensions>
    }
    ]
    }
    }
    // run the helper method
    const result = await collection.createSearchIndex(index);
    console.log(result);
    } finally {
    await client.close();
    }
    }
    run().catch(console.dir);
  2. 如果使用开源模型,则将 <dimensions> 占位符值替换为 768;如果使用 OpenAI 模型,则将占位符值替换为 1536

  3. 保存文件,然后运行以下命令:

    node --env-file=.env create-index.js

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 创建一个名为 vector-query.js 的文件并粘贴以下代码。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    • 使用您定义的嵌入函数创建查询嵌入。

    • queryVector字段中使用此嵌入并指定要查询的路径。

    • 使用 $vectorSearch 执行 ENN 搜索。

    • 返回语义相似的文档,按其搜索分数的相关性进行排序。

    vector-query.js
    import { MongoClient } from 'mongodb';
    import { getEmbedding } from './get-embeddings.js';
    // MongoDB connection URI and options
    const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
    async function run() {
    try {
    // Connect to the MongoDB client
    await client.connect();
    // Specify the database and collection
    const database = client.db("sample_db");
    const collection = database.collection("embeddings");
    // Generate embedding for the search query
    const queryEmbedding = await getEmbedding("ocean tragedy");
    // Define the sample vector search pipeline
    const pipeline = [
    {
    $vectorSearch: {
    index: "vector_index",
    queryVector: queryEmbedding,
    path: "embedding",
    exact: true,
    limit: 5
    }
    },
    {
    $project: {
    _id: 0,
    text: 1,
    score: {
    $meta: "vectorSearchScore"
    }
    }
    }
    ];
    // run pipeline
    const result = collection.aggregate(pipeline);
    // print results
    for await (const doc of result) {
    console.dir(JSON.stringify(doc));
    }
    } finally {
    await client.close();
    }
    }
    run().catch(console.dir);
  2. 保存文件,然后运行以下命令:

    node --env-file=.env vector-query.js
    '{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.5103757977485657}'
    '{"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4616812467575073}'
    '{"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.4115804433822632}'
    node --env-file=.env vector-query.js
    {"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.7007871866226196}
    {"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.6327334046363831}
    {"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.5544710159301758}
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_airbnb.listingsAndReviews 集合上创建索引,将 embedding 字段指定为向量类型,将相似度测量指定为 euclidean

  1. 创建一个名为 create-index.js 的文件并粘贴以下代码。

    create-index.js
    import { MongoClient } from 'mongodb';
    // connect to your Atlas deployment
    const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
    async function run() {
    try {
    const database = client.db("sample_airbnb");
    const collection = database.collection("listingsAndReviews");
    // Define your Atlas Vector Search index
    const index = {
    name: "vector_index",
    type: "vectorSearch",
    definition: {
    "fields": [
    {
    "type": "vector",
    "path": "embedding",
    "similarity": "dotProduct",
    "numDimensions": <dimensions>
    }
    ]
    }
    }
    // Call the method to create the index
    const result = await collection.createSearchIndex(index);
    console.log(result);
    } finally {
    await client.close();
    }
    }
    run().catch(console.dir);
  2. 如果使用开源模型,则将 <dimensions> 占位符值替换为 768;如果使用 OpenAI 模型,则将占位符值替换为 1536

  3. 保存文件,然后运行以下命令:

    node --env-file=.env create-index.js

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 创建一个名为 vector-query.js 的文件并粘贴以下代码。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    • 使用您定义的嵌入函数创建查询嵌入。

    • queryVector字段中使用此嵌入并指定要查询的路径。

    • 使用 $vectorSearch 执行 ENN 搜索。

    • 返回语义相似的文档,按其搜索分数的相关性进行排序。

    vector-query.js
    import { MongoClient } from 'mongodb';
    import { getEmbedding } from './get-embeddings.js';
    // MongoDB connection URI and options
    const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
    async function run() {
    try {
    // Connect to the MongoDB client
    await client.connect();
    // Specify the database and collection
    const database = client.db("sample_airbnb");
    const collection = database.collection("listingsAndReviews");
    // Generate embedding for the search query
    const queryEmbedding = await getEmbedding("beach house");
    // Define the sample vector search pipeline
    const pipeline = [
    {
    $vectorSearch: {
    index: "vector_index",
    queryVector: queryEmbedding,
    path: "embedding",
    exact: true,
    limit: 5
    }
    },
    {
    $project: {
    _id: 0,
    summary: 1,
    score: {
    $meta: "vectorSearchScore"
    }
    }
    }
    ];
    // run pipeline
    const result = collection.aggregate(pipeline);
    // print results
    for await (const doc of result) {
    console.dir(JSON.stringify(doc));
    }
    } finally {
    await client.close();
    }
    }
    run().catch(console.dir);
  2. 保存文件,然后运行以下命令:

    node --env-file=.env vector-query.js
    '{"summary":"Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.","score":0.5334879159927368}'
    '{"summary":"A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.","score":0.5240535736083984}'
    '{"summary":"The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.","score":0.5232879519462585}'
    '{"summary":"Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.","score":0.5186381340026855}'
    '{"summary":"A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.","score":0.5078228116035461}'
    node --env-file=.env vector-query.js
    {"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.483333021402359}
    {"summary": "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.48092877864837646}
    {"summary": "THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!", "score": 0.46294474601745605}
    {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "score": 0.4580020606517792}
    {"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.", "score": 0.45400717854499817}
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

  1. 请将以下代码粘贴到您的笔记本中。

    此代码会对您的集合创建一个索引,而该索引指定了以下内容:

    • embedding 字段作为向量类型字段。

    • dotProduct 作为 float32 嵌入的相似性类型。

    • 768 作为这些嵌入中的维度数量。

    from pymongo.operations import SearchIndexModel
    # Create your index model, then create the search index
    search_index_model = SearchIndexModel(
    definition = {
    "fields": [
    {
    "type": "vector",
    "path": "embedding",
    "similarity": "dotProduct",
    "numDimensions": 768
    }
    ]
    },
    name="vector_index",
    type="vectorSearch"
    )
    collection.create_search_index(model=search_index_model)
  2. 运行代码。

    构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

  1. 请将以下代码粘贴到您的笔记本中。

    此代码在集合上创建索引,并将 embedding字段指定为向量类型,将相似度函数指定为 dotProduct,将维数指定为 1536

    from pymongo.operations import SearchIndexModel
    # Create your index model, then create the search index
    search_index_model = SearchIndexModel(
    definition = {
    "fields": [
    {
    "type": "vector",
    "path": "embedding",
    "similarity": "dotProduct",
    "numDimensions": 1536
    }
    ]
    },
    name="vector_index",
    type="vectorSearch"
    )
    collection.create_search_index(model=search_index_model)
  2. 运行代码。

    构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2

要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

例如,此代码执行以下操作:

  • 使用您定义的嵌入函数创建查询嵌入。

  • queryVector字段中使用此嵌入并指定要查询的路径。

  • 使用 $vectorSearch 执行 ENN 搜索。

  • 返回语义相似的文档,按其搜索分数的相关性进行排序。

注意

查询可能需要一些时间才能完成。

# Generate embedding for the search query
query_embedding = get_embedding("ocean tragedy")
# Sample vector search pipeline
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"queryVector": query_embedding,
"path": "embedding",
"exact": True,
"limit": 5
}
},
{
"$project": {
"_id": 0,
"text": 1,
"score": {
"$meta": "vectorSearchScore"
}
}
}
]
# Execute the search
results = collection.aggregate(pipeline)
# Print results
for i in results:
print(i)
{'data': 'Titanic: The story of the 1912 sinking of the largest luxury liner ever built', 'score': 0.7661112546920776}
{'data': 'Avatar: A marine is dispatched to the moon Pandora on a unique mission', 'score': 0.7050272822380066}
{'data': 'The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.', 'score': 0.7024770379066467}
{'data': 'Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.', 'score': 0.7011005282402039}
{'data': 'E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.', 'score': 0.6877288222312927}
# Generate embedding for the search query
query_embedding = get_embedding("ocean tragedy")
# Sample vector search pipeline
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"queryVector": query_embedding,
"path": "embedding",
"exact": True,
"limit": 5
}
},
{
"$project": {
"_id": 0,
"text": 1,
"score": {
"$meta": "vectorSearchScore"
}
}
}
]
# Execute the search
results = collection.aggregate(pipeline)
# Print results
for i in results:
print(i)
{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.7007871866226196}
{"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.6327334046363831}
{"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.5544710159301758}
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

  1. 请将以下代码粘贴到您的笔记本中。

    此代码会对您的集合创建一个索引,而该索引指定了以下内容:

    • embedding 字段作为向量类型字段。

    • dotProduct 作为 float32 嵌入的相似性类型。

    • 768 作为这些嵌入中的维度数量。

    from pymongo.operations import SearchIndexModel
    # Create your index model, then create the search index
    search_index_model = SearchIndexModel(
    definition = {
    "fields": [
    {
    "type": "vector",
    "path": "embedding",
    "similarity": "dotProduct",
    "numDimensions": 768
    }
    ]
    },
    name="vector_index",
    type="vectorSearch"
    )
    collection.create_search_index(model=search_index_model)
  2. 运行代码。

    构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

  1. 请将以下代码粘贴到您的笔记本中。

    此代码在集合上创建索引,并将 embedding字段指定为向量类型,将相似度函数指定为 dotProduct,将维数指定为 1536

    from pymongo.operations import SearchIndexModel
    # Create your index model, then create the search index
    search_index_model = SearchIndexModel(
    definition = {
    "fields": [
    {
    "type": "vector",
    "path": "embedding",
    "similarity": "dotProduct",
    "numDimensions": 1536
    }
    ]
    },
    name="vector_index",
    type="vectorSearch"
    )
    collection.create_search_index(model=search_index_model)
  2. 运行代码。

    构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2

要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

例如,此代码执行以下操作:

  • 使用您定义的嵌入函数创建查询嵌入。

  • queryVector字段中使用此嵌入并指定要查询的路径。

  • 使用 $vectorSearch 执行 ENN 搜索。

  • 返回语义相似的文档,按其搜索分数的相关性进行排序。

# Generate embedding for the search query
query_embedding = get_embedding("beach house")
# Sample vector search pipeline
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"queryVector": query_embedding,
"path": "embedding",
"exact": True,
"limit": 5
}
},
{
"$project": {
"_id": 0,
"summary": 1,
"score": {
"$meta": "vectorSearchScore"
}
}
}
]
# Execute the search
results = collection.aggregate(pipeline)
# Print results
for i in results:
print(i)
{'summary': 'Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.', 'score': 0.7847104072570801}
{'summary': 'The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.', 'score': 0.7780507802963257}
{'summary': "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", 'score': 0.7723637223243713}
{'summary': 'Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.', 'score': 0.7665778398513794}
{'summary': 'A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.', 'score': 0.7593404650688171}

要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

例如,此代码执行以下操作:

  • 使用您定义的嵌入函数创建查询嵌入。

  • queryVector字段中使用此嵌入并指定要查询的路径。

  • 使用 $vectorSearch 执行 ENN 搜索。

  • 返回语义相似的文档,按其搜索分数的相关性进行排序。

# Generate embedding for the search query
query_embedding = get_embedding("beach house")
# Sample vector search pipeline
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"queryVector": query_embedding,
"path": "embedding",
"exact": True,
"limit": 5
}
},
{
"$project": {
"_id": 0,
"summary": 1,
"score": {
"$meta": "vectorSearchScore"
}
}
}
]
# Execute the search
results = collection.aggregate(pipeline)
# Print results
for i in results:
print(i)
{"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.483333021402359}
{"summary": "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.48092877864837646}
{"summary": "THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!", "score": 0.46294474601745605}
{"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "score": 0.4580020606517792}
{"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.", "score": 0.45400717854499817}

在创建向量嵌入时请考虑以下因素:

要创建向量嵌入,您必须使用嵌入模型。嵌入模型是用于将数据转换为嵌入的算法。您可以选择以下方法之一连接到嵌入模型并创建向量嵌入:

方法
说明

加载开源模型

如果您没有专有嵌入模型的 API 密钥,请从您的应用程序本地加载开源嵌入模型。

使用专有模型

大多数 AI 提供商都为其专有的嵌入模型提供 API,您可以使用这些模型创建向量嵌入。

利用一个集成

您可以将 Atlas Vector Search 与开源框架和 AI 服务集成,以快速连接到开源和专有嵌入模型,并为 Atlas Vector Search 生成矢量嵌入。

要了解更多信息,请参阅将 Vector Search 与 AI 技术集成

您选择的嵌入模型会影响您的查询结果,并决定您在 Atlas Vector Search 索引中指定的维度数。每种模型都提供不同的优势,具体取决于您的数据和使用案例。

有关常用嵌入模型的列表,请参阅海量文本嵌入基准 (MTEB) 。此列表提供了对各种开源和专有文本嵌入模型的深入见解,并允许您按使用案例、模型类型和特定模型指标来过滤模型。

为 Atlas Vector Search 选择嵌入模型时,请考虑以下指标:

  • 嵌入维度:向量嵌入的长度。

    较小的嵌入可以提高存储效率,而较大的嵌入可以捕获数据中更细微的关系。您选择的模型应在效率和复杂性之间取得平衡。

  • Max Tokens(最大标记数):可在单个嵌入中压缩的令牌 数。

  • 模型大小:模型的大小,以千兆字节为单位。

    虽然较大的模型性能更好,但当您将 Atlas Vector Search 扩展到生产环境时,它们需要更多的计算资源。

  • 检索平均值:衡量检索系统性能的分数。

    分数越高,表示模型更擅长在检索结果列表中将相关文档排在较高的位置。在为 RAG 应用程序选择模型时,此分数很重要。

提示

如果您有大量浮点向量,并且希望减少 中的存储和WiredTiger占用空间(例如磁盘和内存使用量),请将嵌入转换为 mongodbinData向量以对其进行压缩。

BinData 是一种存储二进制数据的 BSON 数据类型。向量嵌入的默认类型是 32 位浮点数 (float32) 的大量。二进制数据比默认大量格式的存储效率更高,因此所需的磁盘空间减少三倍。

存储 binData 向量可提高查询性能,因为将文档加载到工作集所需的资源更少。这可以显着提高返回超过 20 个文档的向量查询的查询速度。如果压缩 float32 嵌入,则可以使用 float32binData 向量对其查询。

本页上的教程包括一个示例函数,您可以使用该函数将 float32 向量转换为 binData 向量。

BSON BinData 向量由以下驱动程序支持:

浮点向量通常难以压缩,因为大量中的每个元素都有自己的类型(尽管大多数向量具有统一类型)。因此,将嵌入模型的浮点向量输出转换为子类型为 float32binData 向量是一种更有效的序列化方案。 binData 向量存储整个向量的单个类型描述符,从而减少存储开销。

请考虑以下策略,以确保您的嵌入是正确且最佳的:

学会了如何创建嵌入和使用 Atlas Vector Search 查询嵌入之后,就可以通过实施检索增强生成 (RAG) 开始构建生成式人工智能应用:

您还可以将 32 位浮点向量嵌入量化为更少的位数,以进一步减少资源消耗并提高查询速度。要学习;了解更多信息,请参阅向量量化。

后退

Atlas Vector Search 快速入门