Docs 菜单
Docs 主页
/
Atlas
/

矢量量化

Atlas Vector Search支持浮点向量嵌入(32 位和 64 位)的自动量化。它还支持从某些嵌入模型中提取预量化的标量和二进制向量并索引。

量化是指将全保真向量缩小为更少位数的过程。它可通过索引较少的表示向量,减少在 Atlas Vector Search 索引中存储每个向量所需的主内存量。如此一来,便可存储更多的向量或更高维度的向量。因此,量化可减少资源消耗并提高速度。我们建议对具有大量向量(例如超过 1000 万个)的应用程序进行量化。

标量量化涉及首先识别索引向量每个维度的最小值和最大值,以确定维度的值范围。然后,将该范围划分为大小相等的间隔或分箱。最后,将每个浮点值映射到一个 bin,以将连续浮点值转换为离散整数。在Atlas Vector Search中,这种量化将向量嵌入的RAM费用降低到预量化费用的大约四分之一1/3.75 ()。

二进制量化涉及假设每个维度的中点为 0,这通常适用于归一化为长度 1 的嵌入,例如 OpenAI 的 text-embedding-3-large。然后,将向量中的每个值与中点进行比较,如果大于中点,则为其分配二进制值 1;如果小于或等于中点,则为其分配二进制值 0。在 Atlas Vector Search 中,这种量化将向量嵌入的 RAM 费用降低至量化前费用的二十四分之一 (1/24)。它不是 1/32 是因为包含 可分层导航的小世界 图表本身的数据结构(与向量值分开)未压缩。

当您运行查询时,Atlas Vector Search 会将查询向量中的浮点值转换为二进制向量,并使用相同的中点,以便在查询向量和索引二进制向量之间进行高效比较。然后,通过使用与这些结果相关的二进制索引中的原始浮点值,重新评估二进制比较中识别的候选项,以进一步优化结果。全保真向量存储在磁盘上的独立数据结构中,仅在重新评分时配置二进制量化或对二进制或标量量化向量执行精确搜索时引用。

另请参阅:

  • 什么是矢量量化?

下表显示了自动量化和导入量化向量的要求。

注意

Atlas在内部将所有浮点值存储为 double数据类型;因此,32 位和 64 位嵌入都与自动量化兼容,无需转换。

要求
对于 int1 引入
对于 int8 引入
对于自动标量量化
对于自动二进制量化

需要索引定义设置

No

No

需要 BSON binData 格式

No

No

mongod上的存储

binData(int1)

binData(int8)

binData(float32)
array(double)
binData(float32)
array(double)

支持的相似度方法

euclidean

cosine
euclidean
dotProduct
cosine
euclidean
dotProduct
cosine
euclidean
dotProduct

支持的维度数

8 的倍数

1 改为 8192

1 改为 8192

8 的倍数

支持 ANNENN 搜索

您可以配置Atlas Vector Search ,使其自动将集合中的浮点向量嵌入量化为简化的表示类型,例如向量索引中的 int8(标量)和 binary

要设置或更改量化类型,请在索引定义中指定 quantization 字段值为 scalarbinary。与任何其他索引定义更改类似,这会触发索引重建。所指定的量化类型在查询时适用于所有索引向量和查询向量。

对于大多数嵌入模型,我们建议使用重新评分的二进制量化。 如果要使用非 QAT 的较低维度模型,请使用标量量化,因为它的表征损失较小,因此表征容量损失也较小。

Atlas Vector Search提供标量量化以及带重新评分的二进制量化的原生功能。自动量化可减少高效处理所需的计算资源,从而提高应用程序的可扩展性并节省费用。自动量化会将 的RAM减少mongot 3.75倍(标量)和24 倍(二进制);向量值分别缩小4 x 和32 x,但“分层可导航小世界”图表本身不会缩小。即使在最高数量和扩展,这也能提高性能。

如果您有大量全保真向量(通常超过 10M 个向量),我们建议使用自动量化。 量化后,您可以对简化的表示向量索引,而不会影响检索向量时的准确性。

要启用自动量化:

1

在新的或现有的Atlas Vector Search索引中,在索引定义的fields.quantization 字段中指定以下量化类型之一:

  • scalar:从浮点输入向量生成字节向量。

  • binary:从浮点输入向量生成位向量。

如果您对不是浮点值大量的数据指定自动量化, Atlas Vector Search会默默地忽略该向量,而不是对其索引,并且会跳过这些向量。由于Atlas在内部将浮点值(32 位和 64 位)存储为 double 类型,因此输出任一精度的模型的嵌入将使用自动量化。

2

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

所指定的量化类型在查询时适用于所有索引向量和查询向量。

Atlas Vector Search还支持从某些嵌入模型中摄取标量和二进制量化向量并索引。如果您还没有量化向量,则可以将嵌入转换为BSON BinData 向量,具有 float32int1int8 子类型。

我们建议在以下使用案例中摄取量化的BSON binData 向量:

  • 您需要对嵌入模型中量化向量输出进行索引。

  • 您有大量浮点向量,希望减少 mongod 中的存储和WiredTiger占用空间(例如磁盘和内存使用量)。

BinData 是一种存储二进制数据的BSON数据类型。它可以压缩向量嵌入,与使用标准 float32 大量的嵌入相比,所需的集群磁盘空间减少约三倍。要了解更多信息,请参阅向量压缩

此子类型还允许您使用备用类型(例如 int1int8 向量)对向量索引,从而减少为集合构建Atlas Vector Search索引所需的内存。它将 mongot 的RAM减少了 3.75 倍(标量)和 24 倍(二进制);向量值分别缩小 4x 和 32x,但“分层可导航小世界”图表本身不会缩小。

如果您还没有 binData 向量,则可以在将数据写入集合之前使用任何支持的驾驶员将嵌入转换为这种格式。以下过程将引导您完成将嵌入转换为具有 BinData 向量的步骤,float32int8int1 子类型。

BSON 以下驱动程序支持具有 float32int1 和 子类型的 BinData 向量:int8


➤ 使用 Select your language(选择您的语言)下拉菜单来设置此页面上过程的语言。


要量化BSON binData 向量,必须具备以下条件:

  • 运行 MongoDB 6.0.11、7.0.2 或更高版本的 Atlas 集群。

    确保您的IP 地址包含在 Atlas 项目的访问列表中。

  • 访问支持字节向量输出的嵌入模型。

    以下嵌入模型的输出可用于借助受支持的 MongoDB 驱动程序来生成 BSON binData 向量:

    嵌入模型提供者
    嵌入模型

    voyage-3-large

    embed-english-v3.0

    nomic-embed-text-v1.5

    jina-embeddings-v2-base-en

    mxbai-embed-large-v1

    标量量化可保留这些模型的召回率,因为这些模型均已经过训练,能感知量化。因此,这些模型产生的标量量化嵌入的召回率下降即使在 384 等较低维度下也处于最低水平。

  • 用于运行 Node.js 项目的终端和代码编辑器。

  • npm 和 Node.js 已安装。

此过程中的示例会使用新数据或现有数据,以及使用 Cohere 的 embed-english-v3.0 模型所生成的嵌入。新数据的示例会使用示例文本字符串,而您可将其替换为自己的数据。现有数据的示例则使用来自 sample_airbnb 数据库中 listingsAndReviews 集合的部分文档(不带任何嵌入),而您可将其替换为自己的数据库和集合(可附带或不带任何嵌入)。

根据您是要量化新数据还是Atlas 集群中已有数据的 binData 向量来选择标签页。

在 IDE 中创建Java项目,并为MongoDB Java驱动程序配置依赖项,然后在项目中执行以下步骤。要试用该示例,请将占位符替换为有效值。

1
  1. 在 IDE 中,使用 Maven 或 Gradle 创建Java项目。

  2. 根据您的包管理器,添加以下依赖项:

    如果使用 Maven,请将以下依赖项添加到项目的 pom.xml文件的 dependencies大量中:

    pom.xml
    <dependencies>
    <dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.13.2</version>
    <scope>test</scope>
    </dependency>
    <dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>mongodb-driver-sync</artifactId>
    <version>5.3.1</version>
    </dependency>
    <dependency>
    <groupId>com.cohere</groupId>
    <artifactId>cohere-java</artifactId>
    <version>1.6.0</version>
    </dependency>
    <dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-api</artifactId>
    <version>2.0.16</version>
    </dependency>
    <dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-simple</artifactId>
    <version>2.0.16</version>
    <scope>test</scope>
    </dependency>
    </dependencies>

    如果您使用 Gradle,请将以下内容添加到项目 build.gradle文件的 dependencies大量中:

    build.gradle
    dependencies {
    // MongoDB Java Sync Driver v5.3.1 or later
    implementation 'org.mongodb:mongodb-driver-sync:[5.3.1,)'
    // Java library for working with Cohere models
    implementation 'ai.cohere:cohere-java:1.6.0'
    // SLF4J (The Simple Logging Facade for Java)
    testImplementation("org.slf4j:slf4j-simple:2.0.16")
    implementation("org.slf4j:slf4j-api:2.0.16")
    }
  3. 运行包管理器以安装项目的依赖项。

2

注意

此示例在 IDE 中设置项目的变量。 生产应用程序可以通过部署配置、CI/CD管道或密钥管理器管理环境变量,但您可以调整提供的代码以适合您的使用案例。

在 IDE 中,创建新的配置模板并将以下变量添加到项目中:

  • 如果您使用的是 IntelliJ IDEA,则请创建一个新的 Application 运行配置模板,然后在 Environment variables 字段中将变量添加为用分号分隔的值(例如,FOO=123;BAR=456)。应用这些更改,然后单击 OK

    要学习;了解更多信息,请参阅 IntelliJ IDEA 文档的从模板创建运行/调试配置部分。

  • 如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。 应用更改并单击 OK

    要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。

环境变量
COHERE_API_KEY=<api-key>
MONGODB_URI=<connection-string>

用以下值更新占位符:

  • <api-key> 占位符值替换为您的 Cohere API密钥。

  • 用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

    连接字符串应使用以下格式:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
3

您可以使用嵌入模型提供商为数据生成floatint8int1 嵌入,然后使用MongoDB Java驾驶员原生向量嵌入转换为BSON向量。以下示例代码使用 Cohere 的embed API生成全精度向量。

  1. 在Java项目中创建一个名为 GenerateAndConvertEmbeddings.java 的新文件。

    touch GenerateAndConvertEmbeddings.java
  2. 将以下代码复制并粘贴到 GenerateAndConvertEmbeddings.java 文件中。

    此代码执行以下操作:

    • 使用 Cohere 的 embed API 生成 float32int8ubinary 向量嵌入。

    • 使用MongoDB Java驾驶员将嵌入转换为BSON 向量。binData

    • 创建一个名为 embeddings.json 的文件,并将带有嵌入的数据保存在该文件中以上传到Atlas。

    GenerateAndConvertEmbeddings.java
    1import com.cohere.api.Cohere;
    2import com.cohere.api.requests.EmbedRequest;
    3import com.cohere.api.types.EmbedByTypeResponse;
    4import com.cohere.api.types.EmbedByTypeResponseEmbeddings;
    5import com.cohere.api.types.EmbedInputType;
    6import com.cohere.api.types.EmbedResponse;
    7import com.cohere.api.types.EmbeddingType;
    8import java.io.FileOutputStream;
    9import java.io.IOException;
    10import java.util.ArrayList;
    11import java.util.List;
    12import java.util.Objects;
    13import java.util.Optional;
    14import org.bson.BinaryVector;
    15import org.bson.Document;
    16
    17public class GenerateAndConvertEmbeddings {
    18
    19 // List of text data to embed
    20 private static final List<String> DATA = List.of(
    21 "The Great Wall of China is visible from space.",
    22 "The Eiffel Tower was completed in Paris in 1889.",
    23 "Mount Everest is the highest peak on Earth at 8,848m.",
    24 "Shakespeare wrote 37 plays and 154 sonnets during his lifetime.",
    25 "The Mona Lisa was painted by Leonardo da Vinci."
    26 );
    27
    28 public static void main(String[] args) {
    29 // Cohere API key for authentication
    30 String apiKey = System.getenv("COHERE_API_KEY");
    31
    32 // Fetch embeddings from the Cohere API
    33 EmbedByTypeResponseEmbeddings embeddings = fetchEmbeddingsFromCohere(apiKey);
    34 Document bsonEmbeddings = convertEmbeddingsToBson(embeddings);
    35
    36 writeEmbeddingsToFile(bsonEmbeddings, "embeddings.json");
    37 }
    38
    39 // Fetches embeddings based on input data from the Cohere API
    40 private static EmbedByTypeResponseEmbeddings fetchEmbeddingsFromCohere(String apiKey) {
    41 if (Objects.isNull(apiKey) || apiKey.isEmpty()) {
    42 throw new RuntimeException("API key not found. Please set COHERE_API_KEY in your environment.");
    43 }
    44
    45 Cohere cohere = Cohere.builder().token(apiKey).clientName("embed-example").build();
    46
    47 try {
    48 EmbedRequest request = EmbedRequest.builder()
    49 .model("embed-english-v3.0")
    50 .inputType(EmbedInputType.SEARCH_DOCUMENT)
    51 .texts(DATA)
    52 .embeddingTypes(List.of(EmbeddingType.FLOAT, EmbeddingType.INT_8, EmbeddingType.UBINARY))
    53 .build();
    54
    55 EmbedResponse response = cohere.embed(request);
    56 Optional<EmbedByTypeResponse> optionalEmbeddingsWrapper = response.getEmbeddingsByType();
    57
    58 return optionalEmbeddingsWrapper.orElseThrow().getEmbeddings();
    59 } catch (Exception e) {
    60 System.err.println("Error fetching embeddings: " + e.getMessage());
    61 throw e;
    62 }
    63 }
    64
    65 // Converts embeddings to BSON binary vectors using MongoDB Java Driver
    66 private static Document convertEmbeddingsToBson(EmbedByTypeResponseEmbeddings embeddings) {
    67 List<List<Double>> floatEmbeddings = embeddings.getFloat().orElseThrow();
    68 List<List<Integer>> int8Embeddings = embeddings.getInt8().orElseThrow();
    69 List<List<Integer>> ubinaryEmbeddings = embeddings.getUbinary().orElseThrow();
    70
    71 List<Document> bsonEmbeddings = new ArrayList<>();
    72 for (int i = 0; i < floatEmbeddings.size(); i++) {
    73 Document bsonEmbedding = new Document()
    74 .append("text", DATA.get(i))
    75 .append("embeddings_float32", BinaryVector.floatVector(listToFloatArray(floatEmbeddings.get(i))))
    76 .append("embeddings_int8", BinaryVector.int8Vector(listToByteArray(int8Embeddings.get(i))))
    77 .append("embeddings_int1", BinaryVector.packedBitVector(listToByteArray(ubinaryEmbeddings.get(i)), (byte) 0));
    78
    79 bsonEmbeddings.add(bsonEmbedding);
    80 }
    81
    82 return new Document("data", bsonEmbeddings);
    83 }
    84
    85 // Writes embeddings to JSON file
    86 private static void writeEmbeddingsToFile(Document bsonEmbeddings, String fileName) {
    87 try (FileOutputStream fos = new FileOutputStream(fileName)) {
    88 fos.write(bsonEmbeddings.toJson().getBytes());
    89 System.out.println("Embeddings saved to " + fileName);
    90 } catch (IOException e) {
    91 System.out.println("Error writing embeddings to file: " + e.getMessage());
    92 }
    93 }
    94
    95 // Convert List of Doubles to an array of floats
    96 private static float[] listToFloatArray(List<Double> list) {
    97 float[] array = new float[list.size()];
    98 for (int i = 0; i < list.size(); i++) {
    99 array[i] = list.get(i).floatValue();
    100 }
    101 return array;
    102 }
    103
    104 // Convert List of Integers to an array of bytes
    105 private static byte[] listToByteArray(List<Integer> list) {
    106 byte[] array = new byte[list.size()];
    107 for (int i = 0; i < list.size(); i++) {
    108 array[i] = list.get(i).byteValue();
    109 }
    110 return array;
    111 }
    112}
  3. 如果未设立环境变量,请替换代码中的 COHERE_API_KEY 占位符值,然后保存文件。

  4. 使用应用程序运行配置编译并运行该文件。

    如果使用终端,运行以下命令来编译并执行程序。

    javac GenerateAndConvertEmbeddings.java
    java GenerateAndConvertEmbeddings
    BSON embeddings saved to embeddings.json
  5. 验证 embeddings.json文件中的嵌入。

要学习;了解有关生成嵌入并将嵌入转换为 binData 向量的更多信息,请参阅如何创建向量嵌入。

4

您必须将数据和嵌入上传到Atlas 集群中的集合,并在数据上创建Atlas Vector Search索引,以对数据运行$vectorSearch 查询。

  1. 在Java项目中创建一个名为 UploadDataAndCreateIndex.java 的新文件。

    touch UploadDataAndCreateIndex.java
  2. 将以下代码复制并粘贴到 UploadDataAndCreateIndex.java 文件中。

    此代码执行以下操作:

    • embeddings.json文件中的数据上传到Atlas 集群。

    • embeddings_float32embeddings_int8embeddings_int1 字段上创建Atlas Vector Search索引。

    UploadDataAndCreateIndex.java
    1import com.mongodb.client.MongoClient;
    2import com.mongodb.client.MongoClients;
    3import com.mongodb.client.MongoCollection;
    4import com.mongodb.client.MongoDatabase;
    5import com.mongodb.client.model.SearchIndexModel;
    6import com.mongodb.client.model.SearchIndexType;
    7import org.bson.Document;
    8import org.bson.conversions.Bson;
    9
    10import java.io.IOException;
    11import java.nio.file.Files;
    12import java.nio.file.Path;
    13import java.util.Collections;
    14import java.util.List;
    15import java.util.concurrent.TimeUnit;
    16import java.util.stream.StreamSupport;
    17
    18public class UploadDataAndCreateIndex {
    19
    20 private static final String MONGODB_URI = System.getenv("MONGODB_URI");
    21 private static final String DB_NAME = "<DATABASE-NAME>";
    22 private static final String COLLECTION_NAME = "<COLLECTION-NAME>";
    23 private static final String INDEX_NAME = "<INDEX-NAME>";
    24
    25 public static void main(String[] args) {
    26 try (MongoClient mongoClient = MongoClients.create(MONGODB_URI)) {
    27 storeEmbeddings(mongoClient);
    28 setupVectorSearchIndex(mongoClient);
    29 } catch (IOException | InterruptedException e) {
    30 e.printStackTrace();
    31 }
    32 }
    33
    34 public static void storeEmbeddings(MongoClient client) throws IOException {
    35 MongoDatabase database = client.getDatabase(DB_NAME);
    36 MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME);
    37
    38 String fileContent = Files.readString(Path.of("embeddings.json"));
    39 List<Document> documents = parseDocuments(fileContent);
    40
    41 collection.insertMany(documents);
    42 System.out.println("Inserted documents into MongoDB");
    43 }
    44
    45 private static List<Document> parseDocuments(String jsonContent) throws IOException {
    46 Document rootDoc = Document.parse(jsonContent);
    47 return rootDoc.getList("data", Document.class);
    48 }
    49
    50 public static void setupVectorSearchIndex(MongoClient client) throws InterruptedException {
    51 MongoDatabase database = client.getDatabase(DB_NAME);
    52 MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME);
    53
    54 Bson definition = new Document(
    55 "fields",
    56 List.of(
    57 new Document("type", "vector")
    58 .append("path", "embeddings_float32")
    59 .append("numDimensions", 1024)
    60 .append("similarity", "dotProduct"),
    61 new Document("type", "vector")
    62 .append("path", "embeddings_int8")
    63 .append("numDimensions", 1024)
    64 .append("similarity", "dotProduct"),
    65 new Document("type", "vector")
    66 .append("path", "embeddings_int1")
    67 .append("numDimensions", 1024)
    68 .append("similarity", "euclidean")
    69 )
    70 );
    71
    72 SearchIndexModel indexModel = new SearchIndexModel(
    73 INDEX_NAME,
    74 definition,
    75 SearchIndexType.vectorSearch()
    76 );
    77
    78 List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel));
    79 System.out.println("Successfully created vector index named: " + result.get(0));
    80 System.out.println("It may take up to a minute for the index to leave the BUILDING status and become queryable.");
    81
    82 System.out.println("Polling to confirm the index has changed from the BUILDING status.");
    83 waitForIndex(collection, INDEX_NAME);
    84 }
    85
    86 public static <T> boolean waitForIndex(final MongoCollection<T> collection, final String indexName) {
    87 long startTime = System.nanoTime();
    88 long timeoutNanos = TimeUnit.SECONDS.toNanos(60);
    89 while (System.nanoTime() - startTime < timeoutNanos) {
    90 Document indexRecord = StreamSupport.stream(collection.listSearchIndexes().spliterator(), false)
    91 .filter(index -> indexName.equals(index.getString("name")))
    92 .findAny().orElse(null);
    93 if (indexRecord != null) {
    94 if ("FAILED".equals(indexRecord.getString("status"))) {
    95 throw new RuntimeException("Search index has FAILED status.");
    96 }
    97 if (indexRecord.getBoolean("queryable")) {
    98 System.out.println(indexName + " index is ready to query");
    99 return true;
    100 }
    101 }
    102 try {
    103 Thread.sleep(100); // busy-wait, avoid in production
    104 } catch (InterruptedException e) {
    105 Thread.currentThread().interrupt();
    106 throw new RuntimeException(e);
    107 }
    108 }
    109 return false;
    110 }
    111}
  3. 替换代码中的以下占位符值并保存文件。

    MONGODB_URI

    Atlas 集群连接字符串(如果您未设立环境变量)。

    <DATABASE-NAME>

    Atlas 集群中数据库的名称。

    <COLLECTION-NAME>

    要上传数据的集合的名称。

    <INDEX-NAME>

    集合的Atlas Vector Search索引的名称。

  4. 使用应用程序运行配置编译并运行该文件。

    如果使用终端,运行以下命令来编译并执行程序。

    javac UploadDataAndCreateIndex.java
    java UploadDataAndCreateIndex
    Inserted documents into MongoDB
    Successfully created vector index named: <INDEX_NAME>
    It may take up to a minute for the index to leave the BUILDING status and become queryable.
    Polling to confirm the index has changed from the BUILDING status.
    <INDEX_NAME> index is ready to query
  5. 登录Atlas 集群并验证以下内容:

    • 命名空间空间中的数据。

    • 集合的Atlas Vector Search索引。

5

要测试嵌入,您可以对集合运行查询。使用嵌入模型提供商为查询文本生成 floatint8int1 嵌入。以下示例代码使用 Cohere 的 embed API生成全精度向量。生成嵌入后,使用MongoDB Java驾驶员将原原生向量嵌入转换为BSON向量,并对集合运行$vectorSearch查询。

  1. 在Java项目中创建一个名为 CreateEmbeddingsAndRunQuery.java 的新文件。

    touch CreateEmbeddingsAndRunQuery.java
  2. 将以下代码复制并粘贴到 CreateEmbeddingsAndRunQuery.java 文件中。

    此代码执行以下操作:

    • 使用 Cohere 的 embed API 生成 float32int8ubinary 向量嵌入。

    • 使用MongoDB Java驾驶员将嵌入转换为BSON 向量。binData

    • 针对您的集合运行查询。

    CreateEmbeddingsAndRunQuery.java
    1import com.cohere.api.Cohere;
    2import com.cohere.api.requests.EmbedRequest;
    3import com.cohere.api.types.EmbedResponse;
    4import com.cohere.api.types.EmbedByTypeResponse;
    5import com.cohere.api.types.EmbedByTypeResponseEmbeddings;
    6import com.cohere.api.types.EmbeddingType;
    7import com.cohere.api.types.EmbedInputType;
    8import com.mongodb.client.MongoClient;
    9import com.mongodb.client.MongoClients;
    10import com.mongodb.client.MongoCollection;
    11import com.mongodb.client.MongoDatabase;
    12import org.bson.Document;
    13import org.bson.conversions.Bson;
    14import org.bson.BinaryVector;
    15import org.slf4j.Logger;
    16import org.slf4j.LoggerFactory;
    17
    18import java.util.ArrayList;
    19import java.util.HashMap;
    20import java.util.List;
    21import java.util.Map;
    22import java.util.Optional;
    23
    24import static com.mongodb.client.model.Aggregates.project;
    25import static com.mongodb.client.model.Aggregates.vectorSearch;
    26import static com.mongodb.client.model.Projections.fields;
    27import static com.mongodb.client.model.Projections.include;
    28import static com.mongodb.client.model.Projections.exclude;
    29import static com.mongodb.client.model.Projections.metaVectorSearchScore;
    30import static com.mongodb.client.model.search.SearchPath.fieldPath;
    31import static com.mongodb.client.model.search.VectorSearchOptions.approximateVectorSearchOptions;
    32import static java.util.Arrays.asList;
    33
    34public class CreateEmbeddingsAndRunQuery {
    35 private static final String COHERE_API_KEY = System.getenv("COHERE_API_KEY");
    36 private static final String MONGODB_URI = System.getenv("MONGODB_URI");
    37 private static final String DB_NAME = "<DATABASE-NAME>";
    38 private static final String COLLECTION_NAME = "<COLLECTION-NAME>";
    39 private static final String VECTOR_INDEX_NAME = "<INDEX-NAME>";
    40 private static final String DATA_FIELD_NAME = "<DATA-FIELD>";
    41
    42 public static void main(String[] args) {
    43 String queryText = "<QUERY-TEXT>";
    44
    45 try {
    46 CreateAndRunQuery processor = new CreateAndRunQuery();
    47 Map<String, BinaryVector> embeddingsData = processor.generateAndConvertEmbeddings(queryText);
    48 processor.runVectorSearchQuery(embeddingsData);
    49 } catch (Exception e) {
    50 e.printStackTrace();
    51 }
    52 }
    53
    54 // Generate embeddings using Cohere's embed API from the query text
    55 public Map<String, BinaryVector> generateAndConvertEmbeddings(String text) throws Exception {
    56 if (COHERE_API_KEY == null || COHERE_API_KEY.isEmpty()) {
    57 throw new RuntimeException("API key not found. Set COHERE_API_KEY in your environment.");
    58 }
    59
    60 Cohere cohere = Cohere.builder().token(COHERE_API_KEY).build();
    61
    62 EmbedRequest request = EmbedRequest.builder()
    63 .model("embed-english-v3.0")
    64 .inputType(EmbedInputType.SEARCH_QUERY)
    65 .texts(List.of(text))
    66 .embeddingTypes(List.of(EmbeddingType.FLOAT, EmbeddingType.INT_8, EmbeddingType.UBINARY))
    67 .build();
    68
    69 EmbedResponse response = cohere.embed(request);
    70 Optional<EmbedByTypeResponse> optionalEmbeddingsWrapper = response.getEmbeddingsByType();
    71 if (optionalEmbeddingsWrapper.isEmpty()) {
    72 throw new RuntimeException("No embeddings found in the API response.");
    73 }
    74
    75 EmbedByTypeResponseEmbeddings embeddings = optionalEmbeddingsWrapper.get().getEmbeddings();
    76 return createBinaryVectorEmbeddings(embeddings);
    77 }
    78
    79 // Convert embeddings to BSON binary vectors using MongoDB Java Driver
    80 private static Map<String, BinaryVector> createBinaryVectorEmbeddings(EmbedByTypeResponseEmbeddings embeddings) {
    81 Map<String, BinaryVector> binaryVectorEmbeddings = new HashMap<>();
    82
    83 // Convert float embeddings
    84 List<Double> floatList = embeddings.getFloat().orElseThrow().get(0);
    85 if (floatList != null) {
    86 float[] floatData = listToFloatArray(floatList);
    87 BinaryVector floatVector = BinaryVector.floatVector(floatData);
    88 binaryVectorEmbeddings.put("float32", floatVector);
    89 }
    90
    91 // Convert int8 embeddings
    92 List<Integer> int8List = embeddings.getInt8().orElseThrow().get(0);
    93 if (int8List != null) {
    94 byte[] int8Data = listToByteArray(int8List);
    95 BinaryVector int8Vector = BinaryVector.int8Vector(int8Data);
    96 binaryVectorEmbeddings.put("int8", int8Vector);
    97 }
    98
    99 // Convert ubinary embeddings
    100 List<Integer> ubinaryList = embeddings.getUbinary().orElseThrow().get(0);
    101 if (ubinaryList != null) {
    102 byte[] int1Data = listToByteArray(ubinaryList);
    103 BinaryVector packedBitsVector = BinaryVector.packedBitVector(int1Data, (byte) 0);
    104 binaryVectorEmbeddings.put("int1", packedBitsVector);
    105 }
    106
    107 return binaryVectorEmbeddings;
    108 }
    109
    110 // Define and run $vectorSearch query using the embeddings
    111 public void runVectorSearchQuery(Map<String, BinaryVector> embeddingsData) {
    112 if (MONGODB_URI == null || MONGODB_URI.isEmpty()) {
    113 throw new RuntimeException("MongoDB URI not found. Set MONGODB_URI in your environment.");
    114 }
    115
    116 try (MongoClient mongoClient = MongoClients.create(MONGODB_URI)) {
    117 MongoDatabase database = mongoClient.getDatabase(DB_NAME);
    118 MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME);
    119
    120 for (String path : embeddingsData.keySet()) {
    121 BinaryVector queryVector = embeddingsData.get(path);
    122
    123 List<Bson> pipeline = asList(
    124 vectorSearch(
    125 fieldPath("embeddings_" + path),
    126 queryVector,
    127 VECTOR_INDEX_NAME,
    128 2,
    129 approximateVectorSearchOptions(5)
    130 ),
    131 project(
    132 fields(
    133 exclude("_id"),
    134 include(DATA_FIELD_NAME),
    135 metaVectorSearchScore("vectorSearchScore")
    136 )
    137 )
    138 );
    139
    140 List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>());
    141
    142 System.out.println("Results from " + path + " embeddings:");
    143 for (Document result : results) {
    144 System.out.println(result.toJson());
    145 }
    146 }
    147 }
    148 }
    149
    150 private static float[] listToFloatArray(List<Double> list) {
    151 float[] array = new float[list.size()];
    152 for (int i = 0; i < list.size(); i++) {
    153 array[i] = list.get(i).floatValue();
    154 }
    155 return array;
    156 }
    157
    158 private static byte[] listToByteArray(List<Integer> list) {
    159 byte[] array = new byte[list.size()];
    160 for (int i = 0; i < list.size(); i++) {
    161 array[i] = list.get(i).byteValue();
    162 }
    163 return array;
    164 }
    165}
  3. 替换代码中的以下占位符值并保存文件。

    MONGODB_URI

    Atlas 集群连接字符串(如果您未设立环境变量)。

    COHERE_API_KEY

    如果您未设立环境变量,则使用 Cohere API密钥。

    <DATABASE-NAME>

    Atlas 集群中数据库的名称。

    <COLLECTION-NAME>

    您提取数据的集合的名称。

    <INDEX-NAME>

    集合的Atlas Vector Search索引的名称。

    <DATA-FIELD-NAME>

    包含生成嵌入的文本的字段名称。对于本示例,请使用 text

    <QUERY-TEXT>

    查询的文本。对于本示例,请使用 science fact

  4. 使用应用程序运行配置编译并运行该文件。

    如果使用终端,运行以下命令来编译并执行程序。

    javac CreateEmbeddingsAndRunQuery.java
    java CreateEmbeddingsAndRunQuery
    Results from int1 embeddings:
    {"text": "Mount Everest is the highest peak on Earth at 8,848m.", "score": 0.642578125}
    {"text": "The Great Wall of China is visible from space.", "score": 0.61328125}
    Results from int8 embeddings:
    {"text": "Mount Everest is the highest peak on Earth at 8,848m.", "score": 0.5149773359298706}
    {"text": "The Great Wall of China is visible from space.", "score": 0.5146723985671997}
    Results from float32 embeddings:
    {"text": "Mount Everest is the highest peak on Earth at 8,848m.", "score": 0.6583383083343506}
    {"text": "The Great Wall of China is visible from space.", "score": 0.6536108255386353}

要学习;了解有关生成嵌入并将嵌入转换为 binData 向量的更多信息,请参阅如何创建向量嵌入。

1
  1. 在 IDE 中,使用 Maven 或 Gradle 创建Java项目。

  2. 根据您的包管理器,添加以下依赖项:

    如果使用 Maven,请将以下依赖项添加到项目的 pom.xml文件的 dependencies大量中:

    pom.xml
    <dependencies>
    <dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.13.2</version>
    <scope>test</scope>
    </dependency>
    <dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>mongodb-driver-sync</artifactId>
    <version>5.3.1</version>
    </dependency>
    <dependency>
    <groupId>com.cohere</groupId>
    <artifactId>cohere-java</artifactId>
    <version>1.6.0</version>
    </dependency>
    <dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-api</artifactId>
    <version>2.0.16</version>
    </dependency>
    <dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-simple</artifactId>
    <version>2.0.16</version>
    <scope>test</scope>
    </dependency>
    </dependencies>

    如果您使用 Gradle,请将以下内容添加到项目 build.gradle文件的 dependencies大量中:

    build.gradle
    dependencies {
    // MongoDB Java Sync Driver v5.3.1 or later
    implementation 'org.mongodb:mongodb-driver-sync:[5.3.1,)'
    // Java library for working with Cohere models
    implementation 'ai.cohere:cohere-java:1.6.0'
    // SLF4J (The Simple Logging Facade for Java)
    testImplementation("org.slf4j:slf4j-simple:2.0.16")
    implementation("org.slf4j:slf4j-api:2.0.16")
    }
  3. 运行包管理器以安装项目的依赖项。

2

注意

此示例在 IDE 中设置项目的变量。 生产应用程序可以通过部署配置、CI/CD管道或密钥管理器管理环境变量,但您可以调整提供的代码以适合您的使用案例。

在 IDE 中,创建新的配置模板并将以下变量添加到项目中:

  • 如果您使用的是 IntelliJ IDEA,则应创建一个新的 Application 运行配置模板,然后在 Environment variables 字段中以分号分隔的值的形式来添加变量(例如,FOO=123;BAR=456)。应用这些更改并单击 OK

    要学习;了解更多信息,请参阅 IntelliJ IDEA 文档的从模板创建运行/调试配置部分。

  • 如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。 应用更改并单击 OK

    要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。

环境变量
COHERE_API_KEY=<api-key>
MONGODB_URI=<connection-string>

用以下值更新占位符:

  • <api-key> 占位符值替换为您的 Cohere API密钥。

  • 用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

    连接字符串应使用以下格式:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
3

您可以使用嵌入模型提供商为数据生成floatint8int1 嵌入,然后使用MongoDB Java驾驶员原生向量嵌入转换为BSON向量。以下示例代码使用 Cohere 的embed API从 sample_airbnb.listingsAndReviews命名空间中的数据生成全精度向量。

  1. 在Java项目中创建一个名为 GenerateAndConvertEmbeddings.java 的新文件。

    touch GenerateAndConvertEmbeddings.java
  2. 将以下代码复制并粘贴到 GenerateAndConvertEmbeddings.java 文件中。

    此代码执行以下操作:

    • sample_airbnb.listingsAndReviews命名空间中的 50 文档获取 summary字段。

    • 使用 Cohere 的 embed API 生成 float32int8ubinary 向量嵌入。

    • 使用MongoDB Java驾驶员将嵌入转换为BSON 向量。binData

    • 创建一个名为 embeddings.json 的文件,并将带有嵌入的数据保存在该文件中。

    GenerateAndConvertEmbeddings.java
    1import com.cohere.api.Cohere;
    2import com.cohere.api.requests.EmbedRequest;
    3import com.cohere.api.types.EmbedByTypeResponse;
    4import com.cohere.api.types.EmbedResponse;
    5import com.cohere.api.types.EmbeddingType;
    6import com.cohere.api.types.EmbedInputType;
    7import com.cohere.api.types.EmbedByTypeResponseEmbeddings;
    8import com.mongodb.client.MongoClient;
    9import com.mongodb.client.MongoClients;
    10import com.mongodb.client.MongoDatabase;
    11import com.mongodb.client.MongoCollection;
    12import com.mongodb.client.FindIterable;
    13import org.bson.BsonArray;
    14import org.bson.Document;
    15import org.bson.BinaryVector;
    16import org.slf4j.Logger;
    17import org.slf4j.LoggerFactory;
    18import java.io.FileOutputStream;
    19import java.io.IOException;
    20import java.util.ArrayList;
    21import java.util.Arrays;
    22import java.util.List;
    23import java.util.Objects;
    24import java.util.Optional;
    25
    26public class GenerateAndConvertEmbeddings {
    27 private static final Logger logger = LoggerFactory.getLogger(GenerateAndConvertEmbeddings.class);
    28 private static final String COHERE_API_KEY = System.getenv("COHERE_API_KEY");
    29 private static final String MONGODB_URI = System.getenv("MONGODB_URI");
    30
    31 public static void main(String[] args) {
    32 try {
    33 List<String> summaries = fetchSummariesFromMongoDB();
    34 if (summaries.isEmpty()) {
    35 throw new RuntimeException("No summaries retrieved from MongoDB.");
    36 }
    37 EmbedByTypeResponseEmbeddings embeddingsData = fetchEmbeddingsFromCohere(COHERE_API_KEY, summaries);
    38 if (embeddingsData == null) {
    39 throw new RuntimeException("Failed to fetch embeddings.");
    40 }
    41 convertAndSaveEmbeddings(summaries, embeddingsData);
    42 } catch (Exception e) {
    43 logger.error("Unexpected error: {}", e.getMessage(), e);
    44 }
    45 }
    46
    47 private static List<String> fetchSummariesFromMongoDB() {
    48 List<String> summaries = new ArrayList<>();
    49 if (MONGODB_URI == null || MONGODB_URI.isEmpty()) {
    50 throw new RuntimeException("MongoDB URI is not set.");
    51 }
    52 logger.info("Connecting to MongoDB at URI: {}", MONGODB_URI);
    53 try (MongoClient mongoClient = MongoClients.create(MONGODB_URI)) {
    54 String dbName = "sample_airbnb";
    55 String collName = "listingsAndReviews";
    56 MongoDatabase database = mongoClient.getDatabase(dbName);
    57 MongoCollection<Document> collection = database.getCollection(collName);
    58 Document filter = new Document("summary", new Document("$nin", Arrays.asList(null, "")));
    59 FindIterable<Document> documentsCursor = collection.find(filter).limit(50);
    60 for (Document doc : documentsCursor) {
    61 String summary = doc.getString("summary");
    62 if (summary != null && !summary.isEmpty()) {
    63 summaries.add(summary);
    64 }
    65 }
    66 logger.info("Retrieved {} summaries from MongoDB.", summaries.size());
    67 } catch (Exception e) {
    68 logger.error("Error fetching from MongoDB: {}", e.getMessage(), e);
    69 throw new RuntimeException("Failed to fetch data from MongoDB", e);
    70 }
    71 return summaries;
    72 }
    73
    74 private static EmbedByTypeResponseEmbeddings fetchEmbeddingsFromCohere(String apiKey, List<String> data) {
    75 if (Objects.isNull(apiKey) || apiKey.isEmpty()) {
    76 throw new RuntimeException("API key is not set.");
    77 }
    78 Cohere cohere = Cohere.builder().token(apiKey).clientName("embed-example").build();
    79 try {
    80 EmbedRequest request = EmbedRequest.builder()
    81 .model("embed-english-v3.0")
    82 .inputType(EmbedInputType.SEARCH_DOCUMENT)
    83 .texts(data)
    84 .embeddingTypes(List.of(EmbeddingType.FLOAT, EmbeddingType.INT_8, EmbeddingType.UBINARY))
    85 .build();
    86 EmbedResponse response = cohere.embed(request);
    87 Optional<EmbedByTypeResponse> optionalEmbeddingsWrapper = response.getEmbeddingsByType();
    88 if (optionalEmbeddingsWrapper.isPresent()) {
    89 return optionalEmbeddingsWrapper.get().getEmbeddings();
    90 } else {
    91 logger.warn("No embeddings were returned.");
    92 }
    93 } catch (Exception e) {
    94 logger.error("Error fetching embeddings: {}", e.getMessage(), e);
    95 }
    96 return null;
    97 }
    98
    99 private static void convertAndSaveEmbeddings(List<String> summaries, EmbedByTypeResponseEmbeddings embeddings) {
    100 try {
    101 Document doc = new Document();
    102 BsonArray array = new BsonArray();
    103 for (int i = 0; i < summaries.size(); i++) {
    104 String summary = summaries.get(i);
    105
    106 // Retrieve the embeddings for the current index
    107 List<Double> floatList = embeddings.getFloat().orElseThrow().get(i);
    108 List<Integer> int8List = embeddings.getInt8().orElseThrow().get(i);
    109 List<Integer> ubinaryList = embeddings.getUbinary().orElseThrow().get(i);
    110
    111 // Convert lists to arrays
    112 float[] floatData = listToFloatArray(floatList);
    113 byte[] int8Data = listToByteArray(int8List);
    114 byte[] int1Data = listToByteArray(ubinaryList);
    115
    116 // Create BinaryVector objects
    117 BinaryVector floatVector = BinaryVector.floatVector(floatData);
    118 BinaryVector int8Vector = BinaryVector.int8Vector(int8Data);
    119 BinaryVector packedBitsVector = BinaryVector.packedBitVector(int1Data, (byte) 0);
    120
    121 Document document = new Document()
    122 .append("text", summary)
    123 .append("embeddings_float32", floatVector)
    124 .append("embeddings_int8", int8Vector)
    125 .append("embeddings_int1", packedBitsVector);
    126 array.add(document.toBsonDocument());
    127 }
    128 doc.append("data", array);
    129 try (FileOutputStream fos = new FileOutputStream("embeddings.json")) {
    130 fos.write(doc.toJson().getBytes());
    131 }
    132 logger.info("Embeddings with BSON vectors have been saved to embeddings.json");
    133 } catch (IOException e) {
    134 logger.error("Error writing embeddings to file: {}", e.getMessage(), e);
    135 }
    136 }
    137
    138 private static float[] listToFloatArray(List<Double> list) {
    139 float[] array = new float[list.size()];
    140 for (int i = 0; i < list.size(); i++) {
    141 array[i] = list.get(i).floatValue();
    142 }
    143 return array;
    144 }
    145
    146 private static byte[] listToByteArray(List<Integer> list) {
    147 byte[] array = new byte[list.size()];
    148 for (int i = 0; i < list.size(); i++) {
    149 array[i] = list.get(i).byteValue();
    150 }
    151 return array;
    152 }
    153}
  3. 如果未设立环境变量,请替换代码中的以下占位符值并保存文件。

    MONGODB_URI

    Atlas 集群连接字符串(如果您未设立环境变量)。

    COHERE_API_KEY

    如果您未设立环境变量,则使用 Cohere API密钥。

  4. 使用应用程序运行配置编译并运行该文件。

    如果使用终端,运行以下命令来编译并执行程序。

    javac GenerateAndConvertEmbeddings.java
    java GenerateAndConvertEmbeddings
    [main] INFO GenerateAndConvertEmbeddings - Connecting to MongoDB at URI: <CONNECTION-STRING>
    ...
    [main] INFO GenerateAndConvertEmbeddings - Retrieved 50 summaries from MongoDB.
    [main] INFO GenerateAndConvertEmbeddings - Embeddings with BSON vectors have been saved to embeddings.json
  5. 验证 embeddings.json文件中的嵌入。

要学习;了解有关生成嵌入并将嵌入转换为 binData 向量的更多信息,请参阅如何创建向量嵌入。

4

您必须将数据和嵌入上传到Atlas 集群中的集合,并在数据上创建Atlas Vector Search索引,以对数据运行$vectorSearch 查询。

  1. 在Java项目中创建一个名为 UploadDataAndCreateIndex.java 的新文件。

    touch UploadDataAndCreateIndex.java
  2. 将以下代码复制并粘贴到 UploadDataAndCreateIndex.java 文件中。

    此代码执行以下操作:

    • embeddings.json文件中的 float32int8int1 嵌入上传到Atlas 集群。

    • embeddings.float32embeddings.int8embeddings.int1 字段上创建Atlas Vector Search索引。

    UploadDataAndCreateIndex.java
    1import com.mongodb.client.MongoClient;
    2import com.mongodb.client.MongoClients;
    3import com.mongodb.client.MongoCollection;
    4import com.mongodb.client.MongoDatabase;
    5import com.mongodb.client.model.SearchIndexModel;
    6import com.mongodb.client.model.SearchIndexType;
    7
    8import org.bson.Document;
    9import org.bson.conversions.Bson;
    10import org.bson.BinaryVector; // Import the BinaryVector
    11
    12import java.io.IOException;
    13import java.nio.file.Files;
    14import java.nio.file.Path;
    15import java.util.Collections;
    16import java.util.List;
    17import java.util.concurrent.TimeUnit;
    18import java.util.stream.StreamSupport;
    19
    20public class UploadDataAndCreateIndex {
    21
    22 private static final String MONGODB_URI = System.getenv("MONGODB_URI");
    23 private static final String DB_NAME = "sample_airbnb";
    24 private static final String COLLECTION_NAME = "listingsAndReviews";
    25 private static final String INDEX_NAME = "<INDEX-NAME>";
    26
    27 public static void main(String[] args) {
    28 try (MongoClient mongoClient = MongoClients.create(MONGODB_URI)) {
    29 uploadEmbeddingsData(mongoClient);
    30 setupVectorSearchIndex(mongoClient);
    31 } catch (Exception e) {
    32 e.printStackTrace();
    33 }
    34 }
    35
    36 public static void uploadEmbeddingsData(MongoClient mongoClient) throws IOException {
    37 MongoDatabase database = mongoClient.getDatabase(DB_NAME);
    38 MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME);
    39 String filePath = "embeddings.json";
    40 String fileContent = Files.readString(Path.of(filePath));
    41
    42 Document rootDoc = Document.parse(fileContent);
    43 List<Document> embeddingsDocs = rootDoc.getList("data", Document.class);
    44
    45 for (Document doc : embeddingsDocs) {
    46 // Retrieve the string value from the document
    47 String summary = doc.getString("text");
    48
    49 // Get the BinaryVector objects from the document
    50 BinaryVector embeddingsFloat32 = doc.get("embeddings_float32", BinaryVector.class);
    51 BinaryVector embeddingsInt8 = doc.get("embeddings_int8", BinaryVector.class);
    52 BinaryVector embeddingsInt1 = doc.get("embeddings_int1", BinaryVector.class);
    53
    54 // Create filter and update documents
    55 Document filter = new Document("summary", summary);
    56 Document update = new Document("$set", new Document("summary", summary)
    57 .append("embeddings_float32", embeddingsFloat32)
    58 .append("embeddings_int8", embeddingsInt8)
    59 .append("embeddings_int1", embeddingsInt1));
    60
    61 // Perform update operation with upsert option
    62 collection.updateOne(filter, update, new com.mongodb.client.model.UpdateOptions().upsert(true));
    63 System.out.println("Processed document with summary: " + summary);
    64 }
    65 }
    66
    67 public static void setupVectorSearchIndex(MongoClient client) throws InterruptedException {
    68 MongoDatabase database = client.getDatabase(DB_NAME);
    69 MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME);
    70 // Define the index details
    71 Bson definition = new Document(
    72 "fields",
    73 List.of(
    74 new Document("type", "vector")
    75 .append("path", "embeddings_float32")
    76 .append("numDimensions", 1024)
    77 .append("similarity", "dotProduct"),
    78 new Document("type", "vector")
    79 .append("path", "embeddings_int8")
    80 .append("numDimensions", 1024)
    81 .append("similarity", "dotProduct"),
    82 new Document("type", "vector")
    83 .append("path", "embeddings_int1")
    84 .append("numDimensions", 1024)
    85 .append("similarity", "euclidean")
    86 )
    87 );
    88 // Define the index model
    89 SearchIndexModel indexModel = new SearchIndexModel(
    90 INDEX_NAME,
    91 definition,
    92 SearchIndexType.vectorSearch()
    93 );
    94 // Create the index using the defined model
    95 List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel));
    96 System.out.println("Successfully created vector index named: " + result.get(0));
    97 System.out.println("It may take up to a minute for the index to leave the BUILDING status and become queryable.");
    98 // Wait for Atlas to build the index
    99 System.out.println("Polling to confirm the index has changed from the BUILDING status.");
    100 waitForIndex(collection, INDEX_NAME);
    101 }
    102
    103 public static <T> boolean waitForIndex(final MongoCollection<T> collection, final String indexName) {
    104 long startTime = System.nanoTime();
    105 long timeoutNanos = TimeUnit.SECONDS.toNanos(60);
    106 while (System.nanoTime() - startTime < timeoutNanos) {
    107 Document indexRecord = StreamSupport.stream(collection.listSearchIndexes().spliterator(), false)
    108 .filter(index -> indexName.equals(index.getString("name")))
    109 .findAny().orElse(null);
    110 if (indexRecord != null) {
    111 if ("FAILED".equals(indexRecord.getString("status"))) {
    112 throw new RuntimeException("Search index has FAILED status.");
    113 }
    114 if (indexRecord.getBoolean("queryable")) {
    115 System.out.println(indexName + " index is ready to query");
    116 return true;
    117 }
    118 }
    119 try {
    120 Thread.sleep(100); // busy-wait, avoid in production
    121 } catch (InterruptedException e) {
    122 Thread.currentThread().interrupt();
    123 throw new RuntimeException(e);
    124 }
    125 }
    126 return false;
    127 }
    128}
  3. 替换代码中的以下占位符值并保存文件。

    MONGODB_URI

    Atlas 集群连接字符串(如果您未设立环境变量)。

    <INDEX-NAME>

    集合的Atlas Vector Search索引的名称。

  4. 使用应用程序运行配置编译并运行该文件。

    如果使用终端,运行以下命令来编译并执行程序。

    javac UploadDataAndCreateIndex.java
    java UploadDataAndCreateIndex
    Successfully created vector index named: <INDEX_NAME>
    It may take up to a minute for the index to leave the BUILDING status and become queryable.
    Polling to confirm the index has changed from the BUILDING status.
    <INDEX_NAME> index is ready to query
  5. 登录Atlas 集群并验证以下内容:

    • 命名空间空间中的数据。

    • 集合的Atlas Vector Search索引。

5

要测试嵌入,您可以对集合运行查询。使用嵌入模型提供商为查询文本生成 floatint8int1 嵌入。以下示例代码使用 Cohere 的 embed API生成全精度向量。生成嵌入后,使用MongoDB Java驾驶员将原原生向量嵌入转换为BSON向量,并对集合运行$vectorSearch查询。

  1. 在Java项目中创建一个名为 CreateEmbeddingsAndRunQuery.java 的新文件。

    touch CreateEmbeddingsAndRunQuery.java
  2. 将以下代码复制并粘贴到 CreateEmbeddingsAndRunQuery.java 文件中。

    此代码执行以下操作:

    • 使用 Cohere 的 embed API 生成 float32int8ubinary 向量嵌入。

    • 使用MongoDB Java驾驶员将嵌入转换为BSON 向量。binData

    • 针对集合运行查询并返回结果。

    CreateEmbeddingsAndRunQuery.java
    1import com.cohere.api.Cohere;
    2import com.cohere.api.requests.EmbedRequest;
    3import com.cohere.api.types.EmbedResponse;
    4import com.cohere.api.types.EmbedByTypeResponse;
    5import com.cohere.api.types.EmbedByTypeResponseEmbeddings;
    6import com.cohere.api.types.EmbeddingType;
    7import com.cohere.api.types.EmbedInputType;
    8import com.mongodb.client.MongoClient;
    9import com.mongodb.client.MongoClients;
    10import com.mongodb.client.MongoCollection;
    11import com.mongodb.client.MongoDatabase;
    12import org.bson.Document;
    13import org.bson.conversions.Bson;
    14import org.bson.BinaryVector;
    15import org.slf4j.Logger;
    16import org.slf4j.LoggerFactory;
    17
    18import java.util.ArrayList;
    19import java.util.HashMap;
    20import java.util.List;
    21import java.util.Map;
    22import java.util.Optional;
    23
    24import static com.mongodb.client.model.Aggregates.project;
    25import static com.mongodb.client.model.Aggregates.vectorSearch;
    26import static com.mongodb.client.model.Projections.fields;
    27import static com.mongodb.client.model.Projections.include;
    28import static com.mongodb.client.model.Projections.exclude;
    29import static com.mongodb.client.model.Projections.metaVectorSearchScore;
    30import static com.mongodb.client.model.search.SearchPath.fieldPath;
    31import static com.mongodb.client.model.search.VectorSearchOptions.approximateVectorSearchOptions;
    32import static java.util.Arrays.asList;
    33
    34public class CreateEmbeddingsAndRunQuery {
    35 private static final String COHERE_API_KEY = System.getenv("COHERE_API_KEY");
    36 private static final String MONGODB_URI = System.getenv("MONGODB_URI");
    37 private static final String DB_NAME = "<DATABASE-NAME>";
    38 private static final String COLLECTION_NAME = "<COLLECTION-NAME>";
    39 private static final String VECTOR_INDEX_NAME = "<INDEX-NAME>";
    40 private static final String DATA_FIELD_NAME = "<DATA-FIELD>";
    41
    42 public static void main(String[] args) {
    43 String queryText = "<QUERY-TEXT>";
    44
    45 try {
    46 CreateAndRunQuery processor = new CreateAndRunQuery();
    47 Map<String, BinaryVector> embeddingsData = processor.generateAndConvertEmbeddings(queryText);
    48 processor.runVectorSearchQuery(embeddingsData);
    49 } catch (Exception e) {
    50 e.printStackTrace();
    51 }
    52 }
    53
    54 // Generate embeddings using Cohere's embed API from the query text
    55 public Map<String, BinaryVector> generateAndConvertEmbeddings(String text) throws Exception {
    56 if (COHERE_API_KEY == null || COHERE_API_KEY.isEmpty()) {
    57 throw new RuntimeException("API key not found. Set COHERE_API_KEY in your environment.");
    58 }
    59
    60 Cohere cohere = Cohere.builder().token(COHERE_API_KEY).build();
    61
    62 EmbedRequest request = EmbedRequest.builder()
    63 .model("embed-english-v3.0")
    64 .inputType(EmbedInputType.SEARCH_QUERY)
    65 .texts(List.of(text))
    66 .embeddingTypes(List.of(EmbeddingType.FLOAT, EmbeddingType.INT_8, EmbeddingType.UBINARY))
    67 .build();
    68
    69 EmbedResponse response = cohere.embed(request);
    70 Optional<EmbedByTypeResponse> optionalEmbeddingsWrapper = response.getEmbeddingsByType();
    71 if (optionalEmbeddingsWrapper.isEmpty()) {
    72 throw new RuntimeException("No embeddings found in the API response.");
    73 }
    74
    75 EmbedByTypeResponseEmbeddings embeddings = optionalEmbeddingsWrapper.get().getEmbeddings();
    76 return createBinaryVectorEmbeddings(embeddings);
    77 }
    78
    79 // Convert embeddings to BSON binary vectors using MongoDB Java Driver
    80 private static Map<String, BinaryVector> createBinaryVectorEmbeddings(EmbedByTypeResponseEmbeddings embeddings) {
    81 Map<String, BinaryVector> binaryVectorEmbeddings = new HashMap<>();
    82
    83 // Convert float embeddings
    84 List<Double> floatList = embeddings.getFloat().orElseThrow().get(0);
    85 if (floatList != null) {
    86 float[] floatData = listToFloatArray(floatList);
    87 BinaryVector floatVector = BinaryVector.floatVector(floatData);
    88 binaryVectorEmbeddings.put("float32", floatVector);
    89 }
    90
    91 // Convert int8 embeddings
    92 List<Integer> int8List = embeddings.getInt8().orElseThrow().get(0);
    93 if (int8List != null) {
    94 byte[] int8Data = listToByteArray(int8List);
    95 BinaryVector int8Vector = BinaryVector.int8Vector(int8Data);
    96 binaryVectorEmbeddings.put("int8", int8Vector);
    97 }
    98
    99 // Convert ubinary embeddings
    100 List<Integer> ubinaryList = embeddings.getUbinary().orElseThrow().get(0);
    101 if (ubinaryList != null) {
    102 byte[] int1Data = listToByteArray(ubinaryList);
    103 BinaryVector packedBitsVector = BinaryVector.packedBitVector(int1Data, (byte) 0);
    104 binaryVectorEmbeddings.put("int1", packedBitsVector);
    105 }
    106
    107 return binaryVectorEmbeddings;
    108 }
    109
    110 // Define and run $vectorSearch query using the embeddings
    111 public void runVectorSearchQuery(Map<String, BinaryVector> embeddingsData) {
    112 if (MONGODB_URI == null || MONGODB_URI.isEmpty()) {
    113 throw new RuntimeException("MongoDB URI not found. Set MONGODB_URI in your environment.");
    114 }
    115
    116 try (MongoClient mongoClient = MongoClients.create(MONGODB_URI)) {
    117 MongoDatabase database = mongoClient.getDatabase(DB_NAME);
    118 MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME);
    119
    120 for (String path : embeddingsData.keySet()) {
    121 BinaryVector queryVector = embeddingsData.get(path);
    122
    123 List<Bson> pipeline = asList(
    124 vectorSearch(
    125 fieldPath("embeddings_" + path),
    126 queryVector,
    127 VECTOR_INDEX_NAME,
    128 2,
    129 approximateVectorSearchOptions(5)
    130 ),
    131 project(
    132 fields(
    133 exclude("_id"),
    134 include(DATA_FIELD_NAME),
    135 metaVectorSearchScore("vectorSearchScore")
    136 )
    137 )
    138 );
    139
    140 List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>());
    141
    142 System.out.println("Results from " + path + " embeddings:");
    143 for (Document result : results) {
    144 System.out.println(result.toJson());
    145 }
    146 }
    147 }
    148 }
    149
    150 private static float[] listToFloatArray(List<Double> list) {
    151 float[] array = new float[list.size()];
    152 for (int i = 0; i < list.size(); i++) {
    153 array[i] = list.get(i).floatValue();
    154 }
    155 return array;
    156 }
    157
    158 private static byte[] listToByteArray(List<Integer> list) {
    159 byte[] array = new byte[list.size()];
    160 for (int i = 0; i < list.size(); i++) {
    161 array[i] = list.get(i).byteValue();
    162 }
    163 return array;
    164 }
    165}
  3. 替换代码中的以下占位符值并保存文件。

    MONGODB_URI

    Atlas 集群连接字符串(如果您未设立环境变量)。

    COHERE_API_KEY

    如果您未设立环境变量,则使用 Cohere API密钥。

    <DATABASE-NAME>

    Atlas 集群中数据库的名称。对于本示例,请使用 sample_airbnb

    <COLLECTION-NAME>

    您提取数据的集合的名称。对于本示例,请使用 listingsAndReviews

    <INDEX-NAME>

    集合的Atlas Vector Search索引的名称。

    <DATA-FIELD-NAME>

    包含生成嵌入的文本的字段名称。对于本示例,请使用 summary

    <QUERY-TEXT>

    查询的文本。对于本示例,请使用 ocean view

  4. 使用应用程序运行配置编译并运行该文件。

    如果使用终端,运行以下命令来编译并执行程序。

    javac CreateEmbeddingsAndRunQuery.java
    java CreateEmbeddingsAndRunQuery
    Results from int1 embeddings:
    {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "vectorSearchScore": 0.6591796875}
    {"summary": "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.", "vectorSearchScore": 0.6337890625}
    Results from int8 embeddings:
    {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "vectorSearchScore": 0.5215557217597961}
    {"summary": "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.", "vectorSearchScore": 0.5179016590118408}
    Results from float32 embeddings:
    {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "vectorSearchScore": 0.7278661131858826}
    {"summary": "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.", "vectorSearchScore": 0.688639760017395}

要学习;了解有关生成嵌入并将嵌入转换为 binData 向量的更多信息,请参阅如何创建向量嵌入。

1

运行以下命令以安装 MongoDB Node.js 驱动程序。此操作可能需要几分钟才能完成。

npm install mongodb

您必须安装 Node.js v6.11 或更高版本的驱动程序。如有必要,您还可安装来自嵌入模型提供商的库。例如,要使用 Cohere 来生成 float32int8int1 嵌入(如本页所示),请安装 Cohere:

npm install cohere-ai dotenv
npm show cohere-ai version
2
  1. 要访问嵌入模型提供商以生成和转换嵌入,请按需为嵌入模型提供商的 API 密钥设置环境变量。

    要使用 Cohere 的嵌入功能,请设置 COHERE_API_KEY 环境变量。

    export COHERE_API_KEY="<COHERE-API-KEY>"

    如果您未设置环境变量,则请在运行此示例代码之前将代码中的 <COHERE-API-KEY> 替换为 API 密钥。

  2. 要访问 Atlas 集群,请设置 MONGODB_URI 环境变量。

    export MONGODB_URI="<CONNECTION-STRING>"

    您的连接字符串应使用以下格式:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

    如果您未设置该环境变量,则请在运行此示例代码之前将代码中的 <CONNECTION-STRING> 替换为您的连接字符串。

3
  1. 创建一个名为 get-embeddings.js 的文件,以使用 Cohere 的 embed API 来生成 float32int8int1 向量嵌入。

    touch get-embeddings.js
  2. 将以下代码复制并粘贴到 get-embeddings.js 文件中。

    此代码执行以下操作:

    • 使用 Cohere 的 embed-english-v3.0 嵌入模型为给定数据生成 float32int8int1 嵌入。

    • floatint8int1 的嵌入分别存储在名为 floatint8ubinary 的字段中。

    • 创建一个名为 embeddings.json 的文件,并将嵌入保存到该文件中。

    get-embeddings.js
    1// Use 'require' for modules in a Node.js environment
    2const { CohereClient } = require('cohere-ai');
    3const { writeFile } = require('fs/promises');
    4dd:queueMicrotask
    5// Retrieve API key from environment variables or default placeholder
    6const apiKey = process.env.COHERE_API_KEY || '<COHERE-API-KEY>';
    7
    8if (!apiKey) {
    9 throw new Error('API key not found. Please set COHERE_API_KEY in your environment.');
    10}
    11
    12// Instantiate the CohereClient with the API key
    13const cohere = new CohereClient({ token: apiKey });
    14
    15async function main() {
    16 try {
    17 // Data to embed
    18 const data = [
    19 "The Great Wall of China is visible from space.",
    20 "The Eiffel Tower was completed in Paris in 1889.",
    21 "Mount Everest is the highest peak on Earth at 8,848m.",
    22 "Shakespeare wrote 37 plays and 154 sonnets during his lifetime.",
    23 "The Mona Lisa was painted by Leonardo da Vinci.",
    24 ];
    25
    26 // Fetch embeddings for the data using the cohere API
    27 const response = await cohere.v2.embed({
    28 model: 'embed-english-v3.0',
    29 inputType: 'search_document',
    30 texts: data,
    31 embeddingTypes: ['float', 'int8', 'ubinary'],
    32 });
    33
    34 // Extract embeddings from the API response
    35 const { float, int8, ubinary } = response.embeddings;
    36
    37 // Map the embeddings to the text data
    38 const embeddingsData = data.map((text, index) => ({
    39 text,
    40 embeddings: {
    41 float: float[index],
    42 int8: int8[index],
    43 ubinary: ubinary[index],
    44 },
    45 }));
    46
    47 // Write the embeddings data to a JSON file
    48 await writeFile('embeddings.json', JSON.stringify(embeddingsData, null, 2));
    49 console.log('Embeddings saved to embeddings.json');
    50 } catch (error) {
    51 console.error('Error fetching embeddings:', error);
    52 }
    53}
    54
    55// Execute the main function
    56main();
  3. 如果您没有将 Cohere 的 API 密钥设为环境变量,则请替换 <COHERE_API_KEY> 占位符,然后保存该文件。

  4. 运行此代码以生成嵌入。

    node get-embeddings.js
    Embeddings saved to embeddings.json
  5. 验证所生成 embeddings.json 文件中的嵌入。

4
  1. 创建一个名为 convert-embeddings.js 的文件,以使用 MongoDB Node.js 驱动程序将 Cohere 中的 float32int8int1 向量嵌入转换为 BSON binData 向量。

    touch convert-embeddings.js
  2. 将以下代码复制并粘贴到 convert-embeddings.js 文件中。

    此代码执行以下操作:

    • float32int8int1 嵌入生成 BSON binData 向量。

    • float32int8ubinary BSON binData 向量追加到 embeddings.json 文件中。

    convert-embeddings.js
    1const fs = require('fs/promises');
    2const { BSON } = require('mongodb');
    3const { Binary } = BSON;
    4
    5async function main() {
    6 try {
    7 // Read and parse the contents of 'embeddings.json' file
    8 const fileContent = await fs.readFile('embeddings.json', 'utf8');
    9 const embeddingsData = JSON.parse(fileContent);
    10
    11 // Map the embeddings data to add BSON binary representations with subtype 9
    12 const convertEmbeddingsData = embeddingsData.map(({ text, embeddings }) => {
    13 // Create Binary for Float32Array with manual subtype 9
    14 const bsonFloat32 = Binary.fromFloat32Array(new Float32Array(embeddings.float));
    15
    16 // Create Binary for Int8Array with subtype 9
    17 const bsonInt8 = Binary.fromInt8Array(new Int8Array(embeddings.int8));
    18
    19 // Create Binary for PackedBits (Uint8Array) with subtype 9
    20 const bsonPackedBits = Binary.fromPackedBits(new Uint8Array(embeddings.ubinary));
    21
    22 return {
    23 text,
    24 embeddings: {
    25 float: embeddings.float, // Original float data
    26 int8: embeddings.int8, // Original int8 data
    27 ubinary: embeddings.ubinary, // Original packed bits data
    28 },
    29 bsonEmbeddings: {
    30 float32: bsonFloat32,
    31 int8: bsonInt8,
    32 packedBits: bsonPackedBits,
    33 },
    34 };
    35 });
    36
    37 // Serialize the updated data to EJSON for BSON compatibility
    38 const ejsonSerializedData = BSON.EJSON.stringify(convertEmbeddingsData, null, null, { relaxed: false });
    39
    40 // Write the serialized data to 'embeddings.json'
    41 await fs.writeFile('embeddings.json', ejsonSerializedData);
    42 console.log('Embeddings with BSON vectors have been saved to embeddings.json');
    43 } catch (error) {
    44 console.error('Error processing embeddings:', error);
    45 }
    46}
    47
    48main();
  3. 运行此程序以生成 BSON binData 向量。

    node convert-embeddings.js
    Embeddings with BSON vectors have been saved to embeddings.json
  4. 请验证 embeddings.json 文件中生成的 BSON 嵌入。

5
  1. 创建一个名为 upload-data.js 的文件以连接到 Atlas 集群,并在数据库中为 embeddings.json 文件中的数据创建一个集合。

    touch upload-data.js
  2. 将以下代码复制并粘贴到 upload-data.js 文件中。

    此代码执行以下操作:

    • 连接到您的 Atlas 集群,并使用您指定的数据库和集合名称创建一个命名空间。

    • 将包括嵌入在内的 embeddings.json 文件中的数据上传到指定的命名空间。

    upload-data.js
    1const fs = require('fs/promises'); // Use fs/promises for asynchronous operations
    2const { MongoClient, BSON } = require('mongodb'); // Import from the 'mongodb' package
    3
    4const { Binary } = BSON; // Ensure the Binary class is imported correctly
    5
    6async function main() {
    7 const MONGODB_URI = process.env.MONGODB_URI || "<CONNECTION-STRING>";
    8 const DB_NAME = "<DB-NAME>";
    9 const COLLECTION_NAME = "<COLLECTION-NAME>";
    10
    11 let client;
    12 try {
    13 client = new MongoClient(MONGODB_URI);
    14 await client.connect();
    15 console.log("Connected to MongoDB");
    16
    17 const db = client.db(DB_NAME);
    18 const collection = db.collection(COLLECTION_NAME);
    19
    20 // Read and parse the contents of 'embeddings.json' file using EJSON
    21 const fileContent = await fs.readFile('embeddings.json', 'utf8');
    22 const embeddingsData = BSON.EJSON.parse(fileContent);
    23
    24 // Map embeddings data to recreate BSON binary representations with the correct subtype
    25 const documents = embeddingsData.map(({ text, bsonEmbeddings }) => {
    26 return {
    27 text,
    28 bsonEmbeddings: {
    29 float32: bsonEmbeddings.float32,
    30 int8: bsonEmbeddings.int8,
    31 int1: bsonEmbeddings.packedBits
    32 }
    33 };
    34 });
    35
    36 const result = await collection.insertMany(documents);
    37 console.log(`Inserted ${result.insertedCount} documents into MongoDB`);
    38
    39 } catch (error) {
    40 console.error('Error storing embeddings in MongoDB:', error);
    41 } finally {
    42 if (client) {
    43 await client.close();
    44 }
    45 }
    46}
    47
    48// Run the store function
    49main();
  3. 替换以下设置并保存该文件。

    <CONNECTION-STRING>

    用于连接到您希望在其中创建数据库和集合的 Atlas 集群的连接字符串。

    仅当您没有设置 MONGODB_URI 环境变量时,才需替换该值。

    <DB-NAME>

    您要在其中创建集合的数据库的名称。

    <COLLECTION-NAME>

    您要存储生成的嵌入的集合的名称。

  4. 运行以下命令以上传该数据。

    node upload-data.js
  5. 请确保这些文档存在于您 Atlas 集群上的集合中。

6
  1. 创建一个名为 create-index.js 的文件以在集合上定义 Atlas Vector Search 索引。

    touch create-index.js
  2. 将以下代码复制并粘贴到 create-index.js 文件中以创建索引。

    此代码将执行以下操作:

    • 连接到 Atlas 集群,并为指定的命名空间创建一个具有指定名称的索引。

    • bsonEmbeddings.float32bsonEmbeddings.int8 字段索引为 vector 类型,使用 dotProduct 相似性函数,并将 bsonEmbeddings.int1 字段也索引为 vector 类型,使用 euclidean 函数。

    create-index.js
    1const { MongoClient } = require("mongodb");
    2const { setTimeout } = require("timers/promises"); // Import from timers/promises
    3
    4// Connect to your Atlas deployment
    5const uri = process.env.MONGODB_URI || "<CONNECTION-STRING>";
    6
    7const client = new MongoClient(uri);
    8
    9async function main() {
    10 try {
    11 const database = client.db("<DB-NAME>");
    12 const collection = database.collection("<COLLECTION-NAME>");
    13
    14 // Define your Atlas Vector Search index
    15 const index = {
    16 name: "<INDEX-NAME>",
    17 type: "vectorSearch",
    18 definition: {
    19 fields: [
    20 {
    21 type: "vector",
    22 numDimensions: 1024,
    23 path: "bsonEmbeddings.float32",
    24 similarity: "dotProduct",
    25 },
    26 {
    27 type: "vector",
    28 numDimensions: 1024,
    29 path: "bsonEmbeddings.int8",
    30 similarity: "dotProduct",
    31 },
    32 {
    33 type: "vector",
    34 numDimensions: 1024,
    35 path: "bsonEmbeddings.int1",
    36 similarity: "euclidean",
    37 },
    38 ],
    39 },
    40 };
    41
    42 // Run the helper method
    43 const result = await collection.createSearchIndex(index);
    44 console.log(`New search index named ${result} is building.`);
    45
    46 // Wait for the index to be ready to query
    47 console.log("Polling to check if the index is ready. This may take up to a minute.");
    48 let isQueryable = false;
    49
    50 // Use filtered search for index readiness
    51 while (!isQueryable) {
    52 const [indexData] = await collection.listSearchIndexes(index.name).toArray();
    53
    54 if (indexData) {
    55 isQueryable = indexData.queryable;
    56 if (!isQueryable) {
    57 await setTimeout(5000); // Wait for 5 seconds before checking again
    58 }
    59 } else {
    60 // Handle the case where the index might not be found
    61 console.log(`Index ${index.name} not found.`);
    62 await setTimeout(5000); // Wait for 5 seconds before checking again
    63 }
    64 }
    65
    66 console.log(`${result} is ready for querying.`);
    67 } catch (error) {
    68 console.error("Error:", error);
    69 } finally {
    70 await client.close();
    71 }
    72}
    73
    74main().catch((err) => {
    75 console.error("Unhandled error:", err);
    76});
  3. 替换以下设置并保存该文件。

    <CONNECTION-STRING>

    用于连接到您要在其中创建该索引的 Atlas 集群的连接字符串。

    仅当您没有设置 MONGODB_URI 环境变量时,才需替换该值。

    <DB-NAME>

    您要在其中创建集合的数据库的名称。

    <COLLECTION-NAME>

    您要存储生成的嵌入的集合的名称。

    <INDEX-NAME>

    集合的索引名称。

  4. 创建索引。

    node create-index.js
7
  1. 创建一个名为 get-query-embedding.js 的文件。

    touch get-query-embeddings.js
  2. 将代码复制并粘贴到 get-query-embedding.js 文件中。

    示例代码执行以下操作:

    • 使用 Cohere 为查询文本生成 float32int8int1 嵌入。

    • 使用 PyMongo 将生成的嵌入转换为 BSON binData 向量。

    • 将生成的嵌入保存到名为 query-embeddings.json 的文件中。

    get-query-embedding.js
    1const { CohereClient } = require('cohere-ai');
    2const { BSON } = require('mongodb');
    3const { writeFile } = require('fs/promises');
    4const dotenv = require('dotenv');
    5const process = require('process');
    6
    7// Load environment variables
    8dotenv.config();
    9
    10const { Binary } = BSON;
    11
    12// Get the API key from environment variables or set the key here
    13const apiKey = process.env.COHERE_API_KEY || '<COHERE-API-KEY>';
    14
    15if (!apiKey) {
    16 throw new Error('API key not found. Provide the COHERE_API_KEY.');
    17}
    18
    19// Initialize CohereClient
    20const cohere = new CohereClient({ token: apiKey });
    21
    22async function main(queryText) {
    23 try {
    24 if (typeof queryText !== 'string' || queryText.trim() === '') {
    25 throw new Error('Invalid query text. It must be a non-empty string.');
    26 }
    27
    28 const data = [queryText];
    29
    30 // Request embeddings from the Cohere API
    31 const response = await cohere.v2.embed({
    32 model: 'embed-english-v3.0',
    33 inputType: 'search_query',
    34 texts: data,
    35 embeddingTypes: ['float', 'int8', 'ubinary'], // Request all required embedding types
    36 });
    37
    38 if (!response.embeddings) {
    39 throw new Error('Embeddings not found in the API response.');
    40 }
    41
    42 const { float, int8, ubinary } = response.embeddings;
    43
    44 const updatedEmbeddingsData = data.map((text, index) => {
    45 // Create the BSON Binary objects using VECTOR_TYPE for all embedding types
    46 const float32Binary = Binary.fromFloat32Array(new Float32Array(float[index])); // VECTOR_TYPE.FLOAT32
    47 const int8Binary = Binary.fromInt8Array(new Int8Array(int8[index])); // VECTOR_TYPE.INT8
    48 const packedBitsBinary = Binary.fromPackedBits(new Uint8Array(ubinary[index])); // VECTOR_TYPE.PACKED_BIT
    49
    50 return {
    51 text,
    52 embeddings: {
    53 float: float[index],
    54 int8: int8[index],
    55 ubinary: ubinary[index],
    56 },
    57 bsonEmbeddings: {
    58 float32: float32Binary,
    59 int8: int8Binary,
    60 int1: packedBitsBinary,
    61 },
    62 };
    63 });
    64
    65 // Serialize the embeddings using BSON EJSON for BSON compatibility
    66 const outputFileName = 'query-embeddings.json';
    67 const ejsonSerializedData = BSON.EJSON.stringify(updatedEmbeddingsData, null, null, { relaxed: false });
    68 await writeFile(outputFileName, ejsonSerializedData);
    69 console.log(`Embeddings with BSON data have been saved to ${outputFileName}`);
    70 } catch (error) {
    71 console.error('Error processing query text:', error);
    72 }
    73}
    74
    75// Main function that takes a query string
    76(async () => {
    77 const queryText = "<QUERY-TEXT>"; // Replace with your actual query text
    78 await main(queryText);
    79})();
  3. 替换以下设置并保存该文件。

    <COHERE-API-KEY>

    您的 Cohere API 密钥。仅当您没有设置此环境变量时,才需替换该值。

    <QUERY-TEXT>

    您的查询文本。在本教程中,使用 science fact

  4. 运行此代码以生成查询文本的嵌入。

    node get-query-embeddings.js
    Embeddings with BSON vectors have been saved to query-embeddings.json
8
  1. 创建一个名为 run-query.js 的文件。

    touch run-query.js
  2. 将以下示例 $vectorSearch 查询复制并粘贴到 run-query.js 文件中。

    示例查询执行以下操作:

    • 连接到您的 Atlas 集群,然后使用 query-embeddings.json 文件中的嵌入以针对指定集合中的 bsonEmbeddings.float32bsonEmbeddings.int8bsonEmbeddings.int1 字段来运行 $vectorSearch 查询。

    • 将 Float32、Int8 和 Packed Binary (Int1) 嵌入的结果打印到控制台。

    run-query.js
    1const { MongoClient } = require('mongodb');
    2const fs = require('fs/promises');
    3const { BSON } = require('bson'); // Use BSON's functionality for EJSON parsing
    4const dotenv = require('dotenv');
    5
    6dotenv.config();
    7
    8// MongoDB connection details
    9const mongoUri = process.env.MONGODB_URI || '<CONNECTION-STRING>';
    10const dbName = '<DB-NAME>'; // Update with your actual database name
    11const collectionName = '<COLLECTION-NAME>'; // Update with your actual collection name
    12
    13// Indices and paths should match your MongoDB vector search configuration
    14const VECTOR_INDEX_NAME = '<INDEX-NAME>'; // Replace with your actual index name
    15const NUM_CANDIDATES = 5; // Number of candidate documents for the search
    16const LIMIT = 2; // Limit for the number of documents to return
    17
    18// Fields in the collection that contain the BSON query vectors
    19const FIELDS = [
    20 { path: 'float32', subtype: 9 }, // Ensure that the path and custom subtype match
    21 { path: 'int8', subtype: 9 }, // Use the custom subtype if needed
    22 { path: 'int1', subtype: 9 } // Use the same custom subtype
    23];
    24
    25
    26// Function to read BSON vectors from JSON and run vector search
    27async function main() {
    28 // Initialize MongoDB client
    29 const client = new MongoClient(mongoUri);
    30
    31 try {
    32 await client.connect();
    33 console.log("Connected to MongoDB");
    34
    35 const db = client.db(dbName);
    36 const collection = db.collection(collectionName);
    37
    38 // Load query embeddings from JSON file using EJSON parsing
    39 const fileContent = await fs.readFile('query-embeddings.json', 'utf8');
    40 const embeddingsData = BSON.EJSON.parse(fileContent);
    41
    42 // Define and run the query for each embedding type
    43 const results = {};
    44
    45 for (const fieldInfo of FIELDS) {
    46 const { path, subtype } = fieldInfo;
    47 const bsonBinary = embeddingsData[0]?.bsonEmbeddings?.[path];
    48
    49 if (!bsonBinary) {
    50 console.warn(`BSON embedding for ${path} not found in the JSON.`);
    51 continue;
    52 }
    53
    54 const bsonQueryVector = bsonBinary; // Directly use BSON Binary object
    55
    56 const pipeline = [
    57 {
    58 $vectorSearch: {
    59 index: VECTOR_INDEX_NAME,
    60 path: `bsonEmbeddings.${path}`,
    61 queryVector: bsonQueryVector,
    62 numCandidates: NUM_CANDIDATES,
    63 limit: LIMIT,
    64 }
    65 },
    66 {
    67 $project: {
    68 _id: 0,
    69 text: 1, // Adjust projection fields as necessary to match your document structure
    70 score: { $meta: 'vectorSearchScore' }
    71 }
    72 }
    73 ];
    74
    75 results[path] = await collection.aggregate(pipeline).toArray();
    76 }
    77
    78 return results;
    79 } catch (error) {
    80 console.error('Error during vector search:', error);
    81 } finally {
    82 await client.close();
    83 }
    84}
    85
    86// Main execution block
    87(async () => {
    88 try {
    89 const results = await main();
    90
    91 if (results) {
    92 console.log("Results from Float32 embeddings:");
    93 console.table(results.float32 || []);
    94 console.log("--------------------------------------------------------------------------");
    95
    96 console.log("Results from Int8 embeddings:");
    97 console.table(results.int8 || []);
    98 console.log("--------------------------------------------------------------------------");
    99
    100 console.log("Results from Packed Binary (PackedBits) embeddings:");
    101 console.table(results.int1 || []);
    102 }
    103 } catch (error) {
    104 console.error('Error executing main function:', error);
    105 }
    106})();
  3. 替换以下设置并保存 run-query.js 文件。

    <CONNECTION-STRING>

    用于连接到您希望运行查询的 Atlas 集群的连接字符串。

    仅当您没有设置 MONGODB_URI 环境变量时,才需替换该值。

    <DB-NAME>

    包含该集合的数据库的名称。

    <COLLECTION-NAME>

    您要查询的集合的名称。

    <INDEX-NAME>

    集合的索引名称。

  4. 运行以下命令以执行查询。

    node run-query.js
    Connected to MongoDB
    Results from Float32 embeddings:
    ┌─────────┬─────────────────────────────────────────────────────────┬────────────────────┐
    │ (index) │ text │ score │
    ├─────────┼─────────────────────────────────────────────────────────┼────────────────────┤
    │ 0 │ 'Mount Everest is the highest peak on Earth at 8,848m.' │ 0.6583383083343506 │
    │ 1 │ 'The Great Wall of China is visible from space.' │ 0.6536108255386353 │
    └─────────┴─────────────────────────────────────────────────────────┴────────────────────┘
    --------------------------------------------------------------------------
    Results from Int8 embeddings:
    ┌─────────┬─────────────────────────────────────────────────────────┬────────────────────┐
    │ (index) │ text │ score │
    ├─────────┼─────────────────────────────────────────────────────────┼────────────────────┤
    │ 0 │ 'Mount Everest is the highest peak on Earth at 8,848m.' │ 0.5149773359298706 │
    │ 1 │ 'The Great Wall of China is visible from space.' │ 0.5146723985671997 │
    └─────────┴─────────────────────────────────────────────────────────┴────────────────────┘
    --------------------------------------------------------------------------
    Results from Packed Binary (PackedBits) embeddings:
    ┌─────────┬─────────────────────────────────────────────────────────┬─────────────┐
    │ (index) │ text │ score │
    ├─────────┼─────────────────────────────────────────────────────────┼─────────────┤
    │ 0 │ 'Mount Everest is the highest peak on Earth at 8,848m.' │ 0.642578125 │
    │ 1 │ 'The Great Wall of China is visible from space.' │ 0.61328125 │
    └─────────┴─────────────────────────────────────────────────────────┴─────────────┘
1

运行以下命令以安装 MongoDB Node.js 驱动程序。此操作可能需要几分钟才能完成。

npm install mongodb

您必须安装 Node.js v6.11 或更高版本的驱动程序。如有必要,您还可安装来自嵌入模型提供商的库。例如,要使用 Cohere 来生成 float32int8int1 嵌入(如本页所示),请安装 Cohere:

npm install cohere-ai dotenv
npm show cohere-ai version
2
  1. 要访问嵌入模型提供商以生成和转换嵌入,请按需为嵌入模型提供商的 API 密钥设置环境变量。

    要使用 Cohere 的嵌入功能,请设置 COHERE_API_KEY 环境变量。

    export COHERE_API_KEY="<COHERE-API-KEY>"

    如果您未设置环境变量,则请在运行此示例代码之前将代码中的 <COHERE-API-KEY> 替换为 API 密钥。

  2. 要访问 Atlas 集群,请设置 MONGODB_URI 环境变量。

    export MONGODB_URI="<CONNECTION-STRING>"

    您的连接字符串应使用以下格式:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

    如果您未设置该环境变量,则请在运行此示例代码之前将代码中的 <CONNECTION-STRING> 替换为您的连接字符串。

3
  1. 创建一个名为 get-data.js 的文件。

    touch get-data.js
  2. 将以下示例代码复制并粘贴到您的 Atlas 集群中的 sample_airbnb.listingsAndReviews 命名空间以获取数据。

    示例代码执行以下操作:

    • 连接到您的 Atlas 集群并查找具有 summary 字段的文档。

    • 创建一个名为 subset.json 的文件,将集合中的数据写入该文件。

    get-data.js
    1const { MongoClient } = require('mongodb');
    2const fs = require('fs'); // Import the fs module for file system operations
    3
    4async function main() {
    5 // Replace with your Atlas connection string
    6 const uri = process.env.MONGODB_URI || '<CONNECTION-STRING>';
    7
    8 // Create a new MongoClient instance
    9 const client = new MongoClient(uri);
    10
    11 try {
    12 // Connect to your Atlas cluster
    13 await client.connect();
    14
    15 // Specify the database and collection
    16 const db = client.db('sample_airbnb');
    17 const collection = db.collection('listingsAndReviews');
    18
    19 // Filter to exclude null or empty summary fields
    20 const filter = { summary: { $nin: [null, ''] } };
    21
    22 // Get a subset of documents in the collection
    23 const documentsCursor = collection.find(filter).limit(50);
    24
    25 // Convert the cursor to an array to get the documents
    26 const documents = await documentsCursor.toArray();
    27
    28 // Log the documents to verify their content
    29 console.log('Documents retrieved:', documents);
    30
    31 // Write the documents to a local file called "subset.json"
    32 const outputFilePath = './subset.json';
    33 fs.writeFileSync(outputFilePath, JSON.stringify(documents, null, 2), 'utf-8');
    34
    35 console.log(`Subset of documents written to: ${outputFilePath}`);
    36 } catch (error) {
    37 console.error('An error occurred:', error);
    38 } finally {
    39 // Ensure the client is closed when finished
    40 await client.close();
    41 }
    42}
    43
    44main().catch(console.error);
  3. 如果您没有为Atlas连接字符串设立环境变量,请替换 <CONNECTION-STRING> 占位符,然后保存文件。

  4. 运行以下命令以提取该数据:

    node get-data.js
    Subset of documents written to: ./subset.json
4

如果您的集合中已经有 float32int8int1 向量嵌入,请跳过此步骤。

  1. 创建一个名为 get-embeddings.js 的文件,以使用 Cohere 的 embed API 来生成 float32int8int1 向量嵌入。

    touch get-embeddings.js
  2. 将以下代码复制并粘贴到 get-embeddings.js 文件中。

    此代码执行以下操作:

    • 使用 Cohere 的 embed-english-v3.0 嵌入模型为给定数据生成 float32int8int1 嵌入。

    • float32int8int1 的嵌入分别存储在名为 floatint8ubinary 的字段中。

    • 创建一个名为 embeddings.json 的文件,并将嵌入保存到该文件中。

    get-embeddings.js
    1// Import necessary modules using the CommonJS syntax
    2const { CohereClient } = require('cohere-ai');
    3const { readFile, writeFile } = require('fs/promises');
    4
    5// Retrieve the API key from environment variables or provide a placeholder
    6const apiKey = process.env.COHERE_API_KEY || '<COHERE-API-KEY>';
    7
    8if (!apiKey || apiKey === '<COHERE-API-KEY>') {
    9 throw new Error('API key not found. Please set COHERE_API_KEY in your environment.');
    10}
    11
    12// Initialize the Cohere client with the API key
    13const cohere = new CohereClient({ token: apiKey });
    14
    15async function main() {
    16 try {
    17 // Read and parse the contents of 'subset.json'
    18 const subsetData = await readFile('subset.json', 'utf-8');
    19 const documents = JSON.parse(subsetData);
    20
    21 // Extract the 'summary' fields that are non-empty strings
    22 const data = documents
    23 .map(doc => doc.summary)
    24 .filter(summary => typeof summary === 'string' && summary.length > 0);
    25
    26 if (data.length === 0) {
    27 throw new Error('No valid summary texts available in the data.');
    28 }
    29
    30 // Request embeddings from the Cohere API
    31 const response = await cohere.v2.embed({
    32 model: 'embed-english-v3.0',
    33 inputType: 'search_document',
    34 texts: data,
    35 embeddingTypes: ['float', 'int8', 'ubinary'],
    36 });
    37
    38 // Extract embeddings from the API response
    39 const { float, int8, ubinary } = response.embeddings;
    40
    41 // Structure the embeddings data
    42 const embeddingsData = data.map((text, index) => ({
    43 text,
    44 embeddings: {
    45 float: float[index],
    46 int8: int8[index],
    47 ubinary: ubinary[index],
    48 },
    49 }));
    50
    51 // Write the embeddings data to 'embeddings.json'
    52 await writeFile('embeddings.json', JSON.stringify(embeddingsData, null, 2));
    53 console.log('Embeddings saved to embeddings.json');
    54 } catch (error) {
    55 console.error('Error fetching embeddings:', error);
    56 }
    57}
    58
    59// Execute the main function
    60main();
  3. 如果您没有为 Cohere API密钥设立环境变量,请替换 <COHERE-API-KEY> 占位符并保存文件。

  4. 运行此代码以生成相关嵌入。

    node get-embeddings.js
    Embeddings saved to embeddings.json
  5. 通过打开生成的 embeddings.json 文件来验证生成的嵌入。

5
  1. 创建一个名为 convert-embeddings.js 的文件,以将 Cohere 中的 float32int8int1 向量嵌入转换为 BSON binData 向量。

    touch convert-embeddings.js
  2. 将以下代码复制并粘贴到 convert-embeddings.js 文件中。

    此代码执行以下操作:

    • float32int8int1 嵌入生成 BSON binData 向量。

    • float32int8ubinary BSON binData 向量追加到 embeddings.json 文件中。

    convert-embeddings.js
    1const fs = require('fs/promises');
    2const { BSON } = require('mongodb');
    3const { Binary } = BSON;
    4
    5async function main() {
    6 try {
    7 // Read and parse the contents of 'embeddings.json' file
    8 const fileContent = await fs.readFile('embeddings.json', 'utf8');
    9 const embeddingsData = JSON.parse(fileContent);
    10
    11 // Map the embeddings data to add BSON binary representations with subtype 9
    12 const convertEmbeddingsData = embeddingsData.map(({ text, embeddings }) => {
    13 // Create Binary for Float32Array with manual subtype 9
    14 const bsonFloat32 = Binary.fromFloat32Array(new Float32Array(embeddings.float));
    15
    16 // Create Binary for Int8Array with subtype 9
    17 const bsonInt8 = Binary.fromInt8Array(new Int8Array(embeddings.int8));
    18
    19 // Create Binary for PackedBits (Uint8Array) with subtype 9
    20 const bsonPackedBits = Binary.fromPackedBits(new Uint8Array(embeddings.ubinary));
    21
    22 return {
    23 text,
    24 embeddings: {
    25 float: embeddings.float, // Original float data
    26 int8: embeddings.int8, // Original int8 data
    27 ubinary: embeddings.ubinary, // Original packed bits data
    28 },
    29 bsonEmbeddings: {
    30 float32: bsonFloat32,
    31 int8: bsonInt8,
    32 packedBits: bsonPackedBits,
    33 },
    34 };
    35 });
    36
    37 // Serialize the updated data to EJSON for BSON compatibility
    38 const ejsonSerializedData = BSON.EJSON.stringify(convertEmbeddingsData, null, null, { relaxed: false });
    39
    40 // Write the serialized data to 'embeddings.json'
    41 await fs.writeFile('embeddings.json', ejsonSerializedData);
    42 console.log('Embeddings with BSON vectors have been saved to embeddings.json');
    43 } catch (error) {
    44 console.error('Error processing embeddings:', error);
    45 }
    46}
    47
    48main();
  3. 运行此程序以生成 BSON binData 向量。

    node convert-embeddings.js
    Embeddings with BSON vectors have been saved to embeddings.json
  4. 请验证 embeddings.json 文件中生成的 BSON 嵌入。

6
  1. 创建一个名为 upload-data.js 的文件以连接到 Atlas 集群,并将数据上传到 sample_airbnb.listingsAndReviews 命名空间。

    touch upload-data.js
  2. 将以下代码复制并粘贴到 upload-data.js 文件中。

    此代码执行以下操作:

    • 连接到您的 Atlas 集群,并使用您指定的数据库和集合名称创建一个命名空间。

    • 将包括嵌入在内的数据上传到 sample_airbnb.listingsAndReviews 命名空间。

    upload-data.js
    1const fs = require('fs/promises'); // Use fs/promises for asynchronous operations
    2const { MongoClient, BSON } = require('mongodb'); // Import from the 'mongodb' package
    3const { EJSON, Binary } = require('bson'); // Import EJSON and Binary from bson
    4
    5async function main() {
    6 const MONGODB_URI = process.env.MONGODB_URI || "<CONNECTION-STRING>";
    7 const DB_NAME = "sample_airbnb";
    8 const COLLECTION_NAME = "listingsAndReviews";
    9
    10 let client;
    11 try {
    12 // Connect to MongoDB
    13 client = new MongoClient(MONGODB_URI);
    14 await client.connect();
    15 console.log("Connected to MongoDB");
    16
    17 // Access database and collection
    18 const db = client.db(DB_NAME);
    19 const collection = db.collection(COLLECTION_NAME);
    20
    21 // Load embeddings from JSON using EJSON.parse
    22 const fileContent = await fs.readFile('embeddings.json', 'utf8');
    23 const embeddingsData = EJSON.parse(fileContent); // Use EJSON.parse
    24
    25 // Map embeddings data to recreate BSON binary representations
    26 const documents = embeddingsData.map(({ text, bsonEmbeddings }) => {
    27 return {
    28 summary: text,
    29 bsonEmbeddings: {
    30 float32: bsonEmbeddings.float32,
    31 int8: bsonEmbeddings.int8,
    32 int1: bsonEmbeddings.packedBits
    33 }
    34 };
    35 });
    36
    37 // Iterate over documents and upsert each into the MongoDB collection
    38 for (const doc of documents) {
    39 const filter = { summary: doc.summary };
    40 const update = { $set: doc };
    41
    42 // Update the document with the BSON binary data
    43 const result = await collection.updateOne(filter, update, { upsert: true });
    44 if (result.matchedCount > 0) {
    45 console.log(`Updated document with summary: ${doc.summary}`);
    46 } else {
    47 console.log(`Inserted new document with summary: ${doc.summary}`);
    48 }
    49 }
    50
    51 console.log("Embeddings stored in MongoDB successfully.");
    52 } catch (error) {
    53 console.error('Error storing embeddings in MongoDB:', error);
    54 } finally {
    55 if (client) {
    56 await client.close();
    57 }
    58 }
    59}
    60
    61// Run the main function to load the data
    62main();
  3. 如果您没有为Atlas连接字符串设立环境变量,请替换 <CONNECTION-STRING> 占位符,然后保存文件。

  4. 运行以下命令以上传该数据。

    node upload-data.js
    Connected to MongoDB
    Updated document with text: ...
    ...
    Embeddings stored in MongoDB successfully.
  5. 通过登录到您的 Atlas 集群并检查 Data Explorer 中的命名空间来进行验证。

7
  1. 创建一个名为 create-index.js 的文件。

    touch create-index.js
  2. 将以下代码复制并粘贴到 create-index.js 文件中以创建索引。

    此代码将执行以下操作:

    • 连接到 Atlas 集群,并为指定的命名空间创建一个具有指定名称的索引。

    • 使用 dotProduct 相似性函数将 bsonEmbeddings.float32bsonEmbeddings.int8 字段索引为 vector 类型,并使用 euclidean 函数将 bsonEmbeddings.int1 字段也索引为 vector 类型。

    create-index.js
    1const { MongoClient } = require("mongodb");
    2const { setTimeout } = require("timers/promises"); // Import from timers/promises
    3
    4// Connect to your Atlas deployment
    5const uri = process.env.MONGODB_URI || "<CONNECTION-STRING>";
    6
    7const client = new MongoClient(uri);
    8
    9async function main() {
    10 try {
    11 const database = client.db("<DB-NAME>");
    12 const collection = database.collection("<COLLECTION-NAME>");
    13
    14 // Define your Atlas Vector Search index
    15 const index = {
    16 name: "<INDEX-NAME>",
    17 type: "vectorSearch",
    18 definition: {
    19 fields: [
    20 {
    21 type: "vector",
    22 numDimensions: 1024,
    23 path: "bsonEmbeddings.float32",
    24 similarity: "dotProduct",
    25 },
    26 {
    27 type: "vector",
    28 numDimensions: 1024,
    29 path: "bsonEmbeddings.int8",
    30 similarity: "dotProduct",
    31 },
    32 {
    33 type: "vector",
    34 numDimensions: 1024,
    35 path: "bsonEmbeddings.int1",
    36 similarity: "euclidean",
    37 },
    38 ],
    39 },
    40 };
    41
    42 // Run the helper method
    43 const result = await collection.createSearchIndex(index);
    44 console.log(`New search index named ${result} is building.`);
    45
    46 // Wait for the index to be ready to query
    47 console.log("Polling to check if the index is ready. This may take up to a minute.");
    48 let isQueryable = false;
    49
    50 // Use filtered search for index readiness
    51 while (!isQueryable) {
    52 const [indexData] = await collection.listSearchIndexes(index.name).toArray();
    53
    54 if (indexData) {
    55 isQueryable = indexData.queryable;
    56 if (!isQueryable) {
    57 await setTimeout(5000); // Wait for 5 seconds before checking again
    58 }
    59 } else {
    60 // Handle the case where the index might not be found
    61 console.log(`Index ${index.name} not found.`);
    62 await setTimeout(5000); // Wait for 5 seconds before checking again
    63 }
    64 }
    65
    66 console.log(`${result} is ready for querying.`);
    67 } catch (error) {
    68 console.error("Error:", error);
    69 } finally {
    70 await client.close();
    71 }
    72}
    73
    74main().catch((err) => {
    75 console.error("Unhandled error:", err);
    76});
  3. 替换以下设置并保存该文件。

    <CONNECTION-STRING>

    用于连接到您希望创建数据库和集合的 Atlas 集群的连接字符串。

    仅当您没有设置 MONGODB_URI 环境变量时,才需替换该值。

    <DB-NAME>

    集合的名称是 sample_airbnb

    <COLLECTION-NAME>

    集合的名称是 listingsAndReviews

    <INDEX-NAME>

    集合的索引名称。

  4. 创建索引。

    node create-index.js
    New search index named vector_index is building.
    Polling to check if the index is ready. This may take up to a minute.
    <INDEX-NAME> is ready for querying.
8
  1. 创建一个名为 get-query-embeddings.js 的文件。

    touch get-query-embeddings.js
  2. 将代码复制并粘贴到 get-query-embedding.js 文件中。

    示例代码执行以下操作:

    • 使用 Cohere 为查询文本生成 float32int8int1 嵌入。

    • 使用 PyMongo 将生成的嵌入转换为 BSON binData 向量。

    • 将生成的嵌入保存到名为 query-embeddings.json 的文件中。

    get-query-embedding.js
    1const { CohereClient } = require('cohere-ai');
    2const { BSON } = require('mongodb');
    3const { writeFile } = require('fs/promises');
    4const dotenv = require('dotenv');
    5const process = require('process');
    6
    7// Load environment variables
    8dotenv.config();
    9
    10const { Binary } = BSON;
    11
    12// Get the API key from environment variables or set the key here
    13const apiKey = process.env.COHERE_API_KEY || '<COHERE-API-KEY>';
    14
    15if (!apiKey) {
    16 throw new Error('API key not found. Provide the COHERE_API_KEY.');
    17}
    18
    19// Initialize CohereClient
    20const cohere = new CohereClient({ token: apiKey });
    21
    22async function main(queryText) {
    23 try {
    24 if (typeof queryText !== 'string' || queryText.trim() === '') {
    25 throw new Error('Invalid query text. It must be a non-empty string.');
    26 }
    27
    28 const data = [queryText];
    29
    30 // Request embeddings from the Cohere API
    31 const response = await cohere.v2.embed({
    32 model: 'embed-english-v3.0',
    33 inputType: 'search_query',
    34 texts: data,
    35 embeddingTypes: ['float', 'int8', 'ubinary'], // Request all required embedding types
    36 });
    37
    38 if (!response.embeddings) {
    39 throw new Error('Embeddings not found in the API response.');
    40 }
    41
    42 const { float, int8, ubinary } = response.embeddings;
    43
    44 const updatedEmbeddingsData = data.map((text, index) => {
    45 // Create the BSON Binary objects using VECTOR_TYPE for all embedding types
    46 const float32Binary = Binary.fromFloat32Array(new Float32Array(float[index])); // VECTOR_TYPE.FLOAT32
    47 const int8Binary = Binary.fromInt8Array(new Int8Array(int8[index])); // VECTOR_TYPE.INT8
    48 const packedBitsBinary = Binary.fromPackedBits(new Uint8Array(ubinary[index])); // VECTOR_TYPE.PACKED_BIT
    49
    50 return {
    51 text,
    52 embeddings: {
    53 float: float[index],
    54 int8: int8[index],
    55 ubinary: ubinary[index],
    56 },
    57 bsonEmbeddings: {
    58 float32: float32Binary,
    59 int8: int8Binary,
    60 int1: packedBitsBinary,
    61 },
    62 };
    63 });
    64
    65 // Serialize the embeddings using BSON EJSON for BSON compatibility
    66 const outputFileName = 'query-embeddings.json';
    67 const ejsonSerializedData = BSON.EJSON.stringify(updatedEmbeddingsData, null, null, { relaxed: false });
    68 await writeFile(outputFileName, ejsonSerializedData);
    69 console.log(`Embeddings with BSON data have been saved to ${outputFileName}`);
    70 } catch (error) {
    71 console.error('Error processing query text:', error);
    72 }
    73}
    74
    75// Main function that takes a query string
    76(async () => {
    77 const queryText = "<QUERY-TEXT>"; // Replace with your actual query text
    78 await main(queryText);
    79})();
  3. 替换以下设置并保存该文件。

    <COHERE-API-KEY>

    您的 Cohere API 密钥。仅当您没有将此密钥设为环境变量时,才需替换该值。

    <QUERY-TEXT>

    您的查询文本。对于本示例,请使用 ocean view

  4. 运行此代码以生成查询文本的嵌入。

    node get-query-embeddings.js
    Embeddings with BSON vectors have been saved to query-embeddings.json
9
  1. 创建一个名为 run-query.js 的文件。

    touch run-query.js
  2. 将以下示例 $vectorSearch 查询复制并粘贴到 run-query.js 文件中。

    示例查询执行以下操作:

    • 连接到您的 Atlas 集群,然后使用 query-embeddings.json 文件中的嵌入以针对 sample_airbnb.listingsAndReviews 命名空间中的 bsonEmbeddings.float32bsonEmbeddings.int8bsonEmbeddings.int1 字段来运行 $vectorSearch 查询。

    • 将 Float32、Int8 和 Packed Binary (Int1) 嵌入的结果打印到控制台。

    run-query.js
    1const { MongoClient } = require('mongodb');
    2const fs = require('fs/promises');
    3const { BSON } = require('bson'); // Use BSON's functionality for EJSON parsing
    4const dotenv = require('dotenv');
    5
    6dotenv.config();
    7
    8// MongoDB connection details
    9const mongoUri = process.env.MONGODB_URI || '<CONNECTION-STRING>';
    10const dbName = 'sample_airbnb'; // Update with your actual database name
    11const collectionName = 'listingsAndReviews'; // Update with your actual collection name
    12
    13// Indices and paths should match your MongoDB vector search configuration
    14const VECTOR_INDEX_NAME = '<INDEX-NAME>'; // Replace with your actual index name
    15const NUM_CANDIDATES = 20; // Number of candidate documents for the search
    16const LIMIT = 5; // Limit for the number of documents to return
    17
    18// Fields in the collection that contain the BSON query vectors
    19const FIELDS = [
    20 { path: 'float32', subtype: 9 }, // Ensure that the path and custom subtype match
    21 { path: 'int8', subtype: 9 }, // Use the custom subtype if needed
    22 { path: 'int1', subtype: 9 } // Use the same custom subtype
    23];
    24
    25
    26// Function to read BSON vectors from JSON and run vector search
    27async function main() {
    28 // Initialize MongoDB client
    29 const client = new MongoClient(mongoUri);
    30
    31 try {
    32 await client.connect();
    33 console.log("Connected to MongoDB");
    34
    35 const db = client.db(dbName);
    36 const collection = db.collection(collectionName);
    37
    38 // Load query embeddings from JSON file using EJSON parsing
    39 const fileContent = await fs.readFile('query-embeddings.json', 'utf8');
    40 const embeddingsData = BSON.EJSON.parse(fileContent);
    41
    42 // Define and run the query for each embedding type
    43 const results = {};
    44
    45 for (const fieldInfo of FIELDS) {
    46 const { path, subtype } = fieldInfo;
    47 const bsonBinary = embeddingsData[0]?.bsonEmbeddings?.[path];
    48
    49 if (!bsonBinary) {
    50 console.warn(`BSON embedding for ${path} not found in the JSON.`);
    51 continue;
    52 }
    53
    54 const bsonQueryVector = bsonBinary; // Directly use BSON Binary object
    55
    56 const pipeline = [
    57 {
    58 $vectorSearch: {
    59 index: VECTOR_INDEX_NAME,
    60 path: `bsonEmbeddings.${path}`,
    61 queryVector: bsonQueryVector,
    62 numCandidates: NUM_CANDIDATES,
    63 limit: LIMIT,
    64 }
    65 },
    66 {
    67 $project: {
    68 _id: 0,
    69 name: 1,
    70 summary: 1, // Adjust projection fields as necessary to match your document structure
    71 score: { $meta: 'vectorSearchScore' }
    72 }
    73 }
    74 ];
    75
    76 results[path] = await collection.aggregate(pipeline).toArray();
    77 }
    78
    79 return results;
    80 } catch (error) {
    81 console.error('Error during vector search:', error);
    82 } finally {
    83 await client.close();
    84 }
    85}
    86
    87// Main execution block
    88(async () => {
    89 try {
    90 const results = await main();
    91
    92 if (results) {
    93 console.log("Results from Float32 embeddings:");
    94 (results.float32 || []).forEach((result, index) => {
    95 console.log(`Result ${index + 1}:`, result);
    96 });
    97
    98 console.log("Results from Int8 embeddings:");
    99 (results.int8 || []).forEach((result, index) => {
    100 console.log(`Result ${index + 1}:`, result);
    101 });
    102
    103 console.log("Results from Packed Binary (PackedBits) embeddings:");
    104 (results.int1 || []).forEach((result, index) => {
    105 console.log(`Result ${index + 1}:`, result);
    106 });
    107 }
    108 } catch (error) {
    109 console.error('Error executing main function:', error);
    110 }
    111 })();
    112
  3. 替换以下设置并保存 run-query.js 文件。

    <CONNECTION-STRING>

    用于连接到您希望创建数据库和集合的 Atlas 集群的连接字符串。

    如果您没有设置 MONGODB_URI 环境变量,则请替换该值。

    <INDEX-NAME>

    集合的索引名称。

  4. 运行此查询。

    要执行此查询,请运行以下命令:

    node run-query.js
    Connected to MongoDB
    Results from Float32 embeddings:
    Result 1: {
    name: 'Makaha Valley Paradise with OceanView',
    summary: "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.",
    score: 0.7278661131858826
    }
    Result 2: {
    name: 'Ocean View Waikiki Marina w/prkg',
    summary: "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.",
    score: 0.688639760017395
    }
    Result 3: {
    name: 'A Casa Alegre é um apartamento T1.',
    summary: 'Para 2 pessoas. Vista de mar a 150 mts. Prédio com 2 elevadores. Tem: - quarto com roupeiro e cama de casal (colchão magnetizado); - cozinha: placa de discos, exaustor, frigorifico, micro-ondas e torradeira; casa de banho completa; - sala e varanda.',
    score: 0.6831139326095581
    }
    Result 4: {
    name: 'Your spot in Copacabana',
    summary: 'Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.',
    score: 0.6802051663398743
    }
    Result 5: {
    name: 'LAHAINA, MAUI! RESORT/CONDO BEACHFRONT!! SLEEPS 4!',
    summary: 'THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!',
    score: 0.6779564619064331
    }
    Results from Int8 embeddings:
    Result 1: {
    name: 'Makaha Valley Paradise with OceanView',
    summary: "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.",
    score: 0.5215557217597961
    }
    Result 2: {
    name: 'Ocean View Waikiki Marina w/prkg',
    summary: "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.",
    score: 0.5179016590118408
    }
    Result 3: {
    name: 'A Casa Alegre é um apartamento T1.',
    summary: 'Para 2 pessoas. Vista de mar a 150 mts. Prédio com 2 elevadores. Tem: - quarto com roupeiro e cama de casal (colchão magnetizado); - cozinha: placa de discos, exaustor, frigorifico, micro-ondas e torradeira; casa de banho completa; - sala e varanda.',
    score: 0.5173280239105225
    }
    Result 4: {
    name: 'Your spot in Copacabana',
    summary: 'Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.',
    score: 0.5170232057571411
    }
    Result 5: {
    name: 'LAHAINA, MAUI! RESORT/CONDO BEACHFRONT!! SLEEPS 4!',
    summary: 'THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!',
    score: 0.5168724060058594
    }
    Results from Packed Binary (PackedBits) embeddings:
    Result 1: {
    name: 'Makaha Valley Paradise with OceanView',
    summary: "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.",
    score: 0.6591796875
    }
    Result 2: {
    name: 'Ocean View Waikiki Marina w/prkg',
    summary: "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.",
    score: 0.6337890625
    }
    Result 3: {
    name: 'A Casa Alegre é um apartamento T1.',
    summary: 'Para 2 pessoas. Vista de mar a 150 mts. Prédio com 2 elevadores. Tem: - quarto com roupeiro e cama de casal (colchão magnetizado); - cozinha: placa de discos, exaustor, frigorifico, micro-ondas e torradeira; casa de banho completa; - sala e varanda.',
    score: 0.62890625
    }
    Result 4: {
    name: 'LAHAINA, MAUI! RESORT/CONDO BEACHFRONT!! SLEEPS 4!',
    summary: 'THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!',
    score: 0.6279296875
    }
    Result 5: {
    name: 'Be Happy in Porto',
    summary: 'Be Happy Apartment is an amazing space. Renovated and comfortable apartment, located in a building dating from the nineteenth century in one of the most emblematic streets of the Porto city "Rua do Almada". Be Happy Apartment is located in the city center, able you to visit the historic center only by foot, being very close of majority points of interesting of the Porto City. Be Happy Apartment is located close of central Station MetroTrindade.',
    score: 0.619140625
    }

    您的结果可能会有所不同,因为生成的嵌入可能会因您的环境而有所变化。

通过保存扩展名为 .ipynb 的文件来创建交互式Python笔记本,然后在笔记本中执行以下步骤。 要试用该示例,请将占位符替换为有效值。

使用本教程的可运行版本以作为 Python 笔记本

1

运行以下命令以安装PyMongo驱动程序。如有必要,您还可以安装嵌入模型提供商的库。 此操作可能需要几分钟才能完成。

pip install pymongo

您必须安装 PyMongo v4.10 或更高版本的驱动程序。

例子

安装 PyMongo 和 Cohere

pip install --quiet --upgrade pymongo cohere
2

例子

要导入的示例数据

data = [
"The Great Wall of China is visible from space.",
"The Eiffel Tower was completed in Paris in 1889.",
"Mount Everest is the highest peak on Earth at 8,848m.",
"Shakespeare wrote 37 plays and 154 sonnets during his lifetime.",
"The Mona Lisa was painted by Leonardo da Vinci.",
]
3

如果您尚未从数据生成嵌入,则需要执行此步骤。如果您已经生成了嵌入,请跳过此步骤。要了解有关从数据生成嵌入的更多信息,请参阅如何创建向量嵌入。

例子

使用 Cohere 从样本数据生成嵌入

占位符
有效值

<COHERE-API-KEY>

Cohere 的API密钥。

import os
import cohere
# Specify your Cohere API key
os.environ["COHERE_API_KEY"] = "<COHERE-API-KEY>"
cohere_client = cohere.Client(os.environ["COHERE_API_KEY"])
# Generate embeddings using the embed-english-v3.0 model
generated_embeddings = cohere_client.embed(
texts=data,
model="embed-english-v3.0",
input_type="search_document",
embedding_types=["float", "int8", "ubinary"]
).embeddings
float32_embeddings = generated_embeddings.float
int8_embeddings = generated_embeddings.int8
int1_embeddings = generated_embeddings.ubinary
4

您可以使用 PyMongo 驱动程序将本地向量嵌入转换为 BSON 向量。

例子

定义并运行函数,生成 BSON 向量

from bson.binary import Binary, BinaryVectorDtype
def generate_bson_vector(vector, vector_dtype):
return Binary.from_vector(vector, vector_dtype)
# For all vectors in your collection, generate BSON vectors of float32, int8, and int1 embeddings
bson_float32_embeddings = []
bson_int8_embeddings = []
bson_int1_embeddings = []
for i, (f32_emb, int8_emb, int1_emb) in enumerate(zip(float32_embeddings, int8_embeddings, int1_embeddings)):
bson_float32_embeddings.append(generate_bson_vector(f32_emb, BinaryVectorDtype.FLOAT32))
bson_int8_embeddings.append(generate_bson_vector(int8_emb, BinaryVectorDtype.INT8))
bson_int1_embeddings.append(generate_bson_vector(int1_emb, BinaryVectorDtype.PACKED_BIT))
5

如果集合中的文档已经包含 BSON 向量嵌入,请跳过此步骤。

例子

从示例数据创建文档

占位符
有效值

<FIELD-NAME-FOR-FLOAT32-TYPE>

具有 float32 值的字段名称。

<FIELD-NAME-FOR-INT8-TYPE>

具有 int8 值的字段名称。

<FIELD-NAME-FOR-INT1-TYPE>

具有 int1 值的字段名称。

# Specify the field names for the float32, int8, and int1 embeddings
float32_field = "<FIELD-NAME-FOR-FLOAT32-TYPE>"
int8_field = "<FIELD-NAME-FOR-INT8-TYPE>"
int1_field = "<FIELD-NAME-FOR-INT1-TYPE>"
# Define function to create documents with BSON vector embeddings
def create_docs_with_bson_vector_embeddings(bson_float32_embeddings, bson_int8_embeddings, bson_int1_embeddings, data):
docs = []
for i, (bson_f32_emb, bson_int8_emb, bson_int1_emb, text) in enumerate(zip(bson_float32_embeddings, bson_int8_embeddings, bson_int1_embeddings, data)):
doc = {
"_id": i,
"data": text,
float32_field: bson_f32_emb,
int8_field: bson_int8_emb,
int1_field: bson_int1_emb
}
docs.append(doc)
return docs
# Create the documents
documents = create_docs_with_bson_vector_embeddings(bson_float32_embeddings, bson_int8_embeddings, bson_int1_embeddings, data)
6

您可以从 Atlas UI 加载数据,也可以通过编程方式加载数据。要了解如何从 Atlas UI 加载数据,请参阅 插入数据。以下步骤和相关示例演示了如何使用 PyMongo 驱动程序以编程方式加载数据。

  1. 连接到您的 Atlas 集群。

    占位符
    有效值

    <ATLAS-CONNECTION-STRING>

    Atlas连接string 。 要了解更多信息,请参阅通过驱动程序连接。

    例子

    import pymongo
    mongo_client = pymongo.MongoClient("<ATLAS-CONNECTION-STRING>")
    if not MONGO_URI:
    print("MONGO_URI not set in environment variables")
  2. 将数据加载到 Atlas 集群。

    占位符
    有效值

    <DB-NAME>

    数据库名称。

    <COLLECTION-NAME>

    指定数据库中集合的名称。

    例子

    # Insert documents into a new database and collection
    db = mongo_client["<DB-NAME>"]
    collection_name = "<COLLECTION-NAME>"
    db.create_collection(collection_name)
    collection = db[collection_name]
    collection.insert_many(documents)
7

您可以使用 Atlas UI、Atlas CLI、Atlas Administration API 和 MongoDB 驱动程序创建 Atlas Vector Search 索引。要了解详情,请参阅如何为向量搜索建立字段索引

例子

为示例集合创建索引

占位符
有效值

<INDEX-NAME>

vector 类型索引的名称。

from pymongo.operations import SearchIndexModel
import time
# Define and create the vector search index
index_name = "<INDEX-NAME>"
search_index_model = SearchIndexModel(
definition={
"fields": [
{
"type": "vector",
"path": float32_field,
"similarity": "dotProduct",
"numDimensions": 1024
},
{
"type": "vector",
"path": int8_field,
"similarity": "dotProduct",
"numDimensions": 1024
},
{
"type": "vector",
"path": int1_field,
"similarity": "euclidean",
"numDimensions": 1024
}
]
},
name=index_name,
type="vectorSearch"
)
result = collection.create_search_index(model=search_index_model)
print("New search index named " + result + " is building.")
# Wait for initial sync to complete
print("Polling to check if the index is ready. This may take up to a minute.")
predicate=None
if predicate is None:
predicate = lambda index: index.get("queryable") is True
while True:
indices = list(collection.list_search_indexes(index_name))
if len(indices) and predicate(indices[0]):
break
time.sleep(5)
print(result + " is ready for querying.")
8

运行 Atlas Vector Search 查询的函数必须执行以下操作:

  • 将查询文本转换为 BSON 向量。

  • 定义 Atlas Vector Search 查询的管道。

例子

占位符
有效值

<NUMBER-OF-CANDIDATES-TO-CONSIDER>

搜索期间要使用的最近邻数量。

<NUMBER-OF-DOCUMENTS-TO-RETURN>

结果中要返回的文档数。

# Define a function to run a vector search query
def run_vector_search(query_text, collection, path):
query_text_embeddings = cohere_client.embed(
texts=[query_text],
model="embed-english-v3.0",
input_type="search_query",
embedding_types=["float", "int8", "ubinary"]
).embeddings
if path == float32_field:
query_vector = query_text_embeddings.float[0]
vector_dtype = BinaryVectorDtype.FLOAT32
elif path == int8_field:
query_vector = query_text_embeddings.int8[0]
vector_dtype = BinaryVectorDtype.INT8
elif path == int1_field:
query_vector = query_text_embeddings.ubinary[0]
vector_dtype = BinaryVectorDtype.PACKED_BIT
bson_query_vector = generate_bson_vector(query_vector, vector_dtype)
pipeline = [
{
'$vectorSearch': {
'index': index_name,
'path': path,
'queryVector': bson_query_vector,
'numCandidates': <NUMBER-OF-CANDIDATES-TO-CONSIDER>, # for example, 5
'limit': <NUMBER-OF-DOCUMENTS-TO-RETURN> # for example, 2
}
},
{
'$project': {
'_id': 0,
'data': 1,
'score': { '$meta': 'vectorSearchScore' }
}
}
]
return collection.aggregate(pipeline)
9

您可以通过编程方式运行 Atlas Vector Search 查询。要了解更多信息,请参阅运行向量搜索查询。

例子

from pprint import pprint
# Run the vector search query on the float32, int8, and int1 embeddings
query_text = "tell me a science fact"
float32_results = run_vector_search(query_text, collection, float32_field)
int8_results = run_vector_search(query_text, collection, int8_field)
int1_results = run_vector_search(query_text, collection, int1_field)
print("results from float32 embeddings")
pprint(list(float32_results))
print("--------------------------------------------------------------------------")
print("results from int8 embeddings")
pprint(list(int8_results))
print("--------------------------------------------------------------------------")
print("results from int1 embeddings")
pprint(list(int1_results))
results from float32 embeddings
[{'data': 'Mount Everest is the highest peak on Earth at 8,848m.',
'score': 0.6578356027603149},
{'data': 'The Great Wall of China is visible from space.',
'score': 0.6420407891273499}]
--------------------------------------------------------------------------
results from int8 embeddings
[{'data': 'Mount Everest is the highest peak on Earth at 8,848m.',
'score': 0.5149182081222534},
{'data': 'The Great Wall of China is visible from space.',
'score': 0.5136760473251343}]
--------------------------------------------------------------------------
results from int1 embeddings
[{'data': 'Mount Everest is the highest peak on Earth at 8,848m.',
'score': 0.62109375},
{'data': 'The Great Wall of China is visible from space.',
'score': 0.61328125}]

使用本教程的可运行版本以作为 Python 笔记本

1

运行以下命令以安装PyMongo驱动程序。如有必要,您还可以安装嵌入模型提供商的库。 此操作可能需要几分钟才能完成。

pip install pymongo

您必须安装 PyMongo v4.10 或更高版本的驱动程序。

例子

安装 PyMongo 和 Cohere

pip install --quiet --upgrade pymongo cohere
2

您必须使用嵌入模型定义执行以下操作的函数:

  • 如果现有数据没有任何嵌入,请从现有数据生成嵌入。

  • 将嵌入转换为BSON向量。

例子

生成和转换嵌入的函数

占位符
有效值

<COHERE-API-KEY>

Cohere 的API密钥。

import os
import pymongo
import cohere
from bson.binary import Binary, BinaryVectorDtype
# Specify your Cohere API key
os.environ["COHERE_API_KEY"] = "<COHERE-API-KEY>"
cohere_client = cohere.Client(os.environ["COHERE_API_KEY"])
# Define function to generate embeddings using the embed-english-v3.0 model
def get_embedding(text):
response = cohere_client.embed(
texts=[text],
model='embed-english-v3.0',
input_type='search_document',
embedding_types=["float"]
)
embedding = response.embeddings.float[0]
return embedding
# Define function to convert embeddings to BSON-compatible format
def generate_bson_vector(vector, vector_dtype):
return Binary.from_vector(vector, vector_dtype)
import os
import pymongo
import cohere
from bson.binary import Binary, BinaryVectorDtype
# Specify your Cohere API key
os.environ["COHERE_API_KEY"] = "<COHERE-API-KEY>"
cohere_client = cohere.Client(os.environ["COHERE_API_KEY"])
# Define function to generate embeddings using the embed-english-v3.0 model
def get_embedding(text):
response = cohere_client.embed(
texts=[text],
model='embed-english-v3.0',
input_type='search_document',
embedding_types=["int8"]
)
embedding = response.embeddings.int8[0]
return embedding
# Define function to convert embeddings to BSON-compatible format
def generate_bson_vector(vector, vector_dtype):
return Binary.from_vector(vector, vector_dtype)
import os
import pymongo
import cohere
from bson.binary import Binary, BinaryVectorDtype
# Specify your Cohere API key
os.environ["COHERE_API_KEY"] = "<COHERE-API-KEY>"
cohere_client = cohere.Client(os.environ["COHERE_API_KEY"])
# Define function to generate embeddings using the embed-english-v3.0 model
def get_embedding(text):
response = cohere_client.embed(
texts=[text],
model='embed-english-v3.0',
input_type='search_document',
embedding_types=["ubinary"]
)
embedding = response.embeddings.ubinary[0]
return embedding
# Define function to convert embeddings to BSON-compatible format
def generate_bson_vector(vector, vector_dtype):
return Binary.from_vector(vector, vector_dtype)
3

您必须提供以下信息:

  • 用于连接Atlas 集群的连接字符串,该集群包含要为其生成嵌入的数据库和集合。

  • 数据库名称,该数据库包含要为其生成嵌入的集合。

  • 要为其生成嵌入的集合的名称。

例子

连接到Atlas集群以访问数据

占位符
有效值

<ATLAS-CONNECTION-STRING>

Atlas连接string 。 要了解更多信息,请参阅通过驱动程序连接。

1# Connect to your Atlas cluster
2mongo_client = pymongo.MongoClient("<ATLAS-CONNECTION-STRING>")
3db = mongo_client["sample_airbnb"]
4collection = db["listingsAndReviews"]
5
6# Filter to exclude null or empty summary fields
7filter = { "summary": {"$nin": [None, ""]} }
8
9# Get a subset of documents in the collection
10documents = collection.find(filter).limit(50)
11
12# Initialize the count of updated documents
13updated_doc_count = 0
4
  1. 如果您的数据还没有嵌入,请使用任何嵌入模型从数据生成嵌入。 要学习;了解有关从数据生成嵌入的更多信息,请参阅如何创建向量嵌入。

  2. 将嵌入转换为BSON向量(如以下示例中的第 7行所示)。

  3. 将嵌入上传到Atlas 集群上的集合。

这些操作可能需要几分钟才能完成。

例子

生成、转换嵌入并将其加载到集合中

for doc in documents:
# Generate embeddings based on the summary
summary = doc["summary"]
embedding = get_embedding(summary) # Get float32 embedding
# Convert the float32 embedding to BSON format
bson_float32 = generate_bson_vector(embedding, BinaryVectorDtype.FLOAT32)
# Update the document with the BSON embedding
collection.update_one(
{"_id": doc["_id"]},
{"$set": {"embedding": bson_float32}}
)
updated_doc_count += 1
print(f"Updated {updated_doc_count} documents with BSON embeddings.")
for doc in documents:
# Generate embeddings based on the summary
summary = doc["summary"]
embedding = get_embedding(summary) # Get int8 embedding
# Convert the int8 embedding to BSON format
bson_int8 = generate_bson_vector(embedding, BinaryVectorDtype.INT8)
# Update the document with the BSON embedding
collection.update_one(
{"_id": doc["_id"]},
{"$set": {"embedding": bson_int8}}
)
updated_doc_count += 1
print(f"Updated {updated_doc_count} documents with BSON embeddings.")
for doc in documents:
# Generate embeddings based on the summary
summary = doc["summary"]
embedding = get_embedding(summary) # Get int1 embedding
# Convert the int1 embedding to BSON format
bson_int1 = generate_bson_vector(embedding, BinaryVectorDtype.PACKED_BIT)
# Update the document with the BSON embedding
collection.update_one(
{"_id": doc["_id"]},
{"$set": {"embedding": bson_int1}}
)
updated_doc_count += 1
print(f"Updated {updated_doc_count} documents with BSON embeddings.")
5

您可以使用Atlas用户界面、 Atlas CLI、 Atlas管理API和MongoDB驱动程序以您的首选语言创建Atlas Vector Search索引。 要学习;了解更多信息,请参阅如何为向量搜索的字段编制索引。

例子

为集合创建索引

占位符
有效值

<INDEX-NAME>

vector 类型索引的名称。

1from pymongo.operations import SearchIndexModel
2import time
3
4# Define and create the vector search index
5index_name = "<INDEX-NAME>"
6search_index_model = SearchIndexModel(
7 definition={
8 "fields": [
9 {
10 "type": "vector",
11 "path": "embedding",
12 "similarity": "euclidean",
13 "numDimensions": 1024
14 }
15 ]
16 },
17 name=index_name,
18 type="vectorSearch"
19)
20result = collection.create_search_index(model=search_index_model)
21print("New search index named " + result + " is building.")
22
23# Wait for initial sync to complete
24print("Polling to check if the index is ready. This may take up to a minute.")
25predicate=None
26if predicate is None:
27 predicate = lambda index: index.get("queryable") is True
28while True:
29 indices = list(collection.list_search_indexes(index_name))
30 if len(indices) and predicate(indices[0]):
31 break
32 time.sleep(5)
33print(result + " is ready for querying.")

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

6

运行 Atlas Vector Search 查询的函数必须执行以下操作:

  • 为查询文本生成嵌入。

  • 将查询文本转换为 BSON 向量。

  • 定义 Atlas Vector Search 查询的管道。

例子

用于运行Atlas Vector Search查询的函数

占位符
有效值

<NUMBER-OF-CANDIDATES-TO-CONSIDER>

搜索期间要使用的最近邻数量。

<NUMBER-OF-DOCUMENTS-TO-RETURN>

结果中要返回的文档数。

def run_vector_search(query_text, collection, path):
query_embedding = get_embedding(query_text)
bson_query_vector = generate_bson_vector(query_embedding, BinaryVectorDtype.FLOAT32)
pipeline = [
{
'$vectorSearch': {
'index': index_name,
'path': path,
'queryVector': bson_query_vector,
'numCandidates': <NUMBER-OF-CANDIDATES-TO-CONSIDER>, # for example, 20
'limit': <NUMBER-OF-DOCUMENTS-TO-RETURN> # for example, 5
}
},
{
'$project': {
'_id': 0,
'name': 1,
'summary': 1,
'score': { '$meta': 'vectorSearchScore' }
}
}
]
return collection.aggregate(pipeline)
def run_vector_search(query_text, collection, path):
query_embedding = get_embedding(query_text)
bson_query_vector = generate_bson_vector(query_embedding, BinaryVectorDtype.INT8)
pipeline = [
{
'$vectorSearch': {
'index': index_name,
'path': path,
'queryVector': bson_query_vector,
'numCandidates': <NUMBER-OF-CANDIDATES-TO-CONSIDER>, # for example, 20
'limit': <NUMBER-OF-DOCUMENTS-TO-RETURN> # for example, 5
}
},
{
'$project': {
'_id': 0,
'name': 1,
'summary': 1,
'score': { '$meta': 'vectorSearchScore' }
}
}
]
return collection.aggregate(pipeline)
def run_vector_search(query_text, collection, path):
query_embedding = get_embedding(query_text)
bson_query_vector = generate_bson_vector(query_embedding, BinaryVectorDtype.PACKED_BIT)
pipeline = [
{
'$vectorSearch': {
'index': index_name,
'path': path,
'queryVector': bson_query_vector,
'numCandidates': <NUMBER-OF-CANDIDATES-TO-CONSIDER>, # for example, 20
'limit': <NUMBER-OF-DOCUMENTS-TO-RETURN> # for example, 5
}
},
{
'$project': {
'_id': 0,
'name': 1,
'summary': 1,
'score': { '$meta': 'vectorSearchScore' }
}
}
]
return collection.aggregate(pipeline)
7

您可以通过编程方式运行 Atlas Vector Search 查询。要了解更多信息,请参阅运行向量搜索查询。

例子

运行示例Atlas Vector Search查询

from pprint import pprint
query_text = "ocean view"
query_results = run_vector_search(query_text, collection, "embedding")
print("query results:")
pprint(list(query_results))
query results:
[{'name': 'Your spot in Copacabana',
'score': 0.5468248128890991,
'summary': 'Having a large airy living room. The apartment is well divided. '
'Fully furnished and cozy. The building has a 24h doorman and '
'camera services in the corridors. It is very well located, close '
'to the beach, restaurants, pubs and several shops and '
'supermarkets. And it offers a good mobility being close to the '
'subway.'},
{'name': 'Twin Bed room+MTR Mongkok shopping&My',
'score': 0.527062714099884,
'summary': 'Dining shopping conveniently located Mongkok subway E1, airport '
'shuttle bus stops A21. Three live two beds, separate WC, 24-hour '
'hot water. Free WIFI.'},
{'name': 'Quarto inteiro na Tijuca',
'score': 0.5222363471984863,
'summary': 'O quarto disponível tem uma cama de solteiro, sofá e computador '
'tipo desktop para acomodação.'},
{'name': 'Makaha Valley Paradise with OceanView',
'score': 0.5175154805183411,
'summary': 'A beautiful and comfortable 1 Bedroom Air Conditioned Condo in '
'Makaha Valley - stunning Ocean & Mountain views All the '
'amenities of home, suited for longer stays. Full kitchen & large '
"bathroom. Several gas BBQ's for all guests to use & a large "
'heated pool surrounded by reclining chairs to sunbathe. The '
'Ocean you see in the pictures is not even a mile away, known as '
'the famous Makaha Surfing Beach. Golfing, hiking,snorkeling '
'paddle boarding, surfing are all just minutes from the front '
'door.'},
{'name': 'Cozy double bed room 東涌鄉村雅緻雙人房',
'score': 0.5149975419044495,
'summary': 'A comfortable double bed room at G/F. Independent entrance. High '
'privacy. The room size is around 100 sq.ft. with a 48"x72" '
'double bed. The village house is close to the Hong Kong Airport, '
'AsiaWorld-Expo, HongKong-Zhuhai-Macau Bridge, Disneyland, '
'Citygate outlets, 360 Cable car, shopping centre, main tourist '
'attractions......'}]

结果可能会有所不同,具体取决于您在先前步骤中指定的向量数据类型。

有关使用 Cohere 的 embed-english-v3.0 嵌入模型在样本数据上对这一程序的高级演示,请参阅本笔记本。

您可以通过评估 ANN搜索结果与针对量化向量的 ENN搜索结果的匹配程度来衡量Atlas Vector Search查询的准确性。也就是说,您可以将相同查询条件的 ANN搜索结果与 ENN搜索结果进行比较,并测量 ANN搜索结果在 ENN搜索结果中包含最近邻的频率。

有关评估查询结果的演示,请参阅如何衡量查询结果的准确性。

后退

转换文档和过滤器集合

在此页面上