矢量量化
Atlas Vector Search支持浮点向量嵌入(32 位和 64 位)的自动量化。它还支持从某些嵌入模型中提取预量化的标量和二进制向量并索引。
关于量化
量化是将全保真向量压缩为更少位数的进程。它通过对缩减的表示向量索引来减少在Atlas Vector Search索引中存储每个向量所需的主内存量。这样可以存储更多向量或更高维度的向量。因此,量化可减少资源消耗并提高速度。我们建议对具有大量向量(例如超过 100、000)的应用程序进行量化。
标量量化
标量量化涉及首先识别索引向量每个维度的最小值和最大值,以确定维度的值范围。然后,将该范围划分为大小相等的间隔或分箱。最后,将每个浮点值映射到一个 bin,以将连续浮点值转换为离散整数。在Atlas Vector Search中,这种量化将向量嵌入的RAM费用降低到预量化费用的大约四分之一1/3.75
()。
二进制量化
二进制量化涉及假设每个维度的中点为 0
,这通常适用于归一化为长度 1
的嵌入,例如 OpenAI 的 text-embedding-3-large
。然后,将向量中的每个值与中点进行比较,如果大于中点,则为其分配二进制值 1
;如果小于或等于中点,则为其分配二进制值 0
。在 Atlas Vector Search 中,这种量化将向量嵌入的 RAM 费用降低至量化前费用的二十四分之一 (1/24
)。它不是 1/32
是因为包含 可分层导航的小世界 图表本身的数据结构(与向量值分开)未压缩。
当您运行查询时,Atlas Vector Search 会将查询向量中的浮点值转换为二进制向量,并使用相同的中点,以便在查询向量和索引二进制向量之间进行高效比较。然后,通过使用与这些结果相关的二进制索引中的原始浮点值,重新评估二进制比较中识别的候选项,以进一步优化结果。全保真向量存储在磁盘上的独立数据结构中,仅在重新评分时配置二进制量化或对二进制或标量量化向量执行精确搜索时引用。
另请参阅:
要求
下表显示了自动量化和导入量化向量的要求。
注意
Atlas在内部将所有浮点值存储为 double
数据类型;因此,32 位和 64 位嵌入都与自动量化兼容,无需转换。
要求 | 对于 int1 引入 | 对于 int8 引入 | 对于自动标量量化 | 对于自动二进制量化 |
---|---|---|---|---|
需要索引定义设置 | No | No | 是 | 是 |
需要 BSON | 是 | 是 | No | No |
mongod上的存储 |
|
| binData(float32) array(double) | binData(float32) array(double) |
支持的相似度方法 |
| cosine euclidean dotProduct | cosine euclidean dotProduct | cosine euclidean dotProduct |
支持的维度数 | 8 的倍数 | 1 改为 8192 | 1 改为 8192 | 8 的倍数 |
支持 ANN 和 ENN 搜索 | 是 | 是 | 是 | 是 |
如何启用向量的自动量化
您可以配置Atlas Vector Search ,使其自动将集合中的浮点向量嵌入量化为简化的表示类型,例如向量索引中的 int8
(标量)和 binary
。
要设置或更改量化类型,请在索引定义中指定 quantization
字段值为 scalar
或 binary
。与任何其他索引定义更改类似,这会触发索引重建。所指定的量化类型在查询时适用于所有索引向量和查询向量。
对于大多数嵌入模型,我们建议使用重新评分的二进制量化。 如果要使用非 QAT 的较低维度模型,请使用标量量化,因为它的表征损失较小,因此表征容量损失也较小。
收益分析
Atlas Vector Search提供标量量化以及带重新评分的二进制量化的原生功能。自动量化可减少高效处理所需的计算资源,从而提高应用程序的可扩展性并节省费用。自动量化会将 的RAM减少mongot
3.75倍(标量)和24 倍(二进制);向量值分别缩小4 x 和32 x,但“分层可导航小世界”图表本身不会缩小。即使在最高数量和扩展,这也能提高性能。
用例
如果您有大量全保真向量(通常超过 100,000 个向量),我们建议使用自动量化。量化后,您可以对简化的表示向量进行索引,而不会影响检索向量时的准确性。
步骤
要启用自动量化:
指定您想要在Atlas Vector Search索引中使用的量化类型。
在新的或现有的Atlas Vector Search索引中,在索引定义的fields.quantization
字段中指定以下量化类型之一:
scalar
:从浮点输入向量生成字节向量。binary
:从浮点输入向量生成位向量。
如果您对不是浮点值大量的数据指定自动量化, Atlas Vector Search会默默地忽略该向量,而不是对其索引,并且会跳过这些向量。由于Atlas在内部将浮点值(32 位和 64 位)存储为 double
类型,因此输出任一精度的模型的嵌入将使用自动量化。
创建或更新索引。
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
所指定的量化类型在查询时适用于所有索引向量和查询向量。
Considerations
当您在Atlas 用户界面中查看量化索引时,索引大小可能看起来比未经量化的索引大。这是因为Size 指标表示存储的总数据,其中包括“可分层导航的小世界”图表(在内存中)、量化向量(在内存中)和全保真向量(在磁盘上)。要估计索引在查询时使用的内存量,请参阅Required Memory 指标。
如何摄取预量化向量
Atlas Vector Search还支持从某些嵌入模型中摄取标量和二进制量化向量并索引。如果您还没有量化向量,则可以将嵌入转换为BSON BinData 向量,具有 float32
、int1
或 int8
子类型。
用例
我们建议在以下使用案例中摄取量化的BSON binData
向量:
您需要对嵌入模型中量化向量输出进行索引。
您有大量浮点向量,希望减少
mongod
中的存储和WiredTiger占用空间(例如磁盘和内存使用量)。
收益分析
BinData 是一种存储二进制数据的BSON数据类型。它可以压缩向量嵌入,与使用标准 float32
大量的嵌入相比,所需的集群磁盘空间减少约三倍。要了解更多信息,请参阅向量压缩。
此子类型还允许您使用备用类型(例如 int1
或 int8
向量)对向量索引,从而减少为集合构建Atlas Vector Search索引所需的内存。它将 mongot
的RAM减少了 3.75 倍(标量)和 24 倍(二进制);向量值分别缩小 4x 和 32x,但“分层可导航小世界”图表本身不会缩小。
如果您还没有 binData
向量,则可以在将数据写入集合之前使用任何支持的驾驶员将嵌入转换为这种格式。以下过程将引导您完成将嵌入转换为具有 BinData 向量的步骤,float32
、int8
和 int1
子类型。
支持的驱动程序
BSON 以下驱动程序支持具有 float32
、int1
和 子类型的 BinData 向量:int8
Go驱动程序v 2.1.0或更高版本
PyMongo 驱动程序 v4.10 或更高版本
Node.js 驱动程序 v6.11 或更高版本
Java驱动程序v5.3.1 或更高版本
➤ 使用 Select your language(选择您的语言)下拉菜单来设置此页面上过程的语言。
先决条件
要量化BSON binData
向量,必须具备以下条件:
运行 MongoDB 6.0.11、7.0.2 或更高版本的 Atlas 集群。
确保您的IP 地址包含在 Atlas 项目的访问列表中。
访问支持字节向量输出的嵌入模型。
以下嵌入模型的输出可用于借助受支持的 MongoDB 驱动程序来生成 BSON
binData
向量:嵌入模型提供者嵌入模型voyage-3-large
embed-english-v3.0
nomic-embed-text-v1.5
jina-embeddings-v2-base-en
mxbai-embed-large-v1
标量量化可保留这些模型的召回率,因为这些模型均已经过训练,能感知量化。因此,这些模型产生的标量量化嵌入的召回率下降即使在 384 等较低维度下也处于最低水平。
用于运行 Go 项目的终端和代码编辑器。
Go 已安装。
Java 开发工具包 (JDK) 版本 8 或更高版本。
设立和运行Java应用程序的环境。 我们建议您使用 IntelliJ IDEA 或 Eclipse IDE 等集成开发环境来配置 Maven 或 Gradle,以构建和运行项目。
用于运行 Node.js 项目的终端和代码编辑器。
npm 和 Node.js 已安装。
运行交互式 Python 笔记本(如 VS Code 或 Colab) 的环境。
步骤
此过程中的示例会使用新数据或现有数据,以及使用 Cohere 的 embed-english-v3.0
模型所生成的嵌入。新数据的示例会使用示例文本字符串,而您可将其替换为自己的数据。现有数据的示例则使用来自 sample_airbnb
数据库中 listingsAndReviews
集合的部分文档(不带任何嵌入),而您可将其替换为自己的数据库和集合(可附带或不带任何嵌入)。
根据您是要量化新数据还是Atlas 集群中已有数据的 binData
向量来选择标签页。
安装所需的库。
运行以下命令以安装MongoDB Go驱动程序。此操作可能需要几分钟才能完成。
go get go.mongodb.org/mongo-driver/v2/mongo
您必须安装Go v2.1 或更高版本的驾驶员。如有必要,您还可以安装嵌入模型提供商程序提供的库。示例,要使用 Cohere 生成float32
、int8
和int1
嵌入(如本页所示),请安装 Cohere:
go get github.com/cohere-ai/cohere-go/v2/client@v2.13.0
在您的终端中设置环境变量。
要访问嵌入模型提供商以生成和转换嵌入,请按需为嵌入模型提供商的 API 密钥设置环境变量。
要使用 Cohere 的嵌入功能,请设置
COHERE_API_KEY
环境变量。export COHERE_API_KEY="<COHERE-API-KEY>" 要访问 Atlas 集群,请设置
MONGODB_URI
环境变量。export MONGODB_URI="<CONNECTION-STRING>" 您的连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
(条件)从您的数据生成嵌入。
您可以使用嵌入模型提供商为数据生成float32
、int8
和int1
嵌入,然后使用MongoDB Go驾驶员原生向量嵌入转换为BSON向量。以下示例代码使用 Cohere 的embed
API生成全精度向量。
在Go项目中创建一个名为
GenerateAndConvertEmbeddings.go
的新文件。touch GenerateAndConvertEmbeddings.go 将以下代码复制并粘贴到
GenerateAndConvertEmbeddings.go
文件中。此代码执行以下操作:
使用 Cohere 的
embed
API 生成float32
、int8
和ubinary
向量嵌入。使用MongoDB Go驾驶员将嵌入转换为BSON 向量。
binData
创建一个名为
embeddings.json
的文件,并将带有嵌入的数据保存在该文件中。
GenerateAndConvertEmbeddings.go1 package main 2 3 import ( 4 "context" 5 "encoding/json" 6 "fmt" 7 "log" 8 "os" 9 10 cohere "github.com/cohere-ai/cohere-go/v2" 11 cohereclient "github.com/cohere-ai/cohere-go/v2/client" 12 "go.mongodb.org/mongo-driver/v2/bson" 13 ) 14 15 // Sample data for embedding 16 var data = []string{ 17 "The Great Wall of China is visible from space.", 18 "The Eiffel Tower was completed in Paris in 1889.", 19 "Mount Everest is the highest peak on Earth at 8,848m.", 20 "Shakespeare wrote 37 plays and 154 sonnets during his lifetime.", 21 "The Mona Lisa was painted by Leonardo da Vinci.", 22 } 23 24 func main() { 25 apiKey := os.Getenv("COHERE_API_KEY") 26 if apiKey == "" { 27 log.Fatal("Ensure COHERE_API_KEY is set.") 28 } 29 30 client := cohereclient.NewClient(cohereclient.WithToken(apiKey)) 31 32 embeddings, err := fetchEmbeddingsFromCohere(client) 33 if err != nil { 34 log.Fatalf("Error fetching embeddings: %v", err) 35 } 36 37 documents := convertEmbeddingsToBSON(data, embeddings) 38 39 err = writeJSONToFile("embeddings.json", documents) 40 if err != nil { 41 log.Fatalf("Error writing embeddings to file: %v", err) 42 } 43 44 fmt.Println("Embeddings successfully stored in embeddings.json") 45 } 46 47 // Fetch embeddings using Cohere API 48 func fetchEmbeddingsFromCohere(client *cohereclient.Client) (*cohere.EmbedByTypeResponse, error) { 49 model := "embed-english-v3.0" 50 response, err := client.V2.Embed(context.TODO(), &cohere.V2EmbedRequest{ 51 Texts: data, 52 Model: model, 53 InputType: cohere.EmbedInputTypeSearchDocument, 54 EmbeddingTypes: []cohere.EmbeddingType{ 55 cohere.EmbeddingTypeFloat, 56 cohere.EmbeddingTypeInt8, 57 cohere.EmbeddingTypeUbinary, 58 }, 59 }) 60 if err != nil { 61 return nil, fmt.Errorf("failed to fetch embeddings: %w", err) 62 } 63 return response, nil 64 } 65 66 // Convert embeddings to BSON binary vectors 67 func convertEmbeddingsToBSON(sentences []string, embeddings *cohere.EmbedByTypeResponse) []bson.M { 68 var documents []bson.M 69 70 for i, sentence := range sentences { 71 float32Emb := convertFloat64ToFloat32(embeddings.Embeddings.Float[i]) 72 int8Emb := convertIntToInt8(embeddings.Embeddings.Int8[i]) 73 ubinaryEmb := convertIntToBytes(embeddings.Embeddings.Ubinary[i]) 74 75 floatVector := bson.NewVector(float32Emb) 76 int8Vector := bson.NewVector(int8Emb) 77 ubinaryVector, err := bson.NewPackedBitVector(ubinaryEmb, 0) 78 if err != nil { 79 log.Fatalf("Error creating PackedBitVector: %v", err) 80 } 81 82 document := bson.M{ 83 "text": sentence, 84 "embeddings_float32": floatVector.Binary(), 85 "embeddings_int8": int8Vector.Binary(), 86 "embeddings_int1": ubinaryVector.Binary(), 87 } 88 documents = append(documents, document) 89 } 90 91 return documents 92 } 93 94 // Write JSON file from in-memory BSON documents 95 func writeJSONToFile(filename string, documents []bson.M) error { 96 file, err := os.Create(filename) 97 if err != nil { 98 return fmt.Errorf("failed to create file: %w", err) 99 } 100 defer file.Close() 101 102 var jsonData []json.RawMessage 103 for _, document := range documents { 104 jsonBytes, err := bson.MarshalExtJSON(document, false, false) 105 if err != nil { 106 return fmt.Errorf("error marshaling BSON to JSON: %w", err) 107 } 108 jsonData = append(jsonData, jsonBytes) 109 } 110 111 marshaledData, err := json.MarshalIndent(jsonData, "", " ") 112 if err != nil { 113 return fmt.Errorf("failed to marshal JSON: %w", err) 114 } 115 116 _, err = file.Write(marshaledData) 117 if err != nil { 118 return fmt.Errorf("failed to write JSON to file: %w", err) 119 } 120 121 return nil 122 } 123 124 // Convert a slice of float64 to a slice of float32 125 func convertFloat64ToFloat32(f64s []float64) []float32 { 126 f32s := make([]float32, len(f64s)) 127 for i, v := range f64s { 128 f32s[i] = float32(v) 129 } 130 return f32s 131 } 132 133 // Convert a slice of int to a slice of int8 134 func convertIntToInt8(ints []int) []int8 { 135 ints8 := make([]int8, len(ints)) 136 for i, val := range ints { 137 ints8[i] = int8(val) 138 } 139 return ints8 140 } 141 142 // Convert a slice of int to a slice of bytes 143 func convertIntToBytes(ints []int) []byte { 144 bytes := make([]byte, len(ints)) 145 for i, val := range ints { 146 bytes[i] = byte(val) 147 } 148 return bytes 149 } 替换代码中的以下占位符值并保存文件。
COHERE_API_KEY
只有在未设立环境变量的情况下,才能使用 Cohere API密钥。
使用以下命令运行程序。
如果使用终端,运行以下命令来编译并执行程序。
go run GenerateAndConvertEmbeddings.go Embeddings successfully stored in embeddings.json 验证
embeddings.json
文件中的嵌入。
要学习;了解有关生成嵌入并将嵌入转换为 binData
向量的更多信息,请参阅如何创建向量嵌入。
摄取数据并创建Atlas Vector Search索引。
您必须将数据和嵌入上传到Atlas 集群中的集合,并在数据上创建Atlas Vector Search索引,以对数据运行$vectorSearch
查询。
在Go项目中创建一个名为
UploadDataAndCreateIndex.go
的新文件。touch UploadDataAndCreateIndex.go 将以下代码复制并粘贴到
UploadDataAndCreateIndex.go
文件中。此代码执行以下操作:
将
embeddings.json
文件中的float32
、int8
和int1
嵌入上传到Atlas 集群。在
embeddings.float32
、embeddings.int8
和embeddings.int1
字段上创建Atlas Vector Search索引。
UploadDataAndCreateIndex.go1 package main 2 3 import ( 4 "context" 5 "fmt" 6 "io/ioutil" 7 "log" 8 "time" 9 "os" 10 11 "go.mongodb.org/mongo-driver/v2/bson" 12 "go.mongodb.org/mongo-driver/v2/mongo" 13 "go.mongodb.org/mongo-driver/v2/mongo/options" 14 ) 15 16 var ( 17 mongodbURI = os.Getenv("MONGODB_URI") 18 dbName = "<DATABASE-NAME>" 19 collectionName = "<COLLECTION-NAME>" 20 indexName = "<INDEX-NAME>" 21 numberOfDimensions = 1024 22 embeddingFields = []string{"embeddings_float32", "embeddings_int8", "embeddings_int1"} 23 embeddingSimilarity = []string{"dotProduct", "dotProduct", "euclidean"} 24 ) 25 26 func main() { 27 clientOpts := options.Client().ApplyURI(mongodbURI) 28 client, err := mongo.Connect(clientOpts) 29 if err != nil { 30 log.Fatalf("Failed to connect to MongoDB: %v", err) 31 } 32 33 defer func() { 34 if err := client.Disconnect(context.TODO()); err != nil { 35 log.Fatalf("Failed to disconnect MongoDB client: %v", err) 36 } 37 }() 38 39 storeEmbeddings(client) 40 setupVectorSearchIndex(client) 41 } 42 43 // Reads JSON data, stores it in MongoDB 44 func storeEmbeddings(client *mongo.Client) { 45 database := client.Database(dbName) 46 collection := database.Collection(collectionName) 47 48 data, err := ioutil.ReadFile("embeddings.json") 49 if err != nil { 50 log.Fatalf("Failed to read file: %v", err) 51 } 52 53 var documents []bson.M 54 if err := bson.UnmarshalExtJSON(data, false, &documents); err != nil { 55 log.Fatalf("Failed to unmarshal JSON data: %v", err) 56 } 57 58 if _, err := collection.InsertMany(context.TODO(), documents); err != nil { 59 log.Fatalf("Failed to insert documents: %v", err) 60 } 61 62 fmt.Println("Inserted documents into MongoDB") 63 } 64 65 // Sets up vector search index in MongoDB 66 func setupVectorSearchIndex(client *mongo.Client) { 67 database := client.Database(dbName) 68 collection := database.Collection(collectionName) 69 70 ctx := context.TODO() 71 72 type vectorDefinitionField struct { 73 Type string `bson:"type"` 74 Path string `bson:"path"` 75 NumDimensions int `bson:"numDimensions"` 76 Similarity string `bson:"similarity"` 77 } 78 79 type vectorDefinition struct { 80 Fields []vectorDefinitionField `bson:"fields"` 81 } 82 83 fields := make([]vectorDefinitionField, len(embeddingFields)) 84 for i, field := range embeddingFields { 85 fields[i] = vectorDefinitionField{ 86 Type: "vector", 87 Path: field, 88 NumDimensions: numberOfDimensions, 89 Similarity: embeddingSimilarity[i], 90 } 91 } 92 fmt.Println(fields) 93 94 opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch") 95 96 indexModel := mongo.SearchIndexModel{ 97 Definition: vectorDefinition{ 98 Fields: fields, 99 }, 100 Options: opts, 101 } 102 103 // Create the index 104 log.Println("Creating the index.") 105 searchIndexName, err := collection.SearchIndexes().CreateOne(ctx, indexModel) 106 if err != nil { 107 log.Fatalf("Failed to create the search index: %v", err) 108 } 109 110 // Polling to confirm successful index creation 111 log.Println("Polling to confirm successful index creation.") 112 log.Println("NOTE: This may take up to a minute.") 113 searchIndexes := collection.SearchIndexes() 114 var doc bson.Raw 115 116 for doc == nil { 117 cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName)) 118 if err != nil { 119 log.Fatalf("failed to list search indexes: %v", err) 120 } 121 122 if !cursor.Next(ctx) { 123 break 124 } 125 126 name := cursor.Current.Lookup("name").StringValue() 127 queryable := cursor.Current.Lookup("queryable").Boolean() 128 if name == searchIndexName && queryable { 129 doc = cursor.Current 130 } else { 131 time.Sleep(5 * time.Second) 132 } 133 } 134 135 log.Println("Name of Index Created: " + searchIndexName) 136 } 将以下内容替换为代码中的有效值,然后保存文件。
MONGODB_URI
Atlas 集群连接字符串(如果您未设立环境变量)。
<DATABASE-NAME>
集合的Atlas Vector Search索引的名称。
<COLLECTION-NAME>
集合的Atlas Vector Search索引的名称。
<INDEX-NAME>
集合的Atlas Vector Search索引的名称。
使用以下命令运行程序。
如果使用终端,运行以下命令来编译并执行程序。
go run UploadDataAndCreateIndex.go Inserted documents into MongoDB Creating the index. Polling to confirm successful index creation. NOTE: This may take up to a minute. Name of Index Created: <INDEX-NAME> 登录Atlas 集群并验证以下内容:
命名空间空间中的数据。
集合的Atlas Vector Search索引。
创建并针对集合运行查询。
要测试嵌入,您可以对集合运行查询。使用嵌入模型提供商为查询文本生成float32
、int8
和int1
嵌入。以下示例代码使用 Cohere 的embed
API生成全精度向量。生成嵌入后,使用MongoDB Go驾驶员将原原生向量嵌入转换为BSON二进制向量,并对集合运行 $vectorSearch
查询。
在Go项目中创建一个名为
CreateEmbeddingsAndRunQuery.go
的新文件。touch CreateEmbeddingsAndRunQuery.go 将以下代码复制并粘贴到
CreateEmbeddingsAndRunQuery.go
文件中。此代码执行以下操作:
使用 Cohere 的
embed
API 生成float32
、int8
和ubinary
向量嵌入。使用MongoDB Go驾驶员将嵌入转换为BSON 向量。
binData
针对集合运行查询并返回结果。
CreateEmbeddingsAndRunQuery.go1 package main 2 3 import ( 4 "context" 5 "fmt" 6 "log" 7 "os" 8 9 cohere "github.com/cohere-ai/cohere-go/v2" 10 cohereclient "github.com/cohere-ai/cohere-go/v2/client" 11 "go.mongodb.org/mongo-driver/v2/bson" 12 "go.mongodb.org/mongo-driver/v2/mongo" 13 "go.mongodb.org/mongo-driver/v2/mongo/options" 14 ) 15 16 const ( 17 dbName = "<DATABASE-NAME>" 18 collectionName = "<COLLECTION-NAME>" 19 vectorIndexName = "<INDEX-NAME>" 20 dataFieldName = "<TEXT-FIELD-NAME>" 21 queryText = "<QUERY-TEXT>" 22 ) 23 24 func main() { 25 apiKey := os.Getenv("COHERE_API_KEY") 26 mongodbURI := os.Getenv("MONGODB_URI") 27 28 if apiKey == "" { 29 log.Fatal("API key not found. Set COHERE_API_KEY in your environment.") 30 } 31 if mongodbURI == "" { 32 log.Fatal("MongoDB URI not found. Set MONGODB_URI in your environment.") 33 } 34 35 embeddingsData, err := generateAndConvertEmbeddings(apiKey, queryText) 36 if err != nil { 37 log.Fatalf("Error generating embeddings: %v", err) 38 } 39 40 err = runVectorSearchQuery(mongodbURI, embeddingsData) 41 if err != nil { 42 log.Fatalf("Error running vector search query: %v", err) 43 } 44 } 45 46 // Generate embeddings using Cohere's embed API from the query text 47 func generateAndConvertEmbeddings(apiKey, text string) (map[string]bson.Binary, error) { 48 client := cohereclient.NewClient(cohereclient.WithToken(apiKey)) 49 50 model := "embed-english-v3.0" 51 response, err := client.V2.Embed(context.TODO(), &cohere.V2EmbedRequest{ 52 Texts: []string{text}, 53 Model: model, 54 InputType: cohere.EmbedInputTypeSearchQuery, 55 EmbeddingTypes: []cohere.EmbeddingType{ 56 cohere.EmbeddingTypeFloat, 57 cohere.EmbeddingTypeInt8, 58 cohere.EmbeddingTypeUbinary, 59 }, 60 }) 61 if err != nil { 62 return nil, fmt.Errorf("failed to fetch embeddings: %w", err) 63 } 64 65 if response.Embeddings == nil || len(response.Embeddings.Float) == 0 { 66 return nil, fmt.Errorf("no embeddings found in the API response") 67 } 68 69 return createBSONVectorEmbeddings(response.Embeddings), nil 70 } 71 72 // Convert embeddings to BSON vectors using MongoDB Go Driver 73 func createBSONVectorEmbeddings(embeddings *cohere.EmbedByTypeResponseEmbeddings) map[string]bson.Binary { 74 bsonVectorEmbeddings := make(map[string]bson.Binary) 75 76 // Convert float embeddings 77 if len(embeddings.Float) > 0 { 78 floatData := convertFloat64ToFloat32(embeddings.Float[0]) 79 floatVector := bson.NewVector(floatData) // Here, we are using bson.NewVector to create a BSON Vector. 80 bsonVectorEmbeddings["float32"] = floatVector.Binary() 81 } 82 83 // Convert int8 embeddings 84 if len(embeddings.Int8) > 0 { 85 int8Data := convertIntToInt8(embeddings.Int8[0]) 86 int8Vector := bson.NewVector(int8Data) // Similarly, create BSON Vector for int8 data. 87 bsonVectorEmbeddings["int8"] = int8Vector.Binary() 88 } 89 90 // Convert ubinary embeddings to a packed bit vector 91 if len(embeddings.Ubinary) > 0 { 92 int1Data := convertIntToBytes(embeddings.Ubinary[0]) 93 ubinaryVector, _ := bson.NewPackedBitVector(int1Data, 0) // Packed bit using bson.NewPackedBitVector 94 bsonVectorEmbeddings["int1"] = ubinaryVector.Binary() 95 } 96 97 return bsonVectorEmbeddings 98 } 99 100 // Run $vectorSearch query using the embeddings 101 func runVectorSearchQuery(mongodbURI string, embeddingsData map[string]bson.Binary) error { 102 ctx := context.Background() 103 clientOptions := options.Client().ApplyURI(mongodbURI) 104 client, err := mongo.Connect(clientOptions) 105 if err != nil { 106 return fmt.Errorf("failed to connect to MongoDB: %w", err) 107 } 108 defer func() { _ = client.Disconnect(ctx) }() 109 110 db := client.Database(dbName) 111 collection := db.Collection(collectionName) 112 113 // Map the correct paths for embeddings 114 pathMap := map[string]string{ 115 "float32": "embeddings_float32", 116 "int8": "embeddings_int8", 117 "int1": "embeddings_int1", 118 } 119 120 for pathKey, queryVector := range embeddingsData { 121 path, ok := pathMap[pathKey] 122 if !ok { 123 return fmt.Errorf("invalid path key: %s", pathKey) 124 } 125 126 pipeline := mongo.Pipeline{ 127 { 128 {"$vectorSearch", bson.D{ 129 {"queryVector", queryVector}, 130 {"index", vectorIndexName}, 131 {"path", path}, 132 {"numCandidates", 5}, 133 {"limit", 2}, 134 }}, 135 }, 136 { 137 {"$project", bson.D{ 138 {"_id", 1}, 139 {dataFieldName, 1}, 140 {"score", bson.D{ 141 {"$meta", "vectorSearchScore"}, 142 }}, 143 }}, 144 }, 145 } 146 147 cursor, err := collection.Aggregate(context.Background(), pipeline) 148 if err != nil { 149 return fmt.Errorf("failed to run vector search aggregation query: %w", err) 150 } 151 defer cursor.Close(ctx) 152 153 var results []bson.M 154 if err = cursor.All(context.Background(), &results); err != nil { 155 return fmt.Errorf("failed to parse aggregation query results: %w", err) 156 } 157 158 fmt.Printf("Results from %v embeddings:\n", path) 159 for _, result := range results { 160 fmt.Println(result) 161 } 162 } 163 164 return nil 165 } 166 167 // Utility functions to handle list conversion 168 func convertFloat64ToFloat32(f64s []float64) []float32 { 169 f32s := make([]float32, len(f64s)) 170 for i, v := range f64s { 171 f32s[i] = float32(v) 172 } 173 return f32s 174 } 175 176 func convertIntToInt8(ints []int) []int8 { 177 ints8 := make([]int8, len(ints)) 178 for i, val := range ints { 179 ints8[i] = int8(val) 180 } 181 return ints8 182 } 183 184 func convertIntToBytes(ints []int) []byte { 185 bytes := make([]byte, len(ints)) 186 for i, val := range ints { 187 bytes[i] = byte(val) 188 } 189 return bytes 190 } 替换代码中的以下占位符值并保存文件。
MONGODB_URI
Atlas 集群连接字符串(如果您未设立环境变量)。
COHERE_API_KEY
只有在未设立环境变量的情况下,才能使用 Cohere API密钥。
<DATABASE-NAME>
Name of the database in your Atlas cluster.
<COLLECTION-NAME>
您提取数据的集合的名称。
<INDEX-NAME>
集合的Atlas Vector Search索引的名称。
<TEXT-FIELD-NAME>
包含生成嵌入的文本的字段名称。
<QUERY-TEXT>
查询的文本。对于本示例,请使用
science fact
。使用应用程序运行配置编译并运行该文件。
如果使用终端,运行以下命令来编译并执行程序。
go run CreateEmbeddingsAndRunQuery.go Results from embeddings_int1 embeddings: {"_id":{"$oid":"68129070e7f516cc19658bc9"},"text":"Mount Everest is the highest peak on Earth at 8,848m.","score":{"$numberDouble":"0.642578125"}} {"text":"The Great Wall of China is visible from space.","score":{"$numberDouble":"0.61328125"},"_id":{"$oid":"68129070e7f516cc19658bc7"}} Results from embeddings_float32 embeddings: {"_id":{"$oid":"68129070e7f516cc19658bc9"},"text":"Mount Everest is the highest peak on Earth at 8,848m.","score":{"$numberDouble":"0.6583383083343506"}} {"_id":{"$oid":"68129070e7f516cc19658bc7"},"text":"The Great Wall of China is visible from space.","score":{"$numberDouble":"0.6536108255386353"}} Results from embeddings_int8 embeddings: {"_id":{"$oid":"68129070e7f516cc19658bc9"},"text":"Mount Everest is the highest peak on Earth at 8,848m.","score":{"$numberDouble":"0.5149773359298706"}} {"_id":{"$oid":"68129070e7f516cc19658bc7"},"text":"The Great Wall of China is visible from space.","score":{"$numberDouble":"0.5146723985671997"}}
安装所需的库。
运行以下命令以安装MongoDB Go驱动程序。此操作可能需要几分钟才能完成。
go get go.mongodb.org/mongo-driver/v2/mongo
您必须安装Go v2.1 或更高版本的驾驶员。如有必要,您还可以安装嵌入模型提供商程序提供的库。示例,要使用 Cohere 生成float32
、int8
和int1
嵌入(如本页所示),请安装 Cohere:
go get github.com/cohere-ai/cohere-go/v2/client@v2.13.0
在您的终端中设置环境变量。
要访问嵌入模型提供商以生成和转换嵌入,请按需为嵌入模型提供商的 API 密钥设置环境变量。
要使用 Cohere 的嵌入功能,请设置
COHERE_API_KEY
环境变量。export COHERE_API_KEY="<COHERE-API-KEY>" 要访问 Atlas 集群,请设置
MONGODB_URI
环境变量。export MONGODB_URI="<CONNECTION-STRING>" 您的连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
(条件)从您的数据生成嵌入。
您可以使用嵌入模型提供商为数据生成float
、int8
和int1
嵌入,然后使用MongoDB Go驾驶员原生向量嵌入转换为BSON向量。以下示例代码使用 Cohere 的embed
API从 sample_airbnb.listingsAndReviews
命名空间中的数据生成全精度向量。
在Go项目中创建一个名为
GenerateAndConvertEmbeddings.go
的新文件。touch GenerateAndConvertEmbeddings.go 将以下代码复制并粘贴到
GenerateAndConvertEmbeddings.go
文件中。此代码执行以下操作:
从
sample_airbnb.listingsAndReviews
命名空间中的 50 文档获取summary
字段。使用 Cohere 的
embed
API 生成float32
、int8
和ubinary
向量嵌入。使用MongoDB Go驾驶员将嵌入转换为BSON 向量。
binData
创建一个名为
embeddings.json
的文件,并将带有嵌入的数据保存在该文件中。
GenerateAndConvertEmbeddings.go1 package main 2 3 import ( 4 "context" 5 "encoding/json" 6 "fmt" 7 "log" 8 "os" 9 10 cohere "github.com/cohere-ai/cohere-go/v2" 11 cohereclient "github.com/cohere-ai/cohere-go/v2/client" 12 "go.mongodb.org/mongo-driver/v2/bson" 13 "go.mongodb.org/mongo-driver/v2/mongo" 14 "go.mongodb.org/mongo-driver/v2/mongo/options" 15 ) 16 17 const ( 18 batchSize = 96 19 dbName = "sample_airbnb" 20 collName = "listingsAndReviews" 21 embedModel = "embed-english-v3.0" 22 ) 23 24 func main() { 25 apiKey := os.Getenv("COHERE_API_KEY") 26 mongodbURI := os.Getenv("MONGODB_URI") 27 28 if apiKey == "" || mongodbURI == "" { 29 log.Fatal("Ensure COHERE_API_KEY and MONGODB_URI are set.") 30 } 31 32 summaries, err := fetchSummariesFromMongoDB(mongodbURI) 33 if err != nil { 34 log.Fatalf("Error fetching summaries: %v", err) 35 } 36 37 client := cohereclient.NewClient(cohereclient.WithToken(apiKey)) 38 39 for start := 0; start < len(summaries); start += batchSize { 40 end := start + batchSize 41 if end > len(summaries) { 42 end = len(summaries) 43 } 44 45 embeddingsData, err := fetchEmbeddingsFromCohere(client, summaries[start:end]) 46 if err != nil { 47 log.Fatalf("Error fetching embeddings: %v", err) 48 } 49 50 if embeddingsData.Embeddings == nil { 51 continue 52 } 53 54 documents := convertCohereResponseToStructs(summaries[start:end], embeddingsData) 55 56 err = writeJSONToFile("embeddings.json", documents) 57 if err != nil { 58 log.Fatalf("Error writing embeddings to JSON: %v", err) 59 } 60 } 61 62 fmt.Println("Embeddings successfully saved to embeddings.json") 63 } 64 65 func fetchSummariesFromMongoDB(uri string) ([]string, error) { 66 ctx := context.TODO() 67 clientOpts := options.Client().ApplyURI(uri) 68 69 client, err := mongo.Connect(clientOpts) 70 if err != nil { 71 return nil, fmt.Errorf("failed to connect to MongoDB: %w", err) 72 } 73 defer func() { 74 if err := client.Disconnect(ctx); err != nil { 75 log.Fatalf("Failed to disconnect MongoDB client: %v", err) 76 } 77 }() 78 79 collection := client.Database(dbName).Collection(collName) 80 filter := bson.M{"summary": bson.M{"$nin": []interface{}{nil, ""}}} 81 82 cursor, err := collection.Find(ctx, filter, options.Find().SetLimit(50)) 83 if err != nil { 84 return nil, fmt.Errorf("error finding documents: %w", err) 85 } 86 defer cursor.Close(ctx) 87 88 var summaries []string 89 for cursor.Next(ctx) { 90 var result struct { 91 Summary string 92 } 93 if err := cursor.Decode(&result); err != nil { 94 return nil, fmt.Errorf("error decoding document: %w", err) 95 } 96 if summary := result.Summary; summary != "" { 97 summaries = append(summaries, summary) 98 } 99 } 100 101 if err := cursor.Err(); err != nil { 102 return nil, fmt.Errorf("cursor error: %w", err) 103 } 104 105 return summaries, nil 106 } 107 108 func fetchEmbeddingsFromCohere(client *cohereclient.Client, batchData []string) (*cohere.EmbedByTypeResponse, error) { 109 response, err := client.V2.Embed(context.TODO(), &cohere.V2EmbedRequest{ 110 Texts: batchData, 111 Model: embedModel, 112 InputType: cohere.EmbedInputTypeSearchDocument, 113 EmbeddingTypes: []cohere.EmbeddingType{ 114 cohere.EmbeddingTypeFloat, 115 cohere.EmbeddingTypeInt8, 116 cohere.EmbeddingTypeUbinary, 117 }, 118 }) 119 if err != nil { 120 return nil, fmt.Errorf("failed to fetch embeddings: %w", err) 121 } 122 123 return response, nil 124 } 125 126 func convertCohereResponseToStructs(summaries []string, embeddings *cohere.EmbedByTypeResponse) []bson.M { 127 var documents []bson.M 128 129 for i, summary := range summaries { 130 float32Emb := convertFloat64ToFloat32(embeddings.Embeddings.Float[i]) 131 int8Emb := convertIntToInt8(embeddings.Embeddings.Int8[i]) 132 ubinaryEmb := convertIntToBytes(embeddings.Embeddings.Ubinary[i]) 133 134 floatVector := bson.NewVector(float32Emb) 135 int8Vector := bson.NewVector(int8Emb) 136 ubinaryVector, err := bson.NewPackedBitVector(ubinaryEmb, 0) 137 if err != nil { 138 log.Fatalf("Error creating PackedBitVector: %v", err) 139 } 140 141 document := bson.M{ 142 "text": summary, 143 "embeddings_float32": floatVector.Binary(), 144 "embeddings_int8": int8Vector.Binary(), 145 "embeddings_int1": ubinaryVector.Binary(), 146 } 147 148 documents = append(documents, document) 149 } 150 151 return documents 152 } 153 154 func writeJSONToFile(filename string, docs []bson.M) error { 155 file, err := os.Create(filename) 156 if err != nil { 157 return fmt.Errorf("failed to create file: %w", err) 158 } 159 defer file.Close() 160 161 var jsonDocuments []json.RawMessage 162 for _, document := range docs { 163 jsonBytes, err := bson.MarshalExtJSON(document, false, false) 164 if err != nil { 165 log.Fatalf("Error: %v", err) 166 } 167 jsonDocuments = append(jsonDocuments, jsonBytes) 168 } 169 170 jsonData, err := json.MarshalIndent(jsonDocuments, "", " ") 171 if err != nil { 172 return fmt.Errorf("failed to marshal JSON: %w", err) 173 } 174 175 _, err = file.Write(jsonData) 176 if err != nil { 177 return fmt.Errorf("failed to write JSON to file: %w", err) 178 } 179 180 return nil 181 } 182 183 func convertFloat64ToFloat32(f64s []float64) []float32 { 184 f32s := make([]float32, len(f64s)) 185 for i, v := range f64s { 186 f32s[i] = float32(v) 187 } 188 return f32s 189 } 190 191 func convertIntToInt8(ints []int) []int8 { 192 ints8 := make([]int8, len(ints)) 193 for i, val := range ints { 194 ints8[i] = int8(val) 195 } 196 return ints8 197 } 198 199 func convertIntToBytes(ints []int) []byte { 200 bytes := make([]byte, len(ints)) 201 for i, val := range ints { 202 bytes[i] = byte(val) 203 } 204 return bytes 205 } 如果未设立环境变量,请替换代码中的以下占位符值并保存文件。
MONGODB_URI
Atlas 集群连接字符串(如果您未设立环境变量)。
COHERE_API_KEY
如果您未设立环境变量,则使用 Cohere API密钥。
使用应用程序运行配置编译并运行该文件。
如果使用终端,运行以下命令来编译并执行程序。
go run GenerateAndConvertEmbeddings.go Embeddings successfully saved to embeddings.json 验证
embeddings.json
文件中的嵌入。
要学习;了解有关生成嵌入并将嵌入转换为 binData
向量的更多信息,请参阅如何创建向量嵌入。
摄取数据并创建Atlas Vector Search索引。
您必须将数据和嵌入上传到Atlas 集群中的集合,并在数据上创建Atlas Vector Search索引,以对数据运行$vectorSearch
查询。
在Go项目中创建一个名为
UploadDataAndCreateIndex.go
的新文件。touch UploadDataAndCreateIndex.go 将以下代码复制并粘贴到
UploadDataAndCreateIndex.go
文件中。此代码执行以下操作:
将
embeddings.json
文件中的float32
、int8
和int1
嵌入上传到Atlas 集群。在
embeddings.float32
、embeddings.int8
和embeddings.int1
字段上创建Atlas Vector Search索引。
UploadDataAndCreateIndex.go1 package main 2 3 import ( 4 "context" 5 "fmt" 6 "io/ioutil" 7 "log" 8 "time" 9 "os" 10 11 "go.mongodb.org/mongo-driver/v2/bson" 12 "go.mongodb.org/mongo-driver/v2/mongo" 13 "go.mongodb.org/mongo-driver/v2/mongo/options" 14 ) 15 16 var ( 17 mongodbURI = os.Getenv("MONGODB_URI") 18 dbName = "sample_airbnb" 19 collectionName = "listingsAndReviews" 20 indexName = "<INDEX-NAME>" 21 numberOfDimensions = 1024 22 embeddingFields = []string{"embeddings_float32", "embeddings_int8", "embeddings_int1"} 23 embeddingSimilarity = []string{"dotProduct", "dotProduct", "euclidean"} 24 ) 25 26 func main() { 27 if mongodbURI == "" { 28 log.Fatal("MONGODB_URI environment variable not set") 29 } 30 31 clientOptions := options.Client().ApplyURI(mongodbURI) 32 client, err := mongo.Connect(clientOptions) 33 if err != nil { 34 log.Fatalf("Error connecting to MongoDB: %v", err) 35 } 36 defer func() { 37 if err = client.Disconnect(context.TODO()); err != nil { 38 log.Fatal(err) 39 } 40 }() 41 42 if err := uploadEmbeddingsData(client); err != nil { 43 log.Fatalf("Error uploading embeddings data: %v", err) 44 } 45 46 setupVectorSearchIndex(client) 47 } 48 49 func uploadEmbeddingsData(client *mongo.Client) error { 50 collection := client.Database(dbName).Collection(collectionName) 51 52 // Load embeddings.json file 53 fileContent, err := ioutil.ReadFile("embeddings.json") 54 if err != nil { 55 return fmt.Errorf("error reading file: %w", err) 56 } 57 58 // Convert JSON file content to BSON compatible format using UnmarshalExtJSON 59 var documents []bson.M 60 if err := bson.UnmarshalExtJSON(fileContent, false, &documents); err != nil { 61 return fmt.Errorf("failed to unmarshal JSON data: %w", err) 62 } 63 64 // Update documents in MongoDB 65 for _, doc := range documents { 66 summary, exists := doc["text"].(string) 67 if !exists { 68 return fmt.Errorf("missing 'text' field in document") 69 } 70 71 // Using bson.Binary ensures binary data is correctly interpreted 72 if float32Bin, ok := doc["embeddings_float32"].(bson.Binary); ok { 73 doc["embeddings_float32"] = float32Bin 74 } 75 if int8Bin, ok := doc["embeddings_int8"].(bson.Binary); ok { 76 doc["embeddings_int8"] = int8Bin 77 } 78 if int1Bin, ok := doc["embeddings_int1"].(bson.Binary); ok { 79 doc["embeddings_int1"] = int1Bin 80 } 81 82 filter := bson.M{"summary": summary} 83 update := bson.M{ 84 "$set": doc, 85 } 86 87 // Set the upsert option 88 opts := options.UpdateMany().SetUpsert(true) 89 90 _, err = collection.UpdateMany(context.TODO(), filter, update, opts) 91 if err != nil { 92 return fmt.Errorf("failed to update documents: %w", err) 93 } 94 } 95 96 return nil 97 } 98 99 // Sets up vector search index in MongoDB 100 func setupVectorSearchIndex(client *mongo.Client) { 101 database := client.Database(dbName) 102 collection := database.Collection(collectionName) 103 104 ctx := context.TODO() 105 106 type vectorDefinitionField struct { 107 Type string `bson:"type"` 108 Path string `bson:"path"` 109 NumDimensions int `bson:"numDimensions"` 110 Similarity string `bson:"similarity"` 111 } 112 113 type vectorDefinition struct { 114 Fields []vectorDefinitionField `bson:"fields"` 115 } 116 117 fields := make([]vectorDefinitionField, len(embeddingFields)) 118 for i, field := range embeddingFields { 119 fields[i] = vectorDefinitionField{ 120 Type: "vector", 121 Path: field, 122 NumDimensions: numberOfDimensions, 123 Similarity: embeddingSimilarity[i], 124 } 125 } 126 127 opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch") 128 129 indexModel := mongo.SearchIndexModel{ 130 Definition: vectorDefinition{ 131 Fields: fields, 132 }, 133 Options: opts, 134 } 135 136 // Create the index 137 log.Println("Creating the index.") 138 searchIndexName, err := collection.SearchIndexes().CreateOne(ctx, indexModel) 139 if err != nil { 140 log.Fatalf("Failed to create the search index: %v", err) 141 } 142 143 // Polling to confirm successful index creation 144 log.Println("Polling to confirm successful index creation.") 145 log.Println("NOTE: This may take up to a minute.") 146 searchIndexes := collection.SearchIndexes() 147 var doc bson.Raw 148 149 for doc == nil { 150 cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName)) 151 if err != nil { 152 log.Fatalf("failed to list search indexes: %v", err) 153 } 154 155 if !cursor.Next(ctx) { 156 break 157 } 158 159 name := cursor.Current.Lookup("name").StringValue() 160 queryable := cursor.Current.Lookup("queryable").Boolean() 161 if name == searchIndexName && queryable { 162 doc = cursor.Current 163 } else { 164 time.Sleep(5 * time.Second) 165 } 166 } 167 168 log.Println("Name of Index Created: " + searchIndexName) 169 } 替换代码中的以下占位符值并保存文件。
MONGODB_URI
Atlas 集群连接字符串(如果您未设立环境变量)。
<INDEX-NAME>
集合的Atlas Vector Search索引的名称。
使用应用程序运行配置编译并运行该文件。
如果使用终端,运行以下命令来编译并执行程序。
go run UploadDataAndCreateIndex.go Creating the index. Polling to confirm successful index creation. NOTE: This may take up to a minute. Name of Index Created: <INDEX-NAME> 登录Atlas 集群并验证以下内容:
命名空间空间中的数据。
集合的Atlas Vector Search索引。
创建并针对集合运行查询。
要测试嵌入,您可以对集合运行查询。使用嵌入模型提供商为查询文本生成float
、int8
和int1
嵌入。以下示例代码使用 Cohere 的embed
API生成全精度向量。生成嵌入后,使用MongoDB Go驾驶员将原原生向量嵌入转换为BSON向量,并对集合运行 $vectorSearch
查询。
在Go项目中创建一个名为
CreateEmbeddingsAndRunQuery.go
的新文件。touch CreateEmbeddingsAndRunQuery.go 将以下代码复制并粘贴到
CreateEmbeddingsAndRunQuery.go
文件中。此代码执行以下操作:
使用 Cohere 的
embed
API 生成float32
、int8
和ubinary
向量嵌入。使用MongoDB Go驾驶员将嵌入转换为BSON 向量。
binData
针对集合运行查询并返回结果。
CreateEmbeddingsAndRunQuery.go1 package main 2 3 import ( 4 "context" 5 "fmt" 6 "log" 7 "os" 8 9 cohere "github.com/cohere-ai/cohere-go/v2" 10 cohereclient "github.com/cohere-ai/cohere-go/v2/client" 11 "go.mongodb.org/mongo-driver/v2/bson" 12 "go.mongodb.org/mongo-driver/v2/mongo" 13 "go.mongodb.org/mongo-driver/v2/mongo/options" 14 ) 15 16 const ( 17 dbName = "<DATABASE-NAME>" 18 collectionName = "<COLLECTION-NAME>" 19 vectorIndexName = "<INDEX-NAME>" 20 dataFieldName = "<TEXT-FIELD-NAME>" 21 queryText = "<QUERY-TEXT>" 22 ) 23 24 func main() { 25 apiKey := os.Getenv("COHERE_API_KEY") 26 mongodbURI := os.Getenv("MONGODB_URI") 27 28 if apiKey == "" { 29 log.Fatal("API key not found. Set COHERE_API_KEY in your environment.") 30 } 31 if mongodbURI == "" { 32 log.Fatal("MongoDB URI not found. Set MONGODB_URI in your environment.") 33 } 34 35 embeddingsData, err := generateAndConvertEmbeddings(apiKey, queryText) 36 if err != nil { 37 log.Fatalf("Error generating embeddings: %v", err) 38 } 39 40 err = runVectorSearchQuery(mongodbURI, embeddingsData) 41 if err != nil { 42 log.Fatalf("Error running vector search query: %v", err) 43 } 44 } 45 46 // Generate embeddings using Cohere's embed API from the query text 47 func generateAndConvertEmbeddings(apiKey, text string) (map[string]bson.Binary, error) { 48 client := cohereclient.NewClient(cohereclient.WithToken(apiKey)) 49 50 model := "embed-english-v3.0" 51 response, err := client.V2.Embed(context.TODO(), &cohere.V2EmbedRequest{ 52 Texts: []string{text}, 53 Model: model, 54 InputType: cohere.EmbedInputTypeSearchQuery, 55 EmbeddingTypes: []cohere.EmbeddingType{ 56 cohere.EmbeddingTypeFloat, 57 cohere.EmbeddingTypeInt8, 58 cohere.EmbeddingTypeUbinary, 59 }, 60 }) 61 if err != nil { 62 return nil, fmt.Errorf("failed to fetch embeddings: %w", err) 63 } 64 65 if response.Embeddings == nil || len(response.Embeddings.Float) == 0 { 66 return nil, fmt.Errorf("no embeddings found in the API response") 67 } 68 69 return createBSONVectorEmbeddings(response.Embeddings), nil 70 } 71 72 // Convert embeddings to BSON vectors using MongoDB Go Driver 73 func createBSONVectorEmbeddings(embeddings *cohere.EmbedByTypeResponseEmbeddings) map[string]bson.Binary { 74 bsonVectorEmbeddings := make(map[string]bson.Binary) 75 76 // Convert float embeddings 77 if len(embeddings.Float) > 0 { 78 floatData := convertFloat64ToFloat32(embeddings.Float[0]) 79 floatVector := bson.NewVector(floatData) // Here, we are using bson.NewVector to create a BSON Vector. 80 bsonVectorEmbeddings["float32"] = floatVector.Binary() 81 } 82 83 // Convert int8 embeddings 84 if len(embeddings.Int8) > 0 { 85 int8Data := convertIntToInt8(embeddings.Int8[0]) 86 int8Vector := bson.NewVector(int8Data) // Similarly, create BSON Vector for int8 data. 87 bsonVectorEmbeddings["int8"] = int8Vector.Binary() 88 } 89 90 // Convert ubinary embeddings to a packed bit vector 91 if len(embeddings.Ubinary) > 0 { 92 int1Data := convertIntToBytes(embeddings.Ubinary[0]) 93 ubinaryVector, _ := bson.NewPackedBitVector(int1Data, 0) // Packed bit using bson.NewPackedBitVector 94 bsonVectorEmbeddings["int1"] = ubinaryVector.Binary() 95 } 96 97 return bsonVectorEmbeddings 98 } 99 100 // Run $vectorSearch query using the embeddings 101 func runVectorSearchQuery(mongodbURI string, embeddingsData map[string]bson.Binary) error { 102 ctx := context.Background() 103 clientOptions := options.Client().ApplyURI(mongodbURI) 104 client, err := mongo.Connect(clientOptions) 105 if err != nil { 106 return fmt.Errorf("failed to connect to MongoDB: %w", err) 107 } 108 defer func() { _ = client.Disconnect(ctx) }() 109 110 db := client.Database(dbName) 111 collection := db.Collection(collectionName) 112 113 // Map the correct paths for embeddings 114 pathMap := map[string]string{ 115 "float32": "embeddings_float32", 116 "int8": "embeddings_int8", 117 "int1": "embeddings_int1", 118 } 119 120 for pathKey, queryVector := range embeddingsData { 121 path, ok := pathMap[pathKey] 122 if !ok { 123 return fmt.Errorf("invalid path key: %s", pathKey) 124 } 125 126 pipeline := mongo.Pipeline{ 127 { 128 {"$vectorSearch", bson.D{ 129 {"queryVector", queryVector}, 130 {"index", vectorIndexName}, 131 {"path", path}, 132 {"numCandidates", 5}, 133 {"limit", 2}, 134 }}, 135 }, 136 { 137 {"$project", bson.D{ 138 {"_id", 1}, 139 {dataFieldName, 1}, 140 {"score", bson.D{ 141 {"$meta", "vectorSearchScore"}, 142 }}, 143 }}, 144 }, 145 } 146 147 cursor, err := collection.Aggregate(context.Background(), pipeline) 148 if err != nil { 149 return fmt.Errorf("failed to run vector search aggregation query: %w", err) 150 } 151 defer cursor.Close(ctx) 152 153 var results []bson.M 154 if err = cursor.All(context.Background(), &results); err != nil { 155 return fmt.Errorf("failed to parse aggregation query results: %w", err) 156 } 157 158 fmt.Printf("Results from %v embeddings:\n", path) 159 for _, result := range results { 160 fmt.Println(result) 161 } 162 } 163 164 return nil 165 } 166 167 // Utility functions to handle list conversion 168 func convertFloat64ToFloat32(f64s []float64) []float32 { 169 f32s := make([]float32, len(f64s)) 170 for i, v := range f64s { 171 f32s[i] = float32(v) 172 } 173 return f32s 174 } 175 176 func convertIntToInt8(ints []int) []int8 { 177 ints8 := make([]int8, len(ints)) 178 for i, val := range ints { 179 ints8[i] = int8(val) 180 } 181 return ints8 182 } 183 184 func convertIntToBytes(ints []int) []byte { 185 bytes := make([]byte, len(ints)) 186 for i, val := range ints { 187 bytes[i] = byte(val) 188 } 189 return bytes 190 } 替换代码中的以下占位符值并保存文件。
MONGODB_URI
Atlas 集群连接字符串(如果您未设立环境变量)。
COHERE_API_KEY
如果您未设立环境变量,则使用 Cohere API密钥。
<DATABASE-NAME>
Atlas 集群中数据库的名称。对于本示例,请使用
sample_airbnb
。<COLLECTION-NAME>
您提取数据的集合的名称。对于本示例,请使用
listingsAndReviews
。<INDEX-NAME>
集合的Atlas Vector Search索引的名称。
<TEXT-FIELD-NAME>
包含生成嵌入的文本的字段名称。对于本示例,请使用
summary
。<QUERY-TEXT>
查询的文本。对于本示例,请使用
ocean view
。使用应用程序运行配置编译并运行该文件。
如果使用终端,运行以下命令来编译并执行程序。
go run CreateEmbeddingsAndRunQuery.go Results from embeddings_float32 embeddings: {"_id":"10266175","summary":"A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.","score":{"$numberDouble":"0.7278661131858826"}} {"summary":"A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.","score":{"$numberDouble":"0.688639760017395"},"_id":"1001265"} Results from embeddings_int8 embeddings: {"_id":"10266175","summary":"A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.","score":{"$numberDouble":"0.5215557217597961"}} {"_id":"1001265","summary":"A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.","score":{"$numberDouble":"0.5179016590118408"}} Results from embeddings_int1 embeddings: {"_id":"10266175","summary":"A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.","score":{"$numberDouble":"0.6591796875"}} {"_id":"1001265","summary":"A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.","score":{"$numberDouble":"0.6337890625"}}
要学习;了解有关生成嵌入并将嵌入转换为 binData
向量的更多信息,请参阅如何创建向量嵌入。
在 IDE 中创建Java项目,并为MongoDB Java驱动程序配置依赖项,然后在项目中执行以下步骤。要试用该示例,请将占位符替换为有效值。
创建Java项目并安装依赖项。
在 IDE 中,使用 Maven 或 Gradle 创建Java项目。
根据您的包管理器,添加以下依赖项:
如果使用 Maven,请将以下依赖项添加到项目的
pom.xml
文件的dependencies
大量中:pom.xml<dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.13.2</version> <scope>test</scope> </dependency> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-sync</artifactId> <version>5.3.1</version> </dependency> <dependency> <groupId>com.cohere</groupId> <artifactId>cohere-java</artifactId> <version>1.6.0</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <version>2.0.16</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>2.0.16</version> <scope>test</scope> </dependency> </dependencies> 如果您使用 Gradle,请将以下内容添加到项目
build.gradle
文件的dependencies
大量中:build.gradledependencies { // MongoDB Java Sync Driver v5.3.1 or later implementation 'org.mongodb:mongodb-driver-sync:[5.3.1,)' // Java library for working with Cohere models implementation 'ai.cohere:cohere-java:1.6.0' // SLF4J (The Simple Logging Facade for Java) testImplementation("org.slf4j:slf4j-simple:2.0.16") implementation("org.slf4j:slf4j-api:2.0.16") } 运行包管理器以安装项目的依赖项。
设置环境变量。
注意
此示例在 IDE 中设置项目的变量。 生产应用程序可以通过部署配置、CI/CD管道或密钥管理器管理环境变量,但您可以调整提供的代码以适合您的使用案例。
在 IDE 中,创建新的配置模板并将以下变量添加到项目中:
如果您使用的是 IntelliJ IDEA,则应创建一个新的 Application 运行配置模板,然后在 Environment variables 字段中以分号分隔的值的形式来添加变量(例如,
FOO=123;BAR=456
)。应用这些更改并单击 OK。如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。 应用更改并单击 OK。
要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。
COHERE_API_KEY=<api-key> MONGODB_URI=<connection-string>
用以下值更新占位符:
将
<api-key>
占位符值替换为您的 Cohere API密钥。用 Atlas 集群的 SRV 连接字符串替换
<connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
从数据生成嵌入。
您可以使用嵌入模型提供商为数据生成float
、int8
和int1
嵌入,然后使用MongoDB Java驾驶员原生向量嵌入转换为BSON向量。以下示例代码使用 Cohere 的embed
API生成全精度向量。
在Java项目中创建一个名为
GenerateAndConvertEmbeddings.java
的新文件。touch GenerateAndConvertEmbeddings.java 将以下代码复制并粘贴到
GenerateAndConvertEmbeddings.java
文件中。此代码执行以下操作:
使用 Cohere 的
embed
API 生成float32
、int8
和ubinary
向量嵌入。使用MongoDB Java驾驶员将嵌入转换为BSON 向量。
binData
创建一个名为
embeddings.json
的文件,并将带有嵌入的数据保存在该文件中以上传到Atlas。
GenerateAndConvertEmbeddings.java1 import com.cohere.api.Cohere; 2 import com.cohere.api.requests.EmbedRequest; 3 import com.cohere.api.types.EmbedByTypeResponse; 4 import com.cohere.api.types.EmbedByTypeResponseEmbeddings; 5 import com.cohere.api.types.EmbedInputType; 6 import com.cohere.api.types.EmbedResponse; 7 import com.cohere.api.types.EmbeddingType; 8 import java.io.FileOutputStream; 9 import java.io.IOException; 10 import java.util.ArrayList; 11 import java.util.List; 12 import java.util.Objects; 13 import java.util.Optional; 14 import org.bson.BinaryVector; 15 import org.bson.Document; 16 17 public class GenerateAndConvertEmbeddings { 18 19 // List of text data to embed 20 private static final List<String> DATA = List.of( 21 "The Great Wall of China is visible from space.", 22 "The Eiffel Tower was completed in Paris in 1889.", 23 "Mount Everest is the highest peak on Earth at 8,848m.", 24 "Shakespeare wrote 37 plays and 154 sonnets during his lifetime.", 25 "The Mona Lisa was painted by Leonardo da Vinci." 26 ); 27 28 public static void main(String[] args) { 29 // Cohere API key for authentication 30 String apiKey = System.getenv("COHERE_API_KEY"); 31 32 // Fetch embeddings from the Cohere API 33 EmbedByTypeResponseEmbeddings embeddings = fetchEmbeddingsFromCohere(apiKey); 34 Document bsonEmbeddings = convertEmbeddingsToBson(embeddings); 35 36 writeEmbeddingsToFile(bsonEmbeddings, "embeddings.json"); 37 } 38 39 // Fetches embeddings based on input data from the Cohere API 40 private static EmbedByTypeResponseEmbeddings fetchEmbeddingsFromCohere(String apiKey) { 41 if (Objects.isNull(apiKey) || apiKey.isEmpty()) { 42 throw new RuntimeException("API key not found. Please set COHERE_API_KEY in your environment."); 43 } 44 45 Cohere cohere = Cohere.builder().token(apiKey).clientName("embed-example").build(); 46 47 try { 48 EmbedRequest request = EmbedRequest.builder() 49 .model("embed-english-v3.0") 50 .inputType(EmbedInputType.SEARCH_DOCUMENT) 51 .texts(DATA) 52 .embeddingTypes(List.of(EmbeddingType.FLOAT, EmbeddingType.INT_8, EmbeddingType.UBINARY)) 53 .build(); 54 55 EmbedResponse response = cohere.embed(request); 56 Optional<EmbedByTypeResponse> optionalEmbeddingsWrapper = response.getEmbeddingsByType(); 57 58 return optionalEmbeddingsWrapper.orElseThrow().getEmbeddings(); 59 } catch (Exception e) { 60 System.err.println("Error fetching embeddings: " + e.getMessage()); 61 throw e; 62 } 63 } 64 65 // Converts embeddings to BSON binary vectors using MongoDB Java Driver 66 private static Document convertEmbeddingsToBson(EmbedByTypeResponseEmbeddings embeddings) { 67 List<List<Double>> floatEmbeddings = embeddings.getFloat().orElseThrow(); 68 List<List<Integer>> int8Embeddings = embeddings.getInt8().orElseThrow(); 69 List<List<Integer>> ubinaryEmbeddings = embeddings.getUbinary().orElseThrow(); 70 71 List<Document> bsonEmbeddings = new ArrayList<>(); 72 for (int i = 0; i < floatEmbeddings.size(); i++) { 73 Document bsonEmbedding = new Document() 74 .append("text", DATA.get(i)) 75 .append("embeddings_float32", BinaryVector.floatVector(listToFloatArray(floatEmbeddings.get(i)))) 76 .append("embeddings_int8", BinaryVector.int8Vector(listToByteArray(int8Embeddings.get(i)))) 77 .append("embeddings_int1", BinaryVector.packedBitVector(listToByteArray(ubinaryEmbeddings.get(i)), (byte) 0)); 78 79 bsonEmbeddings.add(bsonEmbedding); 80 } 81 82 return new Document("data", bsonEmbeddings); 83 } 84 85 // Writes embeddings to JSON file 86 private static void writeEmbeddingsToFile(Document bsonEmbeddings, String fileName) { 87 try (FileOutputStream fos = new FileOutputStream(fileName)) { 88 fos.write(bsonEmbeddings.toJson().getBytes()); 89 System.out.println("Embeddings saved to " + fileName); 90 } catch (IOException e) { 91 System.out.println("Error writing embeddings to file: " + e.getMessage()); 92 } 93 } 94 95 // Convert List of Doubles to an array of floats 96 private static float[] listToFloatArray(List<Double> list) { 97 float[] array = new float[list.size()]; 98 for (int i = 0; i < list.size(); i++) { 99 array[i] = list.get(i).floatValue(); 100 } 101 return array; 102 } 103 104 // Convert List of Integers to an array of bytes 105 private static byte[] listToByteArray(List<Integer> list) { 106 byte[] array = new byte[list.size()]; 107 for (int i = 0; i < list.size(); i++) { 108 array[i] = list.get(i).byteValue(); 109 } 110 return array; 111 } 112 } 如果未设立环境变量,请替换代码中的
COHERE_API_KEY
占位符值,然后保存文件。使用应用程序运行配置编译并运行该文件。
如果使用终端,运行以下命令来编译并执行程序。
javac GenerateAndConvertEmbeddings.java java GenerateAndConvertEmbeddings BSON embeddings saved to embeddings.json 验证
embeddings.json
文件中的嵌入。
要学习;了解有关生成嵌入并将嵌入转换为 binData
向量的更多信息,请参阅如何创建向量嵌入。
摄取数据并创建Atlas Vector Search索引。
您必须将数据和嵌入上传到Atlas 集群中的集合,并在数据上创建Atlas Vector Search索引,以对数据运行$vectorSearch
查询。
在Java项目中创建一个名为
UploadDataAndCreateIndex.java
的新文件。touch UploadDataAndCreateIndex.java 将以下代码复制并粘贴到
UploadDataAndCreateIndex.java
文件中。此代码执行以下操作:
将
embeddings.json
文件中的数据上传到Atlas 集群。在
embeddings_float32
、embeddings_int8
和embeddings_int1
字段上创建Atlas Vector Search索引。
UploadDataAndCreateIndex.java1 import com.mongodb.client.MongoClient; 2 import com.mongodb.client.MongoClients; 3 import com.mongodb.client.MongoCollection; 4 import com.mongodb.client.MongoDatabase; 5 import com.mongodb.client.model.SearchIndexModel; 6 import com.mongodb.client.model.SearchIndexType; 7 import org.bson.Document; 8 import org.bson.conversions.Bson; 9 10 import java.io.IOException; 11 import java.nio.file.Files; 12 import java.nio.file.Path; 13 import java.util.Collections; 14 import java.util.List; 15 import java.util.concurrent.TimeUnit; 16 import java.util.stream.StreamSupport; 17 18 public class UploadDataAndCreateIndex { 19 20 private static final String MONGODB_URI = System.getenv("MONGODB_URI"); 21 private static final String DB_NAME = "<DATABASE-NAME>"; 22 private static final String COLLECTION_NAME = "<COLLECTION-NAME>"; 23 private static final String INDEX_NAME = "<INDEX-NAME>"; 24 25 public static void main(String[] args) { 26 try (MongoClient mongoClient = MongoClients.create(MONGODB_URI)) { 27 storeEmbeddings(mongoClient); 28 setupVectorSearchIndex(mongoClient); 29 } catch (IOException | InterruptedException e) { 30 e.printStackTrace(); 31 } 32 } 33 34 public static void storeEmbeddings(MongoClient client) throws IOException { 35 MongoDatabase database = client.getDatabase(DB_NAME); 36 MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME); 37 38 String fileContent = Files.readString(Path.of("embeddings.json")); 39 List<Document> documents = parseDocuments(fileContent); 40 41 collection.insertMany(documents); 42 System.out.println("Inserted documents into MongoDB"); 43 } 44 45 private static List<Document> parseDocuments(String jsonContent) throws IOException { 46 Document rootDoc = Document.parse(jsonContent); 47 return rootDoc.getList("data", Document.class); 48 } 49 50 public static void setupVectorSearchIndex(MongoClient client) throws InterruptedException { 51 MongoDatabase database = client.getDatabase(DB_NAME); 52 MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME); 53 54 Bson definition = new Document( 55 "fields", 56 List.of( 57 new Document("type", "vector") 58 .append("path", "embeddings_float32") 59 .append("numDimensions", 1024) 60 .append("similarity", "dotProduct"), 61 new Document("type", "vector") 62 .append("path", "embeddings_int8") 63 .append("numDimensions", 1024) 64 .append("similarity", "dotProduct"), 65 new Document("type", "vector") 66 .append("path", "embeddings_int1") 67 .append("numDimensions", 1024) 68 .append("similarity", "euclidean") 69 ) 70 ); 71 72 SearchIndexModel indexModel = new SearchIndexModel( 73 INDEX_NAME, 74 definition, 75 SearchIndexType.vectorSearch() 76 ); 77 78 List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel)); 79 System.out.println("Successfully created vector index named: " + result.get(0)); 80 System.out.println("It may take up to a minute for the index to leave the BUILDING status and become queryable."); 81 82 System.out.println("Polling to confirm the index has changed from the BUILDING status."); 83 waitForIndex(collection, INDEX_NAME); 84 } 85 86 public static <T> boolean waitForIndex(final MongoCollection<T> collection, final String indexName) { 87 long startTime = System.nanoTime(); 88 long timeoutNanos = TimeUnit.SECONDS.toNanos(60); 89 while (System.nanoTime() - startTime < timeoutNanos) { 90 Document indexRecord = StreamSupport.stream(collection.listSearchIndexes().spliterator(), false) 91 .filter(index -> indexName.equals(index.getString("name"))) 92 .findAny().orElse(null); 93 if (indexRecord != null) { 94 if ("FAILED".equals(indexRecord.getString("status"))) { 95 throw new RuntimeException("Search index has FAILED status."); 96 } 97 if (indexRecord.getBoolean("queryable")) { 98 System.out.println(indexName + " index is ready to query"); 99 return true; 100 } 101 } 102 try { 103 Thread.sleep(100); // busy-wait, avoid in production 104 } catch (InterruptedException e) { 105 Thread.currentThread().interrupt(); 106 throw new RuntimeException(e); 107 } 108 } 109 return false; 110 } 111 } 替换代码中的以下占位符值并保存文件。
MONGODB_URI
Atlas 集群连接字符串(如果您未设立环境变量)。
<DATABASE-NAME>
Name of the database in your Atlas cluster.
<COLLECTION-NAME>
要上传数据的集合的名称。
<INDEX-NAME>
集合的Atlas Vector Search索引的名称。
使用应用程序运行配置编译并运行该文件。
如果使用终端,运行以下命令来编译并执行程序。
javac UploadDataAndCreateIndex.java java UploadDataAndCreateIndex Inserted documents into MongoDB Successfully created vector index named: <INDEX_NAME> It may take up to a minute for the index to leave the BUILDING status and become queryable. Polling to confirm the index has changed from the BUILDING status. <INDEX_NAME> index is ready to query 登录Atlas 集群并验证以下内容:
命名空间空间中的数据。
集合的Atlas Vector Search索引。
创建并针对集合运行查询。
要测试嵌入,您可以对集合运行查询。使用嵌入模型提供商为查询文本生成 float
、int8
和 int1
嵌入。以下示例代码使用 Cohere 的 embed
API生成全精度向量。生成嵌入后,使用MongoDB Java驾驶员将原原生向量嵌入转换为BSON向量,并对集合运行$vectorSearch
查询。
在Java项目中创建一个名为
CreateEmbeddingsAndRunQuery.java
的新文件。touch CreateEmbeddingsAndRunQuery.java 将以下代码复制并粘贴到
CreateEmbeddingsAndRunQuery.java
文件中。此代码执行以下操作:
使用 Cohere 的
embed
API 生成float32
、int8
和ubinary
向量嵌入。使用MongoDB Java驾驶员将嵌入转换为BSON 向量。
binData
针对您的集合运行查询。
CreateEmbeddingsAndRunQuery.java1 import com.cohere.api.Cohere; 2 import com.cohere.api.requests.EmbedRequest; 3 import com.cohere.api.types.EmbedResponse; 4 import com.cohere.api.types.EmbedByTypeResponse; 5 import com.cohere.api.types.EmbedByTypeResponseEmbeddings; 6 import com.cohere.api.types.EmbeddingType; 7 import com.cohere.api.types.EmbedInputType; 8 import com.mongodb.client.MongoClient; 9 import com.mongodb.client.MongoClients; 10 import com.mongodb.client.MongoCollection; 11 import com.mongodb.client.MongoDatabase; 12 import org.bson.Document; 13 import org.bson.conversions.Bson; 14 import org.bson.BinaryVector; 15 import org.slf4j.Logger; 16 import org.slf4j.LoggerFactory; 17 18 import java.util.ArrayList; 19 import java.util.HashMap; 20 import java.util.List; 21 import java.util.Map; 22 import java.util.Optional; 23 24 import static com.mongodb.client.model.Aggregates.project; 25 import static com.mongodb.client.model.Aggregates.vectorSearch; 26 import static com.mongodb.client.model.Projections.fields; 27 import static com.mongodb.client.model.Projections.include; 28 import static com.mongodb.client.model.Projections.exclude; 29 import static com.mongodb.client.model.Projections.metaVectorSearchScore; 30 import static com.mongodb.client.model.search.SearchPath.fieldPath; 31 import static com.mongodb.client.model.search.VectorSearchOptions.approximateVectorSearchOptions; 32 import static java.util.Arrays.asList; 33 34 public class CreateEmbeddingsAndRunQuery { 35 private static final String COHERE_API_KEY = System.getenv("COHERE_API_KEY"); 36 private static final String MONGODB_URI = System.getenv("MONGODB_URI"); 37 private static final String DB_NAME = "<DATABASE-NAME>"; 38 private static final String COLLECTION_NAME = "<COLLECTION-NAME>"; 39 private static final String VECTOR_INDEX_NAME = "<INDEX-NAME>"; 40 private static final String DATA_FIELD_NAME = "<DATA-FIELD>"; 41 42 public static void main(String[] args) { 43 String queryText = "<QUERY-TEXT>"; 44 45 try { 46 CreateAndRunQuery processor = new CreateAndRunQuery(); 47 Map<String, BinaryVector> embeddingsData = processor.generateAndConvertEmbeddings(queryText); 48 processor.runVectorSearchQuery(embeddingsData); 49 } catch (Exception e) { 50 e.printStackTrace(); 51 } 52 } 53 54 // Generate embeddings using Cohere's embed API from the query text 55 public Map<String, BinaryVector> generateAndConvertEmbeddings(String text) throws Exception { 56 if (COHERE_API_KEY == null || COHERE_API_KEY.isEmpty()) { 57 throw new RuntimeException("API key not found. Set COHERE_API_KEY in your environment."); 58 } 59 60 Cohere cohere = Cohere.builder().token(COHERE_API_KEY).build(); 61 62 EmbedRequest request = EmbedRequest.builder() 63 .model("embed-english-v3.0") 64 .inputType(EmbedInputType.SEARCH_QUERY) 65 .texts(List.of(text)) 66 .embeddingTypes(List.of(EmbeddingType.FLOAT, EmbeddingType.INT_8, EmbeddingType.UBINARY)) 67 .build(); 68 69 EmbedResponse response = cohere.embed(request); 70 Optional<EmbedByTypeResponse> optionalEmbeddingsWrapper = response.getEmbeddingsByType(); 71 if (optionalEmbeddingsWrapper.isEmpty()) { 72 throw new RuntimeException("No embeddings found in the API response."); 73 } 74 75 EmbedByTypeResponseEmbeddings embeddings = optionalEmbeddingsWrapper.get().getEmbeddings(); 76 return createBinaryVectorEmbeddings(embeddings); 77 } 78 79 // Convert embeddings to BSON binary vectors using MongoDB Java Driver 80 private static Map<String, BinaryVector> createBinaryVectorEmbeddings(EmbedByTypeResponseEmbeddings embeddings) { 81 Map<String, BinaryVector> binaryVectorEmbeddings = new HashMap<>(); 82 83 // Convert float embeddings 84 List<Double> floatList = embeddings.getFloat().orElseThrow().get(0); 85 if (floatList != null) { 86 float[] floatData = listToFloatArray(floatList); 87 BinaryVector floatVector = BinaryVector.floatVector(floatData); 88 binaryVectorEmbeddings.put("float32", floatVector); 89 } 90 91 // Convert int8 embeddings 92 List<Integer> int8List = embeddings.getInt8().orElseThrow().get(0); 93 if (int8List != null) { 94 byte[] int8Data = listToByteArray(int8List); 95 BinaryVector int8Vector = BinaryVector.int8Vector(int8Data); 96 binaryVectorEmbeddings.put("int8", int8Vector); 97 } 98 99 // Convert ubinary embeddings 100 List<Integer> ubinaryList = embeddings.getUbinary().orElseThrow().get(0); 101 if (ubinaryList != null) { 102 byte[] int1Data = listToByteArray(ubinaryList); 103 BinaryVector packedBitsVector = BinaryVector.packedBitVector(int1Data, (byte) 0); 104 binaryVectorEmbeddings.put("int1", packedBitsVector); 105 } 106 107 return binaryVectorEmbeddings; 108 } 109 110 // Define and run $vectorSearch query using the embeddings 111 public void runVectorSearchQuery(Map<String, BinaryVector> embeddingsData) { 112 if (MONGODB_URI == null || MONGODB_URI.isEmpty()) { 113 throw new RuntimeException("MongoDB URI not found. Set MONGODB_URI in your environment."); 114 } 115 116 try (MongoClient mongoClient = MongoClients.create(MONGODB_URI)) { 117 MongoDatabase database = mongoClient.getDatabase(DB_NAME); 118 MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME); 119 120 for (String path : embeddingsData.keySet()) { 121 BinaryVector queryVector = embeddingsData.get(path); 122 123 List<Bson> pipeline = asList( 124 vectorSearch( 125 fieldPath("embeddings_" + path), 126 queryVector, 127 VECTOR_INDEX_NAME, 128 2, 129 approximateVectorSearchOptions(5) 130 ), 131 project( 132 fields( 133 exclude("_id"), 134 include(DATA_FIELD_NAME), 135 metaVectorSearchScore("vectorSearchScore") 136 ) 137 ) 138 ); 139 140 List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>()); 141 142 System.out.println("Results from " + path + " embeddings:"); 143 for (Document result : results) { 144 System.out.println(result.toJson()); 145 } 146 } 147 } 148 } 149 150 private static float[] listToFloatArray(List<Double> list) { 151 float[] array = new float[list.size()]; 152 for (int i = 0; i < list.size(); i++) { 153 array[i] = list.get(i).floatValue(); 154 } 155 return array; 156 } 157 158 private static byte[] listToByteArray(List<Integer> list) { 159 byte[] array = new byte[list.size()]; 160 for (int i = 0; i < list.size(); i++) { 161 array[i] = list.get(i).byteValue(); 162 } 163 return array; 164 } 165 } 替换代码中的以下占位符值并保存文件。
MONGODB_URI
Atlas 集群连接字符串(如果您未设立环境变量)。
COHERE_API_KEY
如果您未设立环境变量,则使用 Cohere API密钥。
<DATABASE-NAME>
Name of the database in your Atlas cluster.
<COLLECTION-NAME>
您提取数据的集合的名称。
<INDEX-NAME>
集合的Atlas Vector Search索引的名称。
<DATA-FIELD-NAME>
包含生成嵌入的文本的字段名称。对于本示例,请使用
text
。<QUERY-TEXT>
查询的文本。对于本示例,请使用
science fact
。使用应用程序运行配置编译并运行该文件。
如果使用终端,运行以下命令来编译并执行程序。
javac CreateEmbeddingsAndRunQuery.java java CreateEmbeddingsAndRunQuery Results from int1 embeddings: {"text": "Mount Everest is the highest peak on Earth at 8,848m.", "score": 0.642578125} {"text": "The Great Wall of China is visible from space.", "score": 0.61328125} Results from int8 embeddings: {"text": "Mount Everest is the highest peak on Earth at 8,848m.", "score": 0.5149773359298706} {"text": "The Great Wall of China is visible from space.", "score": 0.5146723985671997} Results from float32 embeddings: {"text": "Mount Everest is the highest peak on Earth at 8,848m.", "score": 0.6583383083343506} {"text": "The Great Wall of China is visible from space.", "score": 0.6536108255386353}
要学习;了解有关生成嵌入并将嵌入转换为 binData
向量的更多信息,请参阅如何创建向量嵌入。
创建Java项目并安装依赖项。
在 IDE 中,使用 Maven 或 Gradle 创建Java项目。
根据您的包管理器,添加以下依赖项:
如果使用 Maven,请将以下依赖项添加到项目的
pom.xml
文件的dependencies
大量中:pom.xml<dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.13.2</version> <scope>test</scope> </dependency> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-sync</artifactId> <version>5.3.1</version> </dependency> <dependency> <groupId>com.cohere</groupId> <artifactId>cohere-java</artifactId> <version>1.6.0</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <version>2.0.16</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>2.0.16</version> <scope>test</scope> </dependency> </dependencies> 如果您使用 Gradle,请将以下内容添加到项目
build.gradle
文件的dependencies
大量中:build.gradledependencies { // MongoDB Java Sync Driver v5.3.1 or later implementation 'org.mongodb:mongodb-driver-sync:[5.3.1,)' // Java library for working with Cohere models implementation 'ai.cohere:cohere-java:1.6.0' // SLF4J (The Simple Logging Facade for Java) testImplementation("org.slf4j:slf4j-simple:2.0.16") implementation("org.slf4j:slf4j-api:2.0.16") } 运行包管理器以安装项目的依赖项。
设置环境变量。
注意
此示例在 IDE 中设置项目的变量。 生产应用程序可以通过部署配置、CI/CD管道或密钥管理器管理环境变量,但您可以调整提供的代码以适合您的使用案例。
在 IDE 中,创建新的配置模板并将以下变量添加到项目中:
如果您使用的是 IntelliJ IDEA,则应创建一个新的 Application 运行配置模板,然后在 Environment variables 字段中以分号分隔的值的形式来添加变量(例如,
FOO=123;BAR=456
)。应用这些更改并单击 OK。如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。 应用更改并单击 OK。
要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。
COHERE_API_KEY=<api-key> MONGODB_URI=<connection-string>
用以下值更新占位符:
将
<api-key>
占位符值替换为您的 Cohere API密钥。用 Atlas 集群的 SRV 连接字符串替换
<connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
(条件)从您的数据生成嵌入。
您可以使用嵌入模型提供商为数据生成float
、int8
和int1
嵌入,然后使用MongoDB Java驾驶员原生向量嵌入转换为BSON向量。以下示例代码使用 Cohere 的embed
API从 sample_airbnb.listingsAndReviews
命名空间中的数据生成全精度向量。
在Java项目中创建一个名为
GenerateAndConvertEmbeddings.java
的新文件。touch GenerateAndConvertEmbeddings.java 将以下代码复制并粘贴到
GenerateAndConvertEmbeddings.java
文件中。此代码执行以下操作:
从
sample_airbnb.listingsAndReviews
命名空间中的 50 文档获取summary
字段。使用 Cohere 的
embed
API 生成float32
、int8
和ubinary
向量嵌入。使用MongoDB Java驾驶员将嵌入转换为BSON 向量。
binData
创建一个名为
embeddings.json
的文件,并将带有嵌入的数据保存在该文件中。
GenerateAndConvertEmbeddings.java1 import com.cohere.api.Cohere; 2 import com.cohere.api.requests.EmbedRequest; 3 import com.cohere.api.types.EmbedByTypeResponse; 4 import com.cohere.api.types.EmbedResponse; 5 import com.cohere.api.types.EmbeddingType; 6 import com.cohere.api.types.EmbedInputType; 7 import com.cohere.api.types.EmbedByTypeResponseEmbeddings; 8 import com.mongodb.client.MongoClient; 9 import com.mongodb.client.MongoClients; 10 import com.mongodb.client.MongoDatabase; 11 import com.mongodb.client.MongoCollection; 12 import com.mongodb.client.FindIterable; 13 import org.bson.BsonArray; 14 import org.bson.Document; 15 import org.bson.BinaryVector; 16 import org.slf4j.Logger; 17 import org.slf4j.LoggerFactory; 18 import java.io.FileOutputStream; 19 import java.io.IOException; 20 import java.util.ArrayList; 21 import java.util.Arrays; 22 import java.util.List; 23 import java.util.Objects; 24 import java.util.Optional; 25 26 public class GenerateAndConvertEmbeddings { 27 private static final Logger logger = LoggerFactory.getLogger(GenerateAndConvertEmbeddings.class); 28 private static final String COHERE_API_KEY = System.getenv("COHERE_API_KEY"); 29 private static final String MONGODB_URI = System.getenv("MONGODB_URI"); 30 31 public static void main(String[] args) { 32 try { 33 List<String> summaries = fetchSummariesFromMongoDB(); 34 if (summaries.isEmpty()) { 35 throw new RuntimeException("No summaries retrieved from MongoDB."); 36 } 37 EmbedByTypeResponseEmbeddings embeddingsData = fetchEmbeddingsFromCohere(COHERE_API_KEY, summaries); 38 if (embeddingsData == null) { 39 throw new RuntimeException("Failed to fetch embeddings."); 40 } 41 convertAndSaveEmbeddings(summaries, embeddingsData); 42 } catch (Exception e) { 43 logger.error("Unexpected error: {}", e.getMessage(), e); 44 } 45 } 46 47 private static List<String> fetchSummariesFromMongoDB() { 48 List<String> summaries = new ArrayList<>(); 49 if (MONGODB_URI == null || MONGODB_URI.isEmpty()) { 50 throw new RuntimeException("MongoDB URI is not set."); 51 } 52 logger.info("Connecting to MongoDB at URI: {}", MONGODB_URI); 53 try (MongoClient mongoClient = MongoClients.create(MONGODB_URI)) { 54 String dbName = "sample_airbnb"; 55 String collName = "listingsAndReviews"; 56 MongoDatabase database = mongoClient.getDatabase(dbName); 57 MongoCollection<Document> collection = database.getCollection(collName); 58 Document filter = new Document("summary", new Document("$nin", Arrays.asList(null, ""))); 59 FindIterable<Document> documentsCursor = collection.find(filter).limit(50); 60 for (Document doc : documentsCursor) { 61 String summary = doc.getString("summary"); 62 if (summary != null && !summary.isEmpty()) { 63 summaries.add(summary); 64 } 65 } 66 logger.info("Retrieved {} summaries from MongoDB.", summaries.size()); 67 } catch (Exception e) { 68 logger.error("Error fetching from MongoDB: {}", e.getMessage(), e); 69 throw new RuntimeException("Failed to fetch data from MongoDB", e); 70 } 71 return summaries; 72 } 73 74 private static EmbedByTypeResponseEmbeddings fetchEmbeddingsFromCohere(String apiKey, List<String> data) { 75 if (Objects.isNull(apiKey) || apiKey.isEmpty()) { 76 throw new RuntimeException("API key is not set."); 77 } 78 Cohere cohere = Cohere.builder().token(apiKey).clientName("embed-example").build(); 79 try { 80 EmbedRequest request = EmbedRequest.builder() 81 .model("embed-english-v3.0") 82 .inputType(EmbedInputType.SEARCH_DOCUMENT) 83 .texts(data) 84 .embeddingTypes(List.of(EmbeddingType.FLOAT, EmbeddingType.INT_8, EmbeddingType.UBINARY)) 85 .build(); 86 EmbedResponse response = cohere.embed(request); 87 Optional<EmbedByTypeResponse> optionalEmbeddingsWrapper = response.getEmbeddingsByType(); 88 if (optionalEmbeddingsWrapper.isPresent()) { 89 return optionalEmbeddingsWrapper.get().getEmbeddings(); 90 } else { 91 logger.warn("No embeddings were returned."); 92 } 93 } catch (Exception e) { 94 logger.error("Error fetching embeddings: {}", e.getMessage(), e); 95 } 96 return null; 97 } 98 99 private static void convertAndSaveEmbeddings(List<String> summaries, EmbedByTypeResponseEmbeddings embeddings) { 100 try { 101 Document doc = new Document(); 102 BsonArray array = new BsonArray(); 103 for (int i = 0; i < summaries.size(); i++) { 104 String summary = summaries.get(i); 105 106 // Retrieve the embeddings for the current index 107 List<Double> floatList = embeddings.getFloat().orElseThrow().get(i); 108 List<Integer> int8List = embeddings.getInt8().orElseThrow().get(i); 109 List<Integer> ubinaryList = embeddings.getUbinary().orElseThrow().get(i); 110 111 // Convert lists to arrays 112 float[] floatData = listToFloatArray(floatList); 113 byte[] int8Data = listToByteArray(int8List); 114 byte[] int1Data = listToByteArray(ubinaryList); 115 116 // Create BinaryVector objects 117 BinaryVector floatVector = BinaryVector.floatVector(floatData); 118 BinaryVector int8Vector = BinaryVector.int8Vector(int8Data); 119 BinaryVector packedBitsVector = BinaryVector.packedBitVector(int1Data, (byte) 0); 120 121 Document document = new Document() 122 .append("text", summary) 123 .append("embeddings_float32", floatVector) 124 .append("embeddings_int8", int8Vector) 125 .append("embeddings_int1", packedBitsVector); 126 array.add(document.toBsonDocument()); 127 } 128 doc.append("data", array); 129 try (FileOutputStream fos = new FileOutputStream("embeddings.json")) { 130 fos.write(doc.toJson().getBytes()); 131 } 132 logger.info("Embeddings with BSON vectors have been saved to embeddings.json"); 133 } catch (IOException e) { 134 logger.error("Error writing embeddings to file: {}", e.getMessage(), e); 135 } 136 } 137 138 private static float[] listToFloatArray(List<Double> list) { 139 float[] array = new float[list.size()]; 140 for (int i = 0; i < list.size(); i++) { 141 array[i] = list.get(i).floatValue(); 142 } 143 return array; 144 } 145 146 private static byte[] listToByteArray(List<Integer> list) { 147 byte[] array = new byte[list.size()]; 148 for (int i = 0; i < list.size(); i++) { 149 array[i] = list.get(i).byteValue(); 150 } 151 return array; 152 } 153 } 如果未设立环境变量,请替换代码中的以下占位符值并保存文件。
MONGODB_URI
Atlas 集群连接字符串(如果您未设立环境变量)。
COHERE_API_KEY
如果您未设立环境变量,则使用 Cohere API密钥。
使用应用程序运行配置编译并运行该文件。
如果使用终端,运行以下命令来编译并执行程序。
javac GenerateAndConvertEmbeddings.java java GenerateAndConvertEmbeddings [main] INFO GenerateAndConvertEmbeddings - Connecting to MongoDB at URI: <CONNECTION-STRING> ... [main] INFO GenerateAndConvertEmbeddings - Retrieved 50 summaries from MongoDB. [main] INFO GenerateAndConvertEmbeddings - Embeddings with BSON vectors have been saved to embeddings.json 验证
embeddings.json
文件中的嵌入。
要学习;了解有关生成嵌入并将嵌入转换为 binData
向量的更多信息,请参阅如何创建向量嵌入。
摄取数据并创建Atlas Vector Search索引。
您必须将数据和嵌入上传到Atlas 集群中的集合,并在数据上创建Atlas Vector Search索引,以对数据运行$vectorSearch
查询。
在Java项目中创建一个名为
UploadDataAndCreateIndex.java
的新文件。touch UploadDataAndCreateIndex.java 将以下代码复制并粘贴到
UploadDataAndCreateIndex.java
文件中。此代码执行以下操作:
将
embeddings.json
文件中的float32
、int8
和int1
嵌入上传到Atlas 集群。在
embeddings.float32
、embeddings.int8
和embeddings.int1
字段上创建Atlas Vector Search索引。
UploadDataAndCreateIndex.java1 import com.mongodb.client.MongoClient; 2 import com.mongodb.client.MongoClients; 3 import com.mongodb.client.MongoCollection; 4 import com.mongodb.client.MongoDatabase; 5 import com.mongodb.client.model.SearchIndexModel; 6 import com.mongodb.client.model.SearchIndexType; 7 8 import org.bson.Document; 9 import org.bson.conversions.Bson; 10 import org.bson.BinaryVector; // Import the BinaryVector 11 12 import java.io.IOException; 13 import java.nio.file.Files; 14 import java.nio.file.Path; 15 import java.util.Collections; 16 import java.util.List; 17 import java.util.concurrent.TimeUnit; 18 import java.util.stream.StreamSupport; 19 20 public class UploadDataAndCreateIndex { 21 22 private static final String MONGODB_URI = System.getenv("MONGODB_URI"); 23 private static final String DB_NAME = "sample_airbnb"; 24 private static final String COLLECTION_NAME = "listingsAndReviews"; 25 private static final String INDEX_NAME = "<INDEX-NAME>"; 26 27 public static void main(String[] args) { 28 try (MongoClient mongoClient = MongoClients.create(MONGODB_URI)) { 29 uploadEmbeddingsData(mongoClient); 30 setupVectorSearchIndex(mongoClient); 31 } catch (Exception e) { 32 e.printStackTrace(); 33 } 34 } 35 36 public static void uploadEmbeddingsData(MongoClient mongoClient) throws IOException { 37 MongoDatabase database = mongoClient.getDatabase(DB_NAME); 38 MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME); 39 String filePath = "embeddings.json"; 40 String fileContent = Files.readString(Path.of(filePath)); 41 42 Document rootDoc = Document.parse(fileContent); 43 List<Document> embeddingsDocs = rootDoc.getList("data", Document.class); 44 45 for (Document doc : embeddingsDocs) { 46 // Retrieve the string value from the document 47 String summary = doc.getString("text"); 48 49 // Get the BinaryVector objects from the document 50 BinaryVector embeddingsFloat32 = doc.get("embeddings_float32", BinaryVector.class); 51 BinaryVector embeddingsInt8 = doc.get("embeddings_int8", BinaryVector.class); 52 BinaryVector embeddingsInt1 = doc.get("embeddings_int1", BinaryVector.class); 53 54 // Create filter and update documents 55 Document filter = new Document("summary", summary); 56 Document update = new Document("$set", new Document("summary", summary) 57 .append("embeddings_float32", embeddingsFloat32) 58 .append("embeddings_int8", embeddingsInt8) 59 .append("embeddings_int1", embeddingsInt1)); 60 61 // Perform update operation with upsert option 62 collection.updateOne(filter, update, new com.mongodb.client.model.UpdateOptions().upsert(true)); 63 System.out.println("Processed document with summary: " + summary); 64 } 65 } 66 67 public static void setupVectorSearchIndex(MongoClient client) throws InterruptedException { 68 MongoDatabase database = client.getDatabase(DB_NAME); 69 MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME); 70 // Define the index details 71 Bson definition = new Document( 72 "fields", 73 List.of( 74 new Document("type", "vector") 75 .append("path", "embeddings_float32") 76 .append("numDimensions", 1024) 77 .append("similarity", "dotProduct"), 78 new Document("type", "vector") 79 .append("path", "embeddings_int8") 80 .append("numDimensions", 1024) 81 .append("similarity", "dotProduct"), 82 new Document("type", "vector") 83 .append("path", "embeddings_int1") 84 .append("numDimensions", 1024) 85 .append("similarity", "euclidean") 86 ) 87 ); 88 // Define the index model 89 SearchIndexModel indexModel = new SearchIndexModel( 90 INDEX_NAME, 91 definition, 92 SearchIndexType.vectorSearch() 93 ); 94 // Create the index using the defined model 95 List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel)); 96 System.out.println("Successfully created vector index named: " + result.get(0)); 97 System.out.println("It may take up to a minute for the index to leave the BUILDING status and become queryable."); 98 // Wait for Atlas to build the index 99 System.out.println("Polling to confirm the index has changed from the BUILDING status."); 100 waitForIndex(collection, INDEX_NAME); 101 } 102 103 public static <T> boolean waitForIndex(final MongoCollection<T> collection, final String indexName) { 104 long startTime = System.nanoTime(); 105 long timeoutNanos = TimeUnit.SECONDS.toNanos(60); 106 while (System.nanoTime() - startTime < timeoutNanos) { 107 Document indexRecord = StreamSupport.stream(collection.listSearchIndexes().spliterator(), false) 108 .filter(index -> indexName.equals(index.getString("name"))) 109 .findAny().orElse(null); 110 if (indexRecord != null) { 111 if ("FAILED".equals(indexRecord.getString("status"))) { 112 throw new RuntimeException("Search index has FAILED status."); 113 } 114 if (indexRecord.getBoolean("queryable")) { 115 System.out.println(indexName + " index is ready to query"); 116 return true; 117 } 118 } 119 try { 120 Thread.sleep(100); // busy-wait, avoid in production 121 } catch (InterruptedException e) { 122 Thread.currentThread().interrupt(); 123 throw new RuntimeException(e); 124 } 125 } 126 return false; 127 } 128 } 替换代码中的以下占位符值并保存文件。
MONGODB_URI
Atlas 集群连接字符串(如果您未设立环境变量)。
<INDEX-NAME>
集合的Atlas Vector Search索引的名称。
使用应用程序运行配置编译并运行该文件。
如果使用终端,运行以下命令来编译并执行程序。
javac UploadDataAndCreateIndex.java java UploadDataAndCreateIndex Successfully created vector index named: <INDEX_NAME> It may take up to a minute for the index to leave the BUILDING status and become queryable. Polling to confirm the index has changed from the BUILDING status. <INDEX_NAME> index is ready to query 登录Atlas 集群并验证以下内容:
命名空间空间中的数据。
集合的Atlas Vector Search索引。
创建并针对集合运行查询。
要测试嵌入,您可以对集合运行查询。使用嵌入模型提供商为查询文本生成 float
、int8
和 int1
嵌入。以下示例代码使用 Cohere 的 embed
API生成全精度向量。生成嵌入后,使用MongoDB Java驾驶员将原原生向量嵌入转换为BSON向量,并对集合运行$vectorSearch
查询。
在Java项目中创建一个名为
CreateEmbeddingsAndRunQuery.java
的新文件。touch CreateEmbeddingsAndRunQuery.java 将以下代码复制并粘贴到
CreateEmbeddingsAndRunQuery.java
文件中。此代码执行以下操作:
使用 Cohere 的
embed
API 生成float32
、int8
和ubinary
向量嵌入。使用MongoDB Java驾驶员将嵌入转换为BSON 向量。
binData
针对集合运行查询并返回结果。
CreateEmbeddingsAndRunQuery.java1 import com.cohere.api.Cohere; 2 import com.cohere.api.requests.EmbedRequest; 3 import com.cohere.api.types.EmbedResponse; 4 import com.cohere.api.types.EmbedByTypeResponse; 5 import com.cohere.api.types.EmbedByTypeResponseEmbeddings; 6 import com.cohere.api.types.EmbeddingType; 7 import com.cohere.api.types.EmbedInputType; 8 import com.mongodb.client.MongoClient; 9 import com.mongodb.client.MongoClients; 10 import com.mongodb.client.MongoCollection; 11 import com.mongodb.client.MongoDatabase; 12 import org.bson.Document; 13 import org.bson.conversions.Bson; 14 import org.bson.BinaryVector; 15 import org.slf4j.Logger; 16 import org.slf4j.LoggerFactory; 17 18 import java.util.ArrayList; 19 import java.util.HashMap; 20 import java.util.List; 21 import java.util.Map; 22 import java.util.Optional; 23 24 import static com.mongodb.client.model.Aggregates.project; 25 import static com.mongodb.client.model.Aggregates.vectorSearch; 26 import static com.mongodb.client.model.Projections.fields; 27 import static com.mongodb.client.model.Projections.include; 28 import static com.mongodb.client.model.Projections.exclude; 29 import static com.mongodb.client.model.Projections.metaVectorSearchScore; 30 import static com.mongodb.client.model.search.SearchPath.fieldPath; 31 import static com.mongodb.client.model.search.VectorSearchOptions.approximateVectorSearchOptions; 32 import static java.util.Arrays.asList; 33 34 public class CreateEmbeddingsAndRunQuery { 35 private static final String COHERE_API_KEY = System.getenv("COHERE_API_KEY"); 36 private static final String MONGODB_URI = System.getenv("MONGODB_URI"); 37 private static final String DB_NAME = "<DATABASE-NAME>"; 38 private static final String COLLECTION_NAME = "<COLLECTION-NAME>"; 39 private static final String VECTOR_INDEX_NAME = "<INDEX-NAME>"; 40 private static final String DATA_FIELD_NAME = "<DATA-FIELD>"; 41 42 public static void main(String[] args) { 43 String queryText = "<QUERY-TEXT>"; 44 45 try { 46 CreateAndRunQuery processor = new CreateAndRunQuery(); 47 Map<String, BinaryVector> embeddingsData = processor.generateAndConvertEmbeddings(queryText); 48 processor.runVectorSearchQuery(embeddingsData); 49 } catch (Exception e) { 50 e.printStackTrace(); 51 } 52 } 53 54 // Generate embeddings using Cohere's embed API from the query text 55 public Map<String, BinaryVector> generateAndConvertEmbeddings(String text) throws Exception { 56 if (COHERE_API_KEY == null || COHERE_API_KEY.isEmpty()) { 57 throw new RuntimeException("API key not found. Set COHERE_API_KEY in your environment."); 58 } 59 60 Cohere cohere = Cohere.builder().token(COHERE_API_KEY).build(); 61 62 EmbedRequest request = EmbedRequest.builder() 63 .model("embed-english-v3.0") 64 .inputType(EmbedInputType.SEARCH_QUERY) 65 .texts(List.of(text)) 66 .embeddingTypes(List.of(EmbeddingType.FLOAT, EmbeddingType.INT_8, EmbeddingType.UBINARY)) 67 .build(); 68 69 EmbedResponse response = cohere.embed(request); 70 Optional<EmbedByTypeResponse> optionalEmbeddingsWrapper = response.getEmbeddingsByType(); 71 if (optionalEmbeddingsWrapper.isEmpty()) { 72 throw new RuntimeException("No embeddings found in the API response."); 73 } 74 75 EmbedByTypeResponseEmbeddings embeddings = optionalEmbeddingsWrapper.get().getEmbeddings(); 76 return createBinaryVectorEmbeddings(embeddings); 77 } 78 79 // Convert embeddings to BSON binary vectors using MongoDB Java Driver 80 private static Map<String, BinaryVector> createBinaryVectorEmbeddings(EmbedByTypeResponseEmbeddings embeddings) { 81 Map<String, BinaryVector> binaryVectorEmbeddings = new HashMap<>(); 82 83 // Convert float embeddings 84 List<Double> floatList = embeddings.getFloat().orElseThrow().get(0); 85 if (floatList != null) { 86 float[] floatData = listToFloatArray(floatList); 87 BinaryVector floatVector = BinaryVector.floatVector(floatData); 88 binaryVectorEmbeddings.put("float32", floatVector); 89 } 90 91 // Convert int8 embeddings 92 List<Integer> int8List = embeddings.getInt8().orElseThrow().get(0); 93 if (int8List != null) { 94 byte[] int8Data = listToByteArray(int8List); 95 BinaryVector int8Vector = BinaryVector.int8Vector(int8Data); 96 binaryVectorEmbeddings.put("int8", int8Vector); 97 } 98 99 // Convert ubinary embeddings 100 List<Integer> ubinaryList = embeddings.getUbinary().orElseThrow().get(0); 101 if (ubinaryList != null) { 102 byte[] int1Data = listToByteArray(ubinaryList); 103 BinaryVector packedBitsVector = BinaryVector.packedBitVector(int1Data, (byte) 0); 104 binaryVectorEmbeddings.put("int1", packedBitsVector); 105 } 106 107 return binaryVectorEmbeddings; 108 } 109 110 // Define and run $vectorSearch query using the embeddings 111 public void runVectorSearchQuery(Map<String, BinaryVector> embeddingsData) { 112 if (MONGODB_URI == null || MONGODB_URI.isEmpty()) { 113 throw new RuntimeException("MongoDB URI not found. Set MONGODB_URI in your environment."); 114 } 115 116 try (MongoClient mongoClient = MongoClients.create(MONGODB_URI)) { 117 MongoDatabase database = mongoClient.getDatabase(DB_NAME); 118 MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME); 119 120 for (String path : embeddingsData.keySet()) { 121 BinaryVector queryVector = embeddingsData.get(path); 122 123 List<Bson> pipeline = asList( 124 vectorSearch( 125 fieldPath("embeddings_" + path), 126 queryVector, 127 VECTOR_INDEX_NAME, 128 2, 129 approximateVectorSearchOptions(5) 130 ), 131 project( 132 fields( 133 exclude("_id"), 134 include(DATA_FIELD_NAME), 135 metaVectorSearchScore("vectorSearchScore") 136 ) 137 ) 138 ); 139 140 List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>()); 141 142 System.out.println("Results from " + path + " embeddings:"); 143 for (Document result : results) { 144 System.out.println(result.toJson()); 145 } 146 } 147 } 148 } 149 150 private static float[] listToFloatArray(List<Double> list) { 151 float[] array = new float[list.size()]; 152 for (int i = 0; i < list.size(); i++) { 153 array[i] = list.get(i).floatValue(); 154 } 155 return array; 156 } 157 158 private static byte[] listToByteArray(List<Integer> list) { 159 byte[] array = new byte[list.size()]; 160 for (int i = 0; i < list.size(); i++) { 161 array[i] = list.get(i).byteValue(); 162 } 163 return array; 164 } 165 } 替换代码中的以下占位符值并保存文件。
MONGODB_URI
Atlas 集群连接字符串(如果您未设立环境变量)。
COHERE_API_KEY
如果您未设立环境变量,则使用 Cohere API密钥。
<DATABASE-NAME>
Atlas 集群中数据库的名称。对于本示例,请使用
sample_airbnb
。<COLLECTION-NAME>
您提取数据的集合的名称。对于本示例,请使用
listingsAndReviews
。<INDEX-NAME>
集合的Atlas Vector Search索引的名称。
<DATA-FIELD-NAME>
包含生成嵌入的文本的字段名称。对于本示例,请使用
summary
。<QUERY-TEXT>
查询的文本。对于本示例,请使用
ocean view
。使用应用程序运行配置编译并运行该文件。
如果使用终端,运行以下命令来编译并执行程序。
javac CreateEmbeddingsAndRunQuery.java java CreateEmbeddingsAndRunQuery Results from int1 embeddings: {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "vectorSearchScore": 0.6591796875} {"summary": "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.", "vectorSearchScore": 0.6337890625} Results from int8 embeddings: {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "vectorSearchScore": 0.5215557217597961} {"summary": "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.", "vectorSearchScore": 0.5179016590118408} Results from float32 embeddings: {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "vectorSearchScore": 0.7278661131858826} {"summary": "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.", "vectorSearchScore": 0.688639760017395}
要学习;了解有关生成嵌入并将嵌入转换为 binData
向量的更多信息,请参阅如何创建向量嵌入。
安装所需的库。
运行以下命令以安装 MongoDB Node.js 驱动程序。此操作可能需要几分钟才能完成。
npm install mongodb
您必须安装 Node.js v6.11 或更高版本的驱动程序。如有必要,您还可安装来自嵌入模型提供商的库。例如,要使用 Cohere 来生成 float32
、int8
与 int1
嵌入(如本页所示),请安装 Cohere:
npm install cohere-ai dotenv npm show cohere-ai version
在您的终端中设置环境变量。
要访问嵌入模型提供商以生成和转换嵌入,请按需为嵌入模型提供商的 API 密钥设置环境变量。
要使用 Cohere 的嵌入功能,请设置
COHERE_API_KEY
环境变量。export COHERE_API_KEY="<COHERE-API-KEY>" 如果您未设置环境变量,则请在运行此示例代码之前将代码中的
<COHERE-API-KEY>
替换为 API 密钥。要访问 Atlas 集群,请设置
MONGODB_URI
环境变量。export MONGODB_URI="<CONNECTION-STRING>" 您的连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net 如果您未设置该环境变量,则请在运行此示例代码之前将代码中的
<CONNECTION-STRING>
替换为您的连接字符串。
为您的数据生成向量嵌入。
创建一个名为
get-embeddings.js
的文件,以使用 Cohere 的embed
API 来生成float32
、int8
与int1
向量嵌入。touch get-embeddings.js 将以下代码复制并粘贴到
get-embeddings.js
文件中。此代码执行以下操作:
使用 Cohere 的
embed-english-v3.0
嵌入模型为给定数据生成float32
、int8
与int1
嵌入。将
float
、int8
和int1
的嵌入分别存储在名为float
、int8
和ubinary
的字段中。创建一个名为
embeddings.json
的文件,并将嵌入保存到该文件中。
get-embeddings.js1 // Use 'require' for modules in a Node.js environment 2 const { CohereClient } = require('cohere-ai'); 3 const { writeFile } = require('fs/promises'); 4 dd:queueMicrotask 5 // Retrieve API key from environment variables or default placeholder 6 const apiKey = process.env.COHERE_API_KEY || '<COHERE-API-KEY>'; 7 8 if (!apiKey) { 9 throw new Error('API key not found. Please set COHERE_API_KEY in your environment.'); 10 } 11 12 // Instantiate the CohereClient with the API key 13 const cohere = new CohereClient({ token: apiKey }); 14 15 async function main() { 16 try { 17 // Data to embed 18 const data = [ 19 "The Great Wall of China is visible from space.", 20 "The Eiffel Tower was completed in Paris in 1889.", 21 "Mount Everest is the highest peak on Earth at 8,848m.", 22 "Shakespeare wrote 37 plays and 154 sonnets during his lifetime.", 23 "The Mona Lisa was painted by Leonardo da Vinci.", 24 ]; 25 26 // Fetch embeddings for the data using the cohere API 27 const response = await cohere.v2.embed({ 28 model: 'embed-english-v3.0', 29 inputType: 'search_document', 30 texts: data, 31 embeddingTypes: ['float', 'int8', 'ubinary'], 32 }); 33 34 // Extract embeddings from the API response 35 const { float, int8, ubinary } = response.embeddings; 36 37 // Map the embeddings to the text data 38 const embeddingsData = data.map((text, index) => ({ 39 text, 40 embeddings: { 41 float: float[index], 42 int8: int8[index], 43 ubinary: ubinary[index], 44 }, 45 })); 46 47 // Write the embeddings data to a JSON file 48 await writeFile('embeddings.json', JSON.stringify(embeddingsData, null, 2)); 49 console.log('Embeddings saved to embeddings.json'); 50 } catch (error) { 51 console.error('Error fetching embeddings:', error); 52 } 53 } 54 55 // Execute the main function 56 main(); 如果您没有将 Cohere 的 API 密钥设为环境变量,则请替换
<COHERE_API_KEY>
占位符,然后保存该文件。运行此代码以生成嵌入。
node get-embeddings.js Embeddings saved to embeddings.json 验证所生成
embeddings.json
文件中的嵌入。
将这些向量嵌入转换为 binData
向量。
创建一个名为
convert-embeddings.js
的文件,以使用 MongoDB Node.js 驱动程序将 Cohere 中的float32
、int8
与int1
向量嵌入转换为 BSONbinData
向量。touch convert-embeddings.js 将以下代码复制并粘贴到
convert-embeddings.js
文件中。此代码执行以下操作:
为
float32
、int8
与int1
嵌入生成 BSONbinData
向量。将
float32
、int8
与ubinary
BSONbinData
向量追加到embeddings.json
文件中。
convert-embeddings.js1 const fs = require('fs/promises'); 2 const { BSON } = require('mongodb'); 3 const { Binary } = BSON; 4 5 async function main() { 6 try { 7 // Read and parse the contents of 'embeddings.json' file 8 const fileContent = await fs.readFile('embeddings.json', 'utf8'); 9 const embeddingsData = JSON.parse(fileContent); 10 11 // Map the embeddings data to add BSON binary representations with subtype 9 12 const convertEmbeddingsData = embeddingsData.map(({ text, embeddings }) => { 13 // Create Binary for Float32Array with manual subtype 9 14 const bsonFloat32 = Binary.fromFloat32Array(new Float32Array(embeddings.float)); 15 16 // Create Binary for Int8Array with subtype 9 17 const bsonInt8 = Binary.fromInt8Array(new Int8Array(embeddings.int8)); 18 19 // Create Binary for PackedBits (Uint8Array) with subtype 9 20 const bsonPackedBits = Binary.fromPackedBits(new Uint8Array(embeddings.ubinary)); 21 22 return { 23 text, 24 embeddings: { 25 float: embeddings.float, // Original float data 26 int8: embeddings.int8, // Original int8 data 27 ubinary: embeddings.ubinary, // Original packed bits data 28 }, 29 bsonEmbeddings: { 30 float32: bsonFloat32, 31 int8: bsonInt8, 32 packedBits: bsonPackedBits, 33 }, 34 }; 35 }); 36 37 // Serialize the updated data to EJSON for BSON compatibility 38 const ejsonSerializedData = BSON.EJSON.stringify(convertEmbeddingsData, null, null, { relaxed: false }); 39 40 // Write the serialized data to 'embeddings.json' 41 await fs.writeFile('embeddings.json', ejsonSerializedData); 42 console.log('Embeddings with BSON vectors have been saved to embeddings.json'); 43 } catch (error) { 44 console.error('Error processing embeddings:', error); 45 } 46 } 47 48 main(); 运行此程序以生成 BSON
binData
向量。node convert-embeddings.js Embeddings with BSON vectors have been saved to embeddings.json 请验证
embeddings.json
文件中生成的 BSON 嵌入。
连接到 Atlas 集群并将数据上传到集合。
创建一个名为
upload-data.js
的文件以连接到 Atlas 集群,并在数据库中为embeddings.json
文件中的数据创建一个集合。touch upload-data.js 将以下代码复制并粘贴到
upload-data.js
文件中。此代码执行以下操作:
连接到您的 Atlas 集群,并使用您指定的数据库和集合名称创建一个命名空间。
将包括嵌入在内的
embeddings.json
文件中的数据上传到指定的命名空间。
upload-data.js1 const fs = require('fs/promises'); // Use fs/promises for asynchronous operations 2 const { MongoClient, BSON } = require('mongodb'); // Import from the 'mongodb' package 3 4 const { Binary } = BSON; // Ensure the Binary class is imported correctly 5 6 async function main() { 7 const MONGODB_URI = process.env.MONGODB_URI || "<CONNECTION-STRING>"; 8 const DB_NAME = "<DB-NAME>"; 9 const COLLECTION_NAME = "<COLLECTION-NAME>"; 10 11 let client; 12 try { 13 client = new MongoClient(MONGODB_URI); 14 await client.connect(); 15 console.log("Connected to MongoDB"); 16 17 const db = client.db(DB_NAME); 18 const collection = db.collection(COLLECTION_NAME); 19 20 // Read and parse the contents of 'embeddings.json' file using EJSON 21 const fileContent = await fs.readFile('embeddings.json', 'utf8'); 22 const embeddingsData = BSON.EJSON.parse(fileContent); 23 24 // Map embeddings data to recreate BSON binary representations with the correct subtype 25 const documents = embeddingsData.map(({ text, bsonEmbeddings }) => { 26 return { 27 text, 28 bsonEmbeddings: { 29 float32: bsonEmbeddings.float32, 30 int8: bsonEmbeddings.int8, 31 int1: bsonEmbeddings.packedBits 32 } 33 }; 34 }); 35 36 const result = await collection.insertMany(documents); 37 console.log(`Inserted ${result.insertedCount} documents into MongoDB`); 38 39 } catch (error) { 40 console.error('Error storing embeddings in MongoDB:', error); 41 } finally { 42 if (client) { 43 await client.close(); 44 } 45 } 46 } 47 48 // Run the store function 49 main(); 替换以下设置并保存该文件。
<CONNECTION-STRING>
用于连接到您希望在其中创建数据库和集合的 Atlas 集群的连接字符串。
仅当您没有设置
MONGODB_URI
环境变量时,才需替换该值。<DB-NAME>
您要在其中创建集合的数据库的名称。
<COLLECTION-NAME>
您要存储生成的嵌入的集合的名称。
运行以下命令以上传该数据。
node upload-data.js 请确保这些文档存在于您 Atlas 集群上的集合中。
在集合上创建 Atlas Vector Search 索引。
创建一个名为
create-index.js
的文件以在集合上定义 Atlas Vector Search 索引。touch create-index.js 将以下代码复制并粘贴到
create-index.js
文件中以创建索引。此代码将执行以下操作:
连接到 Atlas 集群,并为指定的命名空间创建一个具有指定名称的索引。
将
bsonEmbeddings.float32
和bsonEmbeddings.int8
字段索引为vector
类型,使用dotProduct
相似性函数,并将bsonEmbeddings.int1
字段也索引为vector
类型,使用euclidean
函数。
create-index.js1 const { MongoClient } = require("mongodb"); 2 const { setTimeout } = require("timers/promises"); // Import from timers/promises 3 4 // Connect to your Atlas deployment 5 const uri = process.env.MONGODB_URI || "<CONNECTION-STRING>"; 6 7 const client = new MongoClient(uri); 8 9 async function main() { 10 try { 11 const database = client.db("<DB-NAME>"); 12 const collection = database.collection("<COLLECTION-NAME>"); 13 14 // Define your Atlas Vector Search index 15 const index = { 16 name: "<INDEX-NAME>", 17 type: "vectorSearch", 18 definition: { 19 fields: [ 20 { 21 type: "vector", 22 numDimensions: 1024, 23 path: "bsonEmbeddings.float32", 24 similarity: "dotProduct", 25 }, 26 { 27 type: "vector", 28 numDimensions: 1024, 29 path: "bsonEmbeddings.int8", 30 similarity: "dotProduct", 31 }, 32 { 33 type: "vector", 34 numDimensions: 1024, 35 path: "bsonEmbeddings.int1", 36 similarity: "euclidean", 37 }, 38 ], 39 }, 40 }; 41 42 // Run the helper method 43 const result = await collection.createSearchIndex(index); 44 console.log(`New search index named ${result} is building.`); 45 46 // Wait for the index to be ready to query 47 console.log("Polling to check if the index is ready. This may take up to a minute."); 48 let isQueryable = false; 49 50 // Use filtered search for index readiness 51 while (!isQueryable) { 52 const [indexData] = await collection.listSearchIndexes(index.name).toArray(); 53 54 if (indexData) { 55 isQueryable = indexData.queryable; 56 if (!isQueryable) { 57 await setTimeout(5000); // Wait for 5 seconds before checking again 58 } 59 } else { 60 // Handle the case where the index might not be found 61 console.log(`Index ${index.name} not found.`); 62 await setTimeout(5000); // Wait for 5 seconds before checking again 63 } 64 } 65 66 console.log(`${result} is ready for querying.`); 67 } catch (error) { 68 console.error("Error:", error); 69 } finally { 70 await client.close(); 71 } 72 } 73 74 main().catch((err) => { 75 console.error("Unhandled error:", err); 76 }); 替换以下设置并保存该文件。
<CONNECTION-STRING>
用于连接到您要在其中创建该索引的 Atlas 集群的连接字符串。
仅当您没有设置
MONGODB_URI
环境变量时,才需替换该值。<DB-NAME>
您要在其中创建集合的数据库的名称。
<COLLECTION-NAME>
您要存储生成的嵌入的集合的名称。
<INDEX-NAME>
集合的索引名称。
创建索引。
node create-index.js
为查询文本生成嵌入。
创建一个名为
get-query-embedding.js
的文件。touch get-query-embeddings.js 将代码复制并粘贴到
get-query-embedding.js
文件中。示例代码执行以下操作:
使用 Cohere 为查询文本生成
float32
、int8
和int1
嵌入。使用 PyMongo 将生成的嵌入转换为 BSON
binData
向量。将生成的嵌入保存到名为
query-embeddings.json
的文件中。
get-query-embedding.js1 const { CohereClient } = require('cohere-ai'); 2 const { BSON } = require('mongodb'); 3 const { writeFile } = require('fs/promises'); 4 const dotenv = require('dotenv'); 5 const process = require('process'); 6 7 // Load environment variables 8 dotenv.config(); 9 10 const { Binary } = BSON; 11 12 // Get the API key from environment variables or set the key here 13 const apiKey = process.env.COHERE_API_KEY || '<COHERE-API-KEY>'; 14 15 if (!apiKey) { 16 throw new Error('API key not found. Provide the COHERE_API_KEY.'); 17 } 18 19 // Initialize CohereClient 20 const cohere = new CohereClient({ token: apiKey }); 21 22 async function main(queryText) { 23 try { 24 if (typeof queryText !== 'string' || queryText.trim() === '') { 25 throw new Error('Invalid query text. It must be a non-empty string.'); 26 } 27 28 const data = [queryText]; 29 30 // Request embeddings from the Cohere API 31 const response = await cohere.v2.embed({ 32 model: 'embed-english-v3.0', 33 inputType: 'search_query', 34 texts: data, 35 embeddingTypes: ['float', 'int8', 'ubinary'], // Request all required embedding types 36 }); 37 38 if (!response.embeddings) { 39 throw new Error('Embeddings not found in the API response.'); 40 } 41 42 const { float, int8, ubinary } = response.embeddings; 43 44 const updatedEmbeddingsData = data.map((text, index) => { 45 // Create the BSON Binary objects using VECTOR_TYPE for all embedding types 46 const float32Binary = Binary.fromFloat32Array(new Float32Array(float[index])); // VECTOR_TYPE.FLOAT32 47 const int8Binary = Binary.fromInt8Array(new Int8Array(int8[index])); // VECTOR_TYPE.INT8 48 const packedBitsBinary = Binary.fromPackedBits(new Uint8Array(ubinary[index])); // VECTOR_TYPE.PACKED_BIT 49 50 return { 51 text, 52 embeddings: { 53 float: float[index], 54 int8: int8[index], 55 ubinary: ubinary[index], 56 }, 57 bsonEmbeddings: { 58 float32: float32Binary, 59 int8: int8Binary, 60 int1: packedBitsBinary, 61 }, 62 }; 63 }); 64 65 // Serialize the embeddings using BSON EJSON for BSON compatibility 66 const outputFileName = 'query-embeddings.json'; 67 const ejsonSerializedData = BSON.EJSON.stringify(updatedEmbeddingsData, null, null, { relaxed: false }); 68 await writeFile(outputFileName, ejsonSerializedData); 69 console.log(`Embeddings with BSON data have been saved to ${outputFileName}`); 70 } catch (error) { 71 console.error('Error processing query text:', error); 72 } 73 } 74 75 // Main function that takes a query string 76 (async () => { 77 const queryText = "<QUERY-TEXT>"; // Replace with your actual query text 78 await main(queryText); 79 })(); 替换以下设置并保存该文件。
<COHERE-API-KEY>
您的 Cohere API 密钥。仅当您没有设置此环境变量时,才需替换该值。
<QUERY-TEXT>
您的查询文本。在本教程中,使用
science fact
。运行此代码以生成查询文本的嵌入。
node get-query-embeddings.js Embeddings with BSON vectors have been saved to query-embeddings.json
运行 Atlas Vector Search 查询。
创建一个名为
run-query.js
的文件。touch run-query.js 将以下示例
$vectorSearch
查询复制并粘贴到run-query.js
文件中。示例查询执行以下操作:
连接到您的 Atlas 集群,然后使用
query-embeddings.json
文件中的嵌入以针对指定集合中的bsonEmbeddings.float32
、bsonEmbeddings.int8
与bsonEmbeddings.int1
字段来运行$vectorSearch
查询。将 Float32、Int8 和 Packed Binary (Int1) 嵌入的结果打印到控制台。
run-query.js1 const { MongoClient } = require('mongodb'); 2 const fs = require('fs/promises'); 3 const { BSON } = require('bson'); // Use BSON's functionality for EJSON parsing 4 const dotenv = require('dotenv'); 5 6 dotenv.config(); 7 8 // MongoDB connection details 9 const mongoUri = process.env.MONGODB_URI || '<CONNECTION-STRING>'; 10 const dbName = '<DB-NAME>'; // Update with your actual database name 11 const collectionName = '<COLLECTION-NAME>'; // Update with your actual collection name 12 13 // Indices and paths should match your MongoDB vector search configuration 14 const VECTOR_INDEX_NAME = '<INDEX-NAME>'; // Replace with your actual index name 15 const NUM_CANDIDATES = 5; // Number of candidate documents for the search 16 const LIMIT = 2; // Limit for the number of documents to return 17 18 // Fields in the collection that contain the BSON query vectors 19 const FIELDS = [ 20 { path: 'float32', subtype: 9 }, // Ensure that the path and custom subtype match 21 { path: 'int8', subtype: 9 }, // Use the custom subtype if needed 22 { path: 'int1', subtype: 9 } // Use the same custom subtype 23 ]; 24 25 26 // Function to read BSON vectors from JSON and run vector search 27 async function main() { 28 // Initialize MongoDB client 29 const client = new MongoClient(mongoUri); 30 31 try { 32 await client.connect(); 33 console.log("Connected to MongoDB"); 34 35 const db = client.db(dbName); 36 const collection = db.collection(collectionName); 37 38 // Load query embeddings from JSON file using EJSON parsing 39 const fileContent = await fs.readFile('query-embeddings.json', 'utf8'); 40 const embeddingsData = BSON.EJSON.parse(fileContent); 41 42 // Define and run the query for each embedding type 43 const results = {}; 44 45 for (const fieldInfo of FIELDS) { 46 const { path, subtype } = fieldInfo; 47 const bsonBinary = embeddingsData[0]?.bsonEmbeddings?.[path]; 48 49 if (!bsonBinary) { 50 console.warn(`BSON embedding for ${path} not found in the JSON.`); 51 continue; 52 } 53 54 const bsonQueryVector = bsonBinary; // Directly use BSON Binary object 55 56 const pipeline = [ 57 { 58 $vectorSearch: { 59 index: VECTOR_INDEX_NAME, 60 path: `bsonEmbeddings.${path}`, 61 queryVector: bsonQueryVector, 62 numCandidates: NUM_CANDIDATES, 63 limit: LIMIT, 64 } 65 }, 66 { 67 $project: { 68 _id: 0, 69 text: 1, // Adjust projection fields as necessary to match your document structure 70 score: { $meta: 'vectorSearchScore' } 71 } 72 } 73 ]; 74 75 results[path] = await collection.aggregate(pipeline).toArray(); 76 } 77 78 return results; 79 } catch (error) { 80 console.error('Error during vector search:', error); 81 } finally { 82 await client.close(); 83 } 84 } 85 86 // Main execution block 87 (async () => { 88 try { 89 const results = await main(); 90 91 if (results) { 92 console.log("Results from Float32 embeddings:"); 93 console.table(results.float32 || []); 94 console.log("--------------------------------------------------------------------------"); 95 96 console.log("Results from Int8 embeddings:"); 97 console.table(results.int8 || []); 98 console.log("--------------------------------------------------------------------------"); 99 100 console.log("Results from Packed Binary (PackedBits) embeddings:"); 101 console.table(results.int1 || []); 102 } 103 } catch (error) { 104 console.error('Error executing main function:', error); 105 } 106 })(); 替换以下设置并保存
run-query.js
文件。<CONNECTION-STRING>
用于连接到您希望运行查询的 Atlas 集群的连接字符串。
仅当您没有设置
MONGODB_URI
环境变量时,才需替换该值。<DB-NAME>
包含该集合的数据库的名称。
<COLLECTION-NAME>
您要查询的集合的名称。
<INDEX-NAME>
集合的索引名称。
运行以下命令以执行查询。
node run-query.js Connected to MongoDB Results from Float32 embeddings: ┌─────────┬─────────────────────────────────────────────────────────┬────────────────────┐ │ (index) │ text │ score │ ├─────────┼─────────────────────────────────────────────────────────┼────────────────────┤ │ 0 │ 'Mount Everest is the highest peak on Earth at 8,848m.' │ 0.6583383083343506 │ │ 1 │ 'The Great Wall of China is visible from space.' │ 0.6536108255386353 │ └─────────┴─────────────────────────────────────────────────────────┴────────────────────┘ -------------------------------------------------------------------------- Results from Int8 embeddings: ┌─────────┬─────────────────────────────────────────────────────────┬────────────────────┐ │ (index) │ text │ score │ ├─────────┼─────────────────────────────────────────────────────────┼────────────────────┤ │ 0 │ 'Mount Everest is the highest peak on Earth at 8,848m.' │ 0.5149773359298706 │ │ 1 │ 'The Great Wall of China is visible from space.' │ 0.5146723985671997 │ └─────────┴─────────────────────────────────────────────────────────┴────────────────────┘ -------------------------------------------------------------------------- Results from Packed Binary (PackedBits) embeddings: ┌─────────┬─────────────────────────────────────────────────────────┬─────────────┐ │ (index) │ text │ score │ ├─────────┼─────────────────────────────────────────────────────────┼─────────────┤ │ 0 │ 'Mount Everest is the highest peak on Earth at 8,848m.' │ 0.642578125 │ │ 1 │ 'The Great Wall of China is visible from space.' │ 0.61328125 │ └─────────┴─────────────────────────────────────────────────────────┴─────────────┘
安装所需的库。
运行以下命令以安装 MongoDB Node.js 驱动程序。此操作可能需要几分钟才能完成。
npm install mongodb
您必须安装 Node.js v6.11 或更高版本的驱动程序。如有必要,您还可安装来自嵌入模型提供商的库。例如,要使用 Cohere 来生成 float32
、int8
与 int1
嵌入(如本页所示),请安装 Cohere:
npm install cohere-ai dotenv npm show cohere-ai version
在您的终端中设置环境变量。
要访问嵌入模型提供商以生成和转换嵌入,请按需为嵌入模型提供商的 API 密钥设置环境变量。
要使用 Cohere 的嵌入功能,请设置
COHERE_API_KEY
环境变量。export COHERE_API_KEY="<COHERE-API-KEY>" 如果您未设置环境变量,则请在运行此示例代码之前将代码中的
<COHERE-API-KEY>
替换为 API 密钥。要访问 Atlas 集群,请设置
MONGODB_URI
环境变量。export MONGODB_URI="<CONNECTION-STRING>" 您的连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net 如果您未设置该环境变量,则请在运行此示例代码之前将代码中的
<CONNECTION-STRING>
替换为您的连接字符串。
从您的 Atlas 集群获取该数据。
创建一个名为
get-data.js
的文件。touch get-data.js 将以下示例代码复制并粘贴到您的 Atlas 集群中的
sample_airbnb.listingsAndReviews
命名空间以获取数据。示例代码执行以下操作:
连接到您的 Atlas 集群并查找具有
summary
字段的文档。创建一个名为
subset.json
的文件,将集合中的数据写入该文件。
get-data.js1 const { MongoClient } = require('mongodb'); 2 const fs = require('fs'); // Import the fs module for file system operations 3 4 async function main() { 5 // Replace with your Atlas connection string 6 const uri = process.env.MONGODB_URI || '<CONNECTION-STRING>'; 7 8 // Create a new MongoClient instance 9 const client = new MongoClient(uri); 10 11 try { 12 // Connect to your Atlas cluster 13 await client.connect(); 14 15 // Specify the database and collection 16 const db = client.db('sample_airbnb'); 17 const collection = db.collection('listingsAndReviews'); 18 19 // Filter to exclude null or empty summary fields 20 const filter = { summary: { $nin: [null, ''] } }; 21 22 // Get a subset of documents in the collection 23 const documentsCursor = collection.find(filter).limit(50); 24 25 // Convert the cursor to an array to get the documents 26 const documents = await documentsCursor.toArray(); 27 28 // Log the documents to verify their content 29 console.log('Documents retrieved:', documents); 30 31 // Write the documents to a local file called "subset.json" 32 const outputFilePath = './subset.json'; 33 fs.writeFileSync(outputFilePath, JSON.stringify(documents, null, 2), 'utf-8'); 34 35 console.log(`Subset of documents written to: ${outputFilePath}`); 36 } catch (error) { 37 console.error('An error occurred:', error); 38 } finally { 39 // Ensure the client is closed when finished 40 await client.close(); 41 } 42 } 43 44 main().catch(console.error); 如果您没有为Atlas连接字符串设立环境变量,请替换
<CONNECTION-STRING>
占位符,然后保存文件。运行以下命令以提取该数据:
node get-data.js Subset of documents written to: ./subset.json
为您的数据生成向量嵌入。
如果您的集合中已经有 float32
、int8
或 int1
向量嵌入,请跳过此步骤。
创建一个名为
get-embeddings.js
的文件,以使用 Cohere 的embed
API 来生成float32
、int8
与int1
向量嵌入。touch get-embeddings.js 将以下代码复制并粘贴到
get-embeddings.js
文件中。此代码执行以下操作:
使用 Cohere 的
embed-english-v3.0
嵌入模型为给定数据生成float32
、int8
与int1
嵌入。将
float32
、int8
和int1
的嵌入分别存储在名为float
、int8
和ubinary
的字段中。创建一个名为
embeddings.json
的文件,并将嵌入保存到该文件中。
get-embeddings.js1 // Import necessary modules using the CommonJS syntax 2 const { CohereClient } = require('cohere-ai'); 3 const { readFile, writeFile } = require('fs/promises'); 4 5 // Retrieve the API key from environment variables or provide a placeholder 6 const apiKey = process.env.COHERE_API_KEY || '<COHERE-API-KEY>'; 7 8 if (!apiKey || apiKey === '<COHERE-API-KEY>') { 9 throw new Error('API key not found. Please set COHERE_API_KEY in your environment.'); 10 } 11 12 // Initialize the Cohere client with the API key 13 const cohere = new CohereClient({ token: apiKey }); 14 15 async function main() { 16 try { 17 // Read and parse the contents of 'subset.json' 18 const subsetData = await readFile('subset.json', 'utf-8'); 19 const documents = JSON.parse(subsetData); 20 21 // Extract the 'summary' fields that are non-empty strings 22 const data = documents 23 .map(doc => doc.summary) 24 .filter(summary => typeof summary === 'string' && summary.length > 0); 25 26 if (data.length === 0) { 27 throw new Error('No valid summary texts available in the data.'); 28 } 29 30 // Request embeddings from the Cohere API 31 const response = await cohere.v2.embed({ 32 model: 'embed-english-v3.0', 33 inputType: 'search_document', 34 texts: data, 35 embeddingTypes: ['float', 'int8', 'ubinary'], 36 }); 37 38 // Extract embeddings from the API response 39 const { float, int8, ubinary } = response.embeddings; 40 41 // Structure the embeddings data 42 const embeddingsData = data.map((text, index) => ({ 43 text, 44 embeddings: { 45 float: float[index], 46 int8: int8[index], 47 ubinary: ubinary[index], 48 }, 49 })); 50 51 // Write the embeddings data to 'embeddings.json' 52 await writeFile('embeddings.json', JSON.stringify(embeddingsData, null, 2)); 53 console.log('Embeddings saved to embeddings.json'); 54 } catch (error) { 55 console.error('Error fetching embeddings:', error); 56 } 57 } 58 59 // Execute the main function 60 main(); 如果您没有为 Cohere API密钥设立环境变量,请替换
<COHERE-API-KEY>
占位符并保存文件。运行此代码以生成相关嵌入。
node get-embeddings.js Embeddings saved to embeddings.json 通过打开生成的
embeddings.json
文件来验证生成的嵌入。
将这些向量嵌入转换为 binData
向量。
创建一个名为
convert-embeddings.js
的文件,以将 Cohere 中的float32
、int8
与int1
向量嵌入转换为 BSONbinData
向量。touch convert-embeddings.js 将以下代码复制并粘贴到
convert-embeddings.js
文件中。此代码执行以下操作:
为
float32
、int8
与int1
嵌入生成 BSONbinData
向量。将
float32
、int8
与ubinary
BSONbinData
向量追加到embeddings.json
文件中。
convert-embeddings.js1 const fs = require('fs/promises'); 2 const { BSON } = require('mongodb'); 3 const { Binary } = BSON; 4 5 async function main() { 6 try { 7 // Read and parse the contents of 'embeddings.json' file 8 const fileContent = await fs.readFile('embeddings.json', 'utf8'); 9 const embeddingsData = JSON.parse(fileContent); 10 11 // Map the embeddings data to add BSON binary representations with subtype 9 12 const convertEmbeddingsData = embeddingsData.map(({ text, embeddings }) => { 13 // Create Binary for Float32Array with manual subtype 9 14 const bsonFloat32 = Binary.fromFloat32Array(new Float32Array(embeddings.float)); 15 16 // Create Binary for Int8Array with subtype 9 17 const bsonInt8 = Binary.fromInt8Array(new Int8Array(embeddings.int8)); 18 19 // Create Binary for PackedBits (Uint8Array) with subtype 9 20 const bsonPackedBits = Binary.fromPackedBits(new Uint8Array(embeddings.ubinary)); 21 22 return { 23 text, 24 embeddings: { 25 float: embeddings.float, // Original float data 26 int8: embeddings.int8, // Original int8 data 27 ubinary: embeddings.ubinary, // Original packed bits data 28 }, 29 bsonEmbeddings: { 30 float32: bsonFloat32, 31 int8: bsonInt8, 32 packedBits: bsonPackedBits, 33 }, 34 }; 35 }); 36 37 // Serialize the updated data to EJSON for BSON compatibility 38 const ejsonSerializedData = BSON.EJSON.stringify(convertEmbeddingsData, null, null, { relaxed: false }); 39 40 // Write the serialized data to 'embeddings.json' 41 await fs.writeFile('embeddings.json', ejsonSerializedData); 42 console.log('Embeddings with BSON vectors have been saved to embeddings.json'); 43 } catch (error) { 44 console.error('Error processing embeddings:', error); 45 } 46 } 47 48 main(); 运行此程序以生成 BSON
binData
向量。node convert-embeddings.js Embeddings with BSON vectors have been saved to embeddings.json 请验证
embeddings.json
文件中生成的 BSON 嵌入。
连接到 Atlas 集群并将数据上传到命名空间。
创建一个名为
upload-data.js
的文件以连接到 Atlas 集群,并将数据上传到sample_airbnb.listingsAndReviews
命名空间。touch upload-data.js 将以下代码复制并粘贴到
upload-data.js
文件中。此代码执行以下操作:
连接到您的 Atlas 集群,并使用您指定的数据库和集合名称创建一个命名空间。
将包括嵌入在内的数据上传到
sample_airbnb.listingsAndReviews
命名空间。
upload-data.js1 const fs = require('fs/promises'); // Use fs/promises for asynchronous operations 2 const { MongoClient, BSON } = require('mongodb'); // Import from the 'mongodb' package 3 const { EJSON, Binary } = require('bson'); // Import EJSON and Binary from bson 4 5 async function main() { 6 const MONGODB_URI = process.env.MONGODB_URI || "<CONNECTION-STRING>"; 7 const DB_NAME = "sample_airbnb"; 8 const COLLECTION_NAME = "listingsAndReviews"; 9 10 let client; 11 try { 12 // Connect to MongoDB 13 client = new MongoClient(MONGODB_URI); 14 await client.connect(); 15 console.log("Connected to MongoDB"); 16 17 // Access database and collection 18 const db = client.db(DB_NAME); 19 const collection = db.collection(COLLECTION_NAME); 20 21 // Load embeddings from JSON using EJSON.parse 22 const fileContent = await fs.readFile('embeddings.json', 'utf8'); 23 const embeddingsData = EJSON.parse(fileContent); // Use EJSON.parse 24 25 // Map embeddings data to recreate BSON binary representations 26 const documents = embeddingsData.map(({ text, bsonEmbeddings }) => { 27 return { 28 summary: text, 29 bsonEmbeddings: { 30 float32: bsonEmbeddings.float32, 31 int8: bsonEmbeddings.int8, 32 int1: bsonEmbeddings.packedBits 33 } 34 }; 35 }); 36 37 // Iterate over documents and upsert each into the MongoDB collection 38 for (const doc of documents) { 39 const filter = { summary: doc.summary }; 40 const update = { $set: doc }; 41 42 // Update the document with the BSON binary data 43 const result = await collection.updateOne(filter, update, { upsert: true }); 44 if (result.matchedCount > 0) { 45 console.log(`Updated document with summary: ${doc.summary}`); 46 } else { 47 console.log(`Inserted new document with summary: ${doc.summary}`); 48 } 49 } 50 51 console.log("Embeddings stored in MongoDB successfully."); 52 } catch (error) { 53 console.error('Error storing embeddings in MongoDB:', error); 54 } finally { 55 if (client) { 56 await client.close(); 57 } 58 } 59 } 60 61 // Run the main function to load the data 62 main(); 如果您没有为Atlas连接字符串设立环境变量,请替换
<CONNECTION-STRING>
占位符,然后保存文件。运行以下命令以上传该数据。
node upload-data.js Connected to MongoDB Updated document with text: ... ... Embeddings stored in MongoDB successfully. 通过登录到您的 Atlas 集群并检查 Data Explorer 中的命名空间来进行验证。
在集合上创建 Atlas Vector Search 索引。
创建一个名为
create-index.js
的文件。touch create-index.js 将以下代码复制并粘贴到
create-index.js
文件中以创建索引。此代码将执行以下操作:
连接到 Atlas 集群,并为指定的命名空间创建一个具有指定名称的索引。
使用
dotProduct
相似性函数将bsonEmbeddings.float32
和bsonEmbeddings.int8
字段索引为vector
类型,并使用euclidean
函数将bsonEmbeddings.int1
字段也索引为vector
类型。
create-index.js1 const { MongoClient } = require("mongodb"); 2 const { setTimeout } = require("timers/promises"); // Import from timers/promises 3 4 // Connect to your Atlas deployment 5 const uri = process.env.MONGODB_URI || "<CONNECTION-STRING>"; 6 7 const client = new MongoClient(uri); 8 9 async function main() { 10 try { 11 const database = client.db("<DB-NAME>"); 12 const collection = database.collection("<COLLECTION-NAME>"); 13 14 // Define your Atlas Vector Search index 15 const index = { 16 name: "<INDEX-NAME>", 17 type: "vectorSearch", 18 definition: { 19 fields: [ 20 { 21 type: "vector", 22 numDimensions: 1024, 23 path: "bsonEmbeddings.float32", 24 similarity: "dotProduct", 25 }, 26 { 27 type: "vector", 28 numDimensions: 1024, 29 path: "bsonEmbeddings.int8", 30 similarity: "dotProduct", 31 }, 32 { 33 type: "vector", 34 numDimensions: 1024, 35 path: "bsonEmbeddings.int1", 36 similarity: "euclidean", 37 }, 38 ], 39 }, 40 }; 41 42 // Run the helper method 43 const result = await collection.createSearchIndex(index); 44 console.log(`New search index named ${result} is building.`); 45 46 // Wait for the index to be ready to query 47 console.log("Polling to check if the index is ready. This may take up to a minute."); 48 let isQueryable = false; 49 50 // Use filtered search for index readiness 51 while (!isQueryable) { 52 const [indexData] = await collection.listSearchIndexes(index.name).toArray(); 53 54 if (indexData) { 55 isQueryable = indexData.queryable; 56 if (!isQueryable) { 57 await setTimeout(5000); // Wait for 5 seconds before checking again 58 } 59 } else { 60 // Handle the case where the index might not be found 61 console.log(`Index ${index.name} not found.`); 62 await setTimeout(5000); // Wait for 5 seconds before checking again 63 } 64 } 65 66 console.log(`${result} is ready for querying.`); 67 } catch (error) { 68 console.error("Error:", error); 69 } finally { 70 await client.close(); 71 } 72 } 73 74 main().catch((err) => { 75 console.error("Unhandled error:", err); 76 }); 替换以下设置并保存该文件。
<CONNECTION-STRING>
用于连接到您希望创建数据库和集合的 Atlas 集群的连接字符串。
仅当您没有设置
MONGODB_URI
环境变量时,才需替换该值。<DB-NAME>
集合的名称是
sample_airbnb
。<COLLECTION-NAME>
集合的名称是
listingsAndReviews
。<INDEX-NAME>
集合的索引名称。
创建索引。
node create-index.js New search index named vector_index is building. Polling to check if the index is ready. This may take up to a minute. <INDEX-NAME> is ready for querying.
为查询文本生成嵌入。
创建一个名为
get-query-embeddings.js
的文件。touch get-query-embeddings.js 将代码复制并粘贴到
get-query-embedding.js
文件中。示例代码执行以下操作:
使用 Cohere 为查询文本生成
float32
、int8
和int1
嵌入。使用 PyMongo 将生成的嵌入转换为 BSON
binData
向量。将生成的嵌入保存到名为
query-embeddings.json
的文件中。
get-query-embedding.js1 const { CohereClient } = require('cohere-ai'); 2 const { BSON } = require('mongodb'); 3 const { writeFile } = require('fs/promises'); 4 const dotenv = require('dotenv'); 5 const process = require('process'); 6 7 // Load environment variables 8 dotenv.config(); 9 10 const { Binary } = BSON; 11 12 // Get the API key from environment variables or set the key here 13 const apiKey = process.env.COHERE_API_KEY || '<COHERE-API-KEY>'; 14 15 if (!apiKey) { 16 throw new Error('API key not found. Provide the COHERE_API_KEY.'); 17 } 18 19 // Initialize CohereClient 20 const cohere = new CohereClient({ token: apiKey }); 21 22 async function main(queryText) { 23 try { 24 if (typeof queryText !== 'string' || queryText.trim() === '') { 25 throw new Error('Invalid query text. It must be a non-empty string.'); 26 } 27 28 const data = [queryText]; 29 30 // Request embeddings from the Cohere API 31 const response = await cohere.v2.embed({ 32 model: 'embed-english-v3.0', 33 inputType: 'search_query', 34 texts: data, 35 embeddingTypes: ['float', 'int8', 'ubinary'], // Request all required embedding types 36 }); 37 38 if (!response.embeddings) { 39 throw new Error('Embeddings not found in the API response.'); 40 } 41 42 const { float, int8, ubinary } = response.embeddings; 43 44 const updatedEmbeddingsData = data.map((text, index) => { 45 // Create the BSON Binary objects using VECTOR_TYPE for all embedding types 46 const float32Binary = Binary.fromFloat32Array(new Float32Array(float[index])); // VECTOR_TYPE.FLOAT32 47 const int8Binary = Binary.fromInt8Array(new Int8Array(int8[index])); // VECTOR_TYPE.INT8 48 const packedBitsBinary = Binary.fromPackedBits(new Uint8Array(ubinary[index])); // VECTOR_TYPE.PACKED_BIT 49 50 return { 51 text, 52 embeddings: { 53 float: float[index], 54 int8: int8[index], 55 ubinary: ubinary[index], 56 }, 57 bsonEmbeddings: { 58 float32: float32Binary, 59 int8: int8Binary, 60 int1: packedBitsBinary, 61 }, 62 }; 63 }); 64 65 // Serialize the embeddings using BSON EJSON for BSON compatibility 66 const outputFileName = 'query-embeddings.json'; 67 const ejsonSerializedData = BSON.EJSON.stringify(updatedEmbeddingsData, null, null, { relaxed: false }); 68 await writeFile(outputFileName, ejsonSerializedData); 69 console.log(`Embeddings with BSON data have been saved to ${outputFileName}`); 70 } catch (error) { 71 console.error('Error processing query text:', error); 72 } 73 } 74 75 // Main function that takes a query string 76 (async () => { 77 const queryText = "<QUERY-TEXT>"; // Replace with your actual query text 78 await main(queryText); 79 })(); 替换以下设置并保存该文件。
<COHERE-API-KEY>
您的 Cohere API 密钥。仅当您没有将此密钥设为环境变量时,才需替换该值。
<QUERY-TEXT>
您的查询文本。对于本示例,请使用
ocean view
。运行此代码以生成查询文本的嵌入。
node get-query-embeddings.js Embeddings with BSON vectors have been saved to query-embeddings.json
运行 Atlas Vector Search 查询。
创建一个名为
run-query.js
的文件。touch run-query.js 将以下示例
$vectorSearch
查询复制并粘贴到run-query.js
文件中。示例查询执行以下操作:
连接到您的 Atlas 集群,然后使用
query-embeddings.json
文件中的嵌入以针对sample_airbnb.listingsAndReviews
命名空间中的bsonEmbeddings.float32
、bsonEmbeddings.int8
与bsonEmbeddings.int1
字段来运行$vectorSearch
查询。将 Float32、Int8 和 Packed Binary (Int1) 嵌入的结果打印到控制台。
run-query.js1 const { MongoClient } = require('mongodb'); 2 const fs = require('fs/promises'); 3 const { BSON } = require('bson'); // Use BSON's functionality for EJSON parsing 4 const dotenv = require('dotenv'); 5 6 dotenv.config(); 7 8 // MongoDB connection details 9 const mongoUri = process.env.MONGODB_URI || '<CONNECTION-STRING>'; 10 const dbName = 'sample_airbnb'; // Update with your actual database name 11 const collectionName = 'listingsAndReviews'; // Update with your actual collection name 12 13 // Indices and paths should match your MongoDB vector search configuration 14 const VECTOR_INDEX_NAME = '<INDEX-NAME>'; // Replace with your actual index name 15 const NUM_CANDIDATES = 20; // Number of candidate documents for the search 16 const LIMIT = 5; // Limit for the number of documents to return 17 18 // Fields in the collection that contain the BSON query vectors 19 const FIELDS = [ 20 { path: 'float32', subtype: 9 }, // Ensure that the path and custom subtype match 21 { path: 'int8', subtype: 9 }, // Use the custom subtype if needed 22 { path: 'int1', subtype: 9 } // Use the same custom subtype 23 ]; 24 25 26 // Function to read BSON vectors from JSON and run vector search 27 async function main() { 28 // Initialize MongoDB client 29 const client = new MongoClient(mongoUri); 30 31 try { 32 await client.connect(); 33 console.log("Connected to MongoDB"); 34 35 const db = client.db(dbName); 36 const collection = db.collection(collectionName); 37 38 // Load query embeddings from JSON file using EJSON parsing 39 const fileContent = await fs.readFile('query-embeddings.json', 'utf8'); 40 const embeddingsData = BSON.EJSON.parse(fileContent); 41 42 // Define and run the query for each embedding type 43 const results = {}; 44 45 for (const fieldInfo of FIELDS) { 46 const { path, subtype } = fieldInfo; 47 const bsonBinary = embeddingsData[0]?.bsonEmbeddings?.[path]; 48 49 if (!bsonBinary) { 50 console.warn(`BSON embedding for ${path} not found in the JSON.`); 51 continue; 52 } 53 54 const bsonQueryVector = bsonBinary; // Directly use BSON Binary object 55 56 const pipeline = [ 57 { 58 $vectorSearch: { 59 index: VECTOR_INDEX_NAME, 60 path: `bsonEmbeddings.${path}`, 61 queryVector: bsonQueryVector, 62 numCandidates: NUM_CANDIDATES, 63 limit: LIMIT, 64 } 65 }, 66 { 67 $project: { 68 _id: 0, 69 name: 1, 70 summary: 1, // Adjust projection fields as necessary to match your document structure 71 score: { $meta: 'vectorSearchScore' } 72 } 73 } 74 ]; 75 76 results[path] = await collection.aggregate(pipeline).toArray(); 77 } 78 79 return results; 80 } catch (error) { 81 console.error('Error during vector search:', error); 82 } finally { 83 await client.close(); 84 } 85 } 86 87 // Main execution block 88 (async () => { 89 try { 90 const results = await main(); 91 92 if (results) { 93 console.log("Results from Float32 embeddings:"); 94 (results.float32 || []).forEach((result, index) => { 95 console.log(`Result ${index + 1}:`, result); 96 }); 97 98 console.log("Results from Int8 embeddings:"); 99 (results.int8 || []).forEach((result, index) => { 100 console.log(`Result ${index + 1}:`, result); 101 }); 102 103 console.log("Results from Packed Binary (PackedBits) embeddings:"); 104 (results.int1 || []).forEach((result, index) => { 105 console.log(`Result ${index + 1}:`, result); 106 }); 107 } 108 } catch (error) { 109 console.error('Error executing main function:', error); 110 } 111 })(); 112 替换以下设置并保存
run-query.js
文件。<CONNECTION-STRING>
用于连接到您希望创建数据库和集合的 Atlas 集群的连接字符串。
如果您没有设置
MONGODB_URI
环境变量,则请替换该值。<INDEX-NAME>
集合的索引名称。
运行此查询。
要执行此查询,请运行以下命令:
node run-query.js Connected to MongoDB Results from Float32 embeddings: Result 1: { name: 'Makaha Valley Paradise with OceanView', summary: "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", score: 0.7278661131858826 } Result 2: { name: 'Ocean View Waikiki Marina w/prkg', summary: "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.", score: 0.688639760017395 } Result 3: { name: 'A Casa Alegre é um apartamento T1.', summary: 'Para 2 pessoas. Vista de mar a 150 mts. Prédio com 2 elevadores. Tem: - quarto com roupeiro e cama de casal (colchão magnetizado); - cozinha: placa de discos, exaustor, frigorifico, micro-ondas e torradeira; casa de banho completa; - sala e varanda.', score: 0.6831139326095581 } Result 4: { name: 'Your spot in Copacabana', summary: 'Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.', score: 0.6802051663398743 } Result 5: { name: 'LAHAINA, MAUI! RESORT/CONDO BEACHFRONT!! SLEEPS 4!', summary: 'THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!', score: 0.6779564619064331 } Results from Int8 embeddings: Result 1: { name: 'Makaha Valley Paradise with OceanView', summary: "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", score: 0.5215557217597961 } Result 2: { name: 'Ocean View Waikiki Marina w/prkg', summary: "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.", score: 0.5179016590118408 } Result 3: { name: 'A Casa Alegre é um apartamento T1.', summary: 'Para 2 pessoas. Vista de mar a 150 mts. Prédio com 2 elevadores. Tem: - quarto com roupeiro e cama de casal (colchão magnetizado); - cozinha: placa de discos, exaustor, frigorifico, micro-ondas e torradeira; casa de banho completa; - sala e varanda.', score: 0.5173280239105225 } Result 4: { name: 'Your spot in Copacabana', summary: 'Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.', score: 0.5170232057571411 } Result 5: { name: 'LAHAINA, MAUI! RESORT/CONDO BEACHFRONT!! SLEEPS 4!', summary: 'THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!', score: 0.5168724060058594 } Results from Packed Binary (PackedBits) embeddings: Result 1: { name: 'Makaha Valley Paradise with OceanView', summary: "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", score: 0.6591796875 } Result 2: { name: 'Ocean View Waikiki Marina w/prkg', summary: "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.", score: 0.6337890625 } Result 3: { name: 'A Casa Alegre é um apartamento T1.', summary: 'Para 2 pessoas. Vista de mar a 150 mts. Prédio com 2 elevadores. Tem: - quarto com roupeiro e cama de casal (colchão magnetizado); - cozinha: placa de discos, exaustor, frigorifico, micro-ondas e torradeira; casa de banho completa; - sala e varanda.', score: 0.62890625 } Result 4: { name: 'LAHAINA, MAUI! RESORT/CONDO BEACHFRONT!! SLEEPS 4!', summary: 'THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!', score: 0.6279296875 } Result 5: { name: 'Be Happy in Porto', summary: 'Be Happy Apartment is an amazing space. Renovated and comfortable apartment, located in a building dating from the nineteenth century in one of the most emblematic streets of the Porto city "Rua do Almada". Be Happy Apartment is located in the city center, able you to visit the historic center only by foot, being very close of majority points of interesting of the Porto City. Be Happy Apartment is located close of central Station MetroTrindade.', score: 0.619140625 } 您的结果可能会有所不同,因为生成的嵌入可能会因您的环境而有所变化。
通过保存扩展名为 .ipynb
的文件来创建交互式Python笔记本,然后在笔记本中执行以下步骤。 要试用该示例,请将占位符替换为有效值。
使用本教程的可运行版本以作为 Python 笔记本。
安装所需的库。
运行以下命令以安装PyMongo驱动程序。如有必要,您还可以安装嵌入模型提供商的库。 此操作可能需要几分钟才能完成。
pip install pymongo
您必须安装 PyMongo v4.10 或更高版本的驱动程序。
例子
安装 PyMongo 和 Cohere
pip install --quiet --upgrade pymongo cohere
在您的笔记本中加载要生成 BSON 向量的数据。
例子
要导入的示例数据
data = [ "The Great Wall of China is visible from space.", "The Eiffel Tower was completed in Paris in 1889.", "Mount Everest is the highest peak on Earth at 8,848m.", "Shakespeare wrote 37 plays and 154 sonnets during his lifetime.", "The Mona Lisa was painted by Leonardo da Vinci.", ]
(条件)从您的数据生成嵌入。
如果您尚未从数据生成嵌入,则需要执行此步骤。如果您已经生成了嵌入,请跳过此步骤。要了解有关从数据生成嵌入的更多信息,请参阅如何创建向量嵌入。
例子
使用 Cohere 从样本数据生成嵌入
占位符 | 有效值 |
---|---|
| Cohere 的API密钥。 |
import os import cohere # Specify your Cohere API key os.environ["COHERE_API_KEY"] = "<COHERE-API-KEY>" cohere_client = cohere.Client(os.environ["COHERE_API_KEY"]) # Generate embeddings using the embed-english-v3.0 model generated_embeddings = cohere_client.embed( texts=data, model="embed-english-v3.0", input_type="search_document", embedding_types=["float", "int8", "ubinary"] ).embeddings float32_embeddings = generated_embeddings.float int8_embeddings = generated_embeddings.int8 int1_embeddings = generated_embeddings.ubinary
从嵌入中生成 BSON 向量。
您可以使用 PyMongo 驱动程序将本地向量嵌入转换为 BSON 向量。
例子
定义并运行函数,生成 BSON 向量
from bson.binary import Binary, BinaryVectorDtype def generate_bson_vector(vector, vector_dtype): return Binary.from_vector(vector, vector_dtype) # For all vectors in your collection, generate BSON vectors of float32, int8, and int1 embeddings bson_float32_embeddings = [] bson_int8_embeddings = [] bson_int1_embeddings = [] for i, (f32_emb, int8_emb, int1_emb) in enumerate(zip(float32_embeddings, int8_embeddings, int1_embeddings)): bson_float32_embeddings.append(generate_bson_vector(f32_emb, BinaryVectorDtype.FLOAT32)) bson_int8_embeddings.append(generate_bson_vector(int8_emb, BinaryVectorDtype.INT8)) bson_int1_embeddings.append(generate_bson_vector(int1_emb, BinaryVectorDtype.PACKED_BIT))
使用 BSON 向量嵌入创建文档。
如果集合中的文档已经包含 BSON 向量嵌入,请跳过此步骤。
例子
从示例数据创建文档
占位符 | 有效值 |
---|---|
| 具有 |
| 具有 |
| 具有 |
# Specify the field names for the float32, int8, and int1 embeddings float32_field = "<FIELD-NAME-FOR-FLOAT32-TYPE>" int8_field = "<FIELD-NAME-FOR-INT8-TYPE>" int1_field = "<FIELD-NAME-FOR-INT1-TYPE>" # Define function to create documents with BSON vector embeddings def create_docs_with_bson_vector_embeddings(bson_float32_embeddings, bson_int8_embeddings, bson_int1_embeddings, data): docs = [] for i, (bson_f32_emb, bson_int8_emb, bson_int1_emb, text) in enumerate(zip(bson_float32_embeddings, bson_int8_embeddings, bson_int1_embeddings, data)): doc = { "_id": i, "data": text, float32_field: bson_f32_emb, int8_field: bson_int8_emb, int1_field: bson_int1_emb } docs.append(doc) return docs # Create the documents documents = create_docs_with_bson_vector_embeddings(bson_float32_embeddings, bson_int8_embeddings, bson_int1_embeddings, data)
将您的数据加载到 Atlas 集群。
您可以从 Atlas UI 加载数据,也可以通过编程方式加载数据。要了解如何从 Atlas UI 加载数据,请参阅 插入数据。以下步骤和相关示例演示了如何使用 PyMongo 驱动程序以编程方式加载数据。
连接到您的 Atlas 集群。
占位符有效值<ATLAS-CONNECTION-STRING>
Atlas连接string 。 要了解更多信息,请参阅通过驱动程序连接。
例子
import pymongo mongo_client = pymongo.MongoClient("<ATLAS-CONNECTION-STRING>") if not MONGO_URI: print("MONGO_URI not set in environment variables") 将数据加载到 Atlas 集群。
占位符有效值<DB-NAME>
数据库名称。
<COLLECTION-NAME>
指定数据库中集合的名称。
例子
# Insert documents into a new database and collection db = mongo_client["<DB-NAME>"] collection_name = "<COLLECTION-NAME>" db.create_collection(collection_name) collection = db[collection_name] collection.insert_many(documents)
在集合上创建 Atlas Vector Search 索引。
您可以使用 Atlas UI、Atlas CLI、Atlas Administration API 和 MongoDB 驱动程序创建 Atlas Vector Search 索引。要了解详情,请参阅如何为向量搜索建立字段索引。
例子
为示例集合创建索引
占位符 | 有效值 |
---|---|
|
|
from pymongo.operations import SearchIndexModel import time # Define and create the vector search index index_name = "<INDEX-NAME>" search_index_model = SearchIndexModel( definition={ "fields": [ { "type": "vector", "path": float32_field, "similarity": "dotProduct", "numDimensions": 1024 }, { "type": "vector", "path": int8_field, "similarity": "dotProduct", "numDimensions": 1024 }, { "type": "vector", "path": int1_field, "similarity": "euclidean", "numDimensions": 1024 } ] }, name=index_name, type="vectorSearch" ) result = collection.create_search_index(model=search_index_model) print("New search index named " + result + " is building.") # Wait for initial sync to complete print("Polling to check if the index is ready. This may take up to a minute.") predicate=None if predicate is None: predicate = lambda index: index.get("queryable") is True while True: indices = list(collection.list_search_indexes(index_name)) if len(indices) and predicate(indices[0]): break time.sleep(5) print(result + " is ready for querying.")
定义一个函数,用于运行 Atlas Vector Search 查询。
运行 Atlas Vector Search 查询的函数必须执行以下操作:
将查询文本转换为 BSON 向量。
定义 Atlas Vector Search 查询的管道。
例子
占位符 | 有效值 |
---|---|
| 搜索期间要使用的最近邻数量。 |
| 结果中要返回的文档数。 |
# Define a function to run a vector search query def run_vector_search(query_text, collection, path): query_text_embeddings = cohere_client.embed( texts=[query_text], model="embed-english-v3.0", input_type="search_query", embedding_types=["float", "int8", "ubinary"] ).embeddings if path == float32_field: query_vector = query_text_embeddings.float[0] vector_dtype = BinaryVectorDtype.FLOAT32 elif path == int8_field: query_vector = query_text_embeddings.int8[0] vector_dtype = BinaryVectorDtype.INT8 elif path == int1_field: query_vector = query_text_embeddings.ubinary[0] vector_dtype = BinaryVectorDtype.PACKED_BIT bson_query_vector = generate_bson_vector(query_vector, vector_dtype) pipeline = [ { '$vectorSearch': { 'index': index_name, 'path': path, 'queryVector': bson_query_vector, 'numCandidates': <NUMBER-OF-CANDIDATES-TO-CONSIDER>, # for example, 5 'limit': <NUMBER-OF-DOCUMENTS-TO-RETURN> # for example, 2 } }, { '$project': { '_id': 0, 'data': 1, 'score': { '$meta': 'vectorSearchScore' } } } ] return collection.aggregate(pipeline)
运行 Atlas Vector Search 查询。
您可以通过编程方式运行 Atlas Vector Search 查询。要了解更多信息,请参阅运行向量搜索查询。
例子
from pprint import pprint # Run the vector search query on the float32, int8, and int1 embeddings query_text = "tell me a science fact" float32_results = run_vector_search(query_text, collection, float32_field) int8_results = run_vector_search(query_text, collection, int8_field) int1_results = run_vector_search(query_text, collection, int1_field) print("results from float32 embeddings") pprint(list(float32_results)) print("--------------------------------------------------------------------------") print("results from int8 embeddings") pprint(list(int8_results)) print("--------------------------------------------------------------------------") print("results from int1 embeddings") pprint(list(int1_results))
results from float32 embeddings [{'data': 'Mount Everest is the highest peak on Earth at 8,848m.', 'score': 0.6578356027603149}, {'data': 'The Great Wall of China is visible from space.', 'score': 0.6420407891273499}] -------------------------------------------------------------------------- results from int8 embeddings [{'data': 'Mount Everest is the highest peak on Earth at 8,848m.', 'score': 0.5149182081222534}, {'data': 'The Great Wall of China is visible from space.', 'score': 0.5136760473251343}] -------------------------------------------------------------------------- results from int1 embeddings [{'data': 'Mount Everest is the highest peak on Earth at 8,848m.', 'score': 0.62109375}, {'data': 'The Great Wall of China is visible from space.', 'score': 0.61328125}]
使用本教程的可运行版本以作为 Python 笔记本。
安装所需的库。
运行以下命令以安装PyMongo驱动程序。如有必要,您还可以安装嵌入模型提供商的库。 此操作可能需要几分钟才能完成。
pip install pymongo
您必须安装 PyMongo v4.10 或更高版本的驱动程序。
例子
安装 PyMongo 和 Cohere
pip install --quiet --upgrade pymongo cohere
定义函数以生成向量嵌入并将嵌入转换为BSON兼容格式。
您必须使用嵌入模型定义执行以下操作的函数:
如果现有数据没有任何嵌入,请从现有数据生成嵌入。
将嵌入转换为BSON向量。
例子
生成和转换嵌入的函数
占位符 | 有效值 |
---|---|
| Cohere 的API密钥。 |
import os import pymongo import cohere from bson.binary import Binary, BinaryVectorDtype # Specify your Cohere API key os.environ["COHERE_API_KEY"] = "<COHERE-API-KEY>" cohere_client = cohere.Client(os.environ["COHERE_API_KEY"]) # Define function to generate embeddings using the embed-english-v3.0 model def get_embedding(text): response = cohere_client.embed( texts=[text], model='embed-english-v3.0', input_type='search_document', embedding_types=["float"] ) embedding = response.embeddings.float[0] return embedding # Define function to convert embeddings to BSON-compatible format def generate_bson_vector(vector, vector_dtype): return Binary.from_vector(vector, vector_dtype)
import os import pymongo import cohere from bson.binary import Binary, BinaryVectorDtype # Specify your Cohere API key os.environ["COHERE_API_KEY"] = "<COHERE-API-KEY>" cohere_client = cohere.Client(os.environ["COHERE_API_KEY"]) # Define function to generate embeddings using the embed-english-v3.0 model def get_embedding(text): response = cohere_client.embed( texts=[text], model='embed-english-v3.0', input_type='search_document', embedding_types=["int8"] ) embedding = response.embeddings.int8[0] return embedding # Define function to convert embeddings to BSON-compatible format def generate_bson_vector(vector, vector_dtype): return Binary.from_vector(vector, vector_dtype)
import os import pymongo import cohere from bson.binary import Binary, BinaryVectorDtype # Specify your Cohere API key os.environ["COHERE_API_KEY"] = "<COHERE-API-KEY>" cohere_client = cohere.Client(os.environ["COHERE_API_KEY"]) # Define function to generate embeddings using the embed-english-v3.0 model def get_embedding(text): response = cohere_client.embed( texts=[text], model='embed-english-v3.0', input_type='search_document', embedding_types=["ubinary"] ) embedding = response.embeddings.ubinary[0] return embedding # Define function to convert embeddings to BSON-compatible format def generate_bson_vector(vector, vector_dtype): return Binary.from_vector(vector, vector_dtype)
连接到Atlas 集群并检索现有数据。
您必须提供以下信息:
用于连接Atlas 集群的连接字符串,该集群包含要为其生成嵌入的数据库和集合。
数据库名称,该数据库包含要为其生成嵌入的集合。
要为其生成嵌入的集合的名称。
例子
连接到Atlas集群以访问数据
占位符 | 有效值 |
---|---|
| Atlas连接string 。 要了解更多信息,请参阅通过驱动程序连接。 |
1 # Connect to your Atlas cluster 2 mongo_client = pymongo.MongoClient("<ATLAS-CONNECTION-STRING>") 3 db = mongo_client["sample_airbnb"] 4 collection = db["listingsAndReviews"] 5 6 # Filter to exclude null or empty summary fields 7 filter = { "summary": {"$nin": [None, ""]} } 8 9 # Get a subset of documents in the collection 10 documents = collection.find(filter).limit(50) 11 12 # Initialize the count of updated documents 13 updated_doc_count = 0
生成、转换嵌入并将其加载到您的集合中。
如果您的数据还没有嵌入,请使用任何嵌入模型从数据生成嵌入。 要学习;了解有关从数据生成嵌入的更多信息,请参阅如何创建向量嵌入。
将嵌入转换为BSON向量(如以下示例中的第 7行所示)。
将嵌入上传到Atlas 集群上的集合。
这些操作可能需要几分钟才能完成。
例子
生成、转换嵌入并将其加载到集合中
for doc in documents: # Generate embeddings based on the summary summary = doc["summary"] embedding = get_embedding(summary) # Get float32 embedding # Convert the float32 embedding to BSON format bson_float32 = generate_bson_vector(embedding, BinaryVectorDtype.FLOAT32) # Update the document with the BSON embedding collection.update_one( {"_id": doc["_id"]}, {"$set": {"embedding": bson_float32}} ) updated_doc_count += 1 print(f"Updated {updated_doc_count} documents with BSON embeddings.")
for doc in documents: # Generate embeddings based on the summary summary = doc["summary"] embedding = get_embedding(summary) # Get int8 embedding # Convert the int8 embedding to BSON format bson_int8 = generate_bson_vector(embedding, BinaryVectorDtype.INT8) # Update the document with the BSON embedding collection.update_one( {"_id": doc["_id"]}, {"$set": {"embedding": bson_int8}} ) updated_doc_count += 1 print(f"Updated {updated_doc_count} documents with BSON embeddings.")
for doc in documents: # Generate embeddings based on the summary summary = doc["summary"] embedding = get_embedding(summary) # Get int1 embedding # Convert the int1 embedding to BSON format bson_int1 = generate_bson_vector(embedding, BinaryVectorDtype.PACKED_BIT) # Update the document with the BSON embedding collection.update_one( {"_id": doc["_id"]}, {"$set": {"embedding": bson_int1}} ) updated_doc_count += 1 print(f"Updated {updated_doc_count} documents with BSON embeddings.")
在集合上创建 Atlas Vector Search 索引。
您可以使用Atlas用户界面、 Atlas CLI、 Atlas管理API和MongoDB驱动程序以您的首选语言创建Atlas Vector Search索引。 要学习;了解更多信息,请参阅如何为向量搜索的字段编制索引。
例子
为集合创建索引
占位符 | 有效值 |
---|---|
|
|
1 from pymongo.operations import SearchIndexModel 2 import time 3 4 # Define and create the vector search index 5 index_name = "<INDEX-NAME>" 6 search_index_model = SearchIndexModel( 7 definition={ 8 "fields": [ 9 { 10 "type": "vector", 11 "path": "embedding", 12 "similarity": "euclidean", 13 "numDimensions": 1024 14 } 15 ] 16 }, 17 name=index_name, 18 type="vectorSearch" 19 ) 20 result = collection.create_search_index(model=search_index_model) 21 print("New search index named " + result + " is building.") 22 23 # Wait for initial sync to complete 24 print("Polling to check if the index is ready. This may take up to a minute.") 25 predicate=None 26 if predicate is None: 27 predicate = lambda index: index.get("queryable") is True 28 while True: 29 indices = list(collection.list_search_indexes(index_name)) 30 if len(indices) and predicate(indices[0]): 31 break 32 time.sleep(5) 33 print(result + " is ready for querying.")
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
定义一个函数,用于运行 Atlas Vector Search 查询。
运行 Atlas Vector Search 查询的函数必须执行以下操作:
为查询文本生成嵌入。
将查询文本转换为 BSON 向量。
定义 Atlas Vector Search 查询的管道。
例子
用于运行Atlas Vector Search查询的函数
占位符 | 有效值 |
---|---|
| 搜索期间要使用的最近邻数量。 |
| 结果中要返回的文档数。 |
def run_vector_search(query_text, collection, path): query_embedding = get_embedding(query_text) bson_query_vector = generate_bson_vector(query_embedding, BinaryVectorDtype.FLOAT32) pipeline = [ { '$vectorSearch': { 'index': index_name, 'path': path, 'queryVector': bson_query_vector, 'numCandidates': <NUMBER-OF-CANDIDATES-TO-CONSIDER>, # for example, 20 'limit': <NUMBER-OF-DOCUMENTS-TO-RETURN> # for example, 5 } }, { '$project': { '_id': 0, 'name': 1, 'summary': 1, 'score': { '$meta': 'vectorSearchScore' } } } ] return collection.aggregate(pipeline)
def run_vector_search(query_text, collection, path): query_embedding = get_embedding(query_text) bson_query_vector = generate_bson_vector(query_embedding, BinaryVectorDtype.INT8) pipeline = [ { '$vectorSearch': { 'index': index_name, 'path': path, 'queryVector': bson_query_vector, 'numCandidates': <NUMBER-OF-CANDIDATES-TO-CONSIDER>, # for example, 20 'limit': <NUMBER-OF-DOCUMENTS-TO-RETURN> # for example, 5 } }, { '$project': { '_id': 0, 'name': 1, 'summary': 1, 'score': { '$meta': 'vectorSearchScore' } } } ] return collection.aggregate(pipeline)
def run_vector_search(query_text, collection, path): query_embedding = get_embedding(query_text) bson_query_vector = generate_bson_vector(query_embedding, BinaryVectorDtype.PACKED_BIT) pipeline = [ { '$vectorSearch': { 'index': index_name, 'path': path, 'queryVector': bson_query_vector, 'numCandidates': <NUMBER-OF-CANDIDATES-TO-CONSIDER>, # for example, 20 'limit': <NUMBER-OF-DOCUMENTS-TO-RETURN> # for example, 5 } }, { '$project': { '_id': 0, 'name': 1, 'summary': 1, 'score': { '$meta': 'vectorSearchScore' } } } ] return collection.aggregate(pipeline)
运行 Atlas Vector Search 查询。
您可以通过编程方式运行 Atlas Vector Search 查询。要了解更多信息,请参阅运行向量搜索查询。
例子
运行示例Atlas Vector Search查询
from pprint import pprint query_text = "ocean view" query_results = run_vector_search(query_text, collection, "embedding") print("query results:") pprint(list(query_results))
query results: [{'name': 'Your spot in Copacabana', 'score': 0.5468248128890991, 'summary': 'Having a large airy living room. The apartment is well divided. ' 'Fully furnished and cozy. The building has a 24h doorman and ' 'camera services in the corridors. It is very well located, close ' 'to the beach, restaurants, pubs and several shops and ' 'supermarkets. And it offers a good mobility being close to the ' 'subway.'}, {'name': 'Twin Bed room+MTR Mongkok shopping&My', 'score': 0.527062714099884, 'summary': 'Dining shopping conveniently located Mongkok subway E1, airport ' 'shuttle bus stops A21. Three live two beds, separate WC, 24-hour ' 'hot water. Free WIFI.'}, {'name': 'Quarto inteiro na Tijuca', 'score': 0.5222363471984863, 'summary': 'O quarto disponível tem uma cama de solteiro, sofá e computador ' 'tipo desktop para acomodação.'}, {'name': 'Makaha Valley Paradise with OceanView', 'score': 0.5175154805183411, 'summary': 'A beautiful and comfortable 1 Bedroom Air Conditioned Condo in ' 'Makaha Valley - stunning Ocean & Mountain views All the ' 'amenities of home, suited for longer stays. Full kitchen & large ' "bathroom. Several gas BBQ's for all guests to use & a large " 'heated pool surrounded by reclining chairs to sunbathe. The ' 'Ocean you see in the pictures is not even a mile away, known as ' 'the famous Makaha Surfing Beach. Golfing, hiking,snorkeling ' 'paddle boarding, surfing are all just minutes from the front ' 'door.'}, {'name': 'Cozy double bed room 東涌鄉村雅緻雙人房', 'score': 0.5149975419044495, 'summary': 'A comfortable double bed room at G/F. Independent entrance. High ' 'privacy. The room size is around 100 sq.ft. with a 48"x72" ' 'double bed. The village house is close to the Hong Kong Airport, ' 'AsiaWorld-Expo, HongKong-Zhuhai-Macau Bridge, Disneyland, ' 'Citygate outlets, 360 Cable car, shopping centre, main tourist ' 'attractions......'}]
结果可能会有所不同,具体取决于您在先前步骤中指定的向量数据类型。
有关使用 Cohere 的 embed-english-v3.0
嵌入模型在样本数据上对这一程序的高级演示,请参阅本笔记本。
评估查询结果
您可以通过评估 ANN搜索结果与针对量化向量的 ENN搜索结果的匹配程度来衡量Atlas Vector Search查询的准确性。也就是说,您可以将相同查询条件的 ANN搜索结果与 ENN搜索结果进行比较,并测量 ANN搜索结果在 ENN搜索结果中包含最近邻的频率。