Docs 菜单
Docs 主页
/
MongoDB 阿特拉斯
/

如何创建向量嵌入

在此页面上

  • 选择创建嵌入的方法
  • 从数据创建和存储嵌入
  • 先决条件
  • 步骤
  • 为查询创建嵌入
  • 步骤

向量嵌入将数据表示为多维空间中的点。这些嵌入捕获数据中有意义的关系,并支持语义搜索和检索等任务。您可以将矢量嵌入与其他数据一起存储在 Atlas 中,并使用 Atlas Vector Search 查询矢量化数据。

要执行 Atlas Vector Search 查询,您必须:

  1. 选择一种创建向量嵌入的方法。

  2. 从您的数据创建向量嵌入并将其存储在 Atlas 中。

  3. 创建表示搜索查询的向量嵌入并运行查询。

Atlas Vector Search 返回向量嵌入与表示查询的嵌入距离最近的文档。这表明它们的含义相似。

要创建向量嵌入,您必须使用嵌入模型。要连接到嵌入模型并为 Atlas Vector Search 创建嵌入,请使用以下方法之一:

  • 加载开源模型。如果您没有嵌入服务的 API密钥或信用额度,您可以通过从应用程序在本地加载开源嵌入模型来使用它。

    有关示例实现,请参阅根据数据创建和存储嵌入以及使用 Atlas Vector Search 构建本地 RAG 实现。

  • 调用嵌入服务。大多数 AI 提供商都为其专有的嵌入模型提供API ,可用于创建向量嵌入。

    有关 OpenAI 的示例实施,请参阅从数据创建和存储嵌入。

  • 使用集成。 您可以将 Atlas Vector Search 与 LangChain 等开源框架集成 和 LlamaIndex 、 Amazon Bedrock 等服务 ,等等。这些集成包括内置库和工具,可帮助您快速连接到开源和专有的嵌入模型,并为 Atlas Vector Search 生成矢量嵌入。

    要开始使用,请参阅将向量搜索与 AI 技术集成。

以下过程演示了如何使用开源或专有的嵌入模型和 MongoDB PyMongo 驱动程序创建向量嵌入并将其存储在 Atlas 中。

要运行这些示例,您必须具备以下条件:

  • 运行 MongoDB 6版本的 Atlas 集群。 0 。 11 、 7 。 0 。 2或更高版本(包括RC )。确保您的IP 地址包含在 Atlas 项目的访问列表中。

  • 运行交互式 Python 笔记本(例如 Colab)的环境。

完成以下步骤,从样本数据集创建向量嵌入,然后将其存储在 Atlas 的集合中。

注意

此示例介绍了如何从新数据集创建向量嵌入。如果要为现有集合创建嵌入,则必须添加包含嵌入的新字段并更新集合中的每个文档。

1

通过保存扩展名为 .ipynb的文件来创建交互式 Python 笔记本,然后在笔记本中运行以下命令以安装依赖项:

pip install --quiet datasets pandas nomic sentence-transformers einops pymongo

注意

如果遇到有关版本兼容性的警告,可以忽略它们,因为它们不会阻止您完成本教程。

2

本教程使用的样本数据集包含来自各种操作方法文章的文本。此数据集可在 Hushing Face 数据集库 中找到 以轻松访问应用程序中的数据。

在笔记本中粘贴并运行以下代码。此代码执行以下操作:

  • 从 Hushing Face 数据集库加载数据集。

  • 仅保留数据集的前100条目。

  • 将数据集转换为 Pandas DataFrame,以便您可以轻松处理数据。

  • 筛选数据中的非空条目。

from datasets import load_dataset
import pandas as pd
# Load the dataset without downloading it fully
data = load_dataset("MongoDB/cosmopedia-wikihow-chunked", split="train", streaming=True)
data_head = data.take(100)
# Create the DataFrame
df = pd.DataFrame(data_head)
# Only keep entries where the text field is not null
df = df[df["text"].notna()]
# Preview contents of the data
df.head()
3

在笔记本中粘贴并运行以下代码,以使用 Nomic AI 的开源嵌入模型创建向量嵌入 。此代码执行以下操作:

  • 加载 nomic-embed-text-v1 嵌入模型。

  • 创建一个名为get_embedding的函数,该函数使用该模型为给定的文本输入生成嵌入。

  • 调用该函数以从 DataFrame 中的text字段生成嵌入,并将这些嵌入存储在新的text_embedding字段中。

from nomic import embed
from sentence_transformers import SentenceTransformer
# Load the embedding model (https://huggingface.co/nomic-ai/nomic-embed-text-v1")
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
def get_embedding(text):
"""Generates vector embeddings for the given text."""
embedding = model.encode(text)
return embedding.tolist()
# Creates embeddings and stores them as a new field
df["text_embedding"] = df["text"].apply(get_embedding)
df.head()
4

在笔记本中粘贴并运行以下代码,以连接到 Atlas 集群,并将数据存储在sample_db.articles集合中。将占位符值替换为 Atlas 集群的SRV连接字符串。

注意

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
# Ingest data into Atlas
db = mongo_client["sample_db"]
collection = db["articles"]
documents = df.to_dict("records")
collection.insert_many(documents)

运行示例代码后,您可以导航到您的集群中的sample_db.articles collection,在 Atlas 用户界面中查看向量嵌入。

1

通过保存扩展名为.ipynb的文件来创建交互式 Python 笔记本,然后在笔记本中运行以下命令以安装依赖项:

pip install --quiet datasets pandas openai pymongo

注意

如果遇到有关版本兼容性的警告,可以忽略它们,因为它们不会阻止您完成本教程。

2

本教程使用的样本数据集包含来自各种操作方法文章的文本。此数据集可在 Hushing Face 数据集库 中找到 以轻松访问应用程序中的数据。

在笔记本中粘贴并运行以下代码。此代码执行以下操作:

  • 从 Hushing Face 数据集库加载数据集。

  • 仅保留数据集的前100条目。

  • 将数据集转换为 Pandas DataFrame,以便您可以轻松处理数据。

  • 筛选数据中的非空条目。

from datasets import load_dataset
import pandas as pd
# Load the dataset without downloading it fully
data = load_dataset("MongoDB/cosmopedia-wikihow-chunked", split="train", streaming=True)
data_head = data.take(100)
# Create the DataFrame
df = pd.DataFrame(data_head)
# Only keep entries where the text field is not null
df = df[df["text"].notna()]
# Preview contents of the data
df.head()
3

在笔记本中粘贴并运行以下代码,以使用 OpenAI 的专有嵌入模型创建向量嵌入 。将占位符值替换为您的 OpenAI API 密钥。此代码执行以下操作:

  • 指定text-embedding-3-small嵌入模型。

  • 创建一个名为get_embedding的函数,该函数调用模型的API来为给定的文本输入生成嵌入。

  • 调用该函数以从 DataFrame 中的text字段生成嵌入,并将这些嵌入存储在新的text_embedding字段中。

import os
from openai import OpenAI
# Specify your OpenAI API key and embedding model
os.environ["OPENAI_API_KEY"] = "<api-key>"
model = "text-embedding-3-small"
openai_client = OpenAI()
def get_embedding(text):
"""Generates vector embeddings for the given text."""
embeddings = openai_client.embeddings.create(input = [text], model=model).data[0].embedding
return embeddings
# Creates embeddings and stores them as a new field
df["text_embedding"] = df["text"].apply(get_embedding)
df.head()

提示

另请参阅:

有关 API 详细信息和可用模型列表,请参阅 OpenAI 文档。

4

在笔记本中粘贴并运行以下代码,以连接到 Atlas 集群,并将数据存储在sample_db.articles集合中。将占位符值替换为 Atlas 集群的SRV连接字符串。

注意

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
# Ingest data into Atlas
db = mongo_client["sample_db"]
collection = db["articles"]
documents = df.to_dict("records")
collection.insert_many(documents)

运行示例代码后,您可以导航到您的集群中的sample_db.articles collection,在 Atlas 用户界面中查看向量嵌入。

以下过程演示了如何使用开源或专有的嵌入模型和 MongoDB PyMongo 驱动程序为 Atlas Vector Search 查询创建嵌入。

从样本数据创建嵌入后,请完成以下步骤,对数据创建 Atlas Vector Search 索引,并创建可用于向量搜索查询的嵌入。

1

要启用对数据的矢量搜索查询,请在sample_db.articles集合上创建 Atlas Vector Search 索引。

以下索引定义将text_embedding字段指定为向量类型,将768向量维度指定为 euclidean,并将相似性度量指定为 。用于创建索引的方法取决于您的集群层:

  • 对于免费和共享集群,请按照步骤通过 Atlas 用户界面创建索引。将索引命名为vector_index并使用以下索引定义:

    {
    "fields": [
    {
    "type": "vector",
    "path": "text_embedding",
    "numDimensions": 768,
    "similarity": "euclidean"
    }
    ]
    }
  • 对于专用集群,您还可以使用支持的 MongoDB 驱动程序创建索引。在笔记本中粘贴并运行以下代码,使用 PyMongo 驱动程序辅助方法创建索引:

    from pymongo.operations import SearchIndexModel
    # Create your index model, then create the search index
    search_index_model = SearchIndexModel(
    definition = {
    "fields": [
    {
    "type": "vector",
    "path": "text_embedding",
    "numDimensions": 768,
    "similarity": "euclidean"
    }
    ]
    },
    name="vector_index",
    type="vectorSearch",
    )
    collection.create_search_index(model=search_index_model)

要了解更多信息,请参阅创建 Atlas Vector Search 索引。

2

要为向量搜索查询生成查询向量,可以使用与从数据创建嵌入相同的方法。

例如,粘贴并运行以下代码以执行以下操作:

  • 通过调用您在上 一示例 中定义的嵌入函数,为字符串 Home Improvement 创建嵌入。

  • 将嵌入传递到聚合管道中的queryVector字段。

  • 运行样本向量搜索查询并返回输出。

# Generate embedding for the search query
query_embedding = get_embedding("home improvement")
# Sample vector search pipeline
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"queryVector": query_embedding,
"path": "text_embedding",
"numCandidates": 100,
"limit": 5
}
},
{
"$project": {
"_id": 0,
"text": 1,
"score": {
"$meta": "vectorSearchScore"
}
}
}
]
# Execute the search
results = collection.aggregate(pipeline)
# Print results
for i in results:
print(i)
{'text': "**Step 9: Analyze Findings**\nReview collected information meticulously. Identify maximum deviations, average variances, patterns, etc. Decide whether remedial actions are needed based on severity and implications of revealed disparities. Common solutions include shimming low spots, grinding high ones, repairing damaged sections, or even consulting experts about potential structural concerns.\n\nBy diligently adhering to this procedure, you'll successfully check your floor's level condition, thereby facilitating informed decisions concerning maintenance, renovation, or construction projects!", 'score': 0.4972769618034363}
{'text': '**Step 5: Deep Clean Surfaces**\nNow that the room is free of excess clutter, focus on deep cleaning surfaces. Start high and work your way down to avoid recontaminating cleaned areas. Dust light fixtures, ceiling fans, windowsills, shelves, and furniture. Vacuum or sweep floors thoroughly. Mop hard floor surfaces using a suitable cleanser. Pay attention to often neglected spots like baseboards and door frames.\n\nKey Tips:\n- Always start with the highest points to prevent falling dust from settling on already cleaned surfaces.\n- Move large pieces of furniture away from walls to ensure thorough cleaning beneath them.\n- Allow ample drying time before replacing stored items to prevent moisture damage.', 'score': 0.48243528604507446}
{'text': "Remember to include support columns if needed, especially if designing multi-story structures.\n\n**Step 5: Designing Interiors**\nNow comes the fun part - decorating! Add lighting with torches, lanterns, or glowstone. Install staircases leading upstairs or downstairs. Create cozy seating areas with chairs and tables. Adorn walls with paintings, banners, or vines. And don't forget about adding bathroom facilities!\n\nBe creative but consistent with your theme. If going for a luxury feel, opt for gold accents and fine furniture pieces. Alternatively, go minimalist with clean lines and neutral colors.\n\n**Step 6: Creating Upper Levels & Roofs**\nRepeat steps four and five for additional floors, ensuring structural integrity throughout. When reaching the topmost level, cap off the building with a roof. Common roof shapes include gable, hip, mansard, and skillion. Whichever style you choose, ensure symmetry and proper alignment.", 'score': 0.4739491045475006}
{'text': '**Step 7: Landscaping Exteriors**\nFinally, beautify your surroundings. Plant trees, flowers, and grass. Dig ponds or rivers nearby. Pathway bricks or gravel paths towards entrances. Build outdoor sitting areas, gardens, or even swimming pools!\n\nAnd there you have it - a grand hotel standing tall amidst the virtual landscape! With careful planning, patient collection of materials, thoughtful interior design, meticulous upper levels, and picturesque landscaping, you now possess both a functional space and impressive architectural feat. Happy building!', 'score': 0.4724790155887604}
{'text': 'Title: How to Create and Maintain a Compost Pile\n\nIntroduction:\nComposting is an easy and environmentally friendly way to recycle organic materials and create nutrient-rich soil for your garden or plants. By following these steps, you can learn how to build and maintain a successful compost pile that will help reduce waste and improve the health of your plants.\n\n**Step 1: Choose a Location **\nSelect a well-draining spot in your backyard, away from your house or other structures, as compost piles can produce odors. Ideally, locate the pile in partial shade or a location with morning sun only. This allows the pile to retain moisture while avoiding overheating during peak sunlight hours.\n\n_Key tip:_ Aim for a minimum area of 3 x 3 feet (0.9m x 0.9m) for proper decomposition; smaller piles may not generate enough heat for optimal breakdown of materials.', 'score': 0.471458375453949}
1

要启用对数据的矢量搜索查询,请在sample_db.articles集合上创建 Atlas Vector Search 索引。

以下索引定义将text_embedding字段指定为向量类型,将1536向量维度指定为 euclidean,并将相似性度量指定为 。用于创建索引的方法取决于您的集群层:

  • 对于免费和共享集群,请按照步骤通过 Atlas 用户界面创建索引。将索引命名为vector_index并使用以下索引定义:

    {
    "fields": [
    {
    "type": "vector",
    "path": "text_embedding",
    "numDimensions": 1536,
    "similarity": "euclidean"
    }
    ]
    }
  • 对于专用集群,您还可以使用支持的 MongoDB 驱动程序创建索引。在笔记本中粘贴并运行以下代码,使用 PyMongo 驱动程序辅助方法创建索引:

    from pymongo.operations import SearchIndexModel
    # Create your index model, then create the search index
    search_index_model = SearchIndexModel(
    definition = {
    "fields": [
    {
    "type": "vector",
    "path": "text_embedding",
    "numDimensions": 768,
    "similarity": "euclidean"
    }
    ]
    },
    name="vector_index",
    type="vectorSearch",
    )
    collection.create_search_index(model=search_index_model)

要了解更多信息,请参阅创建 Atlas Vector Search 索引。

2

要为向量搜索查询生成查询向量,可以使用与从数据创建嵌入相同的方法。

例如,粘贴并运行以下代码以执行以下操作:

  • 通过调用您在上 一示例 中定义的嵌入函数,为字符串 Home Improvement 创建嵌入。

  • 将嵌入传递到聚合管道中的queryVector字段。

  • 运行样本向量搜索查询并返回输出。

# Generate embedding for the search query
query_embedding = get_embedding("home improvement")
# Sample vector search pipeline
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"queryVector": query_embedding,
"path": "text_embedding",
"numCandidates": 100,
"limit": 5
}
},
{
"$project": {
"_id": 0,
"text": 1,
"score": {
"$meta": "vectorSearchScore"
}
}
}
]
# Execute the search
results = collection.aggregate(pipeline)
# Print results
for i in results:
print(i)
{'text': '**Step 6: Regular Maintenance**\nAfter investing effort into cleaning and organizing a crowded room, maintaining its orderliness is crucial. Establish habits that promote ongoing tidiness, such as regularly putting things back where they belong, scheduling weekly cleanup sessions, and addressing new clutter promptly rather than letting it accumulate over time.\n\nBy consistently applying these steps, you can successfully clean and maintain a very crowded room, creating a peaceful and enjoyable living space.', 'score': 0.42446020245552063}
{'text': "**Step 9: Analyze Findings**\nReview collected information meticulously. Identify maximum deviations, average variances, patterns, etc. Decide whether remedial actions are needed based on severity and implications of revealed disparities. Common solutions include shimming low spots, grinding high ones, repairing damaged sections, or even consulting experts about potential structural concerns.\n\nBy diligently adhering to this procedure, you'll successfully check your floor's level condition, thereby facilitating informed decisions concerning maintenance, renovation, or construction projects!", 'score': 0.421939879655838}
{'text': 'Check If a Floor Is Level: A Comprehensive Step-by-Step Guide\n==========================================================\n\nA level floor is crucial for various reasons such as safety, aesthetics, and proper functioning of appliances or furniture that require stability. This tutorial will guide you through checking whether your floor is level with accuracy and precision using tools available at most hardware stores. By following these steps, you can identify any irregularities, enabling necessary corrections before installing new floors, fixtures, or equipment.\n\n**Duration:** Approximately 30 minutes (excluding correction time)', 'score': 0.4213894307613373}
{'text': '**Step 7: Landscaping Exteriors**\nFinally, beautify your surroundings. Plant trees, flowers, and grass. Dig ponds or rivers nearby. Pathway bricks or gravel paths towards entrances. Build outdoor sitting areas, gardens, or even swimming pools!\n\nAnd there you have it - a grand hotel standing tall amidst the virtual landscape! With careful planning, patient collection of materials, thoughtful interior design, meticulous upper levels, and picturesque landscaping, you now possess both a functional space and impressive architectural feat. Happy building!', 'score': 0.41135403513908386}
{'text': "**Step 2: Gather Necessary Materials**\nTo efficiently clean a crowded room, gather all necessary materials beforehand. Some essential items include:\n\n* Trash bags\n* Recycling bins or bags\n* Boxes or storage containers\n* Cleaning supplies (e.g., broom, vacuum cleaner, dustpan, mop, all-purpose cleaner)\n* Gloves\n* Label maker or markers\n\nHaving everything at hand ensures smooth progress without wasting time searching for tools during the cleaning process.\n\n**Step 3: Declutter Systematically**\nStart by removing unnecessary items from the room. Divide objects into categories such as trash, recyclables, donations, and items to keep. Be ruthless when deciding which belongings are truly valuable or needed. If you haven't used something within the past year, consider whether it's worth keeping. Donating unused items not only frees up space but also benefits those in need.", 'score': 0.407828688621521}

提示

另请参阅:

您还可以通过直接调用 API 端点来创建嵌入。要了解更多信息,请参阅 OpenAI API 参考文档。

要了解有关运行向量搜索查询的更多信息,请参阅运行向量搜索查询。

后退

Atlas Vector Search 快速入门

来年

创建和管理索引