Haystack 集成入门

您可以将MongoDB Vector Search 与 Haystack 集成，以构建具有 LLM 的自定义应用程序并实现检索增强生成 (RAG)。本教程演示如何开始使用MongoDB Vector Search with Haystack 对数据执行语义搜索并构建RAG实施。具体来说，您执行以下操作：

设置环境。
创建MongoDB Vector Search索引。
在MongoDB中存储自定义数据。
使用 MongoDB Vector Search 来回答有关数据的问题，从而实施RAG。

使用本教程的可运行版本以作为 Python 笔记本。

背景

Haystack 是一个使用 LLM、嵌入模型和向量搜索构建自定义应用程序的框架。通过将MongoDB Vector Search 与 Haystack 集成，您可以将MongoDB用作向量数据库，并使用MongoDB Vector Search 从数据中检索语义相似的文档来实现RAG。要学习；了解有关 RAG 的更多信息，请参阅使用MongoDB检索增强生成 (RAG)。

步骤

先决条件

如要完成本教程，您必须具备以下条件：

以下MongoDB 集群类型之一：
- 一个 Atlas 集群，运行 MongoDB 6.0.11、7.0.2 或更高版本。请确保您的 IP 地址包含在 Atlas 项目的访问列表中。
- 使用Atlas CLI创建的本地Atlas部署。要学习；了解更多信息，请参阅创建本地Atlas部署。
- 安装了Search 和 Vector Search的MongoDB Community或 Enterprise集群。
OpenAI API密钥。您必须拥有一个具有可用于API请求的积分的 OpenAI 帐户。要学习；了解有关注册 OpenAI 帐户的更多信息，请参阅 OpenAI API网站。
Voyage AI API密钥。要创建API密钥，请参阅对API密钥建模。
用于运行Python项目（例如Colab。）的笔记本。

设置环境

为此教程设置环境。通过保存具有 .ipynb 扩展名的文件来创建交互式Python笔记本。此 Notebook 允许您单独运行Python代码片段，并且您将使用它来运行本教程中的代码。

要设立笔记本环境，请执行以下操作：

安装并导入依赖项。

运行以下命令：
```
pip install --quiet --upgrade mongodb-atlas-haystack voyage-embedders-haystack pymongo
```

运行以下代码，导入所需包：

import os
from haystack import Pipeline, Document
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.writers import DocumentWriter
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack_integrations.components.embedders.voyage_embedders import VoyageDocumentEmbedder, VoyageTextEmbedder
from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever
from pymongo import MongoClient
from pymongo.operations import SearchIndexModel

定义环境变量。

运行以下代码，将占位符替换为以下值：

您的 Voyage AI API密钥。
您的 OpenAI API 密钥。
MongoDB集群的连接字符串。

os.environ["VOYAGE_API_KEY"] = "<voyage-api-key>"
os.environ["OPENAI_API_KEY"] = "<openai-api-key>"
os.environ["MONGO_CONNECTION_STRING"]= "<connection-string>"

注意

将 <connection-string> 替换为您的 Atlas 集群或本地部署的连接字符串。

连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

要学习；了解更多信息，请参阅通过客户端库连接到集群。

连接字符串应使用以下格式：

mongodb://localhost:<port-number>/?directConnection=true

要学习；了解更多信息，请参阅连接字符串。

创建MongoDB Vector Search 索引

在本部分中，您将创建 haystack_db数据库和 test集合来存储自定义数据。然后，要启用对数据的向量搜索查询，请创建MongoDB Vector Search索引。

运行以下代码以连接到MongoDB 集群。

client = MongoClient(os.environ.get("MONGO_CONNECTION_STRING"))

创建`haystack_db.test` 集合。

运行以下代码以创建 haystack_db 数据库和 test 集合。

# Create your database and collection
db_name = "haystack_db"
collection_name = "test"
database = client[db_name]
database.create_collection(collection_name)
# Define collection
collection = client[db_name][collection_name]

定义 AMongoDB Vector Search索引。

运行以下代码，创建vectorSearch类型的索引。embedding字段包含您将使用 Voyage AI 的 voyage-3-large 嵌入模型创建的嵌入。索引定义指定了 1024 个向量维度，并使用 cosine 来衡量相似性。

# Create your index model, then create the search index
search_index_model = SearchIndexModel(
   definition={
      "fields": [
         {
            "type": "vector",
            "path": "embedding",
            "numDimensions": 1024,
            "similarity": "cosine"
         }
      ]
   },
   name="vector_index",
   type="vectorSearch"
)
collection.create_search_index(model=search_index_model)

构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

在MongoDB中存储自定义数据

在本部分中，您将MongoDB实例化为向量数据库，也称为文档存储。然后，您可以从自定义数据创建向量嵌入，并将这些文档存储在MongoDB 的集合中。在笔记本中粘贴并运行以下代码片段。

将 Atlas 实例化为文档存储。

运行以下代码，将 Atlas 实例化为文档存储。此代码建立与 Atlas 集群的连接并指定以下内容：

haystack_db 以及 test 作为用于存储文档的 Atlas 数据库和集合。
vector_index 作为用于运行语义Atlas Search查询的索引。

document_store = MongoDBAtlasDocumentStore(
   database_name="haystack_db",
   collection_name="test",
   vector_search_index="vector_index",
   full_text_search_index="search_index" # Declared but not used in this example
)

在 Atlas 集群上加载样本数据。

该代码定义了一些示例文档，并通过以下组件运行管道：

OpenAI 的嵌入器，用于将文档转换为矢量嵌入。
A 文档编写器，用于使用示例文档及其嵌入内容填充您的文档存储。

# Create some example documents
documents = [
   Document(content="My name is Jean and I live in Paris."),
   Document(content="My name is Mark and I live in Berlin."),
   Document(content="My name is Giorgio and I live in Rome."),
]
# Initializing a document embedder to convert text content into vectorized form.
doc_embedder = VoyageDocumentEmbedder()
# Setting up a document writer to handle the insertion of documents into the MongoDB collection.
doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)
# Creating a pipeline for indexing documents. The pipeline includes embedding and writing documents.
indexing_pipe = Pipeline()
indexing_pipe.add_component(instance=doc_embedder, name="doc_embedder")
indexing_pipe.add_component(instance=doc_writer, name="doc_writer")
# Connecting the components of the pipeline for document flow.
indexing_pipe.connect("doc_embedder.documents", "doc_writer.documents")
# Running the pipeline with the list of documents to index them in MongoDB.
indexing_pipe.run({"doc_embedder": {"documents": documents}})

Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00,  4.42it/s]
{'doc_embedder': {'meta': {'total_tokens': 32}},
 'doc_writer': {'documents_written': 3}}

提示

运行示例代码后，如果您使用的是Atlas ，则可以导航到Atlas 用户界面中的 haystack_db.test命名空间来验证向量嵌入。

回答有关数据的问题

本部分演示如何使用MongoDB Vector Search 和 Haystack 在应用程序中实现RAG。

以下代码定义并运行具有以下组件的管道：

OpenAITextEmbedder 嵌入器，用于根据查询创建嵌入。
MongoDBAtlasEmbeddingRetriever 检索器，用于从文档存储中检索与查询嵌入类似的嵌入。
PromptBuilder，用于传递提示模板以指示 LLM 使用检索到的文档作为提示的上下文。
OpenAIGenerator 生成器使用 OpenAI 的 LLM 生成上下文感知响应。

在此示例中，您使用示例查询Where does Mark live? 提示法学硕士。法学硕士根据您存储在Atlas中的自定义数据生成准确的、上下文感知的响应。

# Template for generating prompts for a movie recommendation engine.
prompt_template = """
    You are an assistant allowed to use the following context documents.\nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}
    \nQuery: {{query}}
    \nAnswer:
"""
# Setting up a retrieval-augmented generation (RAG) pipeline for generating responses.
rag_pipeline = Pipeline()
rag_pipeline.add_component("text_embedder", VoyageTextEmbedder())
# Adding a component for retrieving related documents from MongoDB based on the query embedding.
rag_pipeline.add_component(instance=MongoDBAtlasEmbeddingRetriever(document_store=document_store,top_k=15), name="retriever")
# Building prompts based on retrieved documents to be used for generating responses.
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=prompt_template, required_variables=["query", "documents"]))
# Adding a language model generator to produce the final text output.
rag_pipeline.add_component("llm", OpenAIGenerator())
# Connecting the components of the RAG pipeline to ensure proper data flow.
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
# Run the pipeline
query = "Where does Mark live?"
result = rag_pipeline.run(
  {
      "text_embedder": {"text": query},
      "prompt_builder": {"query": query},
  });
print(result['llm']['replies'][0])

Mark lives in Berlin.

先决条件

如要完成本教程，您必须具备以下条件：

以下MongoDB 集群类型之一：
- 一个 Atlas 集群，运行 MongoDB 6.0.11、7.0.2 或更高版本。请确保您的 IP 地址包含在 Atlas 项目的访问列表中。
- 使用Atlas CLI创建的本地Atlas部署。要学习；了解更多信息，请参阅创建本地Atlas部署。
- 安装了Search 和 Vector Search的MongoDB Community或 Enterprise集群。
OpenAI API密钥。您必须拥有一个具有可用于API请求的积分的 OpenAI 帐户。要学习；了解有关注册 OpenAI 帐户的更多信息，请参阅 OpenAI API网站。
用于运行Python项目（例如Colab。）的笔记本。

设置环境

为此教程设置环境。通过保存扩展名为 .ipynb 的文件来创建交互式Python笔记本。此 Notebook 允许您单独运行Python代码片段，并且您将使用它来运行本教程中的代码。

要设立笔记本环境，请执行以下操作：

安装并导入依赖项。

运行以下命令：
```
pip install --quiet --upgrade mongodb-atlas-haystack pymongo
```

运行以下代码，导入所需包：

import os
from haystack import Pipeline, Document
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.writers import DocumentWriter
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import OpenAITextEmbedder, OpenAIDocumentEmbedder
from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever
from pymongo import MongoClient
from pymongo.operations import SearchIndexModel

定义环境变量。

运行以下代码，将占位符替换为以下值：

您的 OpenAI API 密钥。
MongoDB集群的连接字符串。

os.environ["OPENAI_API_KEY"] = "<api-key>"
os.environ["MONGO_CONNECTION_STRING"]= "<connection-string>"

注意

将 <connection-string> 替换为您的 Atlas 集群或本地部署的连接字符串。

连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

要学习；了解更多信息，请参阅通过客户端库连接到集群。

连接字符串应使用以下格式：

mongodb://localhost:<port-number>/?directConnection=true

要学习；了解更多信息，请参阅连接字符串。

创建MongoDB Vector Search 索引

在本部分中，您将创建 haystack_db数据库和 test集合来存储自定义数据。然后，要启用对数据的向量搜索查询，请创建MongoDB Vector Search索引。

运行以下代码以连接到MongoDB 集群。

client = MongoClient(os.environ.get("MONGO_CONNECTION_STRING"))

创建`haystack_db.test` 集合。

运行以下代码以创建 haystack_db 数据库和 test 集合。

# Create your database and collection
db_name = "haystack_db"
collection_name = "test"
database = client[db_name]
database.create_collection(collection_name)
# Define collection
collection = client[db_name][collection_name]

定义MongoDB Vector Search索引。

运行以下代码，创建 vectorSearch 类型的索引。embedding 字段包含您将使用 OpenAI 的 text-embedding-ada-002 嵌入模型创建的嵌入。索引定义指定了 1536 个向量维度，并使用 cosine 来衡量相似性。

# Create your index model, then create the search index
search_index_model = SearchIndexModel(
   definition={
      "fields": [
         {
            "type": "vector",
            "path": "embedding",
            "numDimensions": 1536,
            "similarity": "cosine"
         }
      ]
   },
   name="vector_index",
   type="vectorSearch"
)
collection.create_search_index(model=search_index_model)

构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

在MongoDB中存储自定义数据

将 Atlas 实例化为文档存储。

运行以下代码，将 Atlas 实例化为文档存储。此代码建立与 Atlas 集群的连接并指定以下内容：

haystack_db 以及 test 作为用于存储文档的 Atlas 数据库和集合。
vector_index 作为用于运行语义Atlas Search查询的索引。

document_store = MongoDBAtlasDocumentStore(
   database_name="haystack_db",
   collection_name="test",
   vector_search_index="vector_index",
   full_text_search_index="search_index" # Declared but not used in this example
)

在 Atlas 集群上加载样本数据。

该代码定义了一些示例文档，并通过以下组件运行管道：

OpenAI 的嵌入器，用于将文档转换为矢量嵌入。
A 文档编写器，用于使用示例文档及其嵌入内容填充您的文档存储。

# Create some example documents
documents = [
   Document(content="My name is Jean and I live in Paris."),
   Document(content="My name is Mark and I live in Berlin."),
   Document(content="My name is Giorgio and I live in Rome."),
]
# Initializing a document embedder to convert text content into vectorized form.
doc_embedder = OpenAIDocumentEmbedder()
# Setting up a document writer to handle the insertion of documents into the MongoDB collection.
doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)
# Creating a pipeline for indexing documents. The pipeline includes embedding and writing documents.
indexing_pipe = Pipeline()
indexing_pipe.add_component(instance=doc_embedder, name="doc_embedder")
indexing_pipe.add_component(instance=doc_writer, name="doc_writer")
# Connecting the components of the pipeline for document flow.
indexing_pipe.connect("doc_embedder.documents", "doc_writer.documents")
# Running the pipeline with the list of documents to index them in MongoDB.
indexing_pipe.run({"doc_embedder": {"documents": documents}})

Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00,  4.16it/s]
{'doc_embedder': {'meta': {'model': 'text-embedding-ada-002',
   'usage': {'prompt_tokens': 32, 'total_tokens': 32}}},
 'doc_writer': {'documents_written': 3}}

提示

运行示例代码后，如果您使用的是Atlas ，则可以导航到Atlas 用户界面中的 haystack_db.test命名空间来验证向量嵌入。

回答有关数据的问题

本部分演示如何使用MongoDB Vector Search 和 Haystack 在应用程序中实现RAG。

以下代码定义并运行具有以下组件的管道：

OpenAITextEmbedder 嵌入器，用于根据查询创建嵌入。
MongoDBAtlasEmbeddingRetriever 检索器，用于从文档存储中检索与查询嵌入类似的嵌入。
PromptBuilder，用于传递提示模板以指示 LLM 使用检索到的文档作为提示的上下文。
OpenAIGenerator 生成器使用 OpenAI 的 LLM 生成上下文感知响应。

在此示例中，您使用示例查询Where does Mark live? 提示法学硕士。法学硕士根据您存储在Atlas中的自定义数据生成准确的、上下文感知的响应。

# Template for generating prompts for a movie recommendation engine.
prompt_template = """
    You are an assistant allowed to use the following context documents.\nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}
    \nQuery: {{query}}
    \nAnswer:
"""
# Setting up a retrieval-augmented generation (RAG) pipeline for generating responses.
rag_pipeline = Pipeline()
rag_pipeline.add_component("text_embedder", OpenAITextEmbedder())
# Adding a component for retrieving related documents from MongoDB based on the query embedding.
rag_pipeline.add_component(instance=MongoDBAtlasEmbeddingRetriever(document_store=document_store,top_k=15), name="retriever")
# Building prompts based on retrieved documents to be used for generating responses.
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=prompt_template, required_variables=["query", "documents"]))
# Adding a language model generator to produce the final text output.
rag_pipeline.add_component("llm", OpenAIGenerator())
# Connecting the components of the RAG pipeline to ensure proper data flow.
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
# Run the pipeline
query = "Where does Mark live?"
result = rag_pipeline.run(
  {
      "text_embedder": {"text": query},
      "prompt_builder": {"query": query},
  });
print(result['llm']['replies'][0])

Mark lives in Berlin.

后续步骤

MongoDB 还提供以下开发者资源：

提示

后退

C# 集成

来年

Spring AI

背景

步骤

先决条件

设置环境

安装并导入依赖项。

定义环境变量。

注意

创建MongoDB Vector Search 索引

运行以下代码以连接到MongoDB 集群。

创建haystack_db.test 集合。

定义 AMongoDB Vector Search索引。

在MongoDB中存储自定义数据

将 Atlas 实例化为文档存储。

在 Atlas 集群上加载样本数据。

提示

回答有关数据的问题

先决条件

设置环境

安装并导入依赖项。

定义环境变量。

注意

创建MongoDB Vector Search 索引

运行以下代码以连接到MongoDB 集群。

创建haystack_db.test 集合。

定义MongoDB Vector Search索引。

在MongoDB中存储自定义数据

将 Atlas 实例化为文档存储。

在 Atlas 集群上加载样本数据。

提示

回答有关数据的问题

后续步骤

提示

创建`haystack_db.test` 集合。

创建`haystack_db.test` 集合。