Docs 菜单
Docs 主页
/ /

开始使用MongoDB LangChain 集成

注意

本教程使用 LangChain 的 Python库。有关使用JavaScript库的教程,请参阅 开始使用 LangChain JavaScript/ Typescript集成。

您可以将MongoDB Vector Search 与 LangChain 集成,以构建LLM 应用程序并实现检索增强生成 (RAG)。 本教程演示如何开始使用MongoDB Vector Search 和 LangChain 对数据执行语义搜索并构建RAG实施。具体来说,您执行以下操作:

  1. 设置环境。

  2. 在MongoDB中存储自定义数据。

  3. 对数据创建MongoDB Vector Search索引。

  4. 运行以下向量搜索查询:

    • 语义搜索。

    • 带分数的语义搜索。

    • 带元数据预过滤的语义搜索。

  5. 使用 MongoDB Vector Search 来回答有关数据的问题,从而实施RAG

使用本教程的可运行版本以作为 Python 笔记本。

LangChain 是一个开源框架,可通过使用“链”来简化 LLM 应用程序的创建。链是 LangChain 特有的组件,可组合用于各种AI使用案例,包括 RAG

通过将MongoDB Vector Search 与 LangChain 集成,您可以将MongoDB用作向量数据库,并使用MongoDB Vector Search 从数据中检索语义相似的文档来实现RAG。要学习;了解有关RAG的更多信息,请参阅使用MongoDB进行检索增强生成 (RAG)

如要完成本教程,您必须具备以下条件:

  • 以下MongoDB 集群类型之一:

  • Voyage AI API密钥。要创建帐户和API密钥,请参阅 Voyage AI网站。

  • OpenAI API密钥。您必须拥有一个具有可用于API请求的积分的 OpenAI 帐户。要学习;了解有关注册 OpenAI 帐户的更多信息,请参阅 OpenAI API网站。

  • 运行交互式Python笔记本(例如 Colab)的环境。

为此教程设置环境。 通过保存具有 .ipynb 扩展名的文件来创建交互式Python笔记本。 此 Notebook 允许您单独运行Python代码片段,并且您将使用它来运行本教程中的代码。

要设立笔记本环境,请执行以下操作:

1

运行以下命令:

pip install --quiet --upgrade langchain langchain-community langchain-core langchain-mongodb langchain-voyageai langchain-openai langchain-text-splitters pymongo pypdf

然后,运行以下代码以导入所需的包:

import os, pymongo, pprint
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_voyageai import VoyageAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pymongo import MongoClient
from pymongo.operations import SearchIndexModel
2

运行以下代码,将占位符替换为以下值:

  • 您的 Voyage AI和 OpenAI API密钥。

  • MongoDB集群的连接字符串。

os.environ["VOYAGE_API_KEY"] = "<voyage-api-key>"
os.environ["OPENAI_API_KEY"] = "<openai-api-key>"
MONGODB_URI = "<connection-string>"

注意

<connection-string> 替换为您的 Atlas 集群或本地部署的连接字符串。

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

要学习;了解更多信息,请参阅通过驱动程序连接到集群。

连接字符串应使用以下格式:

mongodb://localhost:<port-number>/?directConnection=true

要学习;了解更多信息,请参阅连接字符串。

然后,将自定义数据加载到MongoDB中,并将MongoDB 集群实例化为向量数据库(也称为向量存储)。将以下代码片段复制并粘贴到笔记本中。

1

在本教程中,您将使用一份有关近期MongoDB收益报告的可公开访问的 PDF文档作为向量存储的数据源。

要加载样本数据,请运行以下代码片段。 它执行以下操作:

  • 从指定的 URL 检索 PDF 并加载原始文本数据。

  • 使用文本拆分器将数据分割为较小的文档。

  • 指定数据块参数,它确定每个文档中的字符数以及两个连续文档之间应重叠的字符数。

# Load the PDF
loader = PyPDFLoader("https://investors.mongodb.com/node/13176/pdf")
data = loader.load()
# Split PDF into documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
docs = text_splitter.split_documents(data)
# Print the first document
docs[0]
Document(metadata={'producer': 'West Corporation using ABCpdf', 'creator': 'PyPDF', 'creationdate': '2025-03-05T21:06:26+00:00', 'title': 'MongoDB, Inc. Announces Fourth Quarter and Full Year Fiscal 2025 Financial Results', 'source': 'https://investors.mongodb.com/node/13176/pdf', 'total_pages': 9, 'page': 0, 'page_label': '1'}, page_content='MongoDB, Inc. Announces Fourth Quarter and Full Year Fiscal 2025 Financial Results\nMarch 5, 2025\nFourth Quarter Fiscal 2025 Total Revenue of $548.4 million, up 20% Year-over-Year')
2

运行以下代码以从示例文档中创建一个名为 vector_store 的向量存储实例。此代码段指定如下内容:

  • MongoDB 集群的连接字符串。

  • langchain_db.test 作为存储文档的MongoDB命名空间。

  • 来自 Voyage AI的 voyage-3-large 嵌入模型。将文本转换为 embedding字段的向量嵌入。

  • vector_index 作为用于查询向量存储的索引。

# Instantiate the vector store using your MongoDB connection string
vector_store = MongoDBAtlasVectorSearch.from_connection_string(
connection_string = MONGODB_URI,
namespace = "langchain_db.test",
embedding = VoyageAIEmbeddings(model="voyage-3-large"),
index_name = "vector_index"
)
# Add documents to the vector store
vector_store.add_documents(documents=docs)

运行示例代码后,如果您使用的是Atlas ,则可以导航到Atlas 用户界面中的 langchain_db.test命名空间来验证向量嵌入。

提示

要在向量存储上启用向量搜索查询,请使用 LangChain 辅助工具方法或 PyMongo 驱动程序方法在 langchain_db.test 集合上创建 MongoDB 向量搜索索引。

在笔记本中,按照首选方法运行以下代码。此索引定义指定对以下字段进行索引:

  • embedding 字段作为向量类型。embedding字段包含使用 Voyage AI 的 voyage-3-large 嵌入模型创建的嵌入。索引定义指定了 1024 个向量维度,并使用 cosine 来衡量相似性。

  • page_label 字段作为筛选器类型,用于按 PDF 中的页码对数据进行预筛选。

# Use helper method to create the vector search index
vector_store.create_vector_search_index(
dimensions = 1024, # The number of vector dimensions to index
filters = [ "page_label" ]
)
# Connect to your cluster
client = MongoClient(MONGODB_URI)
collection = client["langchain_db"]["test"]
# Create your vector search index model, then create the index
search_index_model = SearchIndexModel(
definition={
"fields": [
{
"type": "vector",
"path": "embedding",
"numDimensions": 1024,
"similarity": "cosine"
},
{
"type": "filter",
"path": "page_label"
}
]
},
name="vector_index",
type="vectorSearch"
)
collection.create_search_index(model=search_index_model)

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

MongoDB构建索引后,即可对数据运行向量搜索查询。以下示例演示了可以对矢量化数据运行的各种查询。

以下查询使用similarity_search方法对字符串MongoDB acquisition执行基本语义搜索。 它返回按相关性排名的文档列表。

query = "MongoDB acquisition"
results = vector_store.similarity_search(query)
pprint.pprint(results)
[Document(id='67f0259b8bb2babc06924409', metadata={ ... }, page_content='SOURCE MongoDB, Inc.'),
Document(id='67f0259b8bb2babc0692432f', metadata={ ... }, page_content='MongoDB platform. In fiscal year 2026 we expect to see stable consumption growth in Atlas, our main growth driver," said Dev Ittycheria, President\nand Chief Executive Officer of MongoDB .'),
Document(id='67f0259b8bb2babc06924355', metadata={ ... }, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'),
Document(id='67f0259b8bb2babc069243a6', metadata={ ... }, page_content="MongoDB's unified, intelligent data platform was built to power the next generation of applications, and MongoDB is the most widely available, globally")]

以下查询使用 similarity_search_with_score 方法对字符串 MongoDB acquisition 执行语义搜索,并指定 k 参数以将返回的文档数量限制为 3

注意

此示例中的 k 参数指的是 similarity_search_with_score 方法选项,而不是同名的 knnBeta 操作符选项。

它返回三个最相关的文档以及一个在 01 之间的相关性分数

query = "MongoDB acquisition"
results = vector_store.similarity_search_with_score(
query = query, k = 3
)
pprint.pprint(results)
[(Document(id='67f0259b8bb2babc06924409', metadata={ ... }, page_content='SOURCE MongoDB, Inc.'),
0.8193451166152954),
(Document(id='67f0259b8bb2babc0692432f', metadata={ ... }, page_content='MongoDB platform. In fiscal year 2026 we expect to see stable consumption growth in Atlas, our main growth driver," said Dev Ittycheria, President\nand Chief Executive Officer of MongoDB .'),
0.7815237045288086),
(Document(id='67f0259b8bb2babc06924355', metadata={ ... }, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'),
0.7788857221603394)]

您可以使用 MQL 匹配表达式来预过滤您的数据,该表达式将索引字段与集合中的另一个值进行比较。您必须将要过滤的任何元数据字段作为 filter 类型进行索引。要了解详情,请参阅如何为向量搜索建立字段索引。

注意

在为本教程创建索引时,您已指定page_label字段作为筛选器。

以下查询使用 similarity_search_with_score 方法对字符串 MongoDB acquisition 执行语义搜索。它还指定了以下内容:

  • k参数,用于将要返回的文档数量限制为3

  • 针对page_label字段的预筛选器,使用$eq操作符仅匹配出现在2页上的文档。

它会从 2 页返回三个最相关的文档,相关性得分介于 01 之间。

query = "MongoDB acquisition"
results = vector_store.similarity_search_with_score(
query = query,
k = 3,
pre_filter = { "page_label": { "$eq": 2 } }
)
pprint.pprint(results)
[(Document(id='67f0259b8bb2babc06924355', metadata={ ... 'page_label': '2'}, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'),
0.7788857221603394),
(Document(id='67f0259b8bb2babc06924351', metadata={ ... 'page_label': '2'}, page_content='Measures."\nFourth Quarter Fiscal 2025 and Recent Business Highlights\nMongoDB acquired Voyage AI, a pioneer in state-of-the-art embedding and reranking models that power next-generation'),
0.7606035470962524),
(Document(id='67f0259b8bb2babc06924354', metadata={ ... 'page_label': '2'}, page_content='data.\nMongoDB completed the redemption of 2026 Convertible Notes, eliminating all debt from the balance sheet. Additionally, in'),
0.7583936452865601)]

提示

有关语义搜索方法的完整列表,请参阅API参考。

本部分演示如何使用MongoDB Vector Search 和 LangChain 在应用程序中实现RAG。现在您已经使用MongoDB Vector Search检索语义相似的文档,运行以下代码示例以提示法学硕士回答基于这些文档的问题。

此示例执行以下操作:

  • 将MongoDB Vector Search 实例化为检索器以查询类似文档,包括可选的k 参数以仅搜索10 最相关的文档。

  • 定义 LangChain 提示模板,指示 LLM 使用这些文档作为查询的上下文。LangChain 将这些文档传递给 {context} 输入变量,并将您的查询传递给 {question} 变量。

  • 构造一条指定以下内容的链:

    • MongoDB Vector Search 作为检索器,搜索要用作上下文的文档。

    • 您定义的提示模板。

    • OpenAI 的 gpt-4o 聊天模型用于生成情境感知响应。

  • 使用样本查询调用链。

  • 返回 LLM 的响应和作为上下文使用的文档。产生的响应可能会有所不同。

# Instantiate MongoDB Vector Search as a retriever
retriever = vector_store.as_retriever(
search_type = "similarity",
search_kwargs = { "k": 10 }
)
# Define a prompt template
template = """
Use the following pieces of context to answer the question at the end.
{context}
Question: {question}
"""
prompt = PromptTemplate.from_template(template)
model = ChatOpenAI(model="gpt-4o")
# Construct a chain to answer questions on your data
chain = (
{ "context": retriever, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)
# Prompt the chain
question = "What was MongoDB's latest acquisition?"
answer = chain.invoke(question)
print("Question: " + question)
print("Answer: " + answer)
# Return source documents
documents = retriever.invoke(question)
print("\nSource documents:")
pprint.pprint(documents)
Question: What was MongoDB's latest acquisition?
Answer: MongoDB's latest acquisition was Voyage AI, a pioneer in state-of-the-art embedding and reranking models.
Source documents:
[Document(id='67f0259b8bb2babc06924409', metadata={'_id': '67f0259b8bb2babc06924409', ... 'page_label': '9'}, page_content='SOURCE MongoDB, Inc.'),
Document(id='67f0259b8bb2babc06924351', metadata={'_id': '67f0259b8bb2babc06924351', ... 'page_label': '2'}, page_content='Measures."\nFourth Quarter Fiscal 2025 and Recent Business Highlights\nMongoDB acquired Voyage AI, a pioneer in state-of-the-art embedding and reranking models that power next-generation'),
Document(id='67f0259b8bb2babc0692432f', metadata={'_id': '67f0259b8bb2babc0692432f', ... 'page_label': '1'}, page_content='MongoDB platform. In fiscal year 2026 we expect to see stable consumption growth in Atlas, our main growth driver," said Dev Ittycheria, President\nand Chief Executive Officer of MongoDB .'),
Document(id='67f0259b8bb2babc06924355', metadata={'_id': '67f0259b8bb2babc06924355', ... 'page_label': '2'}, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'),
Document(id='67f0259b8bb2babc069243a6', metadata={'_id': '67f0259b8bb2babc069243a6', ... 'page_label': '4'}, page_content="MongoDB's unified, intelligent data platform was built to power the next generation of applications, and MongoDB is the most widely available, globally"),
Document(id='67f0259b8bb2babc06924329', metadata={'_id': '67f0259b8bb2babc06924329', ... 'page_label': '1'}, page_content='MongoDB, Inc. Announces Fourth Quarter and Full Year Fiscal 2025 Financial Results\nMarch 5, 2025\nFourth Quarter Fiscal 2025 Total Revenue of $548.4 million, up 20% Year-over-Year'),
Document(id='67f0259b8bb2babc069243a7', metadata={'_id': '67f0259b8bb2babc069243a7', ... 'page_label': '4'}, page_content='distributed database on the market. With integrated capabilities for operational data, search, real-time analytics, and AI-powered retrieval, MongoDB'),
Document(id='67f0259b8bb2babc069243a5', metadata={'_id': '67f0259b8bb2babc069243a5', ... 'page_label': '4'}, page_content="Headquartered in New York, MongoDB's mission is to empower innovators to create, transform, and disrupt industries with software and data."),
Document(id='67f0259b8bb2babc06924354', metadata={'_id': '67f0259b8bb2babc06924354', ... 'page_label': '2'}, page_content='data.\nMongoDB completed the redemption of 2026 Convertible Notes, eliminating all debt from the balance sheet. Additionally, in'),
Document(id='67f0259b8bb2babc069243a9', metadata={'_id': '67f0259b8bb2babc069243a9', ... 'page_label': '4'}, page_content='50,000 customers across almost every industry—including 70% of the Fortune 100—rely on MongoDB for their most important applications. To learn\nmore, visit mongodb.com .\nInvestor Relations')]

此示例执行以下操作:

  • 将 MongoDB Vector Search 实例化为检索器以查询相似文档,包括以下可选参数:

    • k 用于仅搜索 10 个最相关的文档。

    • score_threshold 仅使用相关度分数高于 0.75 的文档。

      注意

      此参数是指 LangChain 用于规范化结果的相关性分数,而不是在MongoDB Search 查询中使用的相关性分数。要在 RAG 实施中使用 MongoDB 搜索分数,请定义一个自定义检索器,该检索器使用 similarity_search_with_score 方法并按 MongoDB 搜索分数进行筛选。

    • pre_filterpage_label 字段中过滤仅出现在 2 页上的文档。

  • 定义 LangChain 提示模板,指示 LLM 使用这些文档作为查询的上下文。LangChain 将这些文档传递给 {context} 输入变量,并将您的查询传递给 {question} 变量。

  • 构造一条指定以下内容的链:

    • MongoDB Vector Search 作为检索器,搜索要用作上下文的文档。

    • 您定义的提示模板。

    • OpenAI 的 gpt-4o 聊天模型用于生成情境感知响应。

  • 使用样本查询调用链。

  • 返回 LLM 的响应和作为上下文使用的文档。产生的响应可能会有所不同。

# Instantiate MongoDB Vector Search as a retriever
retriever = vector_store.as_retriever(
search_type = "similarity",
search_kwargs = {
"k": 10,
"score_threshold": 0.75,
"pre_filter": { "page_label": { "$eq": 2 } }
}
)
# Define a prompt template
template = """
Use the following pieces of context to answer the question at the end.
{context}
Question: {question}
"""
prompt = PromptTemplate.from_template(template)
model = ChatOpenAI(model="gpt-4o")
# Construct a chain to answer questions on your data
chain = (
{ "context": retriever, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)
# Prompt the chain
question = "What was MongoDB's latest acquisition?"
answer = rag_chain.invoke(question)
print("Question: " + question)
print("Answer: " + answer)
# Return source documents
documents = retriever.invoke(question)
print("\nSource documents:")
pprint.pprint(documents)
Question: What was MongoDB's latest acquisition?
Answer: MongoDB's latest acquisition was Voyage AI, a pioneer in state-of-the-art embedding and reranking models.
Source documents:
[Document(id='67f0259b8bb2babc06924351', metadata={'_id': '67f0259b8bb2babc06924351', ... 'page_label': '2'}, page_content='Measures."\nFourth Quarter Fiscal 2025 and Recent Business Highlights\nMongoDB acquired Voyage AI, a pioneer in state-of-the-art embedding and reranking models that power next-generation'),
Document(id='67f0259b8bb2babc06924355', metadata={'_id': '67f0259b8bb2babc06924355', ... 'page_label': '2'}, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'),
Document(id='67f0259b8bb2babc06924354', metadata={'_id': '67f0259b8bb2babc06924354', ... 'page_label': '2'}, page_content='data.\nMongoDB completed the redemption of 2026 Convertible Notes, eliminating all debt from the balance sheet. Additionally, in'),
Document(id='67f0259b8bb2babc06924358', metadata={'_id': '67f0259b8bb2babc06924358', ... 'page_label': '2'}, page_content='Lombard Odier, a Swiss private bank, partnered with MongoDB to migrate and modernize its legacy banking technology'),
Document(id='67f0259b8bb2babc06924352', metadata={'_id': '67f0259b8bb2babc06924352', ... 'page_label': '2'}, page_content="AI applications. Integrating Voyage AI's technology with MongoDB will enable organizations to easily build trustworthy,"),
Document(id='67f0259b8bb2babc0692435a', metadata={'_id': '67f0259b8bb2babc0692435a', ... 'page_label': '2'}, page_content='applications from a legacy relational database to MongoDB 20 times faster than previous migrations.\nFirst Quarter and Full Year Fiscal 2026 Guidance'),
Document(id='67f0259b8bb2babc06924356', metadata={'_id': '67f0259b8bb2babc06924356', ... 'page_label': '2'}, page_content='For the third consecutive year, MongoDB was named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud'),
Document(id='67f0259b8bb2babc0692434d', metadata={'_id': '67f0259b8bb2babc0692434d', ... 'page_label': '2'}, page_content='compared to $121.5 million of cash from operations in the year-ago period. MongoDB used $29.6 million of cash in capital'),
Document(id='67f0259b8bb2babc0692434c', metadata={'_id': '67f0259b8bb2babc0692434c', ... 'page_label': '2'}, page_content='Cash Flow: During the year ended January 31, 2025, MongoDB generated $150.2 million of cash from operations,'),
Document(id='67f0259b8bb2babc06924364', metadata={'_id': '67f0259b8bb2babc06924364', ... 'page_label': '2'}, page_content='MongoDB will host a conference call today, March 5, 2025, at 5:00 p.m. (Eastern Time) to discuss its financial results and business outlook. A live')]

请观看本视频教程,以了解更多关于使用 LangChain 和 MongoDB 进行语义搜索和检索增强生成 (RAG) 的信息。

时长:8 分钟

后退

LangChain

在此页面上