注意
本教程使用 LangChain 的 Python库。有关使用JavaScript库的教程,请参阅 开始使用 LangChain JavaScript/ Typescript集成。
您可以将MongoDB Vector Search 与 LangChain 集成,以构建LLM 应用程序并实现检索增强生成 (RAG)。 本教程演示如何开始使用MongoDB Vector Search 和 LangChain 对数据执行语义搜索并构建RAG实施。具体来说,您执行以下操作:
设置环境。
在MongoDB中存储自定义数据。
对数据创建MongoDB Vector Search索引。
运行以下向量搜索查询:
语义搜索。
带分数的语义搜索。
带元数据预过滤的语义搜索。
使用 MongoDB Vector Search 来回答有关数据的问题,从而实施RAG。
使用本教程的可运行版本以作为 Python 笔记本。
背景
LangChain 是一个开源框架,可通过使用“链”来简化 LLM 应用程序的创建。链是 LangChain 特有的组件,可组合用于各种AI使用案例,包括 RAG。
通过将MongoDB Vector Search 与 LangChain 集成,您可以将MongoDB用作向量数据库,并使用MongoDB Vector Search 从数据中检索语义相似的文档来实现RAG。要学习;了解有关RAG的更多信息,请参阅使用MongoDB进行检索增强生成 (RAG)。
先决条件
如要完成本教程,您必须具备以下条件:
以下MongoDB 集群类型之一:
一个 Atlas 集群,运行 MongoDB 6.0.11、7.0.2 或更高版本。请确保您的 IP 地址包含在 Atlas 项目的访问列表中。
使用Atlas CLI创建的本地Atlas部署。要学习;了解更多信息,请参阅创建本地Atlas部署。
安装了Search 和 Vector Search的MongoDB Community或 Enterprise集群。
Voyage AI API密钥。要创建帐户和API密钥,请参阅 Voyage AI网站。
OpenAI API密钥。您必须拥有一个具有可用于API请求的积分的 OpenAI 帐户。要学习;了解有关注册 OpenAI 帐户的更多信息,请参阅 OpenAI API网站。
运行交互式Python笔记本(例如 Colab)的环境。
设置环境
为此教程设置环境。 通过保存具有 .ipynb 扩展名的文件来创建交互式Python笔记本。 此 Notebook 允许您单独运行Python代码片段,并且您将使用它来运行本教程中的代码。
要设立笔记本环境,请执行以下操作:
安装并导入依赖项。
运行以下命令:
pip install --quiet --upgrade langchain langchain-community langchain-core langchain-mongodb langchain-voyageai langchain-openai langchain-text-splitters pymongo pypdf
然后,运行以下代码以导入所需的包:
import os, pymongo, pprint from langchain_community.document_loaders import PyPDFLoader from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough from langchain_mongodb import MongoDBAtlasVectorSearch from langchain_voyageai import VoyageAIEmbeddings from langchain_openai import ChatOpenAI from langchain.prompts import PromptTemplate from langchain_text_splitters import RecursiveCharacterTextSplitter from pymongo import MongoClient from pymongo.operations import SearchIndexModel
定义环境变量。
运行以下代码,将占位符替换为以下值:
您的 Voyage AI和 OpenAI API密钥。
MongoDB集群的连接字符串。
os.environ["VOYAGE_API_KEY"] = "<voyage-api-key>" os.environ["OPENAI_API_KEY"] = "<openai-api-key>" MONGODB_URI = "<connection-string>"
注意
将 <connection-string> 替换为您的 Atlas 集群或本地部署的连接字符串。
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
要学习;了解更多信息,请参阅通过驱动程序连接到集群。
使用MongoDB作为向量存储
然后,将自定义数据加载到MongoDB中,并将MongoDB 集群实例化为向量数据库(也称为向量存储)。将以下代码片段复制并粘贴到笔记本中。
加载示例数据。
在本教程中,您将使用一份有关近期MongoDB收益报告的可公开访问的 PDF文档作为向量存储的数据源。
要加载样本数据,请运行以下代码片段。 它执行以下操作:
从指定的 URL 检索 PDF 并加载原始文本数据。
使用文本拆分器将数据分割为较小的文档。
指定数据块参数,它确定每个文档中的字符数以及两个连续文档之间应重叠的字符数。
# Load the PDF loader = PyPDFLoader("https://investors.mongodb.com/node/13176/pdf") data = loader.load() # Split PDF into documents text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20) docs = text_splitter.split_documents(data) # Print the first document docs[0]
Document(metadata={'producer': 'West Corporation using ABCpdf', 'creator': 'PyPDF', 'creationdate': '2025-03-05T21:06:26+00:00', 'title': 'MongoDB, Inc. Announces Fourth Quarter and Full Year Fiscal 2025 Financial Results', 'source': 'https://investors.mongodb.com/node/13176/pdf', 'total_pages': 9, 'page': 0, 'page_label': '1'}, page_content='MongoDB, Inc. Announces Fourth Quarter and Full Year Fiscal 2025 Financial Results\nMarch 5, 2025\nFourth Quarter Fiscal 2025 Total Revenue of $548.4 million, up 20% Year-over-Year')
实例化向量存储。
运行以下代码以从示例文档中创建一个名为 vector_store 的向量存储实例。此代码段指定如下内容:
MongoDB 集群的连接字符串。
langchain_db.test作为存储文档的MongoDB命名空间。来自 Voyage AI的
voyage-3-large嵌入模型。将文本转换为embedding字段的向量嵌入。vector_index作为用于查询向量存储的索引。
# Instantiate the vector store using your MongoDB connection string vector_store = MongoDBAtlasVectorSearch.from_connection_string( connection_string = MONGODB_URI, namespace = "langchain_db.test", embedding = VoyageAIEmbeddings(model="voyage-3-large"), index_name = "vector_index" ) # Add documents to the vector store vector_store.add_documents(documents=docs)
运行示例代码后,如果您使用的是Atlas ,则可以导航到Atlas 用户界面中的 langchain_db.test命名空间来验证向量嵌入。
创建MongoDB Vector Search 索引
要在向量存储上启用向量搜索查询,请使用 LangChain 辅助工具方法或 PyMongo 驱动程序方法在 langchain_db.test 集合上创建 MongoDB 向量搜索索引。
在笔记本中,按照首选方法运行以下代码。此索引定义指定对以下字段进行索引:
embedding字段作为向量类型。embedding字段包含使用 Voyage AI 的voyage-3-large嵌入模型创建的嵌入。索引定义指定了1024个向量维度,并使用cosine来衡量相似性。page_label字段作为筛选器类型,用于按 PDF 中的页码对数据进行预筛选。
# Use helper method to create the vector search index vector_store.create_vector_search_index( dimensions = 1024, # The number of vector dimensions to index filters = [ "page_label" ] )
# Connect to your cluster client = MongoClient(MONGODB_URI) collection = client["langchain_db"]["test"] # Create your vector search index model, then create the index search_index_model = SearchIndexModel( definition={ "fields": [ { "type": "vector", "path": "embedding", "numDimensions": 1024, "similarity": "cosine" }, { "type": "filter", "path": "page_label" } ] }, name="vector_index", type="vectorSearch" ) collection.create_search_index(model=search_index_model)
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
运行向量搜索查询
MongoDB构建索引后,即可对数据运行向量搜索查询。以下示例演示了可以对矢量化数据运行的各种查询。
以下查询使用similarity_search方法对字符串MongoDB acquisition执行基本语义搜索。 它返回按相关性排名的文档列表。
query = "MongoDB acquisition" results = vector_store.similarity_search(query) pprint.pprint(results)
[Document(id='67f0259b8bb2babc06924409', metadata={ ... }, page_content='SOURCE MongoDB, Inc.'), Document(id='67f0259b8bb2babc0692432f', metadata={ ... }, page_content='MongoDB platform. In fiscal year 2026 we expect to see stable consumption growth in Atlas, our main growth driver," said Dev Ittycheria, President\nand Chief Executive Officer of MongoDB .'), Document(id='67f0259b8bb2babc06924355', metadata={ ... }, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'), Document(id='67f0259b8bb2babc069243a6', metadata={ ... }, page_content="MongoDB's unified, intelligent data platform was built to power the next generation of applications, and MongoDB is the most widely available, globally")]
以下查询使用 similarity_search_with_score 方法对字符串 MongoDB acquisition 执行语义搜索,并指定 k 参数以将返回的文档数量限制为 3 。
注意
此示例中的 k 参数指的是 similarity_search_with_score 方法选项,而不是同名的 knnBeta 操作符选项。
它返回三个最相关的文档以及一个在 0 和 1 之间的相关性分数。
query = "MongoDB acquisition" results = vector_store.similarity_search_with_score( query = query, k = 3 ) pprint.pprint(results)
[(Document(id='67f0259b8bb2babc06924409', metadata={ ... }, page_content='SOURCE MongoDB, Inc.'), 0.8193451166152954), (Document(id='67f0259b8bb2babc0692432f', metadata={ ... }, page_content='MongoDB platform. In fiscal year 2026 we expect to see stable consumption growth in Atlas, our main growth driver," said Dev Ittycheria, President\nand Chief Executive Officer of MongoDB .'), 0.7815237045288086), (Document(id='67f0259b8bb2babc06924355', metadata={ ... }, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'), 0.7788857221603394)]
您可以使用 MQL 匹配表达式来预过滤您的数据,该表达式将索引字段与集合中的另一个值进行比较。您必须将要过滤的任何元数据字段作为 filter 类型进行索引。要了解详情,请参阅如何为向量搜索建立字段索引。
注意
在为本教程创建索引时,您已指定page_label字段作为筛选器。
以下查询使用 similarity_search_with_score 方法对字符串 MongoDB acquisition 执行语义搜索。它还指定了以下内容:
k参数,用于将要返回的文档数量限制为3。针对
page_label字段的预筛选器,使用$eq操作符仅匹配出现在2页上的文档。
它会从 2 页返回三个最相关的文档,相关性得分介于 0 和 1 之间。
query = "MongoDB acquisition" results = vector_store.similarity_search_with_score( query = query, k = 3, pre_filter = { "page_label": { "$eq": 2 } } ) pprint.pprint(results)
[(Document(id='67f0259b8bb2babc06924355', metadata={ ... 'page_label': '2'}, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'), 0.7788857221603394), (Document(id='67f0259b8bb2babc06924351', metadata={ ... 'page_label': '2'}, page_content='Measures."\nFourth Quarter Fiscal 2025 and Recent Business Highlights\nMongoDB acquired Voyage AI, a pioneer in state-of-the-art embedding and reranking models that power next-generation'), 0.7606035470962524), (Document(id='67f0259b8bb2babc06924354', metadata={ ... 'page_label': '2'}, page_content='data.\nMongoDB completed the redemption of 2026 Convertible Notes, eliminating all debt from the balance sheet. Additionally, in'), 0.7583936452865601)]
提示
有关语义搜索方法的完整列表,请参阅API参考。
回答有关数据的问题
本部分演示如何使用MongoDB Vector Search 和 LangChain 在应用程序中实现RAG。现在您已经使用MongoDB Vector Search检索语义相似的文档,运行以下代码示例以提示法学硕士回答基于这些文档的问题。
此示例执行以下操作:
定义 LangChain 提示模板,指示 LLM 使用这些文档作为查询的上下文。LangChain 将这些文档传递给
{context}输入变量,并将您的查询传递给{question}变量。MongoDB Vector Search 作为检索器,搜索要用作上下文的文档。
您定义的提示模板。
OpenAI 的
gpt-4o聊天模型用于生成情境感知响应。
使用样本查询调用链。
返回 LLM 的响应和作为上下文使用的文档。产生的响应可能会有所不同。
# Instantiate MongoDB Vector Search as a retriever retriever = vector_store.as_retriever( search_type = "similarity", search_kwargs = { "k": 10 } ) # Define a prompt template template = """ Use the following pieces of context to answer the question at the end. {context} Question: {question} """ prompt = PromptTemplate.from_template(template) model = ChatOpenAI(model="gpt-4o") # Construct a chain to answer questions on your data chain = ( { "context": retriever, "question": RunnablePassthrough()} | prompt | model | StrOutputParser() ) # Prompt the chain question = "What was MongoDB's latest acquisition?" answer = chain.invoke(question) print("Question: " + question) print("Answer: " + answer) # Return source documents documents = retriever.invoke(question) print("\nSource documents:") pprint.pprint(documents)
Question: What was MongoDB's latest acquisition? Answer: MongoDB's latest acquisition was Voyage AI, a pioneer in state-of-the-art embedding and reranking models. Source documents: [Document(id='67f0259b8bb2babc06924409', metadata={'_id': '67f0259b8bb2babc06924409', ... 'page_label': '9'}, page_content='SOURCE MongoDB, Inc.'), Document(id='67f0259b8bb2babc06924351', metadata={'_id': '67f0259b8bb2babc06924351', ... 'page_label': '2'}, page_content='Measures."\nFourth Quarter Fiscal 2025 and Recent Business Highlights\nMongoDB acquired Voyage AI, a pioneer in state-of-the-art embedding and reranking models that power next-generation'), Document(id='67f0259b8bb2babc0692432f', metadata={'_id': '67f0259b8bb2babc0692432f', ... 'page_label': '1'}, page_content='MongoDB platform. In fiscal year 2026 we expect to see stable consumption growth in Atlas, our main growth driver," said Dev Ittycheria, President\nand Chief Executive Officer of MongoDB .'), Document(id='67f0259b8bb2babc06924355', metadata={'_id': '67f0259b8bb2babc06924355', ... 'page_label': '2'}, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'), Document(id='67f0259b8bb2babc069243a6', metadata={'_id': '67f0259b8bb2babc069243a6', ... 'page_label': '4'}, page_content="MongoDB's unified, intelligent data platform was built to power the next generation of applications, and MongoDB is the most widely available, globally"), Document(id='67f0259b8bb2babc06924329', metadata={'_id': '67f0259b8bb2babc06924329', ... 'page_label': '1'}, page_content='MongoDB, Inc. Announces Fourth Quarter and Full Year Fiscal 2025 Financial Results\nMarch 5, 2025\nFourth Quarter Fiscal 2025 Total Revenue of $548.4 million, up 20% Year-over-Year'), Document(id='67f0259b8bb2babc069243a7', metadata={'_id': '67f0259b8bb2babc069243a7', ... 'page_label': '4'}, page_content='distributed database on the market. With integrated capabilities for operational data, search, real-time analytics, and AI-powered retrieval, MongoDB'), Document(id='67f0259b8bb2babc069243a5', metadata={'_id': '67f0259b8bb2babc069243a5', ... 'page_label': '4'}, page_content="Headquartered in New York, MongoDB's mission is to empower innovators to create, transform, and disrupt industries with software and data."), Document(id='67f0259b8bb2babc06924354', metadata={'_id': '67f0259b8bb2babc06924354', ... 'page_label': '2'}, page_content='data.\nMongoDB completed the redemption of 2026 Convertible Notes, eliminating all debt from the balance sheet. Additionally, in'), Document(id='67f0259b8bb2babc069243a9', metadata={'_id': '67f0259b8bb2babc069243a9', ... 'page_label': '4'}, page_content='50,000 customers across almost every industry—including 70% of the Fortune 100—rely on MongoDB for their most important applications. To learn\nmore, visit mongodb.com .\nInvestor Relations')]
此示例执行以下操作:
将 MongoDB Vector Search 实例化为检索器以查询相似文档,包括以下可选参数:
k用于仅搜索10个最相关的文档。score_threshold仅使用相关度分数高于0.75的文档。注意
此参数是指 LangChain 用于规范化结果的相关性分数,而不是在MongoDB Search 查询中使用的相关性分数。要在 RAG 实施中使用 MongoDB 搜索分数,请定义一个自定义检索器,该检索器使用
similarity_search_with_score方法并按 MongoDB 搜索分数进行筛选。pre_filter在page_label字段中过滤仅出现在 2 页上的文档。
定义 LangChain 提示模板,指示 LLM 使用这些文档作为查询的上下文。LangChain 将这些文档传递给
{context}输入变量,并将您的查询传递给{question}变量。MongoDB Vector Search 作为检索器,搜索要用作上下文的文档。
您定义的提示模板。
OpenAI 的
gpt-4o聊天模型用于生成情境感知响应。
使用样本查询调用链。
返回 LLM 的响应和作为上下文使用的文档。产生的响应可能会有所不同。
# Instantiate MongoDB Vector Search as a retriever retriever = vector_store.as_retriever( search_type = "similarity", search_kwargs = { "k": 10, "score_threshold": 0.75, "pre_filter": { "page_label": { "$eq": 2 } } } ) # Define a prompt template template = """ Use the following pieces of context to answer the question at the end. {context} Question: {question} """ prompt = PromptTemplate.from_template(template) model = ChatOpenAI(model="gpt-4o") # Construct a chain to answer questions on your data chain = ( { "context": retriever, "question": RunnablePassthrough()} | prompt | model | StrOutputParser() ) # Prompt the chain question = "What was MongoDB's latest acquisition?" answer = rag_chain.invoke(question) print("Question: " + question) print("Answer: " + answer) # Return source documents documents = retriever.invoke(question) print("\nSource documents:") pprint.pprint(documents)
Question: What was MongoDB's latest acquisition? Answer: MongoDB's latest acquisition was Voyage AI, a pioneer in state-of-the-art embedding and reranking models. Source documents: [Document(id='67f0259b8bb2babc06924351', metadata={'_id': '67f0259b8bb2babc06924351', ... 'page_label': '2'}, page_content='Measures."\nFourth Quarter Fiscal 2025 and Recent Business Highlights\nMongoDB acquired Voyage AI, a pioneer in state-of-the-art embedding and reranking models that power next-generation'), Document(id='67f0259b8bb2babc06924355', metadata={'_id': '67f0259b8bb2babc06924355', ... 'page_label': '2'}, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'), Document(id='67f0259b8bb2babc06924354', metadata={'_id': '67f0259b8bb2babc06924354', ... 'page_label': '2'}, page_content='data.\nMongoDB completed the redemption of 2026 Convertible Notes, eliminating all debt from the balance sheet. Additionally, in'), Document(id='67f0259b8bb2babc06924358', metadata={'_id': '67f0259b8bb2babc06924358', ... 'page_label': '2'}, page_content='Lombard Odier, a Swiss private bank, partnered with MongoDB to migrate and modernize its legacy banking technology'), Document(id='67f0259b8bb2babc06924352', metadata={'_id': '67f0259b8bb2babc06924352', ... 'page_label': '2'}, page_content="AI applications. Integrating Voyage AI's technology with MongoDB will enable organizations to easily build trustworthy,"), Document(id='67f0259b8bb2babc0692435a', metadata={'_id': '67f0259b8bb2babc0692435a', ... 'page_label': '2'}, page_content='applications from a legacy relational database to MongoDB 20 times faster than previous migrations.\nFirst Quarter and Full Year Fiscal 2026 Guidance'), Document(id='67f0259b8bb2babc06924356', metadata={'_id': '67f0259b8bb2babc06924356', ... 'page_label': '2'}, page_content='For the third consecutive year, MongoDB was named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud'), Document(id='67f0259b8bb2babc0692434d', metadata={'_id': '67f0259b8bb2babc0692434d', ... 'page_label': '2'}, page_content='compared to $121.5 million of cash from operations in the year-ago period. MongoDB used $29.6 million of cash in capital'), Document(id='67f0259b8bb2babc0692434c', metadata={'_id': '67f0259b8bb2babc0692434c', ... 'page_label': '2'}, page_content='Cash Flow: During the year ended January 31, 2025, MongoDB generated $150.2 million of cash from operations,'), Document(id='67f0259b8bb2babc06924364', metadata={'_id': '67f0259b8bb2babc06924364', ... 'page_label': '2'}, page_content='MongoDB will host a conference call today, March 5, 2025, at 5:00 p.m. (Eastern Time) to discuss its financial results and business outlook. A live')]
通过观看学习
请观看本视频教程,以了解更多关于使用 LangChain 和 MongoDB 进行语义搜索和检索增强生成 (RAG) 的信息。
时长:8 分钟