/ /

使用MongoDB和 LangChain 执行自查询检索

您可以将 MongoDB Vector Search 与 LangChain 集成以执行自查询检索。本教程演示如何使用自查询检索器运行元数据筛选的自然语言MongoDB Vector Search 查询。

自查询检索使用 LLM进程搜索查询以识别可能的元数据筛选器，使用筛选器形成结构化向量搜索查询，然后运行查询以检索最相关的文档。

例子

通过查询类似“What aretrunk movie from after 2010 with ratings 以上 8？”这样的查询，检索器可以识别针对 genre、year 和 rating 字段的筛选器，并使用这些筛选器用于检索与查询匹配的文档的筛选器。

使用本教程的可运行版本以作为 Python 笔记本。

先决条件

如要完成本教程，您必须具备以下条件：

以下MongoDB 集群类型之一：
- 运行MongoDB 版本的Atlas6.0.11 集群，7.0.2 或更高版本。确保您的 IP解决包含在Atlas项目的访问权限列表中。
- 使用Atlas CLI创建的本地Atlas部署。要学习；了解更多信息，请参阅创建本地Atlas部署。
- 安装了Search 和 Vector Search的MongoDB Community或 Enterprise集群。
Voyage AI API密钥。要学习；了解更多信息，请参阅 Voyage AI文档。
OpenAI API密钥。您必须拥有一个具有可用于API请求的积分的 OpenAI 帐户。要学习；了解有关注册 OpenAI 帐户的更多信息，请参阅 OpenAI API网站。

使用MongoDB作为向量存储

在本部分中，您将使用MongoDB 集群作为向量数据库来创建向量存储实例。

设置环境。

为此教程设置环境。通过保存扩展名为 .ipynb 的文件来创建交互式Python笔记本。此 Notebook 允许您单独运行Python代码片段，并且您将使用它来运行本教程中的代码。

要设立笔记本环境，请执行以下操作：

在笔记本中运行以下命令：
```
pip install --quiet --upgrade langchain-mongodb langchain-voyageai langchain-openai langchain langchain-core lark
```
设置环境变量。
运行以下代码为本教程设立环境变量。提供您的 Voyage API密钥、OpenAI API密钥和MongoDB集群的SRV连接字符串。
```
import os
os.environ["OPENAI_API_KEY"] = "<openai-key>"
os.environ["VOYAGE_API_KEY"] = "<voyage-key>"
MONGODB_URI = "<connection-string>"
```
注意
将 <connection-string> 替换为您的 Atlas 集群或本地部署的连接字符串。
连接字符串应使用以下格式：
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
要学习；了解更多信息，请参阅通过驱动程序连接到集群。
连接字符串应使用以下格式：
mongodb://localhost:<port-number>/?directConnection=true
要学习；了解更多信息，请参阅连接字符串。

实例化向量存储。

在 Notebook 中运行以下代码，使用MongoDB中的 langchain_db.self_query命名空间创建一个名为 vector_store 的向量存储实例：

from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_voyageai import VoyageAIEmbeddings
# Use the voyage-3-large embedding model
embedding_model = VoyageAIEmbeddings(model="voyage-3-large")
# Create the vector store
vector_store = MongoDBAtlasVectorSearch.from_connection_string(
   connection_string = MONGODB_URI,
   embedding = embedding_model,
   namespace = "langchain_db.self_query",
   text_key = "page_content"
)

将数据添加到向量存储中。

在笔记本中粘贴并运行以下代码，以将一些带有元数据的示例文档提取到MongoDB中的集合中。

from langchain_core.documents import Document
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "action"},
    ),
    Document(
        page_content="A fight club that is not a fight club, but is a fight club",
        metadata={"year": 1994, "rating": 8.7, "genre": "action"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "genre": "thriller", "rating": 8.2},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "rating": 8.3, "genre": "drama"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={"year": 1979, "rating": 9.9, "genre": "science fiction"},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "genre": "thriller", "rating": 9.0},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated", "rating": 9.3},
    ),
    Document(
        page_content="The toys come together to save their friend from a kid who doesn't know how to play with them",
        metadata={"year": 1997, "genre": "animated", "rating": 9.1},
    ),
]
# Add data to the vector store, which automaticaly embeds the documents
vector_store.add_documents(docs)

如果您使用 Atlas，您可以通过导航到 Atlas 用户界面中的 langchain_db.self_query 命名空间来验证您的向量嵌入。

使用筛选器创建MongoDB Vector Search索引。

运行以下代码，为向量存储创建带有筛选器的MongoDB Vector Search索引，从而对数据启用向量搜索和元数据筛选：

# Use LangChain helper method to create the vector search index
vector_store.create_vector_search_index(
   dimensions = 1024, # The dimensions of the vector embeddings to be indexed
   filters = [ "genre", "rating", "year" ], # The metadata fields to be indexed for filtering
   wait_until_complete = 60 # Number of seconds to wait for the index to build (can take around a minute)
)

提示

create_vector_search_index API参考

构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

创建自查询检索器

在本部分中，您将初始化自查询检索器以查询向量存储中的数据。

描述文档和元数据字段。

要使用自查询检索器，您必须描述集合中的文档以及要过滤的元数据字段。此信息有助于法学硕士了解数据结构以及如何根据用户查询过滤结果。

from langchain.chains.query_constructor.schema import AttributeInfo
# Define the document content description 
document_content_description = "Brief summary of a movie"
# Define the metadata fields to filter on
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="rating", 
        description="A 1-10 rating for the movie", 
        type="float"
    ),
]

初始化自查询检索器。

运行以下代码，使用 MongoDBAtlasSelfQueryRetriever.from_llm 方法创建自查询检索器。

from langchain_mongodb.retrievers import MongoDBAtlasSelfQueryRetriever
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
retriever = MongoDBAtlasSelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vector_store,
    metadata_field_info=metadata_field_info,
    document_contents=document_content_description
)

提示

MongoDBAtlasSelfQueryRetriever API参考

使用自查询检索器运行查询

运行以下查询，查看自查询检索器如何执行不同类型的查询：

# This example specifies a filter (rating > 9)
retriever.invoke("What are some highly rated movies (above 9)?")

[Document(id='686e84de13668e4048bf9ff3', metadata={'_id': '686e84de13668e4048bf9ff3', 'year': 1979, 'rating': 9.9, 'genre': 'science fiction'}, page_content='Three men walk into the Zone, three men walk out of the Zone'),
 Document(id='686e84de13668e4048bf9ff5', metadata={'_id': '686e84de13668e4048bf9ff5', 'year': 1995, 'genre': 'animated', 'rating': 9.3}, page_content='Toys come alive and have a blast doing so'),
 Document(id='686e84de13668e4048bf9ff6', metadata={'_id': '686e84de13668e4048bf9ff6', 'year': 1997, 'genre': 'animated', 'rating': 9.1}, page_content="The toys come together to save their friend from a kid who doesn't know how to play with them")]

# This example specifies a semantic search and a filter (rating > 9)
retriever.invoke("I want to watch a movie about toys rated higher than 9")

[Document(id='686e84de13668e4048bf9ff5', metadata={'_id': '686e84de13668e4048bf9ff5', 'year': 1995, 'genre': 'animated', 'rating': 9.3}, page_content='Toys come alive and have a blast doing so'),
 Document(id='686e84de13668e4048bf9ff6', metadata={'_id': '686e84de13668e4048bf9ff6', 'year': 1997, 'genre': 'animated', 'rating': 9.1}, page_content="The toys come together to save their friend from a kid who doesn't know how to play with them"),
 Document(id='686e84de13668e4048bf9ff3', metadata={'_id': '686e84de13668e4048bf9ff3', 'year': 1979, 'rating': 9.9, 'genre': 'science fiction'}, page_content='Three men walk into the Zone, three men walk out of the Zone')]

# This example specifies a composite filter (rating >= 9 and genre = thriller)
retriever.invoke("What's a highly rated (above or equal 9) thriller film?")

[Document(id='686e84de13668e4048bf9ff4', metadata={'_id': '686e84de13668e4048bf9ff4', 'year': 2006, 'genre': 'thriller', 'rating': 9.0}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea')]

# This example specifies a query and composite filter (year > 1990 and year < 2005 and genre = action)
retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about dinosaurs, " +
    "and preferably has the action genre"
)

[Document(id='686e84de13668e4048bf9fef', metadata={'_id': '686e84de13668e4048bf9fef', 'year': 1993, 'rating': 7.7, 'genre': 'action'}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),
 Document(id='686e84de13668e4048bf9ff0', metadata={'_id': '686e84de13668e4048bf9ff0', 'year': 1994, 'rating': 8.7, 'genre': 'action'}, page_content='A fight club that is not a fight club, but is a fight club')]

# This example only specifies a semantic search query
retriever.invoke("What are some movies about dinosaurs")

[Document(id='686e84de13668e4048bf9fef', metadata={'_id': '686e84de13668e4048bf9fef', 'year': 1993, 'rating': 7.7, 'genre': 'action'}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),
 Document(id='686e84de13668e4048bf9ff5', metadata={'_id': '686e84de13668e4048bf9ff5', 'year': 1995, 'genre': 'animated', 'rating': 9.3}, page_content='Toys come alive and have a blast doing so'),
 Document(id='686e84de13668e4048bf9ff1', metadata={'_id': '686e84de13668e4048bf9ff1', 'year': 2010, 'genre': 'thriller', 'rating': 8.2}, page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...'),
 Document(id='686e84de13668e4048bf9ff6', metadata={'_id': '686e84de13668e4048bf9ff6', 'year': 1997, 'genre': 'animated', 'rating': 9.1}, page_content="The toys come together to save their friend from a kid who doesn't know how to play with them")]

在 RAG 管道中使用检索器

您可以在 RAG管道中使用自查询检索器。在笔记本中粘贴并运行以下代码，以实现执行自查询检索的示例RAG管道。

此代码还会将检索器配置为使用 enable_limit 参数，从而允许 LLM 在必要时限制检索器返回的文档数量。生成的响应可能会有所不同。

import pprint
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import  RunnablePassthrough
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
# Configure self-query retriever with a document limit
retriever = MongoDBAtlasSelfQueryRetriever.from_llm(
    llm,
    vector_store,
    document_content_description,
    metadata_field_info,
    enable_limit=True
)
# Define a prompt template
template = """
   Use the following pieces of context to answer the question at the end.
   {context}
   Question: {question}
"""
prompt = PromptTemplate.from_template(template)
# Construct a chain to answer questions on your data
chain = (
   { "context": retriever, "question": RunnablePassthrough()}
   | prompt   
   | llm
   | StrOutputParser()
)
# Prompt the chain
question = "What are two movies about toys after 1990?" # year > 1990 and document limit of 2
answer = chain.invoke(question)
print("Question: " + question)
print("Answer: " + answer)
# Return source documents
documents = retriever.invoke(question)
print("\nSource documents:")
pprint.pprint(documents)

Question: What are two movies about toys after 1990?
Answer: The two movies about toys after 1990 are:
1. The 1995 animated movie (rated 9.3) where toys come alive and have fun.
2. The 1997 animated movie (rated 9.1) where toys work together to save their friend from a kid who doesn’t know how to play with them.
Source documents:
[Document(id='686e84de13668e4048bf9ff5', metadata={'_id': '686e84de13668e4048bf9ff5', 'year': 1995, 'genre': 'animated', 'rating': 9.3}, page_content='Toys come alive and have a blast doing so'),
 Document(id='686e84de13668e4048bf9ff6', metadata={'_id': '686e84de13668e4048bf9ff6', 'year': 1997, 'genre': 'animated', 'rating': 9.1}, page_content="The toys come together to save their friend from a kid who doesn't know how to play with them")]

后退

Parent Document Retrieval

来年

本地 RAG

例子

先决条件

使用MongoDB作为向量存储

设置环境。

注意

实例化向量存储。

将数据添加到向量存储中。

使用筛选器创建MongoDB Vector Search索引。

提示

创建自查询检索器

描述文档和元数据字段。

初始化自查询检索器。

提示

使用自查询检索器运行查询

在 RAG 管道中使用检索器