使用 LangChain 集成执行混合搜索
您可以将 Atlas Vector Search 与 LangChain 集成以执行 混合搜索。在本教程中,您将完成以下步骤:
设置环境。
将Atlas用作向量存储。
对数据创建Atlas Vector Search和Atlas Search索引。
运行混合搜索查询。
将查询结果传递到 RAG管道。
使用本教程的可运行版本以作为 Python 笔记本。
先决条件
如要完成本教程,您必须具备以下条件:
一个 Atlas 帐户,而其集群运行着 MongoDB 版本 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的 IP 地址包含在 Atlas 项目的访问列表中。如需了解详情,请参阅创建集群。
一个 OpenAI API 密钥。您必须拥有一个 OpenAI 帐户,该帐户具有可用于 API 请求的信用额度。要了解有关注册 OpenAI 账号的更多信息,请参阅 OpenAI API 网站。
运行交互式 Python 笔记本(例如 Colab)的环境。
设置环境
为此教程设置环境。 通过保存具有 .ipynb
扩展名的文件来创建交互式Python笔记本。 此 Notebook 允许您单独运行Python代码片段,并且您将使用它来运行本教程中的代码。
要设立笔记本环境,请执行以下操作:
设置环境变量。
运行以下代码为本教程设立环境变量。根据提示提供 OpenAI API密钥和Atlas集群的 SRV 连接字符串。
import os os.environ["OPENAI_API_KEY"] = "<api-key>" ATLAS_CONNECTION_STRING = "<connection-string>"
注意
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
使用 Atlas 作为向量存储
您必须使用Atlas作为数据的向量存储。 您可以使用Atlas中的现有集合来实例化向量存储。
加载示例数据。
注意
如果您想使用自己的数据,请参阅LangChain入门或如何创建矢量嵌入,以学习如何将矢量嵌入导入Atlas。
实例化向量存储。
在笔记本中粘贴并运行以下代码,以从Atlas中的 sample_mflix.embedded_movies
命名空间创建一个名为 vector_store
的向量存储实例。 此代码使用 from_connection_string
方法创建 MongoDBAtlasVectorSearch
向量存储并指定以下参数:
您的Atlas集群的连接字符串。
OpenAI 嵌入模型作为用于将文本转换为向量嵌入的模型。 默认下,此模型为
text-embedding-ada-002
。sample_mflix.embedded movies
作为要使用的命名空间空间。plot
作为包含文本的字段。plot_embedding
作为包含嵌入的字段。dotProduct
作为相关性得分函数。
from langchain_mongodb import MongoDBAtlasVectorSearch from langchain_openai import OpenAIEmbeddings # Create the vector store vector_store = MongoDBAtlasVectorSearch.from_connection_string( connection_string = ATLAS_CONNECTION_STRING, embedding = OpenAIEmbeddings(disallowed_special=()), namespace = "sample_mflix.embedded_movies", text_key = "plot", embedding_key = "plot_embedding", relevance_score_fn = "dotProduct" )
创建索引
注意
要创建Atlas Vector Search或Atlas Search索引,您必须对Atlas项目具有Project Data Access Admin
或更高访问权限。
要在向量存储上启用混合搜索查询,请在集合上创建Atlas Vector Search和Atlas Search索引。 您可以使用 LangChain 辅助方法或PyMongo驱动程序方法创建索引:
创建Atlas Search索引。
在笔记本中运行以下代码以创建搜索索引,为集合中的 plot
字段建立索引。
from langchain_mongodb.index import create_fulltext_search_index from pymongo import MongoClient # Connect to your cluster client = MongoClient(ATLAS_CONNECTION_STRING) # Use helper method to create the search index create_fulltext_search_index( collection = client["sample_mflix"]["embedded_movies"], field = "plot", index_name = "search_index" )
创建 Atlas Vector Search 索引。
运行以下代码以创建向量搜索索引,为集合中的plot_embedding
字段编制索引。
from pymongo import MongoClient from pymongo.operations import SearchIndexModel # Connect to your cluster client = MongoClient(ATLAS_CONNECTION_STRING) collection = client["sample_mflix"]["embedded_movies"] # Create your vector search index model, then create the index vector_index_model = SearchIndexModel( definition={ "fields": [ { "type": "vector", "path": "plot_embedding", "numDimensions": 1536, "similarity": "dotProduct" } ] }, name="vector_index", type="vectorSearch" ) collection.create_search_index(model=vector_index_model)
创建Atlas Search索引。
运行以下代码以创建搜索索引,为集合中的plot
字段编制索引。
1 # Create your search index model, then create the search index 2 search_index_model = SearchIndexModel( 3 definition={ 4 "mappings": { 5 "dynamic": False, 6 "fields": { 7 "plot": { 8 "type": "string" 9 } 10 } 11 } 12 }, 13 name="search_index" 14 ) 15 collection.create_search_index(model=search_index_model)
构建索引大约需要一分钟时间。在构建时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
运行混合搜索查询
Atlas构建索引后,您可以对数据运行混合搜索查询。 以下代码使用MongoDBAtlasHybridSearchRetriever
检索器对字符串time travel
执行混合搜索。它还指定了以下参数:
vectorstore
:向量存储实例的名称。search_index_name
: Atlas Search索引的名称。top_k
:要返回的文档数。fulltext_penalty
:全文搜索的惩罚。惩罚越低,全文搜索分数就越高。
vector_penalty
:向量搜索的惩罚。惩罚越低,向量搜索分数就越高。
检索器返回按全文搜索分数和向量搜索分数之和排序的文档列表。 代码示例的最终输出包括标题、图表和每个文档的不同分数。
要学习;了解有关混合搜索查询结果的更多信息,请参阅关于查询。
from langchain_mongodb.retrievers.hybrid_search import MongoDBAtlasHybridSearchRetriever # Initialize the retriever retriever = MongoDBAtlasHybridSearchRetriever( vectorstore = vector_store, search_index_name = "search_index", top_k = 5, fulltext_penalty = 50, vector_penalty = 50 ) # Define your query query = "time travel" # Print results documents = retriever.invoke(query) for doc in documents: print("Title: " + doc.metadata["title"]) print("Plot: " + doc.page_content) print("Search score: {}".format(doc.metadata["fulltext_score"])) print("Vector Search score: {}".format(doc.metadata["vector_score"])) print("Total score: {}\n".format(doc.metadata["fulltext_score"] + doc.metadata["vector_score"]))
Title: Timecop Plot: An officer for a security agency that regulates time travel, must fend for his life against a shady politician who has a tie to his past. Search score: 0.019230769230769232 Vector Search score: 0.01818181818181818 Total score: 0.03741258741258741 Title: The Time Traveler's Wife Plot: A romantic drama about a Chicago librarian with a gene that causes him to involuntarily time travel, and the complications it creates for his marriage. Search score: 0.0196078431372549 Vector Search score: 0 Total score: 0.0196078431372549 Title: Thrill Seekers Plot: A reporter, learning of time travelers visiting 20th century disasters, tries to change the history they know by averting upcoming disasters. Search score: 0 Vector Search score: 0.0196078431372549 Total score: 0.0196078431372549 Title: About Time Plot: At the age of 21, Tim discovers he can travel in time and change what happens and has happened in his own life. His decision to make his world a better place by getting a girlfriend turns out not to be as easy as you might think. Search score: 0 Vector Search score: 0.019230769230769232 Total score: 0.019230769230769232 Title: My iz budushchego Plot: My iz budushchego, or We Are from the Future, is a movie about time travel. Four 21st century treasure seekers are transported back into the middle of a WWII battle in Russia. The movie's ... Search score: 0.018867924528301886 Vector Search score: 0 Total score: 0.018867924528301886
将结果传递到 RAG 管道
您可以将混合搜索结果传递到 RAG管道中,以便对检索到的文档生成响应。示例代码执行以下操作:
定义 LangChain 提示模板,指示 LLM 使用检索到的文档作为查询的上下文。LangChain 将这些文档传递给
{context}
输入变量,并将您的查询传递给{query}
变量。构建一条 链 指定以下内容:
您定义的用于检索相关文档的混合搜索检索器。
您定义的提示模板。
OpenAI 的法学硕士,用于生成上下文感知响应。默认下,这是
gpt-3.5-turbo
模型。
使用示例查询提示链并返回响应。 生成的响应可能会有所不同。
from langchain_core.output_parsers import StrOutputParser from langchain_core.prompts import PromptTemplate from langchain_core.runnables import RunnablePassthrough from langchain_openai import ChatOpenAI # Define a prompt template template = """ Use the following pieces of context to answer the question at the end. {context} Question: Can you recommend some movies about {query}? """ prompt = PromptTemplate.from_template(template) model = ChatOpenAI() # Construct a chain to answer questions on your data chain = ( {"context": retriever, "query": RunnablePassthrough()} | prompt | model | StrOutputParser() ) # Prompt the chain query = "time travel" answer = chain.invoke(query) print(answer)
Based on the pieces of context provided, here are some movies about time travel that you may find interesting: 1. "Timecop" (1994) - A movie about a cop who is part of a law enforcement agency that regulates time travel, seeking justice and dealing with personal loss. 2. "The Time Traveler's Wife" (2009) - A romantic drama about a man with the ability to time travel involuntarily and the impact it has on his relationship with his wife. 3. "Thrill Seekers" (1999) - A movie about two reporters trying to prevent disasters by tracking down a time traveler witnessing major catastrophes. 4. "About Time" (2013) - A film about a man who discovers he can travel through time and uses this ability to improve his life and relationships. 5. "My iz budushchego" (2008) - A Russian movie where four treasure seekers from the 21st century are transported back to a WWII battle, exploring themes of action, drama, fantasy, and romance. These movies offer a variety of perspectives on time travel and its impact on individuals and society.