Add Memory and Semantic Caching to your RAG Applications with LangChain and MongoDB
This tutorial demonstrates how to enhance your RAG applications by adding conversation memory and semantic caching using the LangChain MongoDB integration.
Memory allows you to maintain conversation context across multiple user interactions.
Semantic caching reduces response latency by caching semantically similar queries.
Work with a runnable version of this tutorial as a Python notebook.
Prerequisites
Before you begin, ensure that you have the following:
An Atlas account with a cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs). Ensure that your IP address is included in your Atlas project's access list. To learn more, see Create a Cluster.
A Voyage AI API key. To learn more, see API Key and Python Client.
An OpenAI API Key. You must have an OpenAI account with credits available for API requests. To learn more about registering an OpenAI account, see the OpenAI API website.
Tip
We recommend completing the Get Started tutorial to learn how to create a naive RAG implementation before completing this tutorial.
Use Atlas as a Vector Store
In this section, you create a vector store instance using Atlas as the vector database.
Set up the environment.
Set up the environment for this tutorial.
Create an interactive Python notebook by saving a file
with the .ipynb
extension. This notebook allows you to
run Python code snippets individually, and you'll use
it to run the code in this tutorial.
To set up your notebook environment:
Run the following command in your notebook:
pip install --quiet --upgrade langchain langchain-community langchain-core langchain-mongodb langchain-voyageai langchain-openai pypdf Set environment variables.
Run the following code to set the environment variables for this tutorial. Provide your Voyage API key, OpenAI API Key, and Atlas cluster's SRV connection string.
import os os.environ["OPENAI_API_KEY"] = "<openai-key>" os.environ["VOYAGE_API_KEY"] = "<voyage-key>" MONGODB_URI = "<connection-string>" Note
Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
Instantiate the vector store.
Paste and run the following code in your notebook
to create a vector store instance
named vector_store
using the langchain_db.rag_with_memory
namespace in Atlas:
from langchain_mongodb import MongoDBAtlasVectorSearch from langchain_voyageai import VoyageAIEmbeddings # Use the voyage-3-large embedding model embedding_model = VoyageAIEmbeddings(model="voyage-3-large") # Create the vector store vector_store = MongoDBAtlasVectorSearch.from_connection_string( connection_string = MONGODB_URI, embedding = embedding_model, namespace = "langchain_db.rag_with_memory" )
Add data to the vector store.
Paste and run the following code in your notebook to ingest a sample PDF that contains a recent MongoDB earnings report into the vector store.
This code uses a text splitter to chunk the PDF data into smaller parent documents. It specifies the chunk size (number of characters) and chunk overlap (number of overlapping characters between consecutive chunks) for each document.
from langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter # Load the PDF loader = PyPDFLoader("https://investors.mongodb.com/node/13176/pdf") data = loader.load() # Split PDF into documents text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20) docs = text_splitter.split_documents(data) # Add data to the vector store vector_store.add_documents(docs)
Tip
After running this code, you can
view your vector embeddings in the Atlas UI
by navigating to the langchain_db.rag_with_memory
collection in your cluster.
Create the Atlas Vector Search index.
Run the following code to create the Atlas Vector Search index for the vector store to enable vector search over your data:
# Use LangChain helper method to create the vector search index vector_store.create_vector_search_index( dimensions = 1024 # The dimensions of the vector embeddings to be indexed )
The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.
Implement RAG with Memory
This section demonstrates how to implement RAG with conversation memory by using the LangChain MongoDB integration.
Define a function to get chat message history.
To maintain conversation history across multiple interactions,
use the MongoDBChatMessageHistory
class. It allows you
to store chat messages in a MongoDB database and extend them to
your RAG chain to handle conversation context.
Paste and run the following code in your notebook to create a
function named get_session_history
that returns a
MongoDBChatMessageHistory
instance. This instance
retrieves the chat history for a specific session.
from langchain_mongodb.chat_message_histories import MongoDBChatMessageHistory from langchain_core.runnables.history import RunnableWithMessageHistory from langchain_core.prompts import MessagesPlaceholder def get_session_history(session_id: str) -> MongoDBChatMessageHistory: return MongoDBChatMessageHistory( connection_string=MONGODB_URI, session_id=session_id, database_name="langchain_db", collection_name="rag_with_memory" )
Create a RAG chain that handles chat message history.
Paste and run the following code snippets to create the RAG chain:
Specify the LLM to use.
from langchain_openai import ChatOpenAI # Define the model to use for chat completion llm = ChatOpenAI(model = "gpt-4o") Define a prompt that summarizes the chat history for the retriever.
from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser # Create a prompt to generate standalone questions from follow-up questions standalone_system_prompt = """ Given a chat history and a follow-up question, rephrase the follow-up question to be a standalone question. Do NOT answer the question, just reformulate it if needed, otherwise return it as is. Only return the final standalone question. """ standalone_question_prompt = ChatPromptTemplate.from_messages( [ ("system", standalone_system_prompt), MessagesPlaceholder(variable_name="history"), ("human", "{question}"), ] ) # Parse output as a string parse_output = StrOutputParser() question_chain = standalone_question_prompt | llm | parse_output Build a retriever chain that processes the chat history and retrieves documents.
from langchain_core.runnables import RunnablePassthrough # Create a retriever retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={ "k": 5 }) # Create a retriever chain that processes the question with history and retrieves documents retriever_chain = RunnablePassthrough.assign(context=question_chain | retriever | (lambda docs: "\n\n".join([d.page_content for d in docs]))) Define a prompt to generate an answer based on the chat history and retrieved context.
# Create a prompt template that includes the retrieved context and chat history rag_system_prompt = """Answer the question based only on the following context: {context} """ rag_prompt = ChatPromptTemplate.from_messages( [ ("system", rag_system_prompt), MessagesPlaceholder(variable_name="history"), ("human", "{question}"), ] ) Implement RAG with memory.
Combine the components you defined into a complete RAG chain:
# Build the RAG chain rag_chain = ( retriever_chain | rag_prompt | llm | parse_output ) # Wrap the chain with message history rag_with_memory = RunnableWithMessageHistory( rag_chain, get_session_history, input_messages_key="question", history_messages_key="history", )
Test your RAG implementation.
Invoke the chain to answer questions. This chain maintains the conversation context and returns relevant answers that consider the previous interactions. Your responses might vary.
# First question response_1 = rag_with_memory.invoke( {"question": "What was MongoDB's latest acquisition?"}, {"configurable": {"session_id": "user_1"}} ) print(response_1)
MongoDB's latest acquisition was Voyage AI, a pioneer in state-of-the-art embedding and reranking models for next-generation AI applications.
# Follow-up question that references the previous question response_2 = rag_with_memory.invoke( {"question": "Why did they do it?"}, {"configurable": {"session_id": "user_1"}} ) print(response_2)
MongoDB acquired Voyage AI to enable organizations to easily build trustworthy AI applications by integrating advanced embedding and reranking models into their technology. This acquisition aligns with MongoDB's goal of helping businesses innovate at "AI speed" using its flexible document model and seamless scalability.
Add Semantic Caching
This section adds semantic caching on top of your RAG chain. Semantic caching is a form of caching that retrieves cached prompts based on the semantic similarity between queries.
Note
You can use semantic caching independently of conversation memory, but you use both features together for this tutorial.
For a video tutorial of this feature, see Learn by Watching.
Configure the semantic cache.
Run the following code to configure the semantic cache
by using the MongoDBAtlasSemanticCache
class:
from langchain_mongodb.cache import MongoDBAtlasSemanticCache from langchain_core.globals import set_llm_cache # Configure the semantic cache set_llm_cache(MongoDBAtlasSemanticCache( connection_string = MONGODB_URI, database_name = "langchain_db", collection_name = "semantic_cache", embedding = embedding_model, index_name = "vector_index", similarity_threshold = 0.5 # Adjust based on your requirements ))
Test the semantic cache with your RAG chain.
The semantic cache automatically caches your prompts. Run the following sample queries, where you should see a significant reduction in response time for the second query. Your responses and response times might vary.
Tip
You can view your cached prompts in the semantic_cache
collection.
The semantic cache caches only the input to the LLM. When using it in retrieval chains,
note that documents retrieved can change between runs, resulting in cache misses for
semantically similar queries.
%%time # First query (not cached) rag_with_memory.invoke( {"question": "What was MongoDB's latest acquisition?"}, {"configurable": {"session_id": "user_2"}} )
CPU times: user 54.7 ms, sys: 34.2 ms, total: 88.9 ms Wall time: 7.42 s "MongoDB's latest acquisition was Voyage AI, a pioneer in state-of-the-art embedding and reranking models that power next-generation AI applications."
%%time # Second query (cached) rag_with_memory.invoke( {"question": "What company did MongoDB acquire recently?"}, {"configurable": {"session_id": "user_2"}} )
CPU times: user 79.7 ms, sys: 24 ms, total: 104 ms Wall time: 3.87 s 'MongoDB recently acquired Voyage AI.'
Learn by Watching
Follow along with this video tutorial to learn more about semantic caching with LangChain and MongoDB.
Duration: 30 Minutes