/ /

Add Memory and Semantic Caching with LangChain and MongoDB

This tutorial demonstrates how to enhance your RAG applications by adding conversation memory and semantic caching using the LangChain MongoDB integration.

Memory allows you to maintain conversation context across multiple user interactions.
Semantic caching reduces response latency by caching semantically similar queries.

Work with a runnable version of this tutorial as a Python notebook.

Prerequisites

Before you begin, ensure that you have the following:

One of the following MongoDB cluster types:
- An Atlas cluster running MongoDB version 6.0.11, 7.0.2, or later. Ensure that your IP address is included in your Atlas project's access list.
- A local Atlas deployment created using the Atlas CLI. To learn more, see Create a Local Atlas Deployment.
- A MongoDB Community or Enterprise cluster with Search and Vector Search installed.
A Voyage AI API key. To create an API key, see Model API Keys.
An OpenAI API Key. You must have an OpenAI account with credits available for API requests. To learn more about registering an OpenAI account, see the OpenAI API website.
An environment to run interactive Python notebooks such as Colab.

Tip

We recommend completing the Get Started tutorial to learn how to create a naive RAG implementation before completing this tutorial.

Use MongoDB as a Vector Store

In this section, you create a vector store instance using your MongoDB cluster as a vector database.

Set up the environment.

Set up the environment for this tutorial. Create an interactive Python notebook by saving a file with the .ipynb extension. This notebook allows you to run Python code snippets individually, and you'll use it to run the code in this tutorial.

To set up your notebook environment:

Run the following command in your notebook:

pip install --quiet --upgrade langchain langchain-community langchain-core langchain-mongodb langchain-voyageai langchain-openai pypdf

Set environment variables.
Run the following code to set the environment variables for this tutorial. Provide your Voyage API key, OpenAI API Key, and MongoDB cluster's SRV connection string.
```
import os
os.environ["OPENAI_API_KEY"] = "<openai-key>"
os.environ["VOYAGE_API_KEY"] = "<voyage-key>"
MONGODB_URI = "<connection-string>"
```
Note
Replace <connection-string> with the connection string for your Atlas cluster or local Atlas deployment.
Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
To learn more, see Connect to a Cluster via Client Libraries.
Your connection string should use the following format:
mongodb://localhost:<port-number>/?directConnection=true
To learn more, see Connection Strings.

Instantiate the vector store.

Paste and run the following code in your notebook to create a vector store instance named vector_store using the langchain_db.rag_with_memory namespace in MongoDB:

from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_voyageai import VoyageAIEmbeddings
# Use the voyage-3-large embedding model
embedding_model = VoyageAIEmbeddings(model="voyage-3-large")
# Create the vector store
vector_store = MongoDBAtlasVectorSearch.from_connection_string(
   connection_string = MONGODB_URI,
   embedding = embedding_model,
   namespace = "langchain_db.rag_with_memory"
)

Add data to the vector store.

Paste and run the following code in your notebook to ingest a sample PDF that contains a recent MongoDB earnings report into the vector store.

This code uses a text splitter to chunk the PDF data into smaller parent documents. It specifies the chunk size (number of characters) and chunk overlap (number of overlapping characters between consecutive chunks) for each document.

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Load the PDF
loader = PyPDFLoader("https://investors.mongodb.com/node/13176/pdf")
data = loader.load()
# Split PDF into documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
docs = text_splitter.split_documents(data)
# Add data to the vector store
vector_store.add_documents(docs)

Tip

After running this code, if you're using Atlas, you can verify your vector embeddings by navigating to the langchain_db.rag_with_memory namespace in the Atlas UI.

Create the MongoDB Vector Search index.

Run the following code to create the MongoDB Vector Search index for the vector store to enable vector search over your data:

# Use LangChain helper method to create the vector search index
vector_store.create_vector_search_index(
   dimensions = 1024 # The dimensions of the vector embeddings to be indexed
)

Tip

create_vector_search_index API reference

The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.

Implement RAG with Memory

This section demonstrates how to implement RAG with conversation memory by using the LangChain MongoDB integration.

Define a function to get chat message history.

To maintain conversation history across multiple interactions, use the MongoDBChatMessageHistory class. It allows you to store chat messages in a MongoDB database and extend them to your RAG chain to handle conversation context.

Paste and run the following code in your notebook to create a function named get_session_history that returns a MongoDBChatMessageHistory instance. This instance retrieves the chat history for a specific session.

from langchain_mongodb.chat_message_histories import MongoDBChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.prompts import MessagesPlaceholder
def get_session_history(session_id: str) -> MongoDBChatMessageHistory:
    return MongoDBChatMessageHistory(
        connection_string=MONGODB_URI,
        session_id=session_id,
        database_name="langchain_db",
        collection_name="rag_with_memory"
    )

Create a RAG chain that handles chat message history.

Paste and run the following code snippets to create the RAG chain:

Specify the LLM to use.
```
from langchain_openai import ChatOpenAI
# Define the model to use for chat completion
llm = ChatOpenAI(model = "gpt-4o")
```

Define a prompt that summarizes the chat history for the retriever.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Create a prompt to generate standalone questions from follow-up questions
standalone_system_prompt = """
Given a chat history and a follow-up question, rephrase the follow-up question to be a standalone question.
Do NOT answer the question, just reformulate it if needed, otherwise return it as is.
Only return the final standalone question.
"""
standalone_question_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", standalone_system_prompt),
        MessagesPlaceholder(variable_name="history"),
        ("human", "{question}"),
    ]
)
# Parse output as a string
parse_output = StrOutputParser()
question_chain = standalone_question_prompt | llm | parse_output

Build a retriever chain that processes the chat history and retrieves documents.

from langchain_core.runnables import RunnablePassthrough
# Create a retriever
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={ "k": 5 })
# Create a retriever chain that processes the question with history and retrieves documents
retriever_chain = RunnablePassthrough.assign(context=question_chain | retriever | (lambda docs: "\n\n".join([d.page_content for d in docs])))

Define a prompt to generate an answer based on the chat history and retrieved context.

# Create a prompt template that includes the retrieved context and chat history
rag_system_prompt = """Answer the question based only on the following context:
{context}
"""
rag_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", rag_system_prompt),
        MessagesPlaceholder(variable_name="history"),
        ("human", "{question}"),
    ]
)

Implement RAG with memory.

Combine the components you defined into a complete RAG chain:

# Build the RAG chain
rag_chain = (
    retriever_chain
    | rag_prompt
    | llm
    | parse_output
)
# Wrap the chain with message history
rag_with_memory = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key="question",
    history_messages_key="history",
)

Test your RAG implementation.

Invoke the chain to answer questions. This chain maintains the conversation context and returns relevant answers that consider the previous interactions. Your responses might vary.

# First question
response_1 = rag_with_memory.invoke(
    {"question": "What was MongoDB's latest acquisition?"},
    {"configurable": {"session_id": "user_1"}}
)
print(response_1)

MongoDB's latest acquisition was Voyage AI, a pioneer in state-of-the-art embedding and reranking models for next-generation AI applications.

# Follow-up question that references the previous question
response_2 = rag_with_memory.invoke(
    {"question": "Why did they do it?"},
    {"configurable": {"session_id": "user_1"}}
)
print(response_2)

MongoDB acquired Voyage AI to enable organizations to easily build trustworthy AI applications by integrating advanced embedding and reranking models into their technology. This acquisition aligns with MongoDB's goal of helping businesses innovate at "AI speed" using its flexible document model and seamless scalability.

Add Semantic Caching

This section adds semantic caching on top of your RAG chain. Semantic caching is a form of caching that retrieves cached prompts based on the semantic similarity between queries.

Note

You can use semantic caching independently of conversation memory, but you use both features together for this tutorial.

For a video tutorial of this feature, see Learn by Watching.

Configure the semantic cache.

Run the following code to configure the semantic cache by using the MongoDBAtlasSemanticCache class:

from langchain_mongodb.cache import MongoDBAtlasSemanticCache
from langchain_core.globals import set_llm_cache
# Configure the semantic cache
set_llm_cache(MongoDBAtlasSemanticCache(
    connection_string = MONGODB_URI,
    database_name = "langchain_db",
    collection_name = "semantic_cache",
    embedding = embedding_model,
    index_name = "vector_index",
    similarity_threshold = 0.5  # Adjust based on your requirements
))

Test the semantic cache with your RAG chain.

The semantic cache automatically caches your prompts. Run the following sample queries, where you should see a significant reduction in response time for the second query. Your responses and response times might vary.

Tip

You can view your cached prompts in the semantic_cache collection. The semantic cache caches only the input to the LLM. When using it in retrieval chains, note that documents retrieved can change between runs, resulting in cache misses for semantically similar queries.

%%time
# First query (not cached)
rag_with_memory.invoke(
  {"question": "What was MongoDB's latest acquisition?"},
  {"configurable": {"session_id": "user_2"}}
)

CPU times: user 54.7 ms, sys: 34.2 ms, total: 88.9 ms
Wall time: 7.42 s
"MongoDB's latest acquisition was Voyage AI, a pioneer in state-of-the-art embedding and reranking models that power next-generation AI applications."

%%time
# Second query (cached)
rag_with_memory.invoke(
  {"question": "What company did MongoDB acquire recently?"},
  {"configurable": {"session_id": "user_2"}}
)

CPU times: user 79.7 ms, sys: 24 ms, total: 104 ms
Wall time: 3.87 s
'MongoDB recently acquired Voyage AI.'

Learn by Watching

Follow along with this video tutorial to learn more about semantic caching with LangChain and MongoDB.

Duration: 30 Minutes

Back

Get Started

Hybrid Search