/ /

Perform Parent Document Retrieval with MongoDB and LangChain

You can integrate MongoDB Vector Search with LangChain to perform parent document retrieval. In this tutorial, you complete the following steps:

Set up the environment.
Prepare the data.
Instantiate the parent document retriever.
Create the MongoDB Vector Search index.
Use the retriever in a RAG pipeline.

Work with a runnable version of this tutorial as a Python notebook.

Background

Parent document retrieval is a retrieval technique that involves chunking large documents into smaller sub-documents. In this technique, you query the smaller chunks before returning the full parent document to the LLM. This can improve the responses of your RAG agents and applications by allowing for more granular searches on smaller chunks while giving LLMs the full context of the parent document.

Parent document retrieval with MongoDB allows you to store both parent and child documents in a single collection, which supports efficient retrieval by only having to compute and index the child documents' embeddings.

Prerequisites

To complete this tutorial, you must have the following:

One of the following MongoDB cluster types:
- An Atlas cluster running MongoDB version 6.0.11, 7.0.2, or later. Ensure that your IP address is included in your Atlas project's access list.
- A local Atlas deployment created using the Atlas CLI. To learn more, see Create a Local Atlas Deployment.
- A MongoDB Community or Enterprise cluster with Search and Vector Search installed.
A Voyage AI API key. To create an API key, see Model API Keys.
An OpenAI API Key. You must have an OpenAI account with credits available for API requests. To learn more about registering an OpenAI account, see the OpenAI API website.
An environment to run interactive Python notebooks such as Colab.

Set Up the Environment

Set up the environment for this tutorial. Create an interactive Python notebook by saving a file with the .ipynb extension. This notebook allows you to run Python code snippets individually, and you'll use it to run the code in this tutorial.

To set up your notebook environment:

Install and import dependencies.

Run the following command:

pip install --quiet --upgrade langchain langchain-community langchain-core langchain-mongodb langchain-voyageai langchain-openai pymongo pypdf

Define environment variables.

Run the following code, replacing the placeholders with the following values:

Your Voyage AI and OpenAI API Key.
Your MongoDB cluster's connection string.

import os
os.environ["VOYAGE_API_KEY"] = "<voyage-api-key>"
os.environ["OPENAI_API_KEY"] = "<openai-api-key>"
MONGODB_URI = "<connection-string>"

Note

Replace <connection-string> with the connection string for your Atlas cluster or local Atlas deployment.

Your connection string should use the following format:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

To learn more, see Connect to a Cluster via Client Libraries.

Your connection string should use the following format:

mongodb://localhost:<port-number>/?directConnection=true

To learn more, see Connection Strings.

Prepare the Data

Paste and run the following code in your notebook to load and chunk a sample PDF that contains a recent MongoDB earnings report.

This code uses a text splitter to chunk the PDF data into smaller parent documents. It specifies the chunk size (number of characters) and chunk overlap (number of overlapping characters between consecutive chunks) for each document.

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
# Load the PDF
loader = PyPDFLoader("https://investors.mongodb.com/node/12881/pdf") 
data = loader.load()
# Chunk into parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=20)
docs = parent_splitter.split_documents(data)
# Print a document
docs[0]

Document(metadata={'source': 'https://investors.mongodb.com/node/12881/pdf', 'page': 0, 'page_label': '1'}, page_content='MongoDB, Inc. Announces Third Quarter Fiscal 2025 Financial Results\nDecember 9, 2024\nThird Quarter Fiscal 2025 Total Revenue of $529.4 million, up 22% Year-over-Year\nContinued Strong Customer Growth with Over 52,600 Customers as of October 31, 2024\nMongoDB Atlas Revenue up 26% Year-over-Year; 68% of Total Q3 Revenue\nNEW YORK , Dec. 9, 2024 /PRNewswire/ -- MongoDB, Inc. (NASDAQ: MDB) today announced its financial results for the third quarter ended October\n31, 2024.\n\xa0\n  \xa0\n"MongoDB\'s third quarter results were significantly ahead of expectations on the top and bottom line, driven by better-than-expected EA performance\nand 26% Atlas revenue growth.\xa0 We continue to see success winning new business due to the superiority of MongoDB\'s developer data platform in\naddressing a wide variety of mission-critical use cases," said Dev Ittycheria, President and Chief Executive Officer of MongoDB .\n"We continue to invest in our legacy app modernization and AI offerings as our document model and distributed architecture are exceptionally well\nsuited for customers looking to build highly-performant, modern applications.\xa0 MongoDB  is in a great position to be a central pillar of the emerging AI\ntech stack and benefit from the next wave of application development in the years ahead."\nThird Quarter Fiscal 2025 Financial Highlights\nRevenue: Total revenue was $529.4 million for the third quarter of fiscal 2025, an increase of 22% year-over-year.\nSubscription revenue was $512.2 million, an increase of 22% year-over-year, and services revenue was $17.2 million, an\nincrease of 18% year-over-year.\nGross Profit: Gross profit was $394.0 million for the third quarter of fiscal 2025, representing a 74% gross margin\ncompared to 75% in the year-ago period. Non-GAAP gross profit was $405.7 million, representing a 77% non-GAAP gross\nmargin, consistent with a non-GAAP gross margin of 77% in the year-ago period.')

Instantiate the Retriever

In this section, you instantiate the parent document retriever and use it to ingest data into MongoDB.

MongoDBAtlasParentDocumentRetriever chunks parent documents into smaller child documents, embeds the child documents, and then ingests both parent and child documents into the same collection in MongoDB. Under the hood, this retriever creates the following:

An instance of MongoDBAtlasVectorSearch, a vector store that handles vector search queries to the child documents.
An instance of MongoDBDocStore, a document store that handles storing and retrieving the parent documents.

Instantiate the retriever.

The fastest way to configure MongoDBAtlasParentDocumentRetriever is to use the from_connection_string method. This code specifies the following parameters:

connection_string: Your Atlas connection string to connect to your cluster.
child_splitter: The text splitter to use to split the parent documents into smaller, child documents.
embedding_model: The embedding model to use to embed the child documents.
database_name and collection_name: The database and collection name for which to ingest the documents.
The following optional parameters to configure the MongoDBAtlasVectorSearch vector store:
- text_key: The field in the documents that contains the text to embed.
- relevance_score: The relevance score to use for the vector search query.
- search_kwargs: How many child documents to retrieve in the initial search.

from langchain_mongodb.retrievers import MongoDBAtlasParentDocumentRetriever
from langchain_voyageai import VoyageAIEmbeddings
# Define the embedding model to use
embedding_model = VoyageAIEmbeddings(model="voyage-3-large")
# Define the chunking method for the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
# Specify the database and collection name
database_name = "langchain_db"
collection_name = "parent_document"
# Create the parent document retriever
parent_doc_retriever = MongoDBAtlasParentDocumentRetriever.from_connection_string(
    connection_string = MONGODB_URI,
    child_splitter = child_splitter,
    embedding_model = embedding_model,
    database_name = database_name,
    collection_name = collection_name,
    text_key = "page_content",
    relevance_score_fn = "dotProduct",
    search_kwargs = { "k": 10 },
)

Tip

Ingest the data

Then, run the following code to ingest the documents into Atlas by using retriever's add_documents method. It takes the parent documents as an input and ingests both parent and child documents based on how you configured the retriever.

parent_doc_retriever.add_documents(docs)

(Optional) Verify your documents.

After running the sample code, you can view the documents in the Atlas UI by navigating to the langchain_db.parent_document collection in your cluster.

Both parent and child documents have a page_content field that contains the chunked text. The child documents also have an additional embedding field that contains the vector embeddings of the chunked text, and a doc_id field that corresponds to the _id of the parent document.

You can run the following queries in the Atlas UI, replacing the <id> placeholder with a valid document ID:

To see child documents that share the same parent document ID:
{ doc_id: "<id>" }
To see the parent document of those child documents:
{ _id: "<id>" }

Create the MongoDB Vector Search Index

To enable vector search queries on the langchain_db.parent_document collection, you must create a MongoDB Vector Search index. You can use either the LangChain helper method or the PyMongo driver method. Run the following code in your notebook for your preferred method:

# Get the vector store instance from the retriever
vector_store = parent_doc_retriever.vectorstore
# Use helper method to create the vector search index
vector_store.create_vector_search_index(
   dimensions = 1024 # The number of dimensions to index
)

from pymongo import MongoClient
from pymongo.operations import SearchIndexModel
# Connect to your cluster
client = MongoClient(MONGODB_URI)
collection = client[database_name][collection_name]
# Create your vector search index model, then create the index
vector_index_model = SearchIndexModel(
   definition={
      "fields": [
         {
         "type": "vector",
         "path": "embedding",
         "numDimensions": 1024,
         "similarity": "dotProduct"
         }
      ]
   },
   name="vector_index",
   type="vectorSearch"
)
collection.create_search_index(model=vector_index_model)

The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.

Use the Retriever in Your RAG Pipeline

Once MongoDB builds your index, you can run vector search queries on your data and use the retriever in your RAG pipeline. Paste and run the following code in your notebook to implement a sample RAG pipeline that performs parent document retrieval:

Run a vector search query.

To see the most relevant documents for a given query, paste and run the following code to perform a sample vector search query on the collection. The retriever searches for relevant child documents that are semantically similar to the string AI technology, and then returns the corresponding parent documents of the child documents.

parent_doc_retriever.invoke("AI technology")

[Document(metadata={'_id': '492a138c-1309-4791-a0d0-282d34ea1e55', 'source': 'https://investors.mongodb.com/node/12881/pdf', 'page': 1, 'page_label': '2'}, page_content='downturns and/or the effects of rising interest rates, inflation and volatility in the global economy and financial markets on our business and future\noperating results; our potential failure to meet publicly announced guidance or other expectations about our business and future operating results; our\nlimited operating history; our history of losses; failure of our platform to satisfy customer demands; the effects of increased competition; our\ninvestments in new products and our ability to introduce new features, services or enhancements; our ability to effectively expand our sales and\nmarketing organization; our ability to continue to build and maintain credibility with the developer community; our ability to add new customers or\nincrease sales to our existing customers; our ability to maintain, protect, enforce and enhance our intellectual property; the effects of social, ethical and\nregulatory issues relating to the use of new and evolving technologies, such as artificial intelligence, in our offerings or partnerships; the growth and\nexpansion of the market for database products and our ability to penetrate that market; our ability to integrate acquired businesses and technologies\nsuccessfully or achieve the expected benefits of such acquisitions; our ability to maintain the security of our software and adequately address privacy\nconcerns; our ability to manage our growth effectively and successfully recruit and retain additional highly-qualified personnel; and the price volatility of'),
 Document(metadata={'_id': 'a937204a-0e85-4827-ac63-124735529d51', 'source': 'https://investors.mongodb.com/node/12881/pdf', 'page': 1, 'page_label': '2'}, page_content='that it obtained the AWS Modernization Competency designation and launched a MongoDB University course focused on\nbuilding AI applications with MongoDB  and AWS. At Microsoft Ignite, MongoDB  announced new technology integrations for\nAI, data analytics, and automating database deployments across on-premises, cloud, and edge environments.\nLaunched in July 2024, the MongoDB AI Applications Program (MAAP) is designed to help companies unleash the power\nof their data and to take advantage of rapidly advancing AI technologies. We recently announced that Capgemini,\nConfluent, IBM, Unstructured, and QuantumBlack, AI by McKinsey have joined the MAAP ecosystem, offering customers\nadditional integration and solution options.\nExecutive Leadership Update\nMichael Gordon, MongoDB\'s Chief Operating Officer and Chief Financial Officer, will be stepping down at the end of the Company\'s fiscal year on\nJanuary 31, 2025, and afterwards will serve as an advisor to ensure a smooth transition. The Company has commenced an executive search process\nfor a new CFO and will evaluate internal and external candidates.\xa0 Serge Tanjga, MongoDB\'s Senior Vice President of Finance, will serve as interim\nCFO starting February 1st if a permanent successor has not been named by that date.\nDev Ittycheria commented, "On behalf of everyone at MongoDB , I want to thank Michael for everything he has done to contribute to our success in his\nnearly 10 years with the company.\xa0 In Michael\'s time here, MongoDB  had a successful IPO, has grown revenue nearly 50x and has successfully\nscaled the business model to generate meaningful operating leverage. Michael has also built out a world-class finance team that I am confident will\ndeliver a smooth transition to a new CFO in the coming months."\nMichael Gordon said, "I am incredibly proud of what we have accomplished as a team in my almost ten years with the company.\xa0 While we have')]

To learn more about vector search query examples with LangChain, see Run Vector Search Queries.

Create and run a RAG pipeline.

To create and run a RAG pipeline with the parent document retriever, paste and run the following code. This code does the following:

Defines a LangChain prompt template to instruct the LLM to use the retrieved parent documents as context for your query. LangChain passes these documents to the {context} input variable and your query to the {query} variable.
Constructs a chain that specifies the following:
- The parent document retriever you configured to retrieve relevant parent documents.
- The prompt template that you defined.
- An LLM from OpenAI to generate a context-aware response. By default, this is the gpt-3.5-turbo model.
Prompts the chain with a sample query and returns the response. The generated response might vary.

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import  RunnablePassthrough
from langchain_openai import ChatOpenAI
# Define a prompt template
template = """
   Use the following pieces of context to answer the question at the end.
   {context}
   Question: {query}?
"""
prompt = PromptTemplate.from_template(template)
model = ChatOpenAI()
# Construct a chain to answer questions on your data
chain = (
   {"context": parent_doc_retriever, "query": RunnablePassthrough()}
   | prompt
   | model
   | StrOutputParser()
)
# Prompt the chain
query = "In a list, what are MongoDB's latest AI announcements?"
answer = chain.invoke(query)
print(answer)

1. MongoDB obtained the AWS Modernization Competency designation.
2. MongoDB launched a MongoDB University course focused on building AI applications with MongoDB and AWS.
3. MongoDB announced new technology integrations for AI, data analytics, and automating database deployments across various environments.
4. MongoDB launched the MongoDB AI Applications Program (MAAP) to help companies harness the power of data and future AI technologies.
5. Capgemini, Confluent, IBM, Unstructured, and QuantumBlack joined the MAAP ecosystem to offer customers additional integration and solution options.

Learn by Watching

Follow along with this video about parent document retrieval with LangChain and MongoDB.

Duration: 27 Minutes

Back

Hybrid Search

Self-Querying Retrieval