How to Build a RAG System With LlamaIndex, OpenAI, and MongoDB Vector Database

Richmond Alake10 min read • Published Feb 16, 2024 • Updated Feb 16, 2024
Facebook Icontwitter iconlinkedin icon


Large language models (LLMs) substantially benefit business applications, especially in use cases surrounding productivity. Although LLMs and their applications are undoubtedly advantageous, relying solely on the parametric knowledge of LLMs to respond to user inputs and prompts proves insufficient for private data or queries dependent on real-time data. This is why a non-parametric secure knowledge source that holds sensitive data and can be updated periodically is required to augment user inputs to LLMs with current and relevant information.
Retrieval-augmented generation (RAG) is a system design pattern that leverages information retrieval techniques and generative AI models to provide accurate and relevant responses to user queries by retrieving semantically relevant data to supplement user queries with additional context, combined as input to LLMs.
RAG_STACK_POLM Key takeaways:
  • Overview of RAG
  • Understanding the AI Stack
  • How to build an end-to-end RAG system with MongoDB, LlamaIndex, and OpenAI

What is an AI stack?

This tutorial will implement an end-to-end RAG system using the OLM (OpenAI, LlamaIndex, and MongoDB) or POLM (Python, OpenAI, LlamaIndex, MongoDB) AI Stack. The AI stack, or GenAI stack, refers to the composition of models, databases, libraries, and frameworks used to build and develop modern applications with generative AI capabilities.
A typical AI stack implements an infrastructure that leverages parametric knowledge from the LLM and non-parametric knowledge from data to augment user queries. Components of the AI stack include the following: models, orchestrators or integrators, and operational and vector databases. In this tutorial, MongoDB will act as both the operational and vector database.
AI stack componentOLM/POLM stack
LanguageP ython
Model providerO penAI (GPT-3.5, GPT-4)
LLM orchestratorL lamaIndex
Operational and vector databaseM ongoDB Atlas

Step 1: install libraries

The code snippet below installs various libraries that will provide functionalities to access LLMs, reranking models, databases, and collection methods, abstracting complexities associated with extensive coding into a few lines and method calls.
  • LlamaIndex : data framework that provides functionalities to connect data sources (files, PDFs, website or data source) to both closed (OpenAI, Cohere) and open source (Llama) large language models; the LlamaIndex framework abstracts complexities associated with data ingestion, RAG pipeline implementation, and development of LLM applications (chatbots, agents, etc.).
  • LlamaIndex (MongoDB): LlamaIndex extension library that imports all the necessary methods to connect to and operate with the MongoDB Atlas database.
  • LlamaIndex (OpenAI): LlamaIndex extension library that imports all the necessary methods to access the OpenAI embedding models.
  • PyMongo: a Python library for interacting with MongoDB that enables functionalities to connect to a cluster and query data stored in collections and documents.
  • Hugging Face datasets: Hugging Face library holds audio, vision, and text datasets.
  • Pandas : provides data structure for efficient data processing and analysis using Python.

Step 2: data sourcing and OpenAI key setup

The command below assigns an OpenAI API key to the environment variable OPENAI_API_KEY. This ensures LlamaIndex creates an OpenAI client with the provided OpenAI API key to access features such as LLM models (GPT-3, GPT-3.5-turbo, and GPT-4) and embedding models (text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large).
The data utilised in this tutorial is sourced from Hugging Face datasets, specifically the AIatMongoDB/embedded_movies dataset. A datapoint within the movie dataset contains information corresponding to a particular movie; plot, genre, cast, runtime, and more are captured for each data point. After loading the dataset into the development environment, it is converted into a Pandas data frame object, which enables data structure manipulation and analysis with relative ease.

Step 3: data cleaning, preparation, and loading

The operations within this step focus on enforcing data integrity and quality. The first process ensures that each data point's plot attribute is not empty, as this is the primary data we utilise in the embedding process. This step also ensures we remove the plot_embedding attribute from all data points as this will be replaced by new embeddings created with a different model, the text-embedding-3-small.
An embedding object is initialised from the OpenAIEmbedding model, part of the llama_index.embeddings module. Specifically, the OpenAIEmbedding model takes two parameters: the embedding model name, text-embedding-3-small for this tutorial, and the dimensions of the vector embedding.
The code snippet below configures the embedding model, and LLM utilised throughout the development environment. The LLM utilised to respond to user queries is the default OpenAI model enabled via LlamaIndex and is initialised with the OpenAI() class. To ensure consistency within all consumers of LLMs and their configuration, LlamaIndex provided the "Settings" module, which enables a global configuration of the LLMs and embedding models utilised in the environment.
Next, it's crucial to appropriately format the dataset and its contents for MongoDB ingestion. In the upcoming steps, we'll transform the current structure of the dataset, dataset_df — presently a DataFrame object — into a JSON string. This dataset conversion is done in the line documents = dataset_df.to_json(orient='records'), which assigns the JSON format to the documents.
By specifying orient='records', each row of the DataFrame is converted into a separate JSON object.
The following step creates a list of Python dictionaries, documents_list, each representing an individual record from the original DataFrame. The final step in this process is to convert each dictionary into manually constructed documents, which are first-class citizens that hold information extracted from a data source. Documents within LlamaIndex hold information, such as metadata, that is utilised in downstream processing and ingestion stages in a RAG pipeline.
One important point to note is that when creating a LlamaIndex document manually, it's possible to configure the attributes of the documents that are utilised when passed as input to embedding models and LLMs. The excluded_llm_metadata_keys and excluded_embed_metadata_keys arguments on the document class constructor take a list of attributes to ignore when generating inputs for downstream processes within a RAG pipeline. A reason for doing this is to limit the context utilised within embedding models for more relevant retrievals, and in the case of LLMs, this is used to control the metadata information combined with user queries. Without configuration of either of the two arguments, a document by default utilises all content in its metadata as embedding and LLM input.
At the end of this step, a Python list contains several documents corresponding to each data point in the preprocessed dataset.
The final step of processing before ingesting the data to the MongoDB vector store is to convert the list of LlamaIndex documents into another first-class citizen data structure known as nodes. Once we have the nodes generated from the documents, the next step is to generate embedding data for each node using the content in the text and metadata attributes.

Step 4: database setup and connection

Before moving forward, ensure the following prerequisites are met:
  • Database cluster setup on MongoDB Atlas
  • Obtained the URI to your cluster
For assistance with database cluster setup and obtaining the URI, refer to our guide for setting up a MongoDB cluster, and our guide to get your connection string. Alternatively, follow Step 5 of this article on using embeddings in a RAG system, which offers detailed instructions on configuring and setting up the database cluster.
Once you have successfully created a cluster, create the database and collection within the MongoDB Atlas cluster by clicking + Create Database. The database will be named movies, and the collection will be named movies_records.
Creating a database and collection via MongoDB Atlas
Securely store the URI within your development environment after setting up the database and obtaining the Atlas cluster connection URI.
This guide uses Google Colab, which offers a feature for secure storage of environment secrets. These secrets can then be accessed within the development environment. Specifically, the code mongo_uri = userdata.get('MONGO_URI') retrieves the URI from the secure storage.
The code snippet below also utilises PyMongo to create a MongoDB client object, representing the connection to the cluster and enabling access to its databases and collections.
The following code guarantees that the current database collection is empty by executing the delete_many() operation on the collection.

Step 5: vector search index creation

Creating a vector search index within the movies_records collection is essential for efficient document retrieval from MongoDB into our development environment. To achieve this, refer to the official guide on vector search index creation.
In the creation of a vector search index using the JSON editor on MongoDB Atlas, ensure your vector search index is named vector_index and the vector search index definition is as follows:
After setting up the vector search index, data can be ingested and retrieved efficiently. Data ingestion is a trivial process achieved with less than three lines when leveraging LlamaIndex.

Step 6: data ingestion to vector database

Up to this point, we have successfully done the following:
  • Loaded data sourced from Hugging Face
  • Provided each data point with embedding using the OpenAI embedding model
  • Set up a MongoDB database designed to store vector embeddings
  • Established a connection to this database from our development environment
  • Defined a vector search index for efficient querying of vector embeddings
The code snippet below also initialises a MongoDB Atlas vector store object via the LlamaIndex constructor MongoDBAtlasVectorSearch. It's important to note that in this step, we reference the name of the vector search index previously created via the MongoDB Cloud Atlas interface. For this specific use case, the index name is vector_index.
The crucial method that executes the ingestion of nodes into a specified vector store is the .add() method of the LlamaIndex MongoDB instance.
The last line in the code snippet above creates a LlamaIndex index. Within LlamaIndex, when documents are loaded into any of the index abstraction classes — SummaryIndex, ``TreeIndex, KnowledgeGraphIndex, and especially VectorStoreIndex``` — an index that stores a representation of the original document is built in an in-memory vector store that also stores embeddings.
But since the MongoDB Atlas vector database is utilised in this RAG system to store the embeddings and also the index for our document, LlamaIndex enables the retrieval of the index from Atlas via the from_vector_store method of the VectorStoreIndex class.

Step 7: handling user queries

The next step involves creating a LlamaIndex query engine. The query engine enables the functionality to utilise natural language to retrieve relevant, contextually appropriate information from a data index. The as_query_engine method provided by LlamaIndex abstracts the complexities of AI engineers and developers writing the implementation code to process queries appropriately for extracting information from a data source.
For our use case, the query engine satisfies the requirement of building a question-and-answer application. However, LlamaIndex does provide the ability to construct a chat-like application with the Chat Engine functionality.


Incorporating RAG architectural design patterns improves LLM performance within modern generative AI applications and introduces a cost-conscious approach to building robust AI infrastructure. Building a robust RAG system with minimal code implementation with components such as MongoDB as a vector database and LlamaIndex as the LLM orchestrator is a straightforward process, as this article demonstrates.
In particular, this tutorial covered the implementation of a RAG system that leverages the combined capabilities of Python, OpenAI, LlamaIndex, and the MongoDB vector database, also known as the POLM AI stack.
It should be mentioned that fine-tuning is still a viable strategy for improving the capabilities of LLMs and updating their parametric knowledge. However, for AI engineers who consider the economics of building and maintaining GenAI applications, exploring cost-effective methods that improve LLM capabilities is worth considering, even if it is experimental.
The associated cost of data sourcing, the acquisition of hardware accelerators, and the domain expertise needed for fine-tuning LLMs and foundation models often entail significant investment, making exploring more cost-effective methods, such as RAG systems, an attractive alternative.
Notably, the cost implications of fine-tuning and model training underscore the need for AI engineers and developers to adopt a cost-saving mindset from the early stages of an AI project. Most applications today already, or will, have some form of generative AI capability supported by an AI infrastructure. To this point, it becomes a key aspect of an AI engineer's role to communicate and express the value of exploring cost-effective solutions to stakeholders and key decision-makers when developing AI infrastructure.
All code presented in this article is presented on GitHub. Happy hacking.


Q: What is a retrieval-augmented generation (RAG) system?
Retrieval-augmented generation (RAG) is a design pattern that improves the capabilities of LLMs by using retrieval models to fetch semantically relevant information from a database. This additional context is combined with the user's query to generate more accurate and relevant responses from LLMs.
Q: What are the key components of an AI stack in a RAG system?
The essential components include models (like GPT-3.5, GPT-4, or Llama), orchestrators or integrators for managing interactions between LLMs and data sources, and operational and vector databases for storing and retrieving data efficiently.
Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this tutorial

Exact Matches in Atlas Search: Beginners Guide

Aug 30, 2022 | 6 min read

Using Atlas Data Federation to Control Access to Your Analytics Node

Jun 28, 2023 | 9 min read

Serverless Development with Kotlin, AWS Lambda, and MongoDB Atlas

Aug 01, 2023 | 6 min read

An Introduction to the MongoDB Atlas Data API

Sep 15, 2022 | 10 min read
Table of Contents