What is retrieval-augmented generation (RAG)?

Large language models (LLMs) that power generative AI are amazing pieces of engineering and science, with an ability to reason as they create or generate something new. However, to make an LLM useful for your specific generative AI-powered application or project, you need to make sure you are feeding it with your own relevant data. While LLMs are impressive, everyone has access to them. So your superpower differentiation comes from feeding them with your data - and that is what retrieval-augmented generation (RAG) enables us to do.

Table of contents:

Large language models, or foundation models are all-knowing general purpose models, but they lack knowledge of proprietary, up to date information.
Retrieval-augmented generation provides contextual, up to date data to make LLMs useful.
Primary use cases of RAG.
Why retrieval-augmented generation? What are the alternatives to RAG when building Generative AI applications.
The core building blocks of a basic RAG architecture.
Keeping RAG simple with minimal complexity, yet sophisticated to perform reliably at scale.
Ensuring your Generative AI-powered application will be secure, performant, reliable, scalable, when it goes global.

Large language models, or foundation models are all-knowing general purpose models, but they lack knowledge of proprietary, up to date information.

Large language models (LLMs) and foundation models are a type of artificial intelligence (AI) that can generate and understand multimodal data (i.e., text, code, images, video, audio, tables). They are trained on massive datasets and can be used for a variety of tasks, including translation, writing different kinds of creative content, composing videos and music, answering your questions in an informative way, and a whole lot more.

While LLMs seem to have given access to all world knowledge, they have some limitations. One limitation is that they can generate outputs that are not always accurate or up-to-date. This is because LLMs are trained on data that has since become outdated, incomplete, or lacks proprietary knowledge about a specific use case or domain. Additionally, LLMs can sometimes generate output that is biased or offensive.

Another limitation of LLMs is that they have difficulty accessing and manipulating knowledge from the real world. This is because LLMs are typically trained on data that is synthetic or text-based. As a result, LLMs may not have a good understanding of how the world works, or how to apply their knowledge to real-world problems.

The image shows an example of an LLM-powered application that does not use Retrieval-Augmented Generation.

Retrieval-augmented generation provides contextual, up to date data to make LLMs useful.

Retrieval-augmented generation is a technique that addresses the limitations of LLMs by providing them with access to contextual, up-to-date data. RAG implementations, sometimes referred to as RAG models or patterns, work by combining a pre-trained LLM with a retrieval system of readily accessible information. The retrieval system is responsible for finding relevant information from a knowledge library, such as a database. The RAG models enable the LLM, or foundation model, to generate a more accurate answer with up-to-date and relevant context to the task at hand.

RAG models have been shown to be effective for a variety of knowledge intensive tasks, including:

Language generation tasks like answering questions in a comprehensive and informative way, or generating different creative text formats of text content, like poems, scripts, musical pieces, email, letters, etc.
NLP tasks like providing summaries of conversations, audio recordings and video calls.
Tasks requiring classification such as cybersecurity and compliance or reasoning for things like business planning

RAG can also be used to allow a generative AI powered application to observe some background state, and tailor its generations accordingly. An example would be the ability to write code based on the code that a user is writing. Other examples include:

Application context. Say you are building an AI-powered Excel assistant - it’d help if it knows the names of your sheets, filename, selected cell ranges, etc. RAG will feed this “background activity info” into the prompt so that the LLM can tailor its help to your sheet.
Personal data (e.g. agent assist chatbot). Say you are making a customer support bot. The bot can pull in previous conversation and CRM history for that specific customer to help tailor the conversation (not just greeting, but customizing the options, etc). Without this history, the LLM won’t be able to perform personalization effectively or help with existing issues.
Raw numbers, metrics, tabular data (e.g. CSV, Parquet, JSON etc). RAG is not limited to working with textual context, but quantitative information as well. A business intelligence (BI) chatbot would certainly be doing RAG on raw tabular data.
Other multimodal data types like images, video and audio. Many models, such as DALL-E 2, can leverage text to create or augment images. Conversely, they can synthesize images or video into natural language. By having context to certain images, or design form factors, RAG can make GenAI-apps more powerful when it comes to creating marketing assets or generating summarizations and translations from videos that contain very specific, context-heavy information.

Retrieval-augmented generation is also useful for data that can’t be incorporated as training data

High volatile / time-sensitive data: data such as stock market news quickly becomes stale. Therefore, it makes more sense to enhance the LLM with only the latest up to date information, ideally at each request during inference time - as opposed to attempting to retrain LLMs with this corpus.
Sensitive data: Many of the top performing LLMs (such as OpenAI's GPT or Anthropic's Claude) are pay per use and owned by those companies. Using personal, sensitive information in training data to fine tune those LLMS can lead to private data leakage and is potentially dangerous. Thus, sometimes RAG is the only safe option.

Primary use cases of RAG.

Based on the above, the most suitable use cases of RAG include:

Question answering over any extrinsic domain knowledge such as company-specific documentation and knowledge bases, live operational systems, back office systems, etc: by definition, using LLMs with any data outside of LLM’s knowledge cutoff requires RAG. Additionally, question and answering over highly time sensitive and fast-changing context - data that gets outdated quickly are impossible to integrate into LLMs via fine tuning.
For reducing hallucinations and increasing factual accuracy: generally speaking, RAG can improve factual accuracy even for answering questions about information contained within the LLM’s training corpus. That’s because RAG turns the question answering task into an “open-book quiz” task, which is easier than an unbounded question answering task.
Personalization is a canonical use case for RAG. In this case, the prompt is augmented with user data. Optionally, any PII data could be scrubbed prior to being inserted into the prompt.
Providing contextual answers (inside Co-Pilots). As Github co-pilot demonstrates, LLM generations can be more relevant when grounded on the application state (the current document, the overall project metadata, which URL or page is currently being visited, etc)
Any GenAI-app that works with highly domain specific contexts. Examples include healthcare, financial services, legal discovery, and science and engineering. In these sorts of domains, training data is often sparse and so RAG is essential to build useful GenAI apps.

Why retrieval-augmented generation? What are the alternatives to RAG when building Generative AI applications.

There are a number of alternatives to RAG when building generative AI applications. Some of the most popular alternatives include:

Train your own LLM: While there could be justification for training your own LLM, it’s likely going to be far too expensive and time consuming to create anything competitive with the many commercial (OpenAI GPTs) and open source models (Meta’s LLaMa) available.
Fine-tuning an existing LLM: a technique where a pre-trained LLM is retrained on a smaller dataset of task-specific data. Fine-tuning can be effective for improving the performance of an LLM on a specific task, but it can also be time-consuming and expensive. Fine-tuning never ends – as new data becomes available, so the model has to be fine tuned again. When your GenAI app demands access to live operational data, fine tuning is not going to work for you.

Why retrieval-augmented generation over fine tuning an LLM?

Fine tuning is another way to use LLMs with “custom data”, but unlike RAG which is like giving a LLM an open book quiz, fine tuning is like giving it entirely new memories or a lobotomy. Fine tuning tailors the model so that you can change its performance, behavior, cost profile, etc. It’s time and resource intensive and is generally not viable for grounding LLMs on specific context, and is especially unsuitable for live, operational data from your business.

The core building blocks of a basic RAG architecture.

A basic retrieval-augmented generation architecture consists of three main components:

A pre-trained LLM: The LLM is responsible for generating text, images, audio, and video.
Vector search (or semantic search): The retrieval system is responsible for finding relevant information from a knowledge base that is external to the LLM. There are a variety of general purpose databases or single-purpose vector databases to choose from that can store vector embeddings and run approximate nearest neighbor search queries against them. Vector search is core to being able to precisely augment proprietary knowledge provided to a general purpose LLM.
Vector embeddings: Sometimes referred to simply as "vectors" or "embeddings", vector embeddings are essentially numerical representations capturing the semantic, or underlying meaning of a piece of data. Generally speaking, they're an array of floats, where each float represents a single dimension of the numerical representations.
Orchestration: The fusion mechanism is responsible for combining the output of the LLM with the information from the retrieval system to generate the final output.

The following diagram shows a basic RAG architecture with the same retail example as before:

A large language model being made useful in a generative AI application by leveraging retrieval-augmented generation.

As a workaround for this lack on domain-specific context, retrieval-augmented generation is performed as follows:

We fetch the most relevant product descriptions from a database (often a database with vector search) that contains the latest product catalog
Then, we insert (augment) these descriptions into the LLM prompt
Finally, we instruct the LLM to “reference” this up-to-date product information in answering the question

Three things to consider from the above:

Retrieval-augmented generation is a purely inference time (no retraining required) technique. The above steps 1-3 all happen in inference-time. No changes to the model are required (e.g. modifying the model weights).
Retrieval-augmented generation is well suited for real-time customizations of LLM generations. Because no retraining is involved and everything is done via in-context learning, RAG-based inference is fast (sub 100ms latency), and well suited to be used inside real-time operational applications.
Retrieval-augmented generation makes LLM generations more accurate and useful. Each time the context changes, the LLM will generate a different response. Thus, RAG makes LLM generations depend on whatever context was retrieved.

Keeping RAG simple with minimal complexity, yet sophisticated to perform reliably at scale.

To achieve a performant, yet minimally complex RAG architecture, it starts with choosing the right systems. When choosing the systems, or technologies, for a RAG implementation, it is important to choose systems, or a system, that can achieve the following:

Support new vector data requirements without adding tremendous sprawl, cost, and complexity to your IT operations.
Ensure that the generative AI experiences built have access to live data with minimal latency.
Have the flexibility to accommodate new data and app requirements and allow development teams to stay agile while doing so.
Best equip dev teams to bring the entire AI ecosystem to their data, not the other way around.

Options range from single purpose vector databases, to document and relational databases with native vector capabilities, and data warehouses and lakehouses. However, single purpose vector databases will immediately add sprawl and complexity. Data warehouses and lakehouses are inherently designed for long-running analytical type queries on historic data as opposed to the high volume, low latency and fresh data requirements of the GenAI apps that RAG powers. Additionally, relational databases bring rigid schemas that limit flexibility of adding new data and app requirements easily. That leaves document databases with native, or built-in, vector capabilities. In particular, MongoDB is built on the flexible document model and has native vector search, making it a vector database for RAG in addition to the industry leading database for any modern application.

Taking the power of LLMs to the next level with additional capabilities in your RAG implementation.

In addition to the core components, there are a number of additional capabilities that can be added to a RAG implementation to take the power of LLMs to the next level. Some of these additional capabilities include:

Multimodality: Multimodal RAG models can generate text that is based on both text and non-text data, such as images, videos, and audio. Having this multimodal data stored side by side with live operational data makes the RAG implementation more easy to design and manage.
Defining additional filters in the vector search query: The ability to add keyword search, geospatial search, and point and range filters on the same vector query can add accuracy and speed to the context provided to the LLM.
Domain specificity: Domain-specific RAG models can be trained on data from a specific domain, such as healthcare or finance. This allows the RAG model to generate more accurate and relevant text for that domain.

Ensuring your Generative AI-powered application will be secure, performant, reliable, scalable, when it goes global.

There are a number of things that can be done to ensure that a GenAI-powered application built with a RAG is secure, performant, reliable, and scalable when it goes global. Some of these things include:

Use a platform that is secure and has the proper data governance capabilities: Data governance is a broad term encompassing everything you do to ensure data is secure, private, accurate, available, and usable. It includes the processes, policies, measures, technology, tools, and controls around the data lifecycle. Thus, the platform should be secure by default, have end-to-end encryption, and have achieved compliance at the highest levels.
Use a cloud-based platform: In addition to the security and scalability features cloud-based platforms provide, the major cloud providers are some of the leading innovators for AI infrastructure. Choosing a platform that is cloud agnostic allows the ability for teams to take advantage of the AI innovations wherever they land.
Use a platform that can isolate vector workload infrastructure from other database infrastructure: It is important that regular OLTP workloads and vector workloads do not share infrastructure so that the two workloads can run on hardware optimized for each, and so that they do not compete for resources while still being able to leverage the same data.
Use a platform that has been proven at scale: It’s one thing for a provider to say it can scale, but does it have a history and a track record with global, enterprise customers? Does it have mission critical fault tolerance and ability to scale horizontally, and can it prove it with customer examples?

By following these tips, it is possible to build GenAI-powered applications with RAG architectures that are secure, performant, reliable, and scalable.

With the introduction of Atlas Vector Search, MongoDB’s leading developer data platform provides teams with a vector database that enables building sophisticated, performant RAG architectures that can perform at scale. All this while maintaining the highest levels of security and cloud agnosticism, and most importantly, without adding complexity and unnecessary costs.