EventGet 50% off your ticket to MongoDB.local NYC on May 2. Use code Web50!Learn more >>
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Using OpenAI Latest Embeddings In A RAG System With MongoDB

Richmond Alake15 min read • Published Feb 01, 2024 • Updated Feb 01, 2024
AIPythonAtlas
Facebook Icontwitter iconlinkedin icon
MongoDB Atlas and Azure functions
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty

Using OpenAI Latest Embeddings in a RAG System With MongoDB

Introduction

OpenAI recently released new embeddings and moderation models. This article explores the step-by-step implementation process of utilizing one of the new embedding models: text-embedding-3-small within a retrieval-augmented generation (RAG) system powered by MongoDB Atlas Vector Database.

What is an embedding?

An embedding is a mathematical representation of data within a high-dimensional space, typically referred to as a vector space. Within a vector space, vector embeddings are positioned based on their semantic relationships, concepts, or contextual relevance. This spatial relationship within the vector space effectively mirrors the associations in the original data, making embeddings useful in various artificial intelligence domains, such as machine learning, deep learning, generative AI (GenAI), natural language processing (NLP), computer vision, and data science.
Creating an embedding involves mapping data related to entities like words, products, audio, and user profiles into a numerical format. In NLP, this process involves transforming words and phrases into vectors, converting their semantic meanings into a machine-readable form.
AI applications that utilize RAG architecture design patterns leverage embeddings to augment the large language model (LLM) generative process by retrieving relevant information from a data store such as MongoDB Atlas. By comparing embeddings of the query with those in the database, RAG systems incorporate external knowledge, improving the relevance and accuracy of the responses.
Naive RAG architecture diagram
OpenAI recently introduced two new embedding models: text-embedding-3-small and text-embedding-3-large. The text-embedding-3-small model offers a compact and highly efficient solution, ideal for applications requiring speed and agility, while the text-embedding-3-large model provides a more detailed and powerful vector representation suitable for complex and nuanced data processing tasks.
ada v2text-embedding-3-smalltext-embedding-3-large
Embedding Size1536256, 512 and 1536256, 1024 and 3072

Key takeaways

  • OpenAI's embedding models: Get introduced to OpenAI's new embedding models, text-embedding-3-small and text-embedding-3-large, and their applications.
  • Practical implementation steps: Follow through practical steps, including library installation, data loading and preprocessing, creating embeddings, and data ingestion into MongoDB.
  • Vector Search index in MongoDB: Learn to create and use a vector search index for efficient retrieval and user query processing.
  • AI-driven query responses: Understand how to handle user queries and generate AI responses, integrating RAG system insights for more accurate answers.
  • Real-world application insight: Gain hands-on experience in implementing an advanced RAG system for practical uses like a movie recommendation engine.
The following section introduces a series of steps that explain how to utilize the new OpenAI embedding model text-embedding-3-small to embed plot data points for movies within a movie dataset to power a RAG system that answers user queries based on the collection movies.
The steps also cover the typical stages within RAG systems and pipelines that AI engineers are likely to encounter:
  1. Data loading: importing and accessing datasets from various data sources for processing and analysis; the step involves making data available in the application environment.
  2. Data cleaning and preparation: refining the dataset by removing inaccuracies, filling missing values, and formatting data for use in the downstream stages in the pipeline.
  3. Data ingestion and indexing: moving the processed data into a data store such as MongoDB Atlas database and creating indexes to optimize retrieval efficiency and search performance.
  4. Querying: executing search queries against the database to retrieve relevant data based on specific criteria or user inputs.

Step 1: libraries installation

The development environment for the demonstration of the text-embedding-3-small embedding model and the retrieval system requires the setting up of libraries and tools installed using the Python package manager pip.
Below are brief explanations of the tools and libraries utilised within the implementation code:
  • datasets: This library is part of the Hugging Face ecosystem. By installing 'datasets', we gain access to a number of pre-processed and ready-to-use datasets, which are essential for training and fine-tuning machine learning models or benchmarking their performance.
  • pandas: This is a data science library that provides robust data structures and methods for data manipulation, processing, and analysis.
  • openai: This is the official Python client library for accessing OpenAI's suite of AI models and tools, including GPT and embedding models.
  • pymongo: PyMongo is a Python toolkit for MongoDB. It enables interactions with a MongoDB database.

Step 2: data loading

The code snippet below shows the data loading phase where the libraries load_dataset from the Hugging Face datasets library and the panda's library, denoted as pd, are imported into the development environment. The load_dataset function is for accessing a wide range of datasets available in Hugging Face's repository.
Load the dataset titled AIatMongoDB/embedded_movies. This dataset is a collection of movie-related details that include attributes such as the title, release year, cast, and plot. A unique feature of this dataset is the plot_embedding field for each movie. These embeddings are generated using OpenAI's text-embedding-ada-002 model.
After loading the dataset, it is converted into a pandas DataFrame; this data format simplifies data manipulation and analysis. Display the first five rows using the head(5) function to gain an initial understanding of the data. This preview provides a snapshot of the dataset's structure and its various attributes, such as genres, cast, and plot embeddings.
Import libraries:
  • from datasets import load_dataset: imports the load_dataset function from the Hugging Face datasets library; this function is used to load datasets from Hugging Face's extensive dataset repository.
  • import pandas as pd: imports the pandas library, a fundamental tool in Python for data manipulation and analysis, using the alias pd.
Load the dataset:
  • dataset = load_dataset("AIatMongoDB/embedded_movies"): Loads the dataset named embedded_movies from the Hugging Face datasets repository; this dataset is provided by MongoDB and is specifically designed for embedding and retrieval tasks.
Convert dataset to pandas DataFrame:
  • dataset_df = pd.DataFrame(dataset\['train'\]): converts the training portion of the dataset into a pandas DataFrame.
Preview the dataset:
  • dataset_df.head(5): displays the first five entries of the DataFrame.

Step 3: data cleaning and preparation

The next step cleans the data and prepares it for the next stage, which creates a new embedding data point using the new OpenAI embedding model.
Removing incomplete data:
  • dataset_df = dataset_df.dropna(subset=\['plot'\]): ensures data integrity by removing any data point/row where the “plot” column is missing data; since “plot” is a vital component for the new embeddings, its completeness affects the retrieval performance.
Preparing for new embeddings:
  • dataset_df = dataset_df.drop(columns=\['plot_embedding'\]): remove the existing “plot_embedding” column; new embeddings using OpenAI's "text-embedding-3-small" model, the existing embeddings (generated by a different model) are no longer needed.
  • dataset_df.head(5): allows us to preview the first five rows of the updated datagram to ensure the removal of the “plot_embedding” column and confirm data readiness.

Step 4: create embeddings with OpenAI

This stage focuses on generating new embeddings using OpenAI's advanced model.
This demonstration utilises a Google Colab Notebook, where environment variables are configured explicitly within the notebook's Secrets section and accessed using the user data module. In a production environment, the environment variables that store secret keys are usually stored in a .env file or equivalent.
An OpenAI API key is required to ensure the successful completion of this step. More details on OpenAI's embedding models can be found on the official site.
Setting up OpenAI API:
  • Imports and API key: Import the openai library and retrieve the API key from Google Colab's userdata.
  • Model selection: Set the variable EMBEDDING_MODEL to text-embedding-3-small.
Embedding generation function:
  • get_embedding: converts text into embeddings; it takes both the string input and the embedding model as arguments and generates the text embedding using the specified OpenAI model.
  • Input validation and API call: validates the input to ensure it's a valid string, then calls the OpenAI API to generate the embedding.
  • If the process encounters any issues, such as invalid input or API errors, the function returns None.
  • Applying to dataset: The function get_embedding is applied to the “plot” column of the DataFrame dataset_df. Each plot is transformed into an optimized embedding data stored in a new column, plot_embedding_optimised.
  • Preview updated dataset: dataset_df.head() displays the first few rows of the DataFrame.

Step 5: Vector database setup and data ingestion

MongoDB acts as both an operational and a vector database. It offers a database solution that efficiently stores, queries, and retrieves vector embeddings — the advantages of this lie in the simplicity of database maintenance, management, and cost.
To create a new MongoDB database, set up a database cluster:
  1. Register for a free MongoDB Atlas account, or for existing users, sign into MongoDB Atlas.
  2. Select the “Database” option on the left-hand pane, which will navigate to the Database Deployment page, where there is a deployment specification of any existing cluster. Create a new database cluster by clicking on the "+Create" button.
Database deployments and cluster dashboard
3.   Select all the applicable configurations for the database cluster. Once all the configuration options are selected, click the “Create Cluster” button to deploy the newly created cluster. MongoDB also enables the creation of free clusters on the “Shared Tab.”
Note: Don’t forget to whitelist the IP for the Python host or 0.0.0.0/0 for any IP when creating proof of concepts.
Cluster configuration overview
4. After successfully creating and deploying the cluster, the cluster becomes accessible on the “Database Deployment” page.
Cluster dashboard and overview
5. Click on the “Connect” button of the cluster to view the option to set up a connection to the cluster via various language drivers.
Options overview for connecting to cluster
6. This tutorial only requires the cluster's URI (unique resource identifier). Grab the URI and copy it into the Google Colab Secrets environment in a variable named MONGO_URI, or place it in a .env file or equivalent. Connecting to Atlas cluster via URI and Python driver
7. Database connection setup:
  • MongoDB connection function: The get_mongo_client function is defined to establish a connection to MongoDB using the provided URI. It includes error handling to manage connection failures.
8. Data ingestion process:
  • Retrieving MongoDB URI: The MongoDB URI, which is crucial for connecting to the database, is obtained from the environment variables using userdata.get('MONGO_URI').
  • Establishing database connection: The script attempts to connect to MongoDB using this URI.
  • Database and collection selection: Once connected, the script selects the movies database and the movie_collection collection. This specifies where the data will be stored in MongoDB. If the database or collection does not exist, MongoDB creates them automatically.
  • Data conversion and insertion: DataFrame, with enhanced embeddings, is converted into a dictionary format suitable for MongoDB using to_dict('records'). The insert_many method is then used to ingest data in batch.

Step 6: create a Vector Search index

This next step is mandatory for conducting efficient and accurate vector-based searches based on the vector embeddings stored within the documents in the movie_collection collection. Creating a Vector Search index enables the ability to traverse the documents efficiently to retrieve documents with embeddings that match the query embedding based on vector similarity. Read more about MongoDB Vector Search indexes.
1. Navigate to the movie_collection in the movie database. At this point, the database is populated with several documents containing information about various movies, particularly within the action and romance genres. Overview of record within the movies collection
2. Select the “Atlas Search” tab option on the navigation pane to create an Atlas Vector Search index. Click the “Create Search Index” button to create an Atlas Vector Search Index. Overview of Atlas Search capabilities
3. On the page to create a Vector Search index, select the Atlas Vector Search option that enables the creation of a vector search index by defining the index using JSON.
Creating a vector search index via MongoDB Atlas interface
4. The following page depicted below enables the definition of the index via JSON. This page also provides the ability to name the vector index search. The name given to the index will be referenced in the implementation code in the following steps. For this tutorial, the name “vector_index” will be used.
5. To complete the creation of the vector search index, select the appropriate database and collection for which the index should be created. In this scenario, it is the “movies” database and the “movie_collection” collection. The JSON entered into the JSON editor should look similar to the following:
  • fields: This is a list that specifies the fields to be indexed in the MongoDB collection along with the definition of the characteristic of the index itself.
  • numDimensions: Within each field item, numDimensions specifies the number of dimensions of the vector data. In this case, it is set to 1536. This number should match the dimensionality of the vector data stored in the field, and it is also one of the dimensions that OpenAI's text-embedding-3-small creates vector embeddings.
  • path: The path field indicates the path to the data within the database documents to be indexed. Here, it is set to plot_embedding_optimised.
  • similarity: The similarity field defines the type of similarity distance metric that will be used to compare vectors during a search. Here, it is set to cosine, which measures the cosine of the angle between two vectors, effectively determining how similar or different these vectors are in their orientation in the vector space. Other similarity distance metric measures are Euclidean and Dot Products. Find more information about how to index vector embeddings for vector search.
  • type: This field specifies the data type the index will handle. In this case, it is set to vector, indicating that this index is specifically designed for handling and optimizing searches over vector data.
JSON Editor to define vector search index
Now, the vector search index should be created successfully. Navigating back to the Atlas Search page should show the index named vector_index with a status of active.
Atlas search index overview

Step 7: perform vector search on user queries

This step combines all the activities in the previous step to provide the functionality of conducting vector search on stored records based on embedded user queries.
This step implements a function that returns a vector search result by generating a query embedding and defining a MongoDB aggregation pipeline. The pipeline, consisting of the $vectorSearch and $project stages, queries using the generated vector and formats the results to include only required information like plot, title, and genres while incorporating a search score for each result. This selective projection enhances query performance by reducing data transfer and optimizes the use of network and memory resources, which is especially critical when handling large datasets. For AI engineers and developers considering data security at an early stage, the chances of sensitive data leaked to the client side can be minimized by carefully excluding fields irrelevant to the user's query.
Depiction of the aggregated pipeline for executing vector search queries
1. Vector Search custom function:
  • The vector_search function is designed to perform a sophisticated search within a MongoDB collection, utilizing the vector embeddings stored in the database.
  • It accepts two parameters: user_query, a string representing the user's search query, and collection, the MongoDB collection to be searched.
2. Query embedding and search pipeline:
  • Embedding generation: The function begins by generating an embedding for the user query using the get_embedding function.
  • Defining the search pipeline: A MongoDB aggregation pipeline is defined for the vector search. This pipeline uses the $vectorSearch operator to find documents whose embeddings closely match the query embedding. The pipeline specifies the index to use, the query vector, and the path to the embeddings in the documents and limits the number of candidate matches and the number of results returned.
  • Projection of results: The $project stage formats the output by including relevant fields like the plot, title, genres, and search score while excluding the MongoDB document ID.

Step 8: handling user query and result

The final step in the implementation phase focuses on the practical application of our vector search functionality and AI integration to handle user queries effectively. The handle_user_query function performs a vector search on the MongoDB collection based on the user's query and utilizes OpenAI's GPT-3.5 model to generate context-aware responses.
1. Functionality for query handling:
  • The handle_user_query function takes a user's query and the MongoDB collection as inputs.
  • It starts by executing a vector search on the collection based on the user query, retrieving relevant movie documents.
2. Generating AI-driven responses:
  • Context compilation: Next, the function compiles a context string from the search results, concatenating titles and plots of the retrieved movies.
  • OpenAI model integration: The openai.chat.completions.create function is called with the model gpt-3.5-turbo.
  • System and user roles: In the message sent to the OpenAI model, two roles are defined: system, which establishes the role of the AI as a movie recommendation system, and user, which provides the actual user query and the context.
3. Executing and displaying responses:
  • The handle_user_query function returns the AI-generated response and the search result context used.
Below is the result of running the function:

Conclusion

The new OpenAI embedding model promises better performance in terms of multi-language retrieval and task-specific accuracy. This is in comparison to previously released OpenAI embedding models. This article outlined the implementation steps for a RAG system that leverages the latest embedding model. View the GitHub repo for the implementation code.
In practical scenarios, lower-dimension embeddings that can maintain a high level of semantic capture are beneficial for Generative AI applications where the relevance and speed of retrieval are crucial to user experience and value.
Further advantages of lower embedding dimensions with high performance are:
  • Improved user experience and relevance: Relevance of information retrieval is optimized, directly impacting the user experience and value in AI-driven applications.
  • Comparison with previous model: In contrast to the previous ada v2 model, which only provided embeddings at a dimension of 1536, the new models offer more flexibility. The text-embedding-3-large extends this flexibility further with dimensions of 256, 1024, and 3072.
  • Efficiency in data processing: The availability of lower-dimensional embeddings aids in more efficient data processing, reducing computational load without compromising the quality of results.
  • Resource optimization: Lower-dimensional embeddings are resource-optimized, beneficial for applications running on limited memory and processing power, and for reducing overall computational costs.
Future articles will cover advanced topics, such as benchmarking embedding models and handling migration of embeddings.

Frequently asked questions

1. What is an embedding?

An embedding is a technique where data — such as words, audio, or images — is transformed into mathematical representations, vectors of real numbers in a high-dimensional space referred to as a vector space. This process allows AI models to understand and process complex data by capturing the underlying semantic relationships and contextual nuances.

2. What is a vector store in the context of AI and databases?

A vector store, such as a MongoDB Atlas database, is a storage mechanism for vector embeddings. It allows efficient storing, indexing, and retrieval of vector data, essential for tasks like semantic search, recommendation systems, and other AI applications.

3. How does a retrieval-augmented generation (RAG) system utilize embeddings?

A RAG system uses embeddings to improve the response generated by a large language model (LLM) by retrieving relevant information from a knowledge store based on semantic similarities. The query embedding is compared with the knowledge store (database record) embedding to fetch contextually similar and relevant data, which improves the accuracy and relevance of generated responses by the LLM to the user’s query.

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Influence Search Result Ranking with Function Scores in Atlas Search


Feb 03, 2023 | 5 min read
Article

Using Atlas Data Federation to Control Access to Your Analytics Node


Jun 28, 2023 | 9 min read
Podcast

Database Automation Series - Automated Indexes


Sep 05, 2023 | 23 min
Article

Atlas Data Lake SQL Integration to Form Powerful Data Interactions


Jun 12, 2023 | 3 min read
Table of Contents