EventGet 50% off your ticket to MongoDB.local NYC on May 2. Use code Web50!Learn more >>
MongoDB Developer
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right

Building a RAG System With Google's Gemma, Hugging Face and MongoDB

Richmond Alake12 min read • Published Feb 22, 2024 • Updated Mar 21, 2024
Facebook Icontwitter iconlinkedin icon
Rate this tutorial


Google recently released a state-of-the-art open model into the AI community called Gemma. Specifically, Google released four variants of Gemma: Gemma 2B base model, Gemma 2B instruct model, Gemma 7B base model, and Gemma 7B instruct model. The Gemma open model and its variants utilise similar building blocks as Gemini, Google’s most capable and efficient foundation model built with Mixture-of-Expert (MoE) architecture.
This article presents how to leverage Gemma as the foundation model in a retrieval-augmented generation (RAG) pipeline or system, with supporting models provided by Hugging Face, a repository for open-source models, datasets, and compute resources. The AI stack presented in this article utilises the GTE large embedding models from Hugging Face and MongoDB as the vector database.
Here’s what to expect from this article:
  • Quick overview of a RAG system
  • Information on Google’s latest open model, Gemma
  • Utilising Gemma in a RAG system as the base model
  • Building an end-to-end RAG system with an open-source base and embedding models from Hugging Face*
Depiction of a RAG pipeline using MongoDB, Gemma and Hugging Face

Step 1: Installing libraries

All implementation steps can be accessed in the repository, which has a notebook version of the RAG system presented in this article.
The shell command sequence below installs libraries for leveraging open-source large language models (LLMs), embedding models, and database interaction functionalities. These libraries simplify the development of a RAG system, reducing the complexity to a small amount of code:
  • PyMongo: A Python library for interacting with MongoDB that enables functionalities to connect to a cluster and query data stored in collections and documents.
  • Pandas: Provides a data structure for efficient data processing and analysis using Python
  • Hugging Face datasets: Holds audio, vision, and text datasets
  • Hugging Face Accelerate: Abstracts the complexity of writing code that leverages hardware accelerators such as GPUs. Accelerate is leveraged in the implementation to utilise the Gemma model on GPU resources.
  • Hugging Face Transformers: Access to a vast collection of pre-trained models
  • Hugging Face Sentence Transformers: Provides access to sentence, text, and image embeddings.

Step 2: data sourcing and preparation

The data utilised in this tutorial is sourced from Hugging Face datasets, specifically the AIatMongoDB/embedded_movies dataset. 
A datapoint within the movie dataset contains attributes specific to an individual movie entry; plot, genre, cast, runtime, and more are captured for each data point. After loading the dataset into the development environment, it is converted into a Pandas DataFrame object, which enables efficient data structure manipulation and analysis.
The operations within the following code snippet below focus on enforcing data integrity and quality. 
  1. The first process ensures that each data point's fullplot attribute is not empty, as this is the primary data we utilise in the embedding process. 
  2. This step also ensures we remove the plot_embedding attribute from all data points as this will be replaced by new embeddings created with a different embedding model, the gte-large.

Step 3: generating embeddings

Embedding models convert high-dimensional data such as text, audio, and images into a lower-dimensional numerical representation that captures the input data's semantics and context. This embedding representation of data can be used to conduct semantic searches based on the positions and proximity of embeddings to each other within a vector space.
The embedding model used in the RAG system is the Generate Text Embedding (GTE) model, based on the BERT model. The GTE embedding models come in three variants, mentioned below, and were trained and released by Alibaba DAMO Academy, a research institution.
Model DimensionMassive Text Embedding Benchmark (MTEB) Leaderboard Retrieval (Average)
In the comparison between open-source embedding models GTE and embedding models provided by OpenAI, the GTE-large embedding model offers better performance on retrieval tasks but requires more storage for embedding vectors compared to the latest embedding models from OpenAI. Notably, the GTE embedding model can only be used on English texts.
The code snippet below demonstrates generating text embeddings based on the text in the "fullplot" attribute for each movie record in the DataFrame. Using the SentenceTransformers library, we get access to the "thenlper/gte-large" model hosted on Hugging Face. If your development environment has limited computational resources and cannot hold the embedding model in RAM, utilise other variants of the GTE embedding model: gte-base or gte-small.
The steps in the code snippets are as follows:
  1. Import the SentenceTransformer class to access the embedding models.
  2. Load the embedding model using the SentenceTransformer constructor to instantiate the gte-large embedding model.
  3. Define the get_embedding function, which takes a text string as input and returns a list of floats representing the embedding. The function first checks if the input text is not empty (after stripping whitespace). If the text is empty, it returns an empty list. Otherwise, it generates an embedding using the loaded model.
  4. Generate embeddings by applying the get_embedding function to the "fullplot" column of the dataset_df DataFrame, generating embeddings for each movie's plot. The resulting list of embeddings is assigned to a new column named embedding.
After this section, we now have a complete dataset with embeddings that can be ingested into a vector database, like MongoDB, where vector search operations can be performed.

Step 4: database setup and connection

Before moving forward, ensure the following prerequisites are met
  • Database cluster set up on MongoDB Atlas
  • Obtained the URI to your cluster
For assistance with database cluster setup and obtaining the URI, refer to our guide for setting up a MongoDB cluster and getting your connection string. Alternatively, follow Step 5 of this article on using embeddings in a RAG system, which offers detailed instructions on configuring and setting up the database cluster.
Once you have created a cluster, create the database and collection within the MongoDB Atlas cluster by clicking + Create Database. The database will be named movies, and the collection will be named movies_records.
Creating a database and collection Ensure the connection URI is securely stored within your development environment after setting up the database and obtaining the Atlas cluster connection URI.
This guide uses Google Colab, which offers a feature for the secure storage of environment secrets. These secrets can then be accessed within the development environment. Specifically, the code mongo_uri = userdata.get('MONGO_URI') retrieves the URI from the secure storage. You can click on the "key" icon to the right-hand side of the Colab Notebook, to set values for secrets.
The code snippet below also utilises PyMongo to create a MongoDB client object, representing the connection to the cluster and enabling access to its databases and collections.
The following code guarantees that the current database collection is empty by executing the delete_many() operation on the collection.

Step 5: vector search index creation

Creating a vector search index within the movies_records collection is essential for efficient document retrieval from MongoDB into our development environment. To achieve this, refer to the official vector search index creation guide.
In the creation of a vector search index using the JSON editor on MongoDB Atlas, ensure your vector search index is named vector_index and the vector search index definition is as follows:
The 1024 value of the numDimension field corresponds to the dimension of the vector generated by the gte-large embedding model. If you use the gte-base or gte-small embedding models, the numDimension value in the vector search index must be set to 768 and 384, respectively.
Up to this point, we have successfully done the following:
  • Loaded data sourced from Hugging Face
  • Provided each data point with embedding using the GTE-large embedding model from Hugging Face
  • Set up a MongoDB database designed to store vector embeddings
  • Established a connection to this database from our development environment
  • Defined a vector search index for efficient querying of vector embeddings
Ingesting data into a MongoDB collection from a pandas DataFrame is a straightforward process that can be efficiently accomplished by converting the DataFrame into dictionaries and then utilising the insert_many method on the collection to pass the converted dataset records.
The operations below are performed in the code snippet:
  1. Convert the dataset DataFrame to a dictionary using theto_dict('records') method on dataset_df. This method transforms the DataFrame into a list of dictionaries. The records parameter is crucial as it encapsulates each row as a single dictionary.
  2. Ingest data into the MongoDB vector database by calling the insert_many(documents) function on the MongoDB collection, passing it the list of dictionaries. MongoDB's insert_many function ingests each dictionary from the list as an individual document within the collection.
The following step implements a function that returns a vector search result by generating a query embedding and defining a MongoDB aggregation pipeline. 
The pipeline, consisting of the $vectorSearch and $project stages, executes queries using the generated vector and formats the results to include only the required information, such as plot, title, and genres while incorporating a search score for each result.
The code snippet above conducts the following operations to allow semantic search for movies:
  1. Define the vector_search function that takes a user's query string and a MongoDB collection as inputs and returns a list of documents that match the query based on vector similarity search.
  2. Generate an embedding for the user's query by calling the previously defined function, get_embedding, which converts the query string into a vector representation.
  3. Construct a pipeline for MongoDB's aggregate function, incorporating two main stages: $vectorSearch and $project.
  4. The $vectorSearch stage performs the actual vector search. Theindex field specifies the vector index to utilise for the vector search, and this should correspond to the name entered in the vector search index definition in previous steps. The queryVector field takes the embedding representation of the use query. The path field corresponds to the document field containing the embeddings.  The numCandidates specifies the number of candidate documents to consider and the limit on the number of results to return.
  5. The $project stage formats the results to include only the required fields: plot, title, genres, and the search score. It explicitly excludes the _id field.
  6. The aggregate executes the defined pipeline to obtain the vector search results. The final operation converts the returned cursor from the database into a list.

Step 7: handling user queries and loading Gemma

The code snippet defines the function get_search_result, a custom wrapper for performing the vector search using MongoDB and formatting the results to be passed to downstream stages in the RAG pipeline.
The formatting of the search results extracts the title and plot using the get method and provides default values ("N/A") if either field is missing. The returned results are formatted into a string that includes both the title and plot of each document, which is appended to search_result, with each document's details separated by a newline character.
The RAG system implemented in this use case is a query engine that conducts movie recommendations and provides a justification for its selection.
A user query is defined in the code snippet above; this query is the target for semantic search against the movie embeddings in the database collection. The query and vector search results are combined into a single string to pass as a full context to the base model for the RAG system. 
The following steps below load the Gemma-2b instruction model (“google/gemma-2b-it") into the development environment using the Hugging Face Transformer library. Specifically, the code snippet below loads a tokenizer and a model from the Transformers library by Hugging Face.
Here are the steps to load the Gemma open model:
  1. Import AutoTokenizer and AutoModelForCausalLM classes from the transformers module.
  2. Load the tokenizer using the AutoTokenizer.from_pretrained method to instantiate a tokenizer for the "google/gemma-2b-it" model. This tokenizer converts input text into a sequence of tokens that the model can process.
  3. Load the model using the AutoModelForCausalLM.from_pretrainedmethod. There are two options provided for model loading, and each one accommodates different computing environments.
  4. CPU usage: For environments only utilising CPU for computations, the model can be loaded without specifying the device_map parameter.
  5. GPU usage: The device_map="auto" parameter is included for environments with GPU support to map the model's components automatically to available GPU compute resources.
The steps to process user inputs and Gemma’s output are as follows:
  1. Tokenize the text input combined_information to obtain a sequence of numerical tokens as PyTorch tensors; the result of this operation is assigned to the variable input_ids.
  2. The input_ids are moved to the available GPU resource using the `.to(“cuda”)` method; the aim is to speed up the model’s computation.
  3. Generate a response from the model by involving themodel.generate function with the input_ids tensor. The max_new_tokens=500 parameter limits the length of the generated text, preventing the model from producing excessively long outputs.
  4. Finally, decode the model’s response using the tokenizer.decodemethod, which converts the generated tokens into a readable text string. The response[0] accesses the response tensor containing the generated tokens.
Query Gemma’s responses
What is the best romantic movie to watch and why?Based on the search results, the best romantic movie to watch is **Shut Up and Kiss Me!** because it is a romantic comedy that explores the complexities of love and relationships. The movie is funny, heartwarming, and thought-provoking


The implementation of a RAG system in this article utilised entirely open datasets, models, and embedding models available via Hugging Face. Utilising Gemma, it’s possible to build RAG systems with models that do not rely on the management and availability of models from closed-source model providers. 
The advantages of leveraging open models include transparency in the training details of models utilised, the opportunity to fine-tune base models for further niche task utilisation, and the ability to utilise private sensitive data with locally hosted models.
To better understand open vs. closed models and their application to a RAG system, we have an article implements an end-to-end RAG system using the POLM stack, which leverages embedding models and LLMs provided by OpenAI.
All implementation steps can be accessed in the repository, which has a notebook version of the RAG system presented in this article.


1. What are the Gemma models? Gemma models are a family of lightweight, state-of-the-art open models for text generation, including question-answering, summarisation, and reasoning. Inspired by Google's Gemini, they are available in 2B and 7B sizes, with pre-trained and instruction-tuned variants.
2. How do Gemma models fit into a RAG system?
In a RAG system, Gemma models are the base model for generating responses based on input queries and source information retrieved through vector search. Their efficiency and versatility in handling a wide range of text formats make them ideal for this purpose.
3. Why use MongoDB in a RAG system?
MongoDB is used for its robust management of vector embeddings, enabling efficient storage, retrieval, and querying of document vectors. MongoDB also serves as an operational database that enables traditional transactional database capabilities. MongoDB serves as both the operational and vector database for modern AI applications.
4. Can Gemma models run on limited resources?
Despite their advanced capabilities, Gemma models are designed to be deployable in environments with limited computational resources, such as laptops or desktops, making them accessible for a wide range of applications. Gemma models can also be deployed using deployment options enabled by Hugging Face, such as inference API, inference endpoints and deployment solutions via various cloud services.

Facebook Icontwitter iconlinkedin icon
Rate this tutorial

How to Use Azure Functions with MongoDB Atlas in Java

Apr 14, 2023 | 8 min read
Code Example

Trends analyser

Jul 07, 2022 | 1 min read

Cross Cluster Search Using Atlas Search and Data Federation

Jul 22, 2022 | 3 min read

Keeping Your Costs Down with MongoDB Atlas Serverless Instances

Jan 23, 2023 | 3 min read
Table of Contents