Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

Introducing MongoDB 8.0, the fastest MongoDB ever!
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Implementing Robust RAG Pipelines: Integrating Google's Gemma 2 (2B) Open Model, MongoDB, and LLM Evaluation Techniques

Richmond Alake20 min read • Published Sep 12, 2024 • Updated Sep 12, 2024
AIAtlasPython
Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
During the early days of the generative AI era, a period defined by the emergence of large language models (LLMs) capable of advanced response synthesis, reasoning, and planning capabilities, the observable trend was an increase in the number of parameters within newly released models. GPT-3 was released with 175 billion parameters in 2020, marking a significant leap from its predecessor, GPT-2, which had 1.5 billion parameters. This trend continued with even larger models like GPT-4, estimated to have over a trillion parameters, although OpenAI has yet to disclose the exact number publicly.
However, a counteracting effort was also taking place to reduce the sizes of these language models while maintaining the capabilities and qualities of similarly larger-sized LLMs. Google is a player on both sides of the language model sizing effort. Gemini is Google’s multimodal LLM, released in December 2023. Several versions were released after offering the emergent abilities of large language models that span billions of parameters and have extensive context window sizes for inputs to the LLM.
However, this tutorial's key focus is Gemma 2, especially the variant with just two billion parameters.
As Google described, Gemma “is a family of lightweight, state-of-the-art open models” built with the same building blocks used to create the Gemini models. Small language and open models have their place in various use cases that call for efficiency, cost-effectiveness, privacy, and latency.
Below are some examples of use cases where Gemma models will be viable for developers.
IndustryUse caseDescriptionKey benefit
FinancePersonalized budgeting assistantA mobile app feature that analyzes spending patterns and offers tailored financial advice to usersA smaller model like Gemma 2B is perfect here — it's lightweight enough to run on a smartphone, respects user privacy by processing data locally, and can provide quick, contextual responses without sending sensitive financial data to a cloud server.
HealthcareMedical symptom checkerA web application for initial patient triage, understanding user-described symptoms and suggesting potential causes or urgency levelsGemma 2B can be fine-tuned on medical datasets to provide quick, accurate initial assessments without expensive infrastructure or compromising patient data privacy.
Customer serviceIntelligent chatbotAn advanced customer service chatbot for a SaaS company that understands context and responds more naturally to customer queriesThe model's small variants (2B and 9B) make it easy to deploy and update, allowing quick iterations based on user feedback and new product features.
This tutorial covers building an asset management analyst assistant using Gemma 2. The assistant can analyze market reports stored in a MongoDB database based on user-provided queries, effectively augmenting some of an analyst's responsibilities.
The tutorial also introduces the topic of LLM evaluation, specifically focusing on assessing key components in a typical retrieval-augmented generation (RAG) pipeline.
What’s covered in this tutorial
  • Building an asset management analyst assistant with Gemma 2
  • Implementing a RAG pipeline with MongoDB
  • Vector search and semantic retrieval techniques
  • Generating responses using the Gemma 2 (2B) model
  • LLM evaluation for RAG components
All code and implementation steps in this tutorial can be found on GitHub
RAG Pipeline

Step 1: Install libraries and set environment variables

The first step is to install all the libraries required to provide the functionalities to build the RAG pipeline components and conduct LLM evaluation for the retrieval and generation components of the RAG pipeline.
  • PyMongo: Facilitates interaction with MongoDB databases
  • Pandas: Provides powerful data manipulation and analysis capabilities
  • Hugging Face (datasets): Offers access to a vast collection of pre-processed datasets
  • Hugging Face (Sentence Transformers): Provides access to sentence, text, and image embeddings
  • DeepEvals: Provides metrics for assessing LLM performance across various dimensions
Execute the following command to install and upgrade the required libraries:
1!pip install --upgrade --quiet datasets pandas pymongo sentence_transformers deepevals
Setting up environment variables is crucial to ensuring secure access to external services. This practice enhances security by keeping sensitive information out of your codebase. Ensure you have a Hugging Face token(HF_TOKEN) in your development environment before running the code below.
1import os
2import getpass
3# Make sure you have an Hugging Face token(HF_TOKEN) in your development environment before running the code below
4# How to get a token: https://huggingface.co/docs/hub/en/security-tokens
5os.environ["HF_TOKEN"] = getpass.getpass("Enter your Hugging Face token: ")
6os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ") # Need this for the evaluation step

Step 2: Data loading and preparation

This step loads a financial dataset from Hugging Face into the current development environment. The data preparation process then creates a new attribute for each data point by combining several existing attributes. This combined attribute is prepared for the subsequent embedding generation step.
Data loading and preparation are processes in most data-related tasks and processing pipelines. Data loading refers to obtaining data from an external source and importing it into the current development environment for further downstream processing. Data preparation refers to organizing a dataset's content in a format required for downstream processes. In this tutorial, data is loaded from the MongoDB Hugging Face organization
The dataset is designed for financial analysis and market research in the tech sector. It provides a comprehensive view of each company, including news, financial metrics, and analyst reports. It could be used for tasks such as sentiment analysis, market trend analysis, or as part of a financial recommendation system.
1import pandas as pd
2from datasets import load_dataset
3
4# Make sure you have an Hugging Face token(HF_TOKEN) in your development environment before running the code below
5# How to get a token: https://huggingface.co/docs/hub/en/security-tokens
6# https://huggingface.co/datasets/MongoDB/fake_tech_companies_market_reports
7dataset = load_dataset("MongoDB/fake_tech_companies_market_reports", split="train", streaming=True)
8dataset_df = dataset.take(100)
9
10# Convert the dataset to a pandas dataframe
11dataset_df = pd.DataFrame(dataset_df)
12dataset_df.head(5)
Here are the steps taken in the code snippet above:
  • The load_dataset function is used to fetch the "fake_tech_companies_market_reports" dataset from the MongoDB organization on Hugging Face.
  • The " train" split is specified and is typically used for the main dataset in machine learning tasks.
  • The streaming=True parameter is crucial here. It enables lazy dataset loading, allowing us to work with potentially very large datasets without loading everything into memory at once. `dataset_df =
  • dataset.take(100) uses thetake` method to extract the first 100 samples from the dataset.
  • After extracting a manageable subset of the data, we use Pandas to create a DataFrame object of the data for efficient data manipulation and visualization.
  • We use Pandas' ' head' method to understand the dataset's structure and content.
The operations above are steps taken to load data into the development environment and provide domain-specific data for the problem we are trying to solve. Specifically, we are creating a RAG-enabled chatbot that provides market information on tech-based companies for the asset management use case. The second half of this section focuses on data preparation.
A crucial part of the RAG pipeline process is data embedding, which creates a semantic representation of each data point that can be retrieved using information retrieval techniques such as vector search. Understanding which data attributes best capture the semantics, or in simpler terms, the meaning of a single data point, is crucial for building accurate and reliable RAG applications. Therefore, practitioners must carefully select and preprocess the most relevant features for embedding.
This involves:
  • Conducting thorough exploratory data analysis to understand the dataset's structure and content.
  • Identifying key attributes that best represent the semantic essence of each data point.
  • Considering domain-specific knowledge to guide feature selection and engineering.
  • Evaluate and fine-tune the embedding process regularly to ensure it accurately captures the nuanced meanings and relationships within the data.
For our use case and this tutorial, the combination of the data attributes report, company, and sector suffices to capture the semantic detail of a data point. The code snippet below shows the operation of selecting the attributes and then defining the function combine_attributes to return the concatenated string of the selected attributes.
1# Data Preparation
2def combine_attributes(row):
3 combined = f"{row['company']} {row['sector']} "
4
5 # Add report information
6 for report in row['reports']:
7 combined += f"{report['year']} {report['title']} {report['author']} {report['content']} "
8
9 # Add recent news information
10 for news in row['recent_news']:
11 combined += f"{news['headline']} {news['summary']} "
12
13 return combined.strip()
The next process applies the combine_attributes to each datapoint in the dataset and stores the result of the operation in a new data point attribute: combined_attributes. This will be the attribute that is passed into the embedding model.
1# Add the new column 'combined_attributes'
2dataset_df['combined_attributes'] = dataset_df.apply(combine_attributes, axis=1)

Step 3: Embedding generation with GTE-Large

Be aware of the context window size of the embedding model you use. These models typically have a finite input capacity, often measured in tokens or characters. When input exceeds this threshold, the model may truncate the excess, potentially resulting in significant information loss. This truncation can severely impact the quality and relevance of the generated embeddings, compromising the overall performance of your RAG system.
If you have extensive data to embed, then it’s worth exploring varying chunking strategies. The standard advice for practitioners here is to implement chunking algorithms for longer documents that divide the text into semantically coherent segments, each fitting within the model's context window and considering overlapping chunks to maintain contextual continuity. 
For this use case and tutorial, the leveraged chunking strategy is to chunk any data that extends beyond the maximum input token length of the embedding model and attach the same metadata to the initial and secondary chunks.
The GTE (General Text Embeddings) model series, developed by the Institute for Intelligent Computing, Alibaba Group,  provides state-of-the-art performance in text embedding technology. For this tutorial, we focus on using the GTE-Large English V1.5 model, which offers an optimal balance between performance and computational efficiency. This embedding model achieves a score of 65.39 on the Massive Text Embedding Benchmark (MTEB) for English, indicating overall embedding performance across a diverse set of NLP tasks and datasets. For the use case covered in this tutorial that involves information retrieval tasks, which means finding relevant documents for an input query, the GTE-Large English V1.5 model demonstrates a good performance with an average retrieval score of 57.91 based on an evaluation across 15 datasets.
1from sentence_transformers import SentenceTransformer
2from tqdm import tqdm
3import numpy as np
4
5
6# https://huggingface.co/thenlper/gte-large
7# embedding_model = SentenceTransformer('thenlper/gte-large')
8embedding_model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)
9
10
11# Determine the maximum sequence length for the model
12max_seq_length = embedding_model.max_seq_length
13
14
15
16
17def chunk_text(text, tokenizer, max_length=8192, overlap=50):
18 """
19 Split the text into overlapping chunks based on token length.
20 """
21 tokens = tokenizer.tokenize(text)
22 chunks = []
23 for i in range(0, len(tokens), max_length - overlap):
24 chunk_tokens = tokens[i:i + max_length]
25 chunk = tokenizer.convert_tokens_to_string(chunk_tokens)
26 chunks.append(chunk)
27 return chunks
28
29
30def get_embedding(text: str) -> list[float]:
31 if not text.strip():
32 print("Attempted to get embedding for empty text.")
33 return []
34
35
36 # Get the tokenizer from the model
37 tokenizer = embedding_model.tokenizer
38
39
40 # Split text into chunks if it's too long
41 chunks = chunk_text(text, tokenizer, max_length=max_seq_length)
42
43 if len(chunks) == 1:
44 # If text fits in one chunk, embed as usual
45 embedding = embedding_model.encode(text)
46 else:
47 # If text was split, embed each chunk and average the results
48 chunk_embeddings = embedding_model.encode(chunks)
49 embedding = np.mean(chunk_embeddings, axis=0)
50
51 return embedding.tolist()
52
53# Apply the embedding function with a progress bar
54tqdm.pandas(desc="Generating embeddings")
55dataset_df["embedding"] = dataset_df['combined_attributes'].progress_apply(get_embedding)
The code snippet above performs the following operations: Chunk the input data intended to be passed into the embedding model if it exceeds a particular length:
  • Utilize the chunk_text function to tokenize and split long texts.
  • Check the text split is within the model's max_seq_length (8192 tokens for GTE-Large English V1.5).
  • Implement an overlap of 50 tokens between chunks to maintain context continuity.
Pass the chunk(s) into the embedding model:
  • For single chunks, directly encode using embedding_model.encode(text).
  • For multiple chunks, encode each chunk separately.
  • Handle potential empty inputs, returning an empty list if any empty inputs are present.
Store the embedding data as a new attribute embedding for each data point:
  • Use tqdm to display a progress bar for all data points during the embedding process.
  • Add the resulting embeddings as a new 'embedding' column in the dataset_df DataFrame.

Step 4: MongoDB vector database and connection setup

In this tutorial and many RAG applications, MongoDB acts as an operational and a vector database. MongoDB Atlas specifically provides a database solution that efficiently stores, queries, and retrieves vector embeddings.
Creating a database and collection within MongoDB is made simple with MongoDB Atlas.
  1. First, register for a MongoDB Atlas account. Existing users can sign into MongoDB Atlas.
  2. Follow the instructions. Select Atlas UI as the procedure to deploy your first cluster.
  3. Create the database: asset_management_use_case.
  4. Within the database asset_management_use_case, create the collection market_reports.
  5. Create a vector search index named vector_index for the ‘listings_reviews’ collection. This index enables the RAG application to retrieve records as additional context to supplement user queries via vector search. Below is the JSON definition of the data collection vector search index.
Your vector search index created on MongoDB Atlas should look like the following:
1{
2 "fields": [
3 {
4 "numDimensions": 1024,
5 "path": "embedding",
6 "similarity": "cosine",
7 "type": "vector"
8 }
9 ]
10}
Follow MongoDB’s steps to get the connection string from the Atlas UI. After setting up the database and obtaining the Atlas cluster connection URI, securely store the URI within your development environment.
1os.environ["MONGO_URI"] = getpass.getpass("Enter your MongoDB URI: ")
1import pymongo
2
3
4def get_mongo_client(mongo_uri):
5 """Establish and validate connection to the MongoDB."""
6
7
8 client = pymongo.MongoClient(mongo_uri, appname="devrel.showcase.rag.cohere_mongodb.python")
9
10
11 # Validate the connection
12 ping_result = client.admin.command('ping')
13 if ping_result.get('ok') == 1.0:
14 # Connection successful
15 print("Connection to MongoDB successful")
16 return client
17 else:
18 print("Connection to MongoDB failed")
19 return None
20
21
22MONGO_URI = os.environ["MONGO_URI"]
23
24
25if not MONGO_URI:
26 print("MONGO_URI not set in environment variables")
27
28
29mongo_client = get_mongo_client(MONGO_URI)
30
31
32DB_NAME = "asset_management_use_case"
33COLLECTION_NAME = "market_reports"
34
35
36db = mongo_client.get_database(DB_NAME)
37collection = db.get_collection(COLLECTION_NAME)
For this tutorial, to ensure we are working with a clean collection, you can run the code below to clear the collection of any existing data.
1# Delete any existing records in the collection
2collection.delete_many({})

Step 5: Data ingestion

the document data model
MongoDB's document model and its compatibility with Python dictionaries offer several benefits for data ingestion. MongoDB's document-oriented structure offers several advantages for data storage and manipulation in Python applications. The core of this structure is the use of BSON (Binary JSON) for data storage, which aligns naturally with Python's dictionary data type. This alignment facilitates data representation using key-value pairs, making it intuitive for Python developers to work with MongoDB.
One of MongoDB's key features is its schema flexibility. Unlike traditional relational databases, MongoDB is schema-less, allowing each document in a collection to have a different structure. This flexibility is particularly advantageous in Python environments, as it complements Python's dynamic nature. Developers can ingest varied data structures without predefined schemas, offering greater data handling and storage adaptability.
Another significant benefit of working with Python is MongoDB's data ingestion efficiency. The close similarity between Python dictionaries and MongoDB documents enables direct data ingestion without complex transformations. This streamlined process results in faster data insertion and reduced processing overhead, making MongoDB an excellent choice for applications that require rapid data storage and retrieval in Python-based systems.
And that’s why the ingestion process within this tutorial is completed in one or two lines:
1documents = dataset_df.to_dict('records')
2collection.insert_many(documents)
3print("Data ingestion into MongoDB completed")
MongoDB's query language is designed to work well with document structures, making it easy to query and manipulate ingested data using familiar Python-like syntax. The query language is executed using MongoDB's aggregation pipeline, a powerful feature that allows for complex data processing and analysis within the database.
An aggregation pipeline can be thought of similarly to pipelines in data engineering or machine learning, where processes operate sequentially. Each stage takes an input, performs operations, and provides an output for the next stage.
Stages are the building blocks of an aggregation pipeline. Each stage represents a specific data transformation or analysis operation. Common stages include:
$match: Filters documents (similar to WHERE in SQL) $group: Groups documents by specified fields $sort: Sorts the documents $project: Reshapes documents (select, rename, compute fields) $limit: Limits the number of documents $unwind: Deconstructs array fields $lookup: Performs left outer joins with other collections
Aggregation pipeline The code snippet below defines a vector search function demonstrating semantic search implementation using MongoDB's vector search capabilities. At its core, the function leverages dense vector embeddings to find documents semantically similar to a user's query. It begins by converting the input query into a vector embedding, then utilizes MongoDB's $vectorSearch operator to search through a pre-indexed collection of document embeddings efficiently.
1def vector_search(user_query, collection):
2 """
3 Perform a vector search in the MongoDB collection based on the user query.
4
5
6 Args:
7 user_query (str): The user's query string.
8 collection (MongoCollection): The MongoDB collection to search.
9
10
11 Returns:
12 list: A list of matching documents.
13 """
14
15
16 # Generate embedding for the user query
17 query_embedding = get_embedding(user_query)
18
19
20 if query_embedding is None:
21 return "Invalid query or embedding generation failed."
22
23
24 # Define the vector search pipeline
25 vector_search_stage = {
26 "$vectorSearch": {
27 "index": "vector_index",
28 "queryVector": query_embedding,
29 "path": "embedding",
30 "numCandidates": 150, # Number of candidate matches to consider
31 "limit": 2 # Return top 4 matches
32 }
33 }
34
35
36 unset_stage = {
37 "$unset": "embedding" # Exclude the 'embedding' field from the results
38 }
39
40
41 project_stage = {
42 "$project": {
43 "_id": 0, # Exclude the _id field
44 "company": 1, # Include the plot field
45 "reports": 1, # Include the title field
46 "combined_attributes": 1, # Include the genres field
47 "score": {
48 "$meta": "vectorSearchScore" # Include the search score
49 }
50 }
51 }
52
53
54 pipeline = [vector_search_stage, unset_stage, project_stage]
55
56
57 # Execute the search
58 results = collection.aggregate(pipeline)
59 return list(results)
The code snippet above does the following operations:
  • Converts the user's text query into vector embedding using the get_embedding function, which leverages the earlier embedding operations defined earlier
  • Utilizes MongoDB's $vectorSearch operator to find semantically similar documents
  • Searches the vector_index using the query embedding; ensures that the vector search index has been created on MongoDB atlas and is referenced in the query using the name specified at creation
  • Sets numCandidates to 150 for broad initial matching
  • Limits the final results to the top two matches
  • Uses an $unset stage to remove the "embedding" field from results, reducing data transfer
  • Uses a $project stage to selectively include relevant fields: "company", "reports", and "combined_attributes"
  • Adds a similarity score using $meta: "vectorSearchScore"
  • Combines the stages into a single aggregation pipeline
  • Executes the search using collection.aggregate()
  • Returns the results as a list of documents

Step 7: Handling user queries

The next step after creating the information retrieval component aspect of the RAG pipeline is to handle the result, which means essentially formatting the retrieved documents in a manner that is consumable by the LLM in the generation step of the RAG pipeline.
In the code snippet below, the get_search_result function serves as a wrapper for the vector search operation, transforming raw search results into a more user-friendly format. The function returns a formatted string containing summarized information from the top search results.
1def get_search_result(query, collection):
2
3
4 get_knowledge = vector_search(query, collection)
5
6
7 search_result = ''
8 for result in get_knowledge:
9 search_result += f"Company: {result.get('company', 'N/A')}, Combined Attributes: {result.get('combined_attributes', 'N/A')}\n"
10
11
12 return search_result
The next step is to actually define an input query for the RAG pipeline. This input query is similar to what is expected for chatbot interfaces that expect prompts or inputs from users and provide a response. For the use case in this tutorial, the code snippet below demonstrates the practical application of the semantic search functionality in a question-answering context.
1# Conduct query with the retrieval of sources
2query = "What companies have negative market reports or negative sentiment that might deter from investment in the long term"
3source_information = get_search_result(query, collection)
4combined_information = f"Query: {query}\nContinue to answer the query by using the Search Results:\n{source_information}."
5
6print(combined_information)

Step 8: Load Gemma 2 (2B)

This section of the tutorial demonstrates the initialization of Gemma 2.0 (2B), a two billion-parameter open language model and its associated tokenizer. This particular variant of the model is instruction-tuned, which means it has been specifically fine-tuned on a dataset of instructions and responses, enhancing its ability to understand and follow user prompts accurately.
The code snippet below uses the Hugging Face transformers library to load the tokenizers and model for the Gemma-2.0-2b, using the AutoTokenizer and AutoModelForCausalLM modules of the library. The tokenizer converts text into tokens that can be passed as inputs to be processed by the model.
1import torch
2from transformers import AutoTokenizer, AutoModelForCausalLM
3
4
5tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
6model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it", torch_dtype=torch.bfloat16)
The code snippet below extracts the response from the Gemma 2 (2B) model.
1def extract_model_response(response):
2 # Split the response at the start of the model's turn
3 parts = response.split("<start_of_turn>model")
4 # If there's a model response, it will be in the last part
5 if len(parts) > 1:
6 model_response = parts[-1].strip()
7
8 # Remove any potential end-of-turn markers
9 model_response = model_response.split("<end_of_turn>")[0].strip()
10
11 return model_response
12 else:
13 return "No model response found."
After initializing the Gemma 2.0 (2B) model and its associated tokenizer, the next step is to generate a response using a chat-style input format. This tutorial leverages the model's chat template functionality to create a more conversational experience. The Gemma model tokenizer's apply_chat_template method is utilized to properly format the input for the instruction-tuned model.
1chat = [
2 { "role": "user", "content": combined_information },
3]
4prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
5
6
7inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
8outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150, do_sample=True, temperature=0.7)
9
10
11response = tokenizer.decode(outputs[0])
The code snippet above performs the following operations:
  1. Creation of chat-style input: A single user message containing the combined_information is formatted as a chat-style input.
  2. Application of chat template: The tokenizer's apply_chat_template method formats this input according to Gemma's specific chat template. This process includes adding a generation prompt to guide the model's response.
  3. Tokenization and encoding: The formatted prompt is tokenized and encoded, preparing it for input into the model.
  4. Response generation: The model generates a response using specified parameters, such as maximum_new_tokens and temperature. These parameters control the length and randomness of the output.
  5. Decoding and extraction: Finally, the generated output is decoded and processed to extract the model's response, making it ready for response extraction and other downstream processes.
1model_output = extract_model_response(response)
2print(model_output)
operation results

Step 9: LLM evaluation

LLM evaluation, also referred to as LLM evals, is the systematic process of formulating a profile of foundation models or their derived fine-tuned variants to understand and capture their performance on certain specialized or general-purpose tasks, reliability in certain conditions, effectiveness in particular use cases, and many other evaluative measurement criteria that help in gaining an overview of a model’s overall ability.
The particular category of LLM evaluation performed in this tutorial is LLM system evaluation, defined as the end-to-end performance overview of a system or infrastructure that incorporates an LLM; examples of such systems are RAG pipelines and agentic systems.
Although the RAG pipeline has multiple components, this tutorial will only evaluate the generation component. The generation component involves using the retrieved relevant information from the database to produce a coherent and accurate response to the user's query. Specifically, it includes:
  1. Integrating the retrieved context(source_information) with the original query (query).
  2. Prompting the language model with this combined information (combined_information).
  3. Generating a response(model_output) that addresses the user's question or request.
  4. Ensuring the generated content is relevant, coherent, and faithful to the retrieved information. This is what we are doing in this section.
This component is crucial as it determines the final output quality of the RAG system, directly impacting user experience and the system's overall effectiveness. The evaluation metrics and factors for the generation component of a RAG pipeline are relevancy, faithfulness, coherence, and accuracy.
The LLM evaluation framework leveraged in this tutorial is DeepEval. DeepEval provides the evaluation metrics: answer relevance, faithfulness, hallucination, and others, as well as the ability to define custom metrics. This tutorial will utilize just two metrics: answer relevance and faithfulness.
Faithfulness: Faithfulness quantifies the extent to which the factual information in the generated text aligns with the retrieved documents or provided context. Faithfulness is measured in the DeepEval library by using an LLM to extract claims that are made in the model’s response to a user query and provided context and then using the same LLM to determine if the claims extracted from the model’s response are observed in the retrieved context. Therefore, the faithfulness score is determined by the number of “backed” claims in the model’s response divided by the total number of claims in the model’s response.
1faithfulness = (Number of truthful claims) / (Total number of claims in response)
To begin the evaluation step, ensure that the DeepEval Python library is installed within your development environment. Next, import the LLMTestCase and FaithfulnessMetric modules from the DeepEval library.
  • LLMTestCase
  • FaithfulnessMetric
1from deepeval.test_case import LLMTestCase
2from deepeval.metrics import FaithfulnessMetric
1from deepeval.metrics import FaithfulnessMetric
2
3
4actual_output = model_output
5
6
7retrieval_context = [source_information]
8
9
10metric = FaithfulnessMetric(
11 threshold=0.7,
12 model="gpt-4",
13 include_reason=True
14)
15test_case = LLMTestCase(
16 input=query,
17 actual_output=actual_output,
18 retrieval_context=retrieval_context
19)
20
21
22metric.measure(test_case)
23print(metric.score)
24print(metric.reason)
The code snippet above sets up variables actual_output and retrieval_context to store the model's output and the source information for context. A FaithfulnessMetric object is instantiated with a threshold of 0.7, corresponding to the faithfulness boundary, resulting in a flagging in the evaluation output for claims that fall below the threshold.
The evaluation procedure uses "GPT-4" as the evaluation model and is configured to include a reason for its assessment, using the LLM as a judge approach for evaluation. An LLMTestCase is constructed, encompassing the input query, the model's actual output, and the retrieval context. The code snippet shows the invocation of the metric's measure method, passing the test case as an argument. Finally, the code snippet outputs the calculated faithfulness score and the accompanying explanation for the evaluation.
Below is an example of the output of the faithfulness evaluation process.
1Event loop is already running. Applying nest_asyncio patch to allow async execution...
20.9333333333333333
3The score is 0.93 because there is a minor discrepancy with the year of CDDY's strategic acquisitions. The actual output mistakenly indicates they occurred in 2023, while the context clearly states they happened in 2024.
The faithfulness score is 0.93, which can be interpreted as Gemma 2B's response being 93% faithful to the provided context. The accompanying explanation also provides a reason for the provided score, a response generated by the LLM used for evaluation to provide a textual explanation of its result. This component of the DeepEval library introduces a human-readable explanation feature, which enhances the interpretability of the evaluation results. It provides a clear rationale for the faithfulness score, highlighting discrepancies or alignments between the model's output and the source context.
Answer relevance: This quantifies how well the generated response aligns with the initial input query. Answer relevance assesses the association between the response and the query without evaluating factual accuracy.
1from deepeval.metrics import AnswerRelevancyMetric
2
3
4actual_output = model_output
5
6
7metric = AnswerRelevancyMetric(
8 threshold=0.7,
9 model="gpt-4",
10 include_reason=True
11)
12
13
14test_case = LLMTestCase(
15 input=query,
16 actual_output=actual_output
17)
18
19metric.measure(test_case)
20print(metric.score)
21print(metric.reason)
The result of the answer relevance evaluative process is as follows:
1Event loop is already running. Applying nest_asyncio patch to allow async execution...
21.0
3The score is 1.00 because the response accurately selected a company from the provided information and justified why it's a safe long-term investment, addressing all aspects of the input.
In this section, you were introduced to LLM evaluation, specifically focusing on LLM system evaluation within the context of a RAG pipeline. You implemented two key evaluative methods for the generation component of a RAG pipeline: answer relevance and faithfulness, utilizing the powerful deepevals library.
By leveraging the DeepEval library, you could quantify these aspects of the model's performance and gain valuable insights into its strengths and areas for improvement. The library's ability to provide numerical scores and human-readable explanations enhances the interpretability of your evaluation results.
Moving forward, you can explore the other evaluative metrics the DeepEval library offers.

Conclusion

In this tutorial, you explored building an asset management analyst assistant using Google's Gemma 2 (2B) open model, a RAG pipeline with MongoDB, and LLM evaluation techniques. This showcase demonstrates the potential of open models in creating efficient, cost-effective AI solutions for specific use cases.
The implementation of Gemma 2 (2B), a lightweight two billion-parameter model, highlights the growing capabilities of open-source models in real-world applications and environments with limited compute availability. This allows developers to create AI-driven solutions that balance performance with resource efficiency.
MongoDB's operational and vector database role underscores its flexibility and scalability in modern AI applications and infrastructure. Its vector search capabilities and Python integration make it well-suited for RAG systems, enabling efficient information retrieval.
The focus on LLM evaluation using the DeepEval library emphasizes the importance of assessing AI system performance. By implementing metrics like faithfulness and answer relevance, you gained insights into the model's strengths and areas for improvement.
Moving forward, you can explore more evaluation approaches or chunking strategies to optimize the performance and accuracy of RAG pipelines.

FAQs

1. What is Gemma 2, and how does it differ from other language models? Gemma 2 is a family of lightweight, state-of-the-art open models developed by Google. It's built using the same building blocks as the larger Gemini models but is designed to be more efficient and suitable for use cases that require lower computational resources. The Gemma 2 (2B) variant has two billion parameters and is particularly useful for applications that need efficiency, cost-effectiveness, privacy, and low latency.
2. How does MongoDB contribute to the RAG pipeline?
MongoDB serves as both an operational and vector database in this RAG application. It efficiently stores, queries, and retrieves vector embeddings. The tutorial demonstrates how to set up a MongoDB Atlas cluster, create a vector search index, and use MongoDB's aggregation pipeline for semantic search capabilities, which are crucial for the retrieval component of the RAG system.
3. What is LLM evaluation, and which metrics are used in this tutorial?
LLM evaluation, or LLM evals, is the systematic process of assessing language models' performance on various tasks and use cases. In this tutorial, the focus is on LLM system evaluation, specifically for a RAG pipeline. Two key metrics are used: faithfulness and answer relevance. Faithfulness measures how well the generated response aligns with the provided context, while answer relevance assesses how well the response addresses the initial query.
4. What is the purpose of the DeepEval library in this tutorial?
The DeepEval library is used to conduct LLM evaluations. It provides metrics such as FaithfulnessMetric and AnswerRelevancyMetric to assess the quality of the generated responses. The library allows for setting evaluation thresholds and using models like GPT-4. It includes features for providing human-readable explanations of the evaluation results, enhancing the interpretability of the assessment process.
Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Industry Event
locationBANGKOK, THAILAND | IN-PERSON

Developer Day Bangkok


Oct 9, 2024 | 2:00 AM - 10:00 AM UTC
Code Example

Introducing the Tour Planner With MongoDB Vector Search


Sep 24, 2024 | 5 min read
Article

Comparing NLP Techniques for Scalable Product Search


Sep 23, 2024 | 8 min read
Code Example

EHRS-Peru


Sep 11, 2024 | 3 min read
Table of Contents