You Don't Always Need Frontier Models to Power Your RAG Architecture

Ashwin Gangadhar
August 11, 2025

Frontier AI models are driving the widespread adoption of generative AI by demonstrating unprecedented capabilities. However, their deployment often entails significant costs. The strategic partnership between MongoDB and Fireworks.AI addresses these cost implications by offering solutions that optimize performance and resource utilization. This collaboration leverages MongoDB's efficient data management alongside Fireworks.AI's model optimization tools to enhance speed and efficiency while minimizing operational expenses.

In the current AI environment, achieving high performance is crucial, but equally important is optimizing the total cost of ownership (TCO). Businesses must focus on the price-performance ratio, ensuring that improvements in speed and efficiency lead to real cost savings.

This article will address the following topics:

How to build an agentic RAG using a Fireworks AI hosted LLM and MongoDB Atlas for retrieval.
Strategies for optimizing retrieval-augmented generation (RAG) applications using MongoDB Atlas and large language models (LLMs) through effective query and response caching.
Techniques using the Fireworks AI platform focus on fine-tuning models, accelerating LLM inference, and reducing hardware needs.
Steps to fine-tune a pretrained SLM with PEFT techniques using the Fireworks Platform.

Readers will gain a practical, in-depth strategy to improve AI performance while lowering costs. This will be demonstrated with examples and performance data.

Unlocking efficiency and performance with MongoDB and Fireworks AI

MongoDB Atlas is renowned for its flexible schema, efficient indexing, and distributed architecture, allowing organizations to scale their data infrastructure on demand. MongoDB Atlas is a general-purpose database that focuses on highlighting flexibility, AI suitability, and ACID transactions. Users have the flexibility to run their application anywhere but making sure that there are no compromises made in the security aspects of working with it. MongoDB offers a comprehensive, secure, and efficient database solution for modern applications, catering to various technical and strategic needs.

Fireworks AI is recognized for its suite of technologies focused on optimizing the performance and efficiency of large language models (LLMs). Their offerings span model optimization tools, a specialized FireOptimizer framework, and innovative attention mechanisms like FireAttention. These solutions aim to enhance inference speeds, reduce operational costs, and improve resource utilization. Furthermore, Fireworks AI provides parameter-efficient fine-tuning methods and adaptive speculative execution to tailor models for specific applications. Their advancements also include optimized processing for long-context tasks and techniques to maximize throughput and cost-effectiveness in model serving. Fireworks also provides model serving functionality for select models that are readily available, also they do provide a platform to host and serve custom implementations of LLM models for customers.

Core capabilities: FireOptimizer and FireAttention

The FireOptimizer is Fireworks.ai’s adaptation engine for customizing AI model performance in production environments. It automates latency and quality optimization for unique inference workloads. It tailors performance across hardware, model, and software layers using techniques like customizable quantization, fine-tuning, and adaptive caching. Its hallmark feature, adaptive speculative execution, automatically trains workload-specific draft models to parallelize token generation, achieving up to 3x latency improvements compared to generic speculative decoding. This method significantly boosts responsiveness without compromising accuracy by increasing the hit rate.

Figure 1. FireOptimizer platform.

Diagram showing at a high level how the FireOptimizer platform works. Optimization goals and production workloads and profiles go into the tool. FireOptimizer, with adaptive speculative execution, adaptive caching, adaptive hardware mapping, customized quantization, supervised fine-tuning, and reinforcement learning tuning then produces faster, higher quality inferences.

FireAttention, Fireworks AI's custom-built inference engine, significantly enhances LLM inference speed on GPUs. It achieves this by utilizing a novel micro-precision data format and rewriting key GPU kernels (such as attention and matrix multiplication) from scratch, aligning them with underlying hardware instructions. While FireAttention prioritizes speed, potentially at the cost of initial accuracy, this is mitigated through Quantization-Aware Training (QAT). This approach allows finetuned models to maintain high precision while reducing their memory footprint. Benchmarks demonstrate FireAttention V4's superior performance over SGLang on H200 and TRT-LLM on B200, particularly in MMLU Pro tests. Overall, FireAttention V4 represents a breakthrough in achieving low-latency, high-efficiency LLM inference, especially beneficial for frontier models like DeepSeek R1.

Key benefits:

Faster inference: FireOptimizer's adaptive speculative execution has demonstrated up to 3x latency improvements in production workloads across various models, ensuring highly responsive applications.
Hassle-free optimization: FireOptimizer automates the complexities of optimization, allowing users to concentrate on application development.

FireOptimizer

FireOptimizer improves batch inference by integrating with MongoDB for efficient model fine-tuning and streamlined deployment. This multi-layered customization is vital for compound AI systems, ensuring consistent model alignment. Available for enterprise on-premise and own-cloud, FireOptimizer enhances traditional inference performance through techniques like adaptive speculative execution, caching, customizable quantization, personalized fine-tuning at scale, and customizable hardware mapping.

In this blog post, we’ll explore the benefits of FireOptimizer to perform Parameter-Efficient Fine-Tuning (PEFT) so we can use a small language model(SLM) model to carry out personalized tasks such as RAG for a private dataset. This activity will demonstrate how generative AI can be adopted for general use at scale and critical domains effectively.

Survey of fine-tuning strategies for smaller, efficient models

Smaller language models present significant opportunities for tailored adaptation while using fewer resources. The ongoing evolution in this field is fueled by increasing demand for deploying optimized LLMs across diverse environments, including cloud platforms, edge devices, and specialized hardware. These fine-tuning approaches can be categorized as follows:

Additive parameter-efficient fine-tuning (PEFT): This class of methods augments pre-trained models with new trainable parameters without altering the original weights.

Adapters: These involve inserting small, trainable modules, known as adapters, within the pre-trained model's layers. These adapters learn task-specific adjustments, enabling adaptation to new tasks without changing the pre-existing parameters.
Soft prompts: These are trainable vector embeddings appended to the input sequence, acting as guiding signals to influence the model's output for a specific task.
Prefix tuning: This technique adds a trainable prefix to the input sequence. This prefix learns task-specific information without requiring modifications to the core model architecture.

Reparametrization PEFT: This approach reduces the number of trainable parameters by reparameterizing existing model weights using low-rank approximations.

Low-Rank Adaptation (LoRA): LoRA approximates weight updates in the attention layers of a pre-trained model using low-rank matrices, significantly decreasing the number of trainable parameters.
Quantized LoRA (QLoRA): QLoRA builds upon LoRA by integrating quantization methods, further decreasing memory footprint and computational expenses.

Selective fine-tuning: This category focuses on fine-tuning only specific parameters of the pre-trained model, leading to improved computational efficiency.

BitFit: This method fine-tunes only the bias terms, or other designated parameters, of the pre-trained model, enhancing computational efficiency.
DiffPruning: This technique identifies and removes parameters that have minimal impact on the model's performance, thus reducing the number of trainable parameters.

Layer freezing strategies: These strategies involve selectively freezing certain layers of the pre-trained model while fine-tuning others to optimize the adaptation process.

Freeze and reconfigure (FAR): FAR involves freezing specific layers of the pre-trained model and fine-tuning the remaining layers to optimize model adaptation.
FishMask: This technique uses a mask to selectively freeze or fine-tune layers, optimizing adaptation for specific tasks.

Parameter-Efficient Fine-Tuning (PEFT) is a popular technique for adapting small pre-trained models to niche tasks. By adjusting only a small portion of the model's parameters, PEFT prevents overfitting, especially on smaller datasets, and greatly reduces computational and memory demands compared to full fine-tuning. Additionally, PEFT helps mitigate catastrophic forgetting in LLMs. This approach allows for efficient model customization in resource-constrained environments without the need for complete retraining.

Leveraging PEFT LoRA techniques in Fireworks AI, combined with the availability of trace data and labeled data, allows for efficient fine-tuning of smaller models.

To demonstrate the practical implications of using a small language model (SLM), we will build an agentic RAG application using MongoDB Atlas and illustrate how MongoDB can be used to power semantic search capabilities and also be leveraged as a semantic caching layer. The application serves as a demonstration to follow along with a step-by-step guide to build a simple application that is task-driven by using a Frontier LLM model, such as Llama Maverick, and they fine-tune using data generated out of this setting to fine-tune an SLM that will satisfactorily perform a similar operation while consuming fewer resources.

Step-by-Step guide for building an Agentic RAG application with MongoDB Atlas

The sample code below demonstrates an end-to-end Agentic Retrieval-Augmented Generation (RAG) workflow using LangChain, MongoDB Atlas Vector Search, and Fireworks LLMs. Below is a summary of the key steps and components:

1. Data loading & preprocessing

PDF loading: The EU Act regulations PDF is loaded using PDFLoader.
Text splitting: The document is split into manageable chunks using RecursiveCharacterTextSplitter for efficient retrieval and embedding.

2. Embedding & vector store setup

Embeddings: Sentence-transformers MPNet model is used to generate vector embeddings for each text chunk.
MongoDB Atlas Vector Search: The embeddings and text chunks are stored in MongoDB, and a vector search index is created for similarity search.

3. LLM & caching

LLM setup: Meta Llama Maverick is used as the main LLM, with a custom output parser to clean up responses.
Semantic cache: MongoDB Atlas Semantic Cache is configured to cache LLM responses and avoid redundant computation./li>

4. Agentic RAG workflow

StateGraph Construction: The workflow is modeled as a state machine with the following steps:

plan_step: Reformulates the user query for optimal retrieval.
retrieve_documents_step: Retrieves relevant documents from the vector store.
execute_step: Generates an answer using the LLM and the retrieved context.
validate_step: Uses the LLM to validate the relevance of the answer.
should_continue: Decides whether to proceed to the execute step or go back to the plan step.

Steps to build the Agentic RAG as described above are available in the notebook here.

Once built, the graph for your agentic workflow looks as shown in Figure 2.

Figure 2. Agentic RAG workflow graph.

This workflow starts at the top with _start_. From here, it goes to plan_step. From here, there is a path to the left and a path to the right. The path to the right jumps to the bottom of the diagram to validate_step and then to _end_. The left side path goes from plan_step to retrieve_step, then to execute_step, then to validate_step and finally to _end_.

Running the Agentic RAG Workflow

Invoke the workflow with a user query:

query = "In the EU act what are the various biometric categorizations of data?"
app.invoke({"question": query}, config={"recursion_limit": 30})

Response:

Response: 
In EU legislation, biometric data is categorized and regulated under several key frameworks, primarily focusing on its use, protection, and specific applications. Based on the provided context and relevant regulations:

### 1. **Biometric Verification (Authentication)**  
   - **Definition**: Automated one-to-one comparison of biometric data (e.g., fingerprints, facial recognition) to verify an individual’s identity against previously stored data.  
   - **Purpose**: Authentication (e.g., unlocking devices, accessing services).  
   - **Legal Reference**: Article 3(36) of the cited regulations.

### 2. **Biometric Identification (Matching)**  
   - **Definition**: One-to-many comparison of biometric data against databases to establish identity (e.g., border control, law enforcement).  
   - **Examples**:  
     - **Eurodac** (Regulation 2024/1358): Compares biometric data (e.g., fingerprints) to identify illegally staying third-country nationals or stateless persons.  
     - **Law Enforcement**: Used to locate suspects, victims of trafficking, or missing persons under strict conditions (Article 3(38), 3(39), and provisions like point (i)–(iii)).  

### 3. **Special Categories of Personal Data**  
   - **Status**: Biometric data is classified as a "special category" under:  
     - **GDPR (Article 9(1), Regulation (EU) 2016/679)**: Requires enhanced protections due to its sensitivity.  
     - **Directive (EU) 2016/680** and **Regulation (EU) 2018/1725**: Extend these protections to law enforcement and EU institutions.  
   - **Safeguards**: Pseudonymization, strict access controls, confidentiality obligations, and mandatory deletion after retention periods (points (c)–(e) in the context).  

### 4. **Operational and Sensitive Data**  
   - **Sensitive Operational Data**: Biometric data used in criminal investigations or counter-terrorism, where disclosure could jeopardize proceedings (Article 3(38)).  
   - **Emotion Recognition Systems**: While not explicitly labeled as biometric, these systems infer emotions/intentions (Article 3(39)) and may intersect with biometric processing if tied to identifiable individuals.  

### 5. **Law Enforcement Exceptions**  
   - Biometric data may be processed for:  
     - Preventing terrorist attacks or imminent threats (point (ii)).  
     - Investigating serious crimes (punishable by ≥4 years’ imprisonment) under Annex II (point (iii)).  

### Key Requirements:  
   - **Security**: State-of-the-art measures, pseudonymization, and access documentation (point (c)).  
   - **Restrictions**: Prohibition on unauthorized transfers (point (d)).  
   - **Retention**: Deletion after correcting bias or reaching retention limits (point (e)).  

These categorizations ensure biometric data is used proportionally, with stringent safeguards to protect privacy and fundamental rights under EU law.

Validation Score:

Score: 0.9

This notebook provides a modular, agentic RAG pipeline that can be adapted for various document retrieval and question-answering tasks using MongoDB and LLMs.

Step-by-Step guide for fine-tuning a small language model with Fireworks AI

Current challenges with frontier models

The large language model used in the preceding example, accounts/fireworks/models/deepseek-r1, can result in slow application response times due to the significant computational resources required for its billions of parameters. An agentic RAG task involves multiple LLM invocations for steps such as generating retrieval questions, producing answers, and comparing user questions to the generated results. This process involves several LLM queries, extending the total response time to 30-40 seconds, with each query potentially taking 5 or more seconds. Additionally, deploying and scaling LLMs for a large user base can be complex and expensive. To mitigate this issue, the example code demonstrates the use of a semantic cache; however, this only addresses repeated queries to the system.

By leveraging small language models (SLMs), enterprises can achieve significant gains in processing speed and cost-efficiency. SLMs require less computational power, making them ideal for resource-constrained devices, while delivering faster response times and lower operational costs. But there is a huge caveat using SLM; they come with several limitations, such as reduced generalization, limited context retention, and lower accuracy on complex tasks compared to larger models. They may struggle with nuanced reasoning, exhibit increased biases, and generate hallucinations due to their constrained training data and fewer parameters. While they are computationally efficient and well-suited for lightweight applications, their ability to adapt across domains remains restricted; for example, a pretrained SLM such as accounts/fireworks/models/deepseek-r1-distill-qwen-1p5b does not produce results which is satisfactory in our agentic RAG setting. It is not able to perform validation scoring or tends to hallucinate, which generatesa response even when context is provided.

Adapting a pre-trained Small Language Model (SLM) for specialized applications such as agentic Retrieval-Augmented Generation (RAG) utilizing private knowledge bases offers a cost-effective alternative to frontier models while maintaining similar performance levels. This strategy also provides scalability for numerous clients, ensuring Service Level Agreements (SLAs) are met.

Parameter-Efficient Fine-Tuning(PEFT) i.e. QLoRA techniques, including Quantized Low-Rank Adaptation (LoRA), substantially improve efficiency by focusing optimization on a limited set of parameters. This method lowers memory demands and operational expenses. Integrating with MongoDB streamlines data management and supports efficient model fine-tuning workflows.

MongoDB's unique value

MongoDB is integral, providing seamless data management and real-time integration that improves operational efficiency. By storing trace data as JSON and enabling efficient retrieval and storage, MongoDB adds substantial value to the process of fine-tuning models. MongoDB also doubles up as a caching layer to avoid unnecessarily invoking LLM on repeated requests for the same data.

The following steps will go through step-by-step how one can make use of the platform to fine-tune an SLM.

Here’s how to leverage this platform and tool:

Figure 3. The fine-tuning process explained.

This diagram starts on the left with user input, which funnels into the agentic RAG workflow. This workflow connects with MongoDB Atlas and the LLM. Atlas sends data to the task driven train datageneration, which then funnels to Fireworks, and then to the LLM. The LLM and SLM+LoRA adapter then send data for the output.

To enhance RAG applications, the initial step involves collecting data relevant to the specific task for fine-tuning. MongoDB Atlas, a flexible database, can be utilized to store LLM responses in a cache. For example, in our agentic RAG approach, we can create questions using diverse datasets and store their corresponding answers in MongoDB Atlas. While a powerful LLM might be useful for generating these initial responses or task-specific data during this simulation phase, a smaller scale fine-tuning process requires at least 1000 examples.

Subsequently, these generated responses need to be converted into the required format for the Fireworks.ai platform to begin the fine-tuning process. The cache.jsonl file, used later in fine-tuning, can be created by executing the provided code.

from pymongo import MongoClient
import pandas as pd
import json

client = MongoClient("<mongodb_atlas_connection_string>")
cache_col = client["agenticrag"]["cache"]
df = pd.DataFrame.from_records(cache_col.find())
vals = list(zip([{"role": "user", "content": json.loads(text)[0]["kwargs"]["content"]} for text in df.text], [
            {"role": "assistant", "content": json.loads(json.loads(text)[0])["kwargs"]["text"]} for text in df.return_val]))
messages = []
for val in vals:
    messages += [{"messages": list(val)}]
with open("cache.jsonl", "w") as f:
    for item in messages:
        f.write(json.dumps(item) + "\n")

Now that we have prepared the dataset and generated our cache.jsonl file, we can fine-tune the pre-trained deepseek-r1-distill-qwen-1p5b model by following the steps below.

Prerequisites:

Install firectl: Use the command pip install firectl to install the Fireworks command-line tool.
Authenticate: Log in to your Fireworks account using firectl login.
Prepare Dataset: Ensure your fine-tuning dataset (created during the data generation process) is ready.

Steps:

1. Upload dataset: Upload your prepared dataset to the Fireworks platform using the following command, replacing <dataset_name> with your desired name and cache.jsonl with your dataset file:

2. firectl create dataset <dataset_name> cache.jsonl

3. Create fine-tuning job: Initiate a fine-tuning job by specifying the base model, dataset, output model name, LoRA rank, and number of epochs. For example:

4. firectl create sftj --base-model accounts/fireworks/models/deepseek-r1-distill-qwen-1p5b \
5.  --dataset <dataset_name> --output-model ragmodel --lora-rank 8 --epochs 1

6. The output will provide details about the job, including its name, creation time, dataset used, current state, and the name of the output model.

7. Monitor fine-tuning: Track the progress of your fine-tuning job using the Fireworks AI portal. This allows you to ensure the process is running as expected.

8. Deploy fine-tuned model: Once the fine-tuning is complete, deploy the model for inference on the Fireworks platform. This involves two steps:

Deploy the base model used for fine-tuning:

firectl create deployment accounts/fireworks/models/deepseek-r1-distill-qwen-1p5b --enable-addons --wait

Deploy the fine-tuned LoRA adapter:

firectl load-lora ragmodel --deployment  <deployment_id>

9. Use deployed model: After deployment, the model ID (e.g., models/ragmodel) can be used to invoke the fine-tuned language model via your preferred LLM framework, leveraging the Fireworks platform's serverless API.

Summary

Fine-tuning smaller language models (SLMs) for Retrieval Augmented Generation (RAG) using platforms like Fireworks AI offers significant advantages over relying solely on large frontier models. This approach drastically improves response times, reducing latency from around 5 seconds with a large LLM to 2.3 seconds with a fine-tuned SLM, while also substantially decreasing memory and hardware requirements. By leveraging parameter-efficient fine-tuning techniques and integrating with data management solutions like MongoDB, businesses can achieve faster, more cost-effective AI performance for RAG applications, making advanced AI capabilities more accessible and sustainable.

Conclusion

The collaboration between MongoDB and Fireworks AI offers a powerful synergy for enhancing the efficiency and affordability of Large Language Model (LLM) training and deployment. Fireworks AI's utilization of Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA and qLoRA significantly curtails the computational resources necessary for fine-tuning LLMs by focusing on low-rank adaptation and quantization. This directly translates to substantial reductions in the costs associated with this crucial process. Complementarily, MongoDB's robust infrastructure, characterized by its distributed architecture, flexible schema, and efficient indexing capabilities, provides the ideal data management foundation. It allows for on-demand scaling of data infrastructure while minimizing storage expenses, thereby contributing to lower capital and operational expenditures.

This integration further fosters streamlined workflows between data and AI processes. MongoDB's capacity for real-time data integration ensures that AI models have immediate access to the most current information, thereby improving operational efficiency and the relevance of the models' insights. When combined with Fireworks AI's fine-tuning tools, this creates a cohesive environment where AI models can be continuously updated and refined. Moreover, the partnership simplifies the development of robust Retrieval Augmented Generation (RAG) solutions. MongoDB Atlas offers a scalable platform for storing embeddings, while Fireworks AI provides managed LLM hosting and other essential features. This seamless combination enables the creation of scalable and intelligent systems that significantly enhance user experience through more effective and relevant information retrieval.

Organizations adopting this strategy can achieve accelerated AI performance, resource savings, and future-proof solutions—driving innovation and competitive advantage across different sectors.

Further reading:

Atlas Vector Search: Learn AI and vector search; generate, store, index, and search embeddings in MongoDB Atlas for semantic search. Build hybrid search with Atlas Search and Atlas Vector Search. Use vector search for a RAG chatbot. Manage indexes with Atlas CLI and MongoDB Shell.
FireAttention V4: Enables cost-effective GPU inference and provides industry-leading latency and cost efficiency with FP4.
FireOptimizer: Allows users to customize latency and quality for production inference workloads.

← Previous

rerank-2.5 and rerank-2.5-lite: Instruction-Following Rerankers

Note to readers: rerank-2.5 and rerank-2.5-lite are available through the Voyage AI APIs directly. For access, sign up for Voyage AI . TL;DR – We are excited to introduce the rerank-2.5 series, which significantly improves upon rerank-2’s performance while also introducing instruction-following capabilities for the first time. On our standard suite of 93 retrieval datasets spanning multiple domains, rerank-2.5 and rerank-2.5-lite improve retrieval accuracy by 7.94% and 7.16% over Cohere Rerank v3.5. Furthermore, the new instruction-following feature allows users to steer the model’s output relevance scores using natural language. On the Massive Instructed Retrieval Benchmark (MAIR), rerank-2.5 and rerank-2.5-lite outperform Cohere Rerank v3.5 by 12.70% and 10.36%, respectively, and by similar margins on our in-house evaluation datasets. Both models now support a 32K token context length – 8x that of Cohere Rerank v3.5 and double that of rerank-2 – enabling more accurate retrieval across longer documents. Rerankers are a critical component in sophisticated retrieval systems, refining initial search results to deliver superior accuracy. Today, we are excited to announce rerank-2.5 and rerank-2.5-lite. Both models outperform LLMs as rerankers – a topic which we will dive deeper into in an upcoming blog post. These models are the product of an improved mixture of training data and advanced distillation techniques from our larger, in-house instruction-following models. Both rerank-2.5 and rerank-2.5-lite now support a 32K token context length, an 8x increase over Cohere Rerank v3.5. This allows for the reranking of much longer documents without truncation and comes with no change in pricing. For an introduction into rerankers, check out our previous post . Instruction-following capability A key feature of the rerank-2.5 series is its instruction-following capability. This allows users to dynamically steer the reranking process by providing explicit instructions alongside their query. These instructions can define the user’s notion of relevance or specify the desired characteristics of the documents to be retrieved. Leveraging the new instruction-following capability is straightforward. Users can simply append or prepend natural language instructions to their queries. The model is designed to understand these instructions and adjust the output relevance score accordingly. Examples of instructions - Instructions can include, but are not limited to, the following examples: Emphasizing query components: Specify which parts of a document are most important. For an application that searches academic papers, a standing instruction could be “Prioritize the title and ignore the abstract” to consistently surface the most relevant research based on titles. Defining document types: Direct the reranker to retrieve a specific type of document (e.g., for the query “legal implications of AI,” an instruction could be “Retrieve regulatory documents and legal statutes, not court cases.”). For example, a legal research tool could be configured with the instruction “Retrieve regulatory documents and legal statutes, not court cases” to ensure that all queries prioritize statutory law over case law. Disambiguating queries with contexts: Provide complementary information so that ambiguous queries can be clarified. For example, an instruction could be “This is an e-commerce application about cars” so that the word “Jaguar” will be interpreted as the car brand rather than as an animal. Concrete examples of instructions and the impact of instructions on search results are available in Appendix A. Accuracy gains from instruction following: The instruction-following feature is particularly useful for search/retrieval tasks where user intent can be nuanced. To demonstrate this, we built a set of in-house evaluation datasets composed of 24 domain-specific instruction-following datasets across 7 domains (web, tech, legal, finance, conversational, medical, and code). On domain-specific data, the accuracy of rerank-2.5 and rerank-2.5-lite is increased by an average of 8.13% and 7.55%, respectively, when leveraging instructions. Figure 1. Accuracy of rerank-2.5 with and without instructions for domain-specific instruction. Domain-specific instruction following results. Evaluation details Datasets: For standard results without instruction following, we conducted an evaluation across 9 domains: technical documentation, code, law, finance, web reviews, multilingual, long documents, medical, and conversations. The multilingual domain is composed of 51 datasets from 31 languages. Detailed information about each of the domains and languages can be found in the rerank-2 release blog . To evaluate instruction-following capabilities, we utilize a set of in-house domain-specific and real-world instruction-following datasets (detailed in the previous section) as well as the MAIR (Massive Instructed Retrieval) benchmark , an academic benchmark with task-specific instructions in domains such as web, legal, and biomedical search. Method and metrics: We evaluate the retrieval quality of various rerankers on top of four first-stage search methods: (1) lexical search with BM25, (2) OpenAI v3 large (text-embedding-3-large), (3) voyage-3-large, and (4) voyage-3.5. For each query, the first-stage method retrieves up to 100 candidate documents. The reranker then re-orders these documents, and we retrieve the top 10. We report the normalized discounted cumulative gain (NDCG@10), the standard metric for retrieval quality. Baselines: We compare our models against rerank-2-lite, rerank-2, Cohere Rerank 3.5, and Qwen3-Reranker-8B. Results rerank-2.5 and rerank-2.5-lite collectively set a new cost-to-performance frontier. Specifically, rerank-2.5 outperforms rerank-2 by 1.85% at the same price per token, while rerank-2.5-lite outperforms rerank-2-lite by 3.40% at the same price per token. Furthermore, rerank-2.5-lite performs better than Qwen3-Reranker-8B, the best open source reranker, despite being over an order of magnitude smaller. Figure 2. Retrieval quality versus price per million tokens for rerankers. We use $0.10 for Qwen3-Reranker-8B following the industry standard for 8B-parameter models. Real-world instruction following: In addition to the 24 domain-specific instruction-following datasets, we also curated 3 instruction-following datasets from real-world applications. Evaluating on these datasets shows that the accuracy of rerank-2.5 and rerank-2.5-lite is increased by an average of 11.48% and 7.83%, respectively, when leveraging instructions. Figure 3. Accuracy of Voyage AI with and without instruction for real-world instruction. Real-world instruction following results. Results without instruction following: The first bar chart below shows the average accuracy of each reranker when evaluated across 9 domains without instruction following. rerank-2.5 and rerank-2.5-lite consistently emerge as the top-performing rerankers, regardless of the first-stage retrieval method used. This is not the case for Cohere Rerank v3.5, which hurts retrieval quality when applied on top of voyage-3-large (the most powerful first-stage retrieval method). In particular: Averaged across the four first-stage retrieval methods, rerank-2.5 outperforms Cohere Rerank v3.5, Qwen3-Reranker-8B, and rerank-2 by 7.94%, 2.25%, 1.85%, respectively. rerank-2.5-lite, while optimized for latency, still outperforms Cohere Rerank v3.5, Qwen3 Reranker 8B, and rerank-2 by 7.16%, 1.47%, and 1.08%, respectively. Both rerank-2.5 and rerank-2.5-lite provide a significant quality improvement on top of all first-stage retrieval results. Figure 4. Reranker averages across domains without instruction. The bar charts below illustrate NDCG@10 across different languages. Both rerank-2.5 and rerank-2.5-lite consistently increase performance across the board for all languages and first-stage retrieval methods. Specifically: Averaged across the four first-stage retrieval methods, rerank-2.5 outperforms Cohere Rerank v3.5, Qwen3-Reranker-8B, and rerank-2 by 3.26%, 2.34%, and 1.35%, respectively. Likewise, rerank-2.5-lite outperforms Cohere Rerank v3.5, Qwen3-Reranker-8B, and rerank-2-lite by 1.93%, 1.01%, and 2.70%, respectively. Figure 5. Retrieval accuracy averages for reranker models across languages. Detailed domain-specific and multilingual results using BM25, voyage-3-large, and voyage-3.5 as first-stage retrieval methods can be found in Appendix B. MAIR benchmark - The figures below illustrate the accuracy gains attained by rerank-2.5 and rerank-2.5-lite on MAIR. Both rerank-2.5 and rerank-2.5-lite consistently improve atop all first-stage search results. Specifically: rerank-2.5 outperforms Cohere Rerank v3.5 and rerank-2 by an average of 12.70% and 4.90% when evaluated atop the four first-stage retrieval methods. rerank-2.5-lite outperforms Cohere Rerank v3.5 and rerank-2 by an average of 10.36% and 2.57% when evaluated atop the four first-stage retrieval methods. Figure 6. Accuracy gains through MAIR. Detailed results: Numeric results for all evaluations are available in this spreadsheet . Try rerank-2.5 and rerank-2.5-lite today! Both rerank-2.5 and rerank-2.5-lite are available today with flexible, token-based pricing. For existing rerank-2 and rerank-2-lite users, we recommend upgrading to rerank-2.5 and rerank-2.5-lite, respectively. This upgrade provides better quality and double the context length at the same cost. We will continue to offer the rerank-2 series for existing users who do not wish to upgrade to rerank-2.5. For new users, head over to our docs to get started and learn more; first 200M tokens are free. As our results show, combining Voyage embedding models with Voyage rerankers delivers the highest possible retrieval accuracy. Appendix A – Examples of instruction following table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Query & Instruction Model Results (Top-1 Document) Query: Who is at highest risk for Hand-foot-and-mouth disease? Instruction: Focus on age-related risk factors and the most common age group. rerank-2 (No Instruction): Children are at the highest risk of getting the disease. rerank-2.5 (With Instruction): The most important risk factor is age. The infection occurs most often in children under age 10, but can be seen in adolescents and occasionally adults. The outbreaks occur most often in the summer and early fall. Query: what does unlock my device mean? Instruction: I am an international business person and need to unlock my work phone for use with multiple carriers. My search should prioritize the implications of unlocking a device provided by my employer, focusing on adherence to my company’s BYOD policies and the impact on global connectivity. rerank-2 (No Instruction): The term “unlock my device” typically means making a mobile phone or tablet work with a different service provider’s network… It’s like having a key to open up your selection of providers. rerank-2.5 (With Instruction): Gaining access to other network services on a phone initially configured for a specific telecommunications company… For a professional who travels across borders on business transactions, this process allows seamless switching to local service providers, ensuring constant connection to corporate networks and clients, given that this does not contravene any pre-established protocols… Query: Why are historical prices of stocks different on different websites? Which one should I believe? Instruction: Explain the process and challenges of collecting and reconciling historical stock price data across different financial websites. rerank-2 (No Instruction): I still can’t understand why there is a price discrepancancy. There isn’t. It’s the same stock and price differences between such major exchanges will always be minimal… rerank-2.5 (With Instruction): The cause of incomplete/inaccurate financial data’s appearing on free sites is that it is both complicated and expensive to obtain and parse these data. Even within a single country, different pieces of financial data are handled by different authorities… There are some companies (e.g. Bloomberg) whose entire business model is to do the above… Appendix B – Figures for domain-specific and multilingual results The figures below show results on domain-specific datasets without instructions for BM25, voyage-3-large, and voyage-3.5 as the first-stage retrieval method, respectively: The figures below show results on multilingual datasets when using BM25, voyage-3-large, and voyage-3.5 as the first-stage retrieval method, respectively:

August 11, 2025

Next →

Carrying Complexity, Delivering Agility

Resilience, intelligence, and simplicity: The pillars of MongoDB’s engineering vision for innovating at scale We’re relatively new to MongoDB— Ashish joined two years ago via the Granite acquisition after a decade-plus building Google’s databases and distributed systems, and Akshat joined in June 2024 after 15 years building databases at AWS. We have a shared obsession with distributed systems. We’d seen how much developers loved MongoDB, which is part of the reason we joined the company—MongoDB is one of the most loved databases in the world. So one of the first things we sought to understand was why. It turned out to be simpler than we thought: MongoDB’s vision is to get developers to production fast. This means making it easy to start, and easier to keep going—one command spin-up, sane defaults for day one, and zero downtime upgrades and zero downtime expansion to multiple clouds as you scale. That’s what developer agility looks like in practice: the ability to choose the best tools, move quickly, and to trust the system to carry the weight of failure, complexity, and change. At MongoDB, three principles drive that vision: resilience, intelligence, and simplicity. Resilience is the ability to keep going when something breaks, intelligence is the ability to adapt to changing conditions, and simplicity is reducing cognitive and operational load so users and operators can move quickly and safely. These are not just technical goals—we treat them as non-negotiable design constraints. So if a change widens blast radius, breaks adaptive performance, or adds operator toil, it doesn’t ship. In this post, we share the key engineering themes shaping our work and the mechanisms that keep us honest. Security as a first principle Security isn't a wall you build around your data. It's an assumption you design against from the very beginning. The assumption is simple: in a distributed system, you can’t trust the network, you can’t trust the hardware, and you certainly can't trust your neighbors. This starts with architectural isolation. In most cloud database service offerings, you're sharing walls with strangers. Shared walls hurt performance, they leak failures, and sometimes they leak secrets. We minimize shared walls, and where utilities must be shared, we build firebreaks. Stronger isolation reduces the blast radius of mistakes and attacks. With a MongoDB Atlas dedicated cluster, you get the whole building. Your cluster runs on its own provisioned servers, in its own private network (VPC). Your unencrypted data is never available in a shared VM or process. There are no "noisy neighbors" because you have no neighbors. The attack surface shrinks dramatically, and resource contention disappears. The blast radius of a problem elsewhere stops at your door. In other words, we follow an anti-Vegas principle—what happens outside your cluster will stay outside. But true security is layered. Once we’ve isolated the environment, we defend it from the inside out. We start by asking the hard questions: Who are you? That's strong authentication, from SCRAM to AWS IAM. What can you do? That's fine-grained RBAC, enforcing the principle of least privilege. What if someone gets in? That's encryption everywhere—in transit, at rest, and even in use with Client-Side Field Level Encryption . How do we lock down the roads? That’s network controls like IP access lists and private endpoints. And how do we prove it? That's granular auditing for a clear, immutable trail. Every one of these layers reflects defense in depth. Figure 1. Queryable Encryption. The history of database security is full of trade-offs between safety and functionality. For decades, the trade-off has been brutal: to run a query, you had to decrypt your data on the server, exposing it to risk. Queryable Encryption —an industry-first searchable encryption scheme developed by MongoDB Research—breaks this paradigm. It allows your application to run expressive queries, including equality and range checks on data that remains fully encrypted on the server. The decryption keys never leave your client. The server maintains encrypted indexes for the fields you wish to query on, and queries can be done entirely on the encrypted data, maintaining the strongest privacy and security of your sensitive data. By carrying these defenses in the platform itself, security stops being another burden developers have to design around. They get the privacy guarantees , the audit trails, and the compliance , without sacrificing functionality or velocity. Achieving resilience: Architecture, operations, and proof Systems don’t live in a vacuum. They live in messy realities: network partitions, power outages, kernel panics, cloud control plane hiccups, operator mistakes. The measure of resilience is not “will it fail?” but “what happens next?” Resilience is the ability to keep going when the thing you depend on stops working, not because you planned for it to fail, but because you planned for it to recover. Here’s how we achieve resilience. Architecture: MongoDB Atlas is built on the assumption that something may fail at any time. Every cluster starts life as a replica set, spread across independent availability zones. That’s the default, not an upgrade. The moment a primary becomes unreachable, an election happens. Within seconds, another node takes over, clients reconnect, and in-flight writes retry automatically. Single-zone diversity buys you protection against a data center outage. Adding more regions buys you protection against a full region failure. Adding more cloud providers buys you insulation against provider-wide events. Each step up that ladder buys you more protection against bigger failures. The trade-off is that each step adds more moving parts to manage, and the failure modes evolve: intra-region links are fast; cross-region introduce wide, lossy links; cross-cloud adds different fabrics, load balancers, and failure semantics. Figure 2. Resilience options: Single zone, multi-AZ, multi-region, multi-cloud. Our job is to make any type of failures (node failures, link failures, gray failures) invisible to you. Writes are only committed when a majority of voting members have the entry in the same term. That rule sounds small, but it’s the safety net that prevents a primary stranded on the wrong side of a partition from accepting writes it can’t keep. Heartbeats and UpdatePosition messages carry progress and truth; if a node learns of a higher term, it steps down immediately. When elections happen, the new primary doesn’t open for writers until it has caught up to the latest known state, preserving as many uncommitted writers as possible. Secondaries apply operations as they arrive, even over lossy links. Operating discipline: Resilience isn’t just in the code and architecture, it’s in how you operate it every day. Even the best design will fail without the discipline to detect problems early and recover quickly. You need to embed it in how you operate. Operational excellence is about preventing avoidable failures, detecting the ones you can’t prevent, and recovering quickly when they happen. And we’ve turned that into a discipline. Every week, the people closest to the work—engineers, on-calls, product managers, and leaders—step out of the day’s firefight to review the system with rigor. We celebrate the small wins that quietly make the system safer. We dig into failures to understand not just what happened, but how to make sure it doesn’t happen again anywhere. The goal isn’t perfection. Instead, it’s building a system where every lesson learned and every fix made raises the floor for everyone. A single automation can remove a whole category of incidents. A well-written postmortem can stop the same mistake from happening across dozens of systems. The return isn’t linear—it compounds. Figure 3. The ops excellence flywheel. When resilience works, failure stops being something every developer has to carry in their head. The system absorbs it, recovers, and lets them keep moving. Proof before shipping: Testing tells you that your code works in the cases you’ve thought to test. Formal verification tells you whether it works in all the cases that matter, even the ones you didn’t think to test. MongoDB is among the few cloud databases that apply and publish formal methods on the core database paths. This rigor translates into agility; teams using the database ship products without worrying about node failures, failovers, or clock skew, causing edge cases. Those edge cases in the database have already been explored, proven, and designed against. Figure 4. Formal methods. When we design a new replication or failover protocol, we don’t just code it, run a few chaos tests, and ship it. We build a mathematical model of the core logic stripped of distracting details like disk format or thread pools and ask a model checker to try every possible interleaving of events. The tool doesn’t skip the “unlikely” cases. It tries them all. Take logless reconfiguration . The idea is simple: MongoDB decouples configuration changes from the data replication log, so membership changes no longer queue behind user writes. But while the idea is simple, the implementation is not. Without care, concurrent configs can fork the cluster, primaries can be elected on stale terms, or new majorities can lose the old majority’s writes. We modeled the protocol in TLA+, explored millions of interleavings, and distilled the solution down to four invariants: terms block stale primaries, monotonic versions prevent forks, majority votes stop minority splits, and the oplog-commit rule ensures durability carries forward. For transactions , we developed a modular formal specification of the multi-shard protocol in TLA+ to verify protocol correctness and snapshot isolation, defined and tested the WiredTiger storage interface with automated model-based techniques, and analyzed permissiveness to assess how well concurrency is maximized within the isolation level. These models are not giant, perfect representations of the whole system. They’re small, precise abstractions that focus on the essence of correctness. The payoff is simple: the model checker explores more corner cases in minutes than a human tester could in years. Alongside formal proofs, we use additional tools to test the implementation under deterministic simulation: fuzzing, fault injection, and message reordering against real binaries. Determinism gives us one-click bug replication, CI/CD regression gates, and reliable incident replays—o rare timing bugs become easy fixes. Mastering the multi-cloud reality with simple abstractions Developer agility isn’t about having a hundred choices on a menu; it's about removing the friction that makes real choice impossible. One such choice that almost never materializes in practice is multi-cloud. We achieve multi-cloud by building a unified data fabric that lets you put your data anywhere you need it, controlled from a single place. A DIY multi-cloud database where you run self-managed MongoDB across AWS, Microsoft Azure, and Google Cloud seems simple on paper. In practice, it involves weeks of networking (VPC/VNet peering, routing, and firewall rules) and brittle scripts. The theoretical agility that you got by going multi-cloud collapses under the weight of operational reality. Figure 5. Multi-cloud replica sets with MongoDB. Now contrast this with MongoDB Atlas, where you don’t have to manually orchestrate provisioning across three different cloud APIs. A single replica set can span AWS, Google Cloud, and Azure. Provisioning, networking, and failover are handled for you. Your app connects with a standard mongodb+srv string, and our intelligent drivers ensure that if your AWS primary fails, traffic automatically fails over to a new primary in GCP or Azure without any changes to your code. This transforms an operational nightmare into a simple deployment choice, giving you freedom from vendor lock-in and a robust defense against provider-wide outages. Agility also means precise data placement for data sovereignty and global latency. Global Clusters and Zone sharding let you describe simple rules so data stays where policy requires and users are served locally, e.g., A rule to map "DE", "FR", and "ES" to the EU_Zone can guarantee that all European customer data and order history physically reside within European borders, satisfying strict GDPR requirements out of the box. Because Zone Sharding is built into the core sharding system, you can add or adjust placement without app rewrites. That’s real agility: the platform removes the hard parts, so the choices are real. From data to intelligence: Building the next generation of AI-powered applications Building intelligent AI-powered features has been a complex and fragmented process. The traditional approach forced developers to maintain separate vector databases for semantic search, creating brittle ETL pipelines to shuttle data back and forth from their primary operational database. This introduced architectural complexity, latency, and a higher total cost of ownership. That’s not agility. That’s friction. Our approach is to eliminate this friction entirely. We believe the best place to build AI-powered applications is directly on your operational data. This is the vision behind MongoDB Atlas Vector Search. Instead of creating a separate product, we integrated vector search capabilities directly into the MongoDB query engine. This is a profound simplification for developers. You can now perform semantic search—finding results based on meaning and context, not just keywords—using the same MongoDB Query API (MQL) and drivers you already know. There are no new systems to learn and no data to synchronize. You can seamlessly combine vector search with traditional filters, aggregations, and updates in a single, expressive query. This dramatically accelerates the development of modern features like RAG ( retrieval-augmented generation ) for chatbots, sophisticated recommendation engines, and intelligent search experiences. Intelligence isn’t something you bolt on. It’s something you build on. This is an area where we continue to make multiple enhancements. For example, with the acquisition of Voyage AI earlier this year, we are making progress towards integrating Voyage's embedding and reranking models into Atlas to deliver a truly native experience . We are also actively applying AI toward our Application Modernization efforts. Consider a relational database application that involves pages of SQL statements representing a view or a query. How do you translate it so it can work effectively with MongoDB’s MQL? LLMs have advanced enough to provide a base version that may be mostly the correct shape, but to get it accurate and performant requires building additional tooling. We are actively working with several customers, not only on the SQL → MQL translation, but also on modernizing their application code using similar techniques. What’s next? We’ll keep pushing on the same three levers: resilience, intelligence, and simplicity. Keep watching this space. We’ll publish deep dives similar to our TLA+ write-up on logless reconfiguration , covering formal methods and other behind-the-scenes work on hard engineering problems, such as MongoDB 8.0 performance improvement challenges . Our vision is to carry the complexity so developers don’t have to—and to give them the agility & freedom to build the next generation of intelligent applications wherever they want. For more on how MongoDB went from a “niche” NoSQL database to a powerhouse with the high availability, tunable consistency, ACID transactions, and robust security that enterprises demand, check out the MongoDB blog .

September 25, 2025