MongoDB Developer Blog

Deep dives into technical concepts, architectures, and innovations with MongoDB.

You Don't Always Need Frontier Models to Power Your RAG Architecture

Frontier AI models are driving the widespread adoption of generative AI by demonstrating unprecedented capabilities. However, their deployment often entails significant costs. The strategic partnership between MongoDB and Fireworks.AI addresses these cost implications by offering solutions that optimize performance and resource utilization. This collaboration leverages MongoDB's efficient data management alongside Fireworks.AI's model optimization tools to enhance speed and efficiency while minimizing operational expenses. In the current AI environment, achieving high performance is crucial, but equally important is optimizing the total cost of ownership (TCO). Businesses must focus on the price-performance ratio, ensuring that improvements in speed and efficiency lead to real cost savings. This article will address the following topics: How to build an agentic RAG using a Fireworks AI hosted LLM and MongoDB Atlas for retrieval. Strategies for optimizing retrieval-augmented generation (RAG) applications using MongoDB Atlas and large language models (LLMs) through effective query and response caching. Techniques using the Fireworks AI platform focus on fine-tuning models, accelerating LLM inference, and reducing hardware needs. Steps to fine-tune a pretrained SLM with PEFT techniques using the Fireworks Platform. Readers will gain a practical, in-depth strategy to improve AI performance while lowering costs. This will be demonstrated with examples and performance data. Unlocking efficiency and performance with MongoDB and Fireworks AI MongoDB Atlas is renowned for its flexible schema, efficient indexing, and distributed architecture, allowing organizations to scale their data infrastructure on demand. MongoDB Atlas is a general-purpose database that focuses on highlighting flexibility, AI suitability, and ACID transactions. Users have the flexibility to run their application anywhere but making sure that there are no compromises made in the security aspects of working with it. MongoDB offers a comprehensive, secure, and efficient database solution for modern applications, catering to various technical and strategic needs. Fireworks AI is recognized for its suite of technologies focused on optimizing the performance and efficiency of large language models (LLMs). Their offerings span model optimization tools, a specialized FireOptimizer framework, and innovative attention mechanisms like FireAttention. These solutions aim to enhance inference speeds, reduce operational costs, and improve resource utilization. Furthermore, Fireworks AI provides parameter-efficient fine-tuning methods and adaptive speculative execution to tailor models for specific applications. Their advancements also include optimized processing for long-context tasks and techniques to maximize throughput and cost-effectiveness in model serving. Fireworks also provides model serving functionality for select models that are readily available, also they do provide a platform to host and serve custom implementations of LLM models for customers. Core capabilities: FireOptimizer and FireAttention The FireOptimizer is Fireworks.ai’s adaptation engine for customizing AI model performance in production environments. It automates latency and quality optimization for unique inference workloads. It tailors performance across hardware, model, and software layers using techniques like customizable quantization, fine-tuning, and adaptive caching. Its hallmark feature, adaptive speculative execution, automatically trains workload-specific draft models to parallelize token generation, achieving up to 3x latency improvements compared to generic speculative decoding. This method significantly boosts responsiveness without compromising accuracy by increasing the hit rate. Figure 1. FireOptimizer platform. FireAttention, Fireworks AI's custom-built inference engine, significantly enhances LLM inference speed on GPUs. It achieves this by utilizing a novel micro-precision data format and rewriting key GPU kernels (such as attention and matrix multiplication) from scratch, aligning them with underlying hardware instructions. While FireAttention prioritizes speed, potentially at the cost of initial accuracy, this is mitigated through Quantization-Aware Training (QAT). This approach allows finetuned models to maintain high precision while reducing their memory footprint. Benchmarks demonstrate FireAttention V4's superior performance over SGLang on H200 and TRT-LLM on B200, particularly in MMLU Pro tests. Overall, FireAttention V4 represents a breakthrough in achieving low-latency, high-efficiency LLM inference, especially beneficial for frontier models like DeepSeek R1. Key benefits: Faster inference: FireOptimizer's adaptive speculative execution has demonstrated up to 3x latency improvements in production workloads across various models, ensuring highly responsive applications. Hassle-free optimization: FireOptimizer automates the complexities of optimization, allowing users to concentrate on application development. FireOptimizer FireOptimizer improves batch inference by integrating with MongoDB for efficient model fine-tuning and streamlined deployment. This multi-layered customization is vital for compound AI systems, ensuring consistent model alignment. Available for enterprise on-premise and own-cloud, FireOptimizer enhances traditional inference performance through techniques like adaptive speculative execution, caching, customizable quantization, personalized fine-tuning at scale, and customizable hardware mapping. In this blog post, we’ll explore the benefits of FireOptimizer to perform Parameter-Efficient Fine-Tuning (PEFT) so we can use a small language model(SLM) model to carry out personalized tasks such as RAG for a private dataset. This activity will demonstrate how generative AI can be adopted for general use at scale and critical domains effectively. Survey of fine-tuning strategies for smaller, efficient models Smaller language models present significant opportunities for tailored adaptation while using fewer resources. The ongoing evolution in this field is fueled by increasing demand for deploying optimized LLMs across diverse environments, including cloud platforms, edge devices, and specialized hardware. These fine-tuning approaches can be categorized as follows: Additive parameter-efficient fine-tuning (PEFT): This class of methods augments pre-trained models with new trainable parameters without altering the original weights. Adapters: These involve inserting small, trainable modules, known as adapters, within the pre-trained model's layers. These adapters learn task-specific adjustments, enabling adaptation to new tasks without changing the pre-existing parameters. Soft prompts: These are trainable vector embeddings appended to the input sequence, acting as guiding signals to influence the model's output for a specific task. Prefix tuning: This technique adds a trainable prefix to the input sequence. This prefix learns task-specific information without requiring modifications to the core model architecture. Reparametrization PEFT: This approach reduces the number of trainable parameters by reparameterizing existing model weights using low-rank approximations. Low-Rank Adaptation (LoRA): LoRA approximates weight updates in the attention layers of a pre-trained model using low-rank matrices, significantly decreasing the number of trainable parameters. Quantized LoRA (QLoRA): QLoRA builds upon LoRA by integrating quantization methods, further decreasing memory footprint and computational expenses. Selective fine-tuning: This category focuses on fine-tuning only specific parameters of the pre-trained model, leading to improved computational efficiency. BitFit: This method fine-tunes only the bias terms, or other designated parameters, of the pre-trained model, enhancing computational efficiency. DiffPruning: This technique identifies and removes parameters that have minimal impact on the model's performance, thus reducing the number of trainable parameters. Layer freezing strategies: These strategies involve selectively freezing certain layers of the pre-trained model while fine-tuning others to optimize the adaptation process. Freeze and reconfigure (FAR): FAR involves freezing specific layers of the pre-trained model and fine-tuning the remaining layers to optimize model adaptation. FishMask: This technique uses a mask to selectively freeze or fine-tune layers, optimizing adaptation for specific tasks. Parameter-Efficient Fine-Tuning (PEFT) is a popular technique for adapting small pre-trained models to niche tasks. By adjusting only a small portion of the model's parameters, PEFT prevents overfitting, especially on smaller datasets, and greatly reduces computational and memory demands compared to full fine-tuning. Additionally, PEFT helps mitigate catastrophic forgetting in LLMs. This approach allows for efficient model customization in resource-constrained environments without the need for complete retraining. Leveraging PEFT LoRA techniques in Fireworks AI, combined with the availability of trace data and labeled data, allows for efficient fine-tuning of smaller models. To demonstrate the practical implications of using a small language model (SLM), we will build an agentic RAG application using MongoDB Atlas and illustrate how MongoDB can be used to power semantic search capabilities and also be leveraged as a semantic caching layer. The application serves as a demonstration to follow along with a step-by-step guide to build a simple application that is task-driven by using a Frontier LLM model, such as Llama Maverick, and they fine-tune using data generated out of this setting to fine-tune an SLM that will satisfactorily perform a similar operation while consuming fewer resources. Step-by-Step guide for building an Agentic RAG application with MongoDB Atlas The sample code below demonstrates an end-to-end Agentic Retrieval-Augmented Generation (RAG) workflow using LangChain, MongoDB Atlas Vector Search, and Fireworks LLMs. Below is a summary of the key steps and components: 1. Data loading & preprocessing PDF loading: The EU Act regulations PDF is loaded using PDFLoader . Text splitting: The document is split into manageable chunks using RecursiveCharacterTextSplitter for efficient retrieval and embedding. 2. Embedding & vector store setup Embeddings: Sentence-transformers MPNet model is used to generate vector embeddings for each text chunk. MongoDB Atlas Vector Search : The embeddings and text chunks are stored in MongoDB, and a vector search index is created for similarity search. 3. LLM & caching LLM setup: Meta Llama Maverick is used as the main LLM, with a custom output parser to clean up responses. Semantic cache: MongoDB Atlas Semantic Cache is configured to cache LLM responses and avoid redundant computation./li> 4. Agentic RAG workflow StateGraph Construction: The workflow is modeled as a state machine with the following steps: plan_step: Reformulates the user query for optimal retrieval. retrieve_documents_step: Retrieves relevant documents from the vector store. execute_step: Generates an answer using the LLM and the retrieved context. validate_step: Uses the LLM to validate the relevance of the answer. should_continue: Decides whether to proceed to the execute step or go back to the plan step. Steps to build the Agentic RAG as described above are available in the notebook here . Once built, the graph for your agentic workflow looks as shown in Figure 2. Figure 2. Agentic RAG workflow graph. Running the Agentic RAG Workflow Invoke the workflow with a user query: query = "In the EU act what are the various biometric categorizations of data?" app.invoke({"question": query}, config={"recursion_limit": 30}) Response: Response: In EU legislation, biometric data is categorized and regulated under several key frameworks, primarily focusing on its use, protection, and specific applications. Based on the provided context and relevant regulations: ### 1. **Biometric Verification (Authentication)** - **Definition**: Automated one-to-one comparison of biometric data (e.g., fingerprints, facial recognition) to verify an individual’s identity against previously stored data. - **Purpose**: Authentication (e.g., unlocking devices, accessing services). - **Legal Reference**: Article 3(36) of the cited regulations. ### 2. **Biometric Identification (Matching)** - **Definition**: One-to-many comparison of biometric data against databases to establish identity (e.g., border control, law enforcement). - **Examples**: - **Eurodac** (Regulation 2024/1358): Compares biometric data (e.g., fingerprints) to identify illegally staying third-country nationals or stateless persons. - **Law Enforcement**: Used to locate suspects, victims of trafficking, or missing persons under strict conditions (Article 3(38), 3(39), and provisions like point (i)–(iii)). ### 3. **Special Categories of Personal Data** - **Status**: Biometric data is classified as a "special category" under: - **GDPR (Article 9(1), Regulation (EU) 2016/679)**: Requires enhanced protections due to its sensitivity. - **Directive (EU) 2016/680** and **Regulation (EU) 2018/1725**: Extend these protections to law enforcement and EU institutions. - **Safeguards**: Pseudonymization, strict access controls, confidentiality obligations, and mandatory deletion after retention periods (points (c)–(e) in the context). ### 4. **Operational and Sensitive Data** - **Sensitive Operational Data**: Biometric data used in criminal investigations or counter-terrorism, where disclosure could jeopardize proceedings (Article 3(38)). - **Emotion Recognition Systems**: While not explicitly labeled as biometric, these systems infer emotions/intentions (Article 3(39)) and may intersect with biometric processing if tied to identifiable individuals. ### 5. **Law Enforcement Exceptions** - Biometric data may be processed for: - Preventing terrorist attacks or imminent threats (point (ii)). - Investigating serious crimes (punishable by ≥4 years’ imprisonment) under Annex II (point (iii)). ### Key Requirements: - **Security**: State-of-the-art measures, pseudonymization, and access documentation (point (c)). - **Restrictions**: Prohibition on unauthorized transfers (point (d)). - **Retention**: Deletion after correcting bias or reaching retention limits (point (e)). These categorizations ensure biometric data is used proportionally, with stringent safeguards to protect privacy and fundamental rights under EU law. Validation Score: Score: 0.9 This notebook provides a modular, agentic RAG pipeline that can be adapted for various document retrieval and question-answering tasks using MongoDB and LLMs. Step-by-Step guide for fine-tuning a small language model with Fireworks AI Current challenges with frontier models The large language model used in the preceding example, accounts/fireworks/models/deepseek-r1 , can result in slow application response times due to the significant computational resources required for its billions of parameters. An agentic RAG task involves multiple LLM invocations for steps such as generating retrieval questions, producing answers, and comparing user questions to the generated results. This process involves several LLM queries, extending the total response time to 30-40 seconds , with each query potentially taking 5 or more seconds . Additionally, deploying and scaling LLMs for a large user base can be complex and expensive. To mitigate this issue, the example code demonstrates the use of a semantic cache; however, this only addresses repeated queries to the system. By leveraging small language models (SLMs), enterprises can achieve significant gains in processing speed and cost-efficiency. SLMs require less computational power, making them ideal for resource-constrained devices, while delivering faster response times and lower operational costs. But there is a huge caveat using SLM; they come with several limitations, such as reduced generalization, limited context retention, and lower accuracy on complex tasks compared to larger models. They may struggle with nuanced reasoning, exhibit increased biases, and generate hallucinations due to their constrained training data and fewer parameters. While they are computationally efficient and well-suited for lightweight applications, their ability to adapt across domains remains restricted; for example, a pretrained SLM such as accounts/fireworks/models/deepseek-r1-distill-qwen-1p5b does not produce results which is satisfactory in our agentic RAG setting. It is not able to perform validation scoring or tends to hallucinate, which generatesa response even when context is provided. Adapting a pre-trained Small Language Model (SLM) for specialized applications such as agentic Retrieval-Augmented Generation (RAG) utilizing private knowledge bases offers a cost-effective alternative to frontier models while maintaining similar performance levels. This strategy also provides scalability for numerous clients, ensuring Service Level Agreements (SLAs) are met. Parameter-Efficient Fine-Tuning(PEFT) i.e. QLoRA techniques, including Quantized Low-Rank Adaptation (LoRA), substantially improve efficiency by focusing optimization on a limited set of parameters. This method lowers memory demands and operational expenses. Integrating with MongoDB streamlines data management and supports efficient model fine-tuning workflows. MongoDB's unique value MongoDB is integral, providing seamless data management and real-time integration that improves operational efficiency. By storing trace data as JSON and enabling efficient retrieval and storage, MongoDB adds substantial value to the process of fine-tuning models. MongoDB also doubles up as a caching layer to avoid unnecessarily invoking LLM on repeated requests for the same data. The following steps will go through step-by-step how one can make use of the platform to fine-tune an SLM. Here’s how to leverage this platform and tool: Figure 3. The fine-tuning process explained. To enhance RAG applications, the initial step involves collecting data relevant to the specific task for fine-tuning. MongoDB Atlas, a flexible database, can be utilized to store LLM responses in a cache. For example, in our agentic RAG approach, we can create questions using diverse datasets and store their corresponding answers in MongoDB Atlas. While a powerful LLM might be useful for generating these initial responses or task-specific data during this simulation phase, a smaller scale fine-tuning process requires at least 1000 examples. Subsequently, these generated responses need to be converted into the required format for the Fireworks.ai platform to begin the fine-tuning process. The cache.jsonl file, used later in fine-tuning, can be created by executing the provided code. from pymongo import MongoClient import pandas as pd import json client = MongoClient("<mongodb_atlas_connection_string>") cache_col = client["agenticrag"]["cache"] df = pd.DataFrame.from_records(cache_col.find()) vals = list(zip([{"role": "user", "content": json.loads(text)[0]["kwargs"]["content"]} for text in df.text], [ {"role": "assistant", "content": json.loads(json.loads(text)[0])["kwargs"]["text"]} for text in df.return_val])) messages = [] for val in vals: messages += [{"messages": list(val)}] with open("cache.jsonl", "w") as f: for item in messages: f.write(json.dumps(item) + "\n") Now that we have prepared the dataset and generated our cache.jsonl file, we can fine-tune the pre-trained deepseek-r1-distill-qwen-1p5b model by following the steps below. Prerequisites: Install firectl: Use the command pip install firectl to install the Fireworks command-line tool. Authenticate: Log in to your Fireworks account using firectl login. Prepare Dataset: Ensure your fine-tuning dataset (created during the data generation process) is ready. Steps: 1. Upload dataset: Upload your prepared dataset to the Fireworks platform using the following command, replacing <dataset_name> with your desired name and cache.jsonl with your dataset file: 2. firectl create dataset <dataset_name> cache.jsonl 3. Create fine-tuning job: Initiate a fine-tuning job by specifying the base model, dataset, output model name, LoRA rank, and number of epochs. For example: 4. firectl create sftj --base-model accounts/fireworks/models/deepseek-r1-distill-qwen-1p5b \ 5. --dataset <dataset_name> --output-model ragmodel --lora-rank 8 --epochs 1 6. The output will provide details about the job, including its name, creation time, dataset used, current state, and the name of the output model. 7. Monitor fine-tuning: Track the progress of your fine-tuning job using the Fireworks AI portal. This allows you to ensure the process is running as expected. 8. Deploy fine-tuned model: Once the fine-tuning is complete, deploy the model for inference on the Fireworks platform. This involves two steps: Deploy the base model used for fine-tuning: firectl create deployment accounts/fireworks/models/deepseek-r1-distill-qwen-1p5b --enable-addons --wait Deploy the fine-tuned LoRA adapter: firectl load-lora ragmodel --deployment <deployment_id> 9. Use deployed model: After deployment, the model ID (e.g., models/ragmodel ) can be used to invoke the fine-tuned language model via your preferred LLM framework, leveraging the Fireworks platform's serverless API. Summary Fine-tuning smaller language models (SLMs) for Retrieval Augmented Generation (RAG) using platforms like Fireworks AI offers significant advantages over relying solely on large frontier models. This approach drastically improves response times, reducing latency from around 5 seconds with a large LLM to 2.3 seconds with a fine-tuned SLM , while also substantially decreasing memory and hardware requirements. By leveraging parameter-efficient fine-tuning techniques and integrating with data management solutions like MongoDB, businesses can achieve faster, more cost-effective AI performance for RAG applications, making advanced AI capabilities more accessible and sustainable. Conclusion The collaboration between MongoDB and Fireworks AI offers a powerful synergy for enhancing the efficiency and affordability of Large Language Model (LLM) training and deployment. Fireworks AI's utilization of Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA and qLoRA significantly curtails the computational resources necessary for fine-tuning LLMs by focusing on low-rank adaptation and quantization. This directly translates to substantial reductions in the costs associated with this crucial process. Complementarily, MongoDB's robust infrastructure, characterized by its distributed architecture, flexible schema, and efficient indexing capabilities, provides the ideal data management foundation. It allows for on-demand scaling of data infrastructure while minimizing storage expenses, thereby contributing to lower capital and operational expenditures. This integration further fosters streamlined workflows between data and AI processes. MongoDB's capacity for real-time data integration ensures that AI models have immediate access to the most current information, thereby improving operational efficiency and the relevance of the models' insights. When combined with Fireworks AI's fine-tuning tools, this creates a cohesive environment where AI models can be continuously updated and refined. Moreover, the partnership simplifies the development of robust Retrieval Augmented Generation (RAG) solutions. MongoDB Atlas offers a scalable platform for storing embeddings, while Fireworks AI provides managed LLM hosting and other essential features. This seamless combination enables the creation of scalable and intelligent systems that significantly enhance user experience through more effective and relevant information retrieval. Organizations adopting this strategy can achieve accelerated AI performance, resource savings, and future-proof solutions—driving innovation and competitive advantage across different sectors. Further reading: Atlas Vector Search : Learn AI and vector search; generate, store, index, and search embeddings in MongoDB Atlas for semantic search. Build hybrid search with Atlas Search and Atlas Vector Search. Use vector search for a RAG chatbot. Manage indexes with Atlas CLI and MongoDB Shell. FireAttention V4 : Enables cost-effective GPU inference and provides industry-leading latency and cost efficiency with FP4. FireOptimizer : Allows users to customize latency and quality for production inference workloads.

August 11, 2025
Developer Blog

Kubernetes, Crossplane, and Atlas—Better Together

More and more companies are moving away from the original approach to DevOps, where application developers were given access to—and often responsibility for—the tooling used to spin up infrastructure and deploy workloads. While this definitely ticked the self-service box, the overhead in cognitive load was very high for developers; they now had to learn and support what was previously centrally managed (even if that meant a slow ticket-ops approach to provisioning and deploying). At MongoDB, we're seeing more and more customers moving towards the concept of an Internal Developer Platform. Tooling and governance are centrally managed and developed, often by central teams with titles like "platform engineering". Application developers retain the self-service that was the great value proposition of DevOps, but thanks to the central ownership, they don't suffer the overhead of maintaining (or even needing to fully understand) the tooling they leverage. The tooling often abstracts the developers from many of the optional or typically unchanged options and settings. This can offer developers a minimal number of decisions to make to meet their needs It also offers many benefits to the company: not only are developers empowered to move and deliver faster, but governance and compliance become easier thanks to centrally enforced settings (e.g., security best practices like TLS). Such tooling also typically makes it far easier for the company to roll out changes to the centrally owned templates used by application developers, whether that be new defaults or new enforced security settings. Properly enabling application developers makes their lives easier, and delivering customer value becomes faster than with either a fully centralized approach (with ticket-ops) or a totally decentralized 'classic' DevOps approach. The company retains security, governance, and monitoring oversight, helping meet business requirements. Everyone wins. Atlas Kubernetes Operator MongoDB Atlas is a fully managed cloud database service that simplifies deploying, managing, and scaling MongoDB clusters. With features like automated backups, advanced security controls, global clusters, and real-time performance metrics, Atlas is designed to help teams move faster while ensuring the reliability and security of their data infrastructure. MongoDB Atlas offers a wide range of programmatic management options . Many MongoDB customers are leveraging Kubernetes-native workflows, either where applications are deployed to Kubernetes, or where a centrally managed Internal Developer Platform is run through Kubernetes (regardless of where the applications run). To support this, MongoDB provides the Atlas Kubernetes Operator —an open-source operator that lets you manage Atlas resources declaratively through Custom Resource Definitions (CRDs). This means you can define projects, clusters, database users, IP access lists, and more, directly in YAML files, and the operator will reconcile these specs with the actual state in MongoDB Atlas - in other words, apply your declarative configuration to Atlas. The value for customers and application development teams is that you can manage Atlas through the same workflow (often GitOps and leveraging tooling like ArgoCD) that you already use to configure your applications running in Kubernetes, or through a Kubernetes-based Internal Developer Platform. By bridging the gap between GitOps and database management, the Atlas Kubernetes Operator empowers platform engineers to treat Atlas resources as first-class citizens in their Kubernetes ecosystem—just like pods, deployments, or services. Crossplane Various tools and solutions are competing to provide a base for a Kubernetes-native Internal Developer Platform. Crossplane is a tool at the forefront of this, and we're gradually hearing more interest in it among MongoDB customers. It is a powerful open-source framework that extends Kubernetes into a universal control plane for managing infrastructure and services across environments using standard Kubernetes APIs - whether those services run in Kubernetes or outside. Rather than relying on separate tools or external scripts, Crossplane lets users define and manage infrastructure and workloads in a consistent, declarative, and version-controlled way. At its core, Crossplane works by installing custom controllers and CRDs into a Kubernetes cluster. These controllers can manage internal or external systems, including cloud services, databases, and more, via APIs, using native Kubernetes resources. This makes it possible to describe infrastructure as code, enforce organizational policies, and integrate infrastructure provisioning into existing GitOps workflows. The standardization even simplifies any further abstraction a company might implement for its application developers, for example, a GUI for simplified infrastructure provisioning. Crossplane compositions Crossplane is very well aligned with the value proposition of an Internal Developer Platform thanks to its ability to abstract infrastructure provisioning behind Compositions and Composite Resources (XRs) . This allows platform engineers (with input from other teams, such as security) to define reusable blueprints (sometimes called "golden paths" in the context of Internal Developer Platforms) for common services. The standardization and simplification possible with these templates make it far easier for application developers - whether they use the simplified declarative configuration directly or it is abstracted behind a customer's user interface—all without exposing the underlying complexity or provider-specific details. Crossplane’s flexibility lies in its ability to: Connect to multiple external systems via providers with custom templates that: Enforce standards through mandatory settings aligned with the organization’s policies (e.g., TLS) Reduce cognitive load for application developers through abstraction that cuts down on what they have to think about when trying to provision something like a MongoDB Atlas cluster Enable centrally implemented changes through central management of the templates, addressing a common problem with other configuration tooling Key concepts table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Concept Description Managed Resource This is the end result and the resulting configuration in Kubernetes that defines the thing actually being managed—this might be a server, a VM, or a much lower-level object like an IP access list. So, for Atlas, a managed resource could be a custom resource defining an Atlas project. This is the output of Crossplane and the configuration ultimately applied to do something, like create or update an Atlas Project. Composite Resource (XR) This is the input to Crossplane. It's the template, created by a platform team and used (either directly or indirectly) by the application developers. A composite resource might define a single resource in Kubernetes or a cloud service or represent something more akin to a whole stack. For example, a composite resource might describe a full MongoDB Atlas setup, including a project, a cluster, and a user. Composite Resource Definition (XRD) This defines a new custom resource type for Crossplane—essentially the schema of the template. This enables customers to request infrastructure via a Composite Resource (XR). The central platform team creates this type, but application developers do not directly use it. Composition This governs the translation of a Composite Resource (XR) into one or more Managed Resources. An example would be translating a Composite Resource defining an Atlas Project, Cluster, and user into distinct custom resources, which the Application Developer doesn't need to see but which are then applied to Atlas via the Atlas Kubernetes Operator. Claim (optional) A developer-friendly alias for an XR. Think of it as a simplified interface to create a Composite Resource without exposing platform-specific naming. An optional—but powerful and recommended—further abstraction and simplification. Crossplane Provider Crossplane Providers are secondary to the inbuilt capabilities in the previous concepts. Providers are extensions to Crossplane that typically enable management of a specific service or workload. They're akin to Terraform Providers, and many Crossplane Providers are even built from an existing Terraform Provider. A Crossplane Provider is one option for actually applying Managed resources - for example, applying the configuration to an external service via APIs. Kubernetes Operators and Crossplane Kubernetes Providers Kubernetes Operators are a well-established concept in the Kubernetes ecosystem. They can run services in Kubernetes (e.g., the MongoDB Controllers for Kubernetes Operator , which supports running MongoDB Community or Enterprise Advanced in Kubernetes) or manage external services via APIs (e.g., the MongoDB Atlas Kubernetes Operator , which supports managing Atlas). When using Crossplane, any Kubernetes Operator can be used thanks to the Crossplane Kubernetes Provider , which enables Crossplane to provide a consistent, simplified, and centrally managed interface in front of any number of existing Kubernetes Operators. How it works Below is a visual representation of how Crossplane Compositions work—from claim to managed infrastructure: Figure 1. Crossplane compositions at work Workflow: Application Developer applies a Claim to Kubernetes - likely via a GitOps flow. (This may also be generated through some further abstraction for application developers—e.g., a GUI—and automatically applied to Kubernetes) Crossplane creates a Composite Resource (XR) from the claim. Crossplane selects the matching Composition . The Composition generates one or more managed resources , which are actioned/applied using Crossplane Providers—for example, the Crossplane Kubernetes Provider , which enables the use of the Atlas Kubernetes Operator to apply configuration to Atlas. In essence, Crossplane shifts the responsibility of infrastructure design and lifecycle management to the central platform team, while giving application development teams a clean, consistent, simplified interface to request the services they need. MongoDB's support of Crossplane As described above, Crossplane provides a centrally managed self-service interface for a wide array of services thanks to its flexibility and native support. Though not directly supported by MongoDB, Crossplane can be used in conjunction with tools like the MongoDB Atlas Kubernetes Operator, thanks to the already mentioned Crossplane Kubernetes Provider , which enables Crossplane to work with any operator. The following is an example of how this can be done. Bear in mind that this blog is not official guidance on Crossplane (consult the official Crossplane documentation ) and is not guaranteed to be kept up to date. It's meant as an illustration of using Crossplane with the Crossplane Kubernetes Provider and the Atlas Kubernetes Operator . As a result, MongoDB is not able to offer anything beyond best efforts guidance on using the Atlas Kubernetes Operator with Crossplane, though MongoDB officially supports its use by Atlas customers. Example use case: Self-service data platform for microservices teams We are going to define a Composite Resource Definition called ProjectEnvironment (which allows users to define specific ProjectEnvironments) and a Composition that governs how a ProjectEnvironment is broken down into the underlying Managed Resources, which in this case define Atlas resources, including Project, Deployment, IP Access List, and Database Users. So, with one ProjectEnvironment Composite Resource (perhaps applied to Kubernetes via GitOps), our Crossplane Composition will generate several custom resources in Kubernetes. The Atlas Kubernetes Operator will then read these and apply them via the Atlas Admin API to create the various resources in Atlas. Before diving into the examples, let’s make sure you have the necessary setup in place. Pre-requisites To follow the examples below, ensure the following: A Kubernetes cluster is available and running (can be local or managed). Crossplane is installed in the cluster. You can follow the official installation guide . The Crossplane Kubernetes Provider is installed. Functions patch-and-transform , template-go , and auto-ready are installed. The Atlas Kubernetes Operator is installed. Refer to the official quick start guide . An Atlas Organization has been created and API keys set up and put in place for the Operator to use (all covered in the quickstart guide , but stop after step 4 as we'll be using the Operator via Crossplane!) Ensure your API keys include permission to manage: Projects Deployments (clusters) Database Custom Roles Database Users Note: In order to allow the crossplane provider-kubernetes to manage Atlas Kubernetes Operator objects, RBAC permissions must be given: apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: crossplane:provider:provider-kubernetes:mongodb-atlas rules: - apiGroups: - "" resources: - serviceaccounts verbs: - '*' - apiGroups: - atlas.mongodb.com resources: - "*" verbs: - "*" - apiGroups: - rbac.authorization.k8s.io resources: - "*" verbs: - "*" --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: crossplane:provider:provider-kubernetes:mongodb-atlas roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: crossplane:provider:provider-kubernetes:mongodb-atlas subjects: - kind: ServiceAccount name: upbound-provider-kubernetes-beb1eef47cde namespace: crossplane-system Step 1: Define the Composite Resource Definition (XRD) A Composite Resource Definition (XRD) defines a custom API for your platform, offering a self-service interface for users to provision resources. It acts as a template for the Composite Resources (XRs) that your users will create. While it looks like a Kubernetes Custom Resource Definition (CRD), an XRD provides a higher-level abstraction designed for Crossplane. The following XRD defines a ProjectEnvironment resource. This custom API allows development teams to provision a complete MongoDB Atlas environment, including a project, a database deployment, access lists, and users, by creating a single, simple Kubernetes object. apiVersion: apiextensions.crossplane.io/v1 kind: CompositeResourceDefinition metadata: name: projectenvironments.platform.example.org spec: group: platform.example.org names: kind: ProjectEnvironment plural: projectenvironments claimNames: kind: ProjectEnvironmentClaim plural: projectenvironmentsclaims defaultCompositeDeletePolicy: Foreground spec.group : Defines the API group for your composite resource, used for organizing related resources. spec.names : Sets the name for the Composite Resource (XR) that platform administrators will see. In this case, it's ProjectEnvironment. spec.claimNames : Defines the developer-facing resource, known as a Composite Resource Claim. Developers create a ProjectEnvironmentClaim , and the platform automatically provisions a corresponding ProjectEnvironment resource based on this definition. spec.defaultCompositeDeletePolicy : Specifies what happens when a ProjectEnvironment is deleted. Background : Deletes the Composite Resource first and cleans up the underlying cloud resources (like the Atlas project and database) in the background. This is faster but can leave resources orphaned if cleanup fails. Foreground : Ensures all underlying cloud resources are successfully deleted before removing the Composite Resource from Kubernetes. This is safer and prevents orphaned resources. The schema block defines the structure of your new API, including the fields that users can configure ( spec ) and the information that will be reported back ( status ). versions: - name: v1alpha1 served: true referenceable: true schema: openAPIV3Schema: type: object properties: spec: type: object required: - project - users properties: project: type: string environment: type: string enum: ["dev", "qa", "prod"] default: dev version: type: string enum: ["6.0", "7.0", "8.0"] default: "8.0" users: type: array items: type: object required: ["username", "secret"] properties: username: type: string secret: type: string status: type: object properties: id: type: string connectionStrings: type: object properties: standard: type: string standardSrv: type: string private: type: string privateSrv: type: string mongoDBVersion: type: string additionalPrinterColumns: - name: Project type: string jsonPath: ".spec.project" - name: MongoDB Version type: string jsonPath: ".status.mongoDBVersion" versions : Contains one or more versions of your API schema. served : When true, this API version is enabled and can be used on the cluster. referenceable : Allows other resources to reference this one, which is useful for building complex compositions where one resource depends on another. openAPIV3Schema : Defines the data structure for your API. spec : The fields that users configure when creating a ProjectEnvironmentClaim . This example includes required fields like project and users, along with optional fields like environment and MongoDB version that have default values. status : The fields that Crossplane will populate with information from the provisioned cloud resources, such as the Atlas project id and database connectionStrings. For a comprehensive list of all attributes of a CompositeResourceDefinition see its API Reference . Step 2: Define the composition This Composition defines the platform logic that translates a ProjectEnvironment into Managed Resources representing all the elements of a full MongoDB Atlas environment. The Managed Resources (representing project, deployment, access list, and users) are then consumed and applied to MongoDB Atlas by the MongoDB Atlas Kubernetes Operator. This Composition provides the logic that turns a high-level ProjectEnvironment resource into a complete, ready-to-use MongoDB Atlas environment. It acts as the "brain" of the platform, orchestrating the creation of several underlying resources in a specific order. It uses Pipeline mode, which allows for a sequence of steps, each powered by a Composition Function . This mode is ideal for complex scenarios that require conditional logic, looping, and custom processing beyond simple field mapping. Step 2.1: Composition definition This initial block defines that this Composition is responsible for fulfilling requests for ProjectEnvironment resources and that it will use the Pipeline mode for its logic. apiVersion: apiextensions.crossplane.io/v1 kind: Composition metadata: name: projectenvironments.platform.example.org spec: compositeTypeRef: apiVersion: platform.example.org/v1alpha1 kind: ProjectEnvironment mode: Pipeline compositeTypeRef Specifies the Composite Resource type this Composition applies to. In our case, it refers to ProjectEnvironment Composite Resources. mode defines how the composition is executed. Pipeline indicates that a Composition specifies a pipeline of Composition Functions, each of which is responsible for producing composed resources that Crossplane should create or update. Resource indicates that a Composition uses “Patch & Transform” (P&T) composition. This uses an array of resources, each a template for a composed resource. The pipeline defines the sequence of operations required to build the full environment. Each step can use a different function and is executed in order. Step 2.2: Atlas project and IP access list These steps use the patch-and-transform function to create the foundational AtlasProject and a corresponding AtlasIPAccessList . pipeline: - step: atlas-project-with-ip-access-list functionRef: name: function-patch-and-transform input: apiVersion: pt.fn.crossplane.io/v1beta1 kind: Resources patchSets: - name: project-ref patches: - type: FromCompositeFieldPath fromFieldPath: "spec.claimRef.name" toFieldPath: "spec.forProvider.manifest.spec.projectRef.name" transforms: - type: string string: type: Format fmt: "%s-project" - type: FromCompositeFieldPath fromFieldPath: "spec.claimRef.namespace" toFieldPath: "spec.forProvider.manifest.spec.projectRef.namespace" resources: - name: atlas-project base: apiVersion: kubernetes.crossplane.io/v1alpha2 kind: Object spec: readiness: policy: AllTrue deletionPolicy: Delete forProvider: manifest: apiVersion: atlas.mongodb.com/v1 kind: AtlasProject providerConfigRef: name: kubernetes-provider patches: - type: FromCompositeFieldPath fromFieldPath: "spec.claimRef.name" toFieldPath: "metadata.name" transforms: - type: string string: type: Format fmt: "%s-project-object" - type: FromCompositeFieldPath fromFieldPath: "spec.claimRef.name" toFieldPath: "spec.forProvider.manifest.metadata.name" transforms: - type: string string: type: Format fmt: "%s-project" - type: FromCompositeFieldPath fromFieldPath: "spec.claimRef.namespace" toFieldPath: "spec.forProvider.manifest.metadata.namespace" - type: FromCompositeFieldPath fromFieldPath: "spec.project" toFieldPath: "spec.forProvider.manifest.spec.name" - type: ToCompositeFieldPath fromFieldPath: "status.atProvider.manifest.status.id" toFieldPath: "status.id" - name: atlas-ip-access-list base: apiVersion: kubernetes.crossplane.io/v1alpha2 kind: Object spec: readiness: policy: AllTrue deletionPolicy: Delete references: - dependsOn: {} forProvider: manifest: apiVersion: atlas.mongodb.com/v1 kind: AtlasIPAccessList spec: entries: - cidrBlock: "10.0.16.0/20" comment: "Company Office Network" providerConfigRef: name: kubernetes-provider patches: - type: FromCompositeFieldPath fromFieldPath: "spec.claimRef.name" toFieldPath: "metadata.name" transforms: - type: string string: type: Format fmt: "%s-ial-object" - type: FromCompositeFieldPath fromFieldPath: "spec.claimRef.name" toFieldPath: "spec.references[0].dependsOn.name" transforms: - type: string string: type: Format fmt: "%s-project-object" - type: FromCompositeFieldPath fromFieldPath: "spec.claimRef.name" toFieldPath: "spec.forProvider.manifest.metadata.name" transforms: - type: string string: type: Format fmt: "%s-ial" - type: FromCompositeFieldPath fromFieldPath: "spec.claimRef.namespace" toFieldPath: "spec.forProvider.manifest.metadata.namespace" - type: PatchSet patchSetName: project-ref This step performs direct field mappings from the incoming ProjectEnvironmentClaim to the new AtlasProject and AtlasIPAccessList resources. It uses a PatchSet to define a reusable way to reference the project's name and namespace, ensuring consistency. For simplicity, the IP access list is hardcoded to a specific CIDR block, but could be parameterized by consuming an input from our composite or by applying custom logic using a custom function. Step 2.3: Atlas deployment - step: atlas-deployment functionRef: name: function-template-go input: apiVersion: gotemplating.fn.crossplane.io/v1beta1 kind: GoTemplate source: Inline inline: template: | apiVersion: kubernetes.crossplane.io/v1alpha2 kind: Object metadata: name: {{ printf "%s-deployment-object" .observed.composite.resource.spec.claimRef.name }} annotations: {{ setResourceNameAnnotation (printf "%s-deployment" .observed.composite.resource.spec.claimRef.name) }} spec: readiness: policy: AllTrue deletionPolicy: Delete references: - dependsOn: name: {{ printf "%s-project-object" .observed.composite.resource.spec.claimRef.name }} providerConfigRef: name: kubernetes-provider forProvider: manifest: apiVersion: atlas.mongodb.com/v1 kind: AtlasDeployment metadata: name: {{ printf "%s-deployment" .observed.composite.resource.spec.claimRef.name }} namespace: {{ .observed.composite.resource.spec.claimRef.namespace }} labels: environment: {{ .observed.composite.resource.spec.environment }} spec: projectRef: name: {{ printf "%s-project" .observed.composite.resource.spec.claimRef.name }} {{- $env := .observed.composite.resource.spec.environment }} {{- if eq $env "dev" }} flexSpec: name: {{ printf "%s-deployment" .observed.composite.resource.spec.project }} terminationProtectionEnabled: false providerSettings: backingProviderName: AWS regionName: US_EAST_1 {{- else }} deploymentSpec: tags: - key: environment value: {{ $env }} name: {{ printf "%s-deployment" .observed.composite.resource.spec.project }} clusterType: REPLICASET mongoDBMajorVersion: "{{ .observed.composite.resource.spec.version }}" backupEnabled: false replicationSpecs: - regionConfigs: - providerName: AWS regionName: US_EAST_1 priority: 7 electableSpecs: instanceSize: M10 nodeCount: 3 {{- if eq $env "prod" }} readonlySpecs: instanceSize: M40 nodeCount: 2 autoscaling: compute: enabled: true maxInstanceSize: M50 diskGB: enabled: true {{- end }} {{- end }} A Go template generates the deployment's specification dynamically. It inspects the environment field from the user's claim and creates a cost-effective flexSpec deployment for dev environments, or a more robust deploymentSpec for qa and prod. It also uses references to ensure this deployment is only created after the project from Step 1 is ready. Step 2.4: Atlas database users - step: atlas-database-users functionRef: name: function-template-go input: apiVersion: gotemplating.fn.crossplane.io/v1beta1 kind: GoTemplate source: Inline inline: template: | {{- range $index, $user := .observed.composite.resource.spec.users }} --- apiVersion: kubernetes.crossplane.io/v1alpha2 kind: Object metadata: name: {{ printf "%s-user-%s" $.observed.composite.resource.spec.claimRef.name $user.username }} annotations: {{ setResourceNameAnnotation (printf "%s-user-%s" $.observed.composite.resource.spec.claimRef.name $user.username) }} spec: readiness: policy: AllTrue deletionPolicy: Delete references: - dependsOn: name: {{ printf "%s-project-object" $.observed.composite.resource.spec.claimRef.name }} providerConfigRef: name: kubernetes-provider forProvider: manifest: apiVersion: atlas.mongodb.com/v1 kind: AtlasDatabaseUser metadata: name: {{ printf "%s-user-%s" $.observed.composite.resource.spec.claimRef.name $user.username }} namespace: {{ $.observed.composite.resource.spec.claimRef.namespace }} spec: projectRef: name: {{ printf "%s-project" $.observed.composite.resource.spec.claimRef.name }} namespace: {{ $.observed.composite.resource.spec.claimRef.namespace }} username: {{ $user.username }} passwordSecretRef: name: {{ $user.secret }} namespace: {{ $.observed.composite.resource.spec.claimRef.namespace }} databaseName: admin roles: - roleName: readWriteAnyDatabase databaseName: admin scopes: - name: {{ printf "%s-deployment" $.observed.composite.resource.spec.project }} type: CLUSTER {{- end }} This step also uses the template-go function, but this time to loop through the user's request and create multiple database users. The template uses a range block to iterate over the users array specified in the ProjectEnvironmentClaim . For each entry in the array, it generates a complete AtlasDatabaseUser resource, linking it to the correct project and password secret. This allows a single claim to stamp out multiple, similar resources. Step 2.5: Status update - step: custom-status-update functionRef: name: function-template-go input: apiVersion: gotemplating.fn.crossplane.io/v1beta1 kind: GoTemplate source: Inline inline: template: | {{ if .observed.resources }} {{ $project := index .observed.resources "atlas-project" }} {{ $deployment := index .observed.resources (printf "%s-deployment" .observed.composite.resource.spec.claimRef.name) }} apiVersion: platform.example.org/v1alpha1 kind: ProjectEnvironment status: {{ if $project.resource.status.atProvider.manifest }} id: {{ $project.resource.status.atProvider.manifest.status.id }} {{ end }} {{ if $deployment.resource.status.atProvider.manifest }} mongoDBVersion: {{ $deployment.resource.status.atProvider.manifest.status.mongoDBVersion }} connectionStrings: {{ $deployment.resource.status.atProvider.manifest.status.connectionStrings | toJson }} {{ end }} {{ end }} This crucial step uses a Go template not to create a cloud resource, but to feed information back to the user. It inspects the resources created in the previous steps, extracts key information like the project ID, MongoDB version, and connection strings, and patches them into the status field of the ProjectEnvironment resource. This makes vital information available to the developer directly on their claim object. Step 2.6: Readiness - step: readiness-check functionRef: name: function-auto-ready The final step uses the function-auto-ready function to determine when the entire composition is complete and healthy. This function automatically inspects all the resources managed by the pipeline and updates the ProjectEnvironment's status conditions accordingly. It signals that the environment is fully provisioned and ready for use only when every component (project, deployment, users) reports a ready state. For a comprehensive list of all attributes of a Composition resource, see its API reference . Step 3: Create a composite resource This is the final step - all of the previous elements are the cogs in the machine, and this is a specific instance of a Composite Resource that is our input to Crossplane. This is what an application development team would use (directly or with further abstraction) to request a MongoDB environment in a self-service manner, using a single claim and secret. apiVersion: platform.example.org/v1alpha1 kind: ProjectEnvironmentClaim metadata: name: dev-env-claim spec: compositionRef: name: projectenvironments.platform.example.org project: payments environment: dev users: - username: user1 secret: user-secret - username: user2 secret: user-secret - username: user3 secret: user-secret This is a standard Kubernetes Secret that securely stores the password for the database users. apiVersion: v1 kind: Secret metadata: name: user-secret labels: atlas.mongodb.com/type: credentials type: Opaque stringData: password: myH4rdP@ssw0rd Once this ProjectEnvironmentClaim is applied to the Kubernetes cluster where Crossplane is running: The Composition is selected by Crossplane via compositionRef.name in our ProjectEnvironmentClaim. The Composition instructs Crossplane on the creation of custom resources in Kubernetes for: A MongoDB project A dev environment deployment A CIDR-bound IP access list 3 database users, sharing the same password by using the same secret (for simplicity) Once created in Kubernetes, those custom resources are read by the MongoDB Atlas Kubernetes Operator and applied to Atlas via the Atlas Admin API. The status of the claim reflects the readiness of the composed resources. This means that the Application Developers have visibility into the success/failure/status of what's ultimately being created in MongoDB Atlas. How developers see the claim status depends on how they've applied the claim into Kubernetes—e.g., it might be surfaced in a GUI or through a tool like ArgoCD. Conclusion Crossplane is a powerful tool for providing central management and governance while enabling developer self-service with minimal cognitive load. Though MongoDB does not directly support Crossplane, it's entirely possible to leverage the MongoDB Atlas Kubernetes Operator through Crossplane and the Crossplane Kubernetes Provider . The self-service data platform design using Crossplane Compositions, CompositeResourceDefinitions, and Kubernetes Provider Object resources empowers microservices teams with declarative infrastructure provisioning for MongoDB Atlas. This approach offers several key benefits: Application developer self-service Write a simplified claim without needing to understand the underlying mechanics or infrastructure. Central enablement and governance Platform Engineers define and version the Composition, while developers only need to interact with a simplified custom API tailored to their use case. Platform Engineers can enforce mandatory settings (enabling central governance) and recommend sensible defaults that simplify application developer use. Flexible environment profiles Environment-specific configurations (dev, qa, prod) can be baked into the templates, enabling consistent infrastructure provisioning with variable capacity and redundancy. Extensibility The use of template-go and patch-and-transform functions allows for rich customization logic, templating, and conditional behavior, going beyond what vanilla Crossplane patching can offer. Namespace isolation & multi-tenancy Claims can be namespace-scoped and operate safely in multi-tenant environments using namespaced secrets and resources. Despite its many strengths, there are known limitations to this approach, mainly due to the current state of the Crossplane and external operators' interoperability: Deletion Ordering & Finalizer Issues While Crossplane Kubernetes Provider coordinates deletion in a dependency-aware manner, a bug in the Crossplane Kubernetes Provider fails to properly set a finalizer on dependent resources. This means that a parent resource (e.g., AtlasProject) may be deleted before children resources (e.g., AtlasDatabaseUser), and these resources will become orphaned. This is especially problematic because: Some Atlas Kubernetes Operator resources do not support independent references - the orphaned child resources may sit there in an error state due to their dependence on the deleted parent resource. Manual cleanup may be required for such orphaned resources. Readiness may not be fully tracked When using go-template to generate the resource, readiness is not automatically checked. This might mean that the status of resources that a claim is meant to generate or manage might not be up to date by default. This must be mitigated by explicitly generating ClaimConditions to update the status, or using the `auto-ready` function. With Resource mode, or using Resources in the Pipeline mode, this is not an issue, as readiness is automatically checked. Visit our docs page to learn more about the MongoDB Atlas Kubernetes Operator.

July 28, 2025
Developer Blog

Build Scalable RAG With MongoDB Atlas and Cohere Command R+

Retrieval-augmented generation (RAG) is becoming increasingly vital for developing sophisticated AI applications that not only generate fluent text but also ensure precision and contextual relevance by grounding responses in real, factual data. This approach significantly mitigates hallucinations and enhances the reliability of AI outputs. This guide provides a detailed exploration of an open-source solution designed to facilitate the deployment of a production-ready RAG application by using the powerful combination of MongoDB Atlas and Cohere Command R+. This solution is built upon and extends the foundational principles demonstrated in the official Cohere plus MongoDB RAG documentation available at Build Chatbots with MongoDB and Cohere . To provide you with in-depth knowledge and practical skills in several key areas, this comprehensive walkthrough will: Show you how to build a complete RAG pipeline using MongoDB Atlas and Cohere APIs Focus on data flow, retrieval, and generation Enable you to enhance answer quality through reranking to improve relevance and accuracy Enable detailed, flexible deployment with Docker Compose for local or cloud environments Explain MongoDB’s dual role as a vector store and chat memory for a seamless RAG application Reasons to choose MongoDB and Cohere for RAG The convergence of powerful technologies— MongoDB Atlas and Cohere Command R+ —unlocks significant potential for creating sophisticated, scalable, and high-performance systems for grounded generative AI (gen AI). This synergistic approach provides a comprehensive toolkit to handle the unique demands of modern AI applications. MongoDB Atlas and Cohere Command R+ facilitate the development of scalable, high-performing, and grounded AI applications. MongoDB Atlas provides a scalable, flexible, reliable, and fast database for managing large datasets used to ground generative models. Cohere Command R+ offers a sophisticated large language model (LLM) for natural language understanding and generation, incorporating retrieved data for factual accuracy and rapid inference. The combined use of MongoDB Atlas and Cohere Command R+ results in applications with fast and accurate responses, scalable architectures, and outputs informed by real-world data. This powerful combination represents a compelling approach to building the next generation of gen AI applications, facilitating innovation and unlocking novel opportunities across various sectors. Architecture overview In this section, we’ll look at the implementation architecture of the application and how the mixture of Cohere and MongoDB components flow underneath. Figure 1. Reference architecture, with Cohere and MongoDB components. The following list divides and explains the architecture components: 1. Document ingestion, chunking, and embedding with Cohere The initial step involves loading your source documents, which can be in various formats. These documents are then intelligently segmented into smaller, semantically meaningful chunks to optimize retrieval and processing. Cohere’s powerful embedding models generate dense vector representations of these text chunks, capturing their underlying meaning and semantic relationships. 2. Scalable vector and text storage in MongoDB Atlas MongoDB Atlas , a fully managed and scalable database service, serves as the central repository for both the original text chunks and their corresponding vector embeddings. MongoDB Atlas’s built-in vector search capabilities (with MongoDB Atlas Vector Search ) enable efficient and high-performance similarity searches based on the generated embeddings. This enables the scalable storage and retrieval of vast amounts of textual data and their corresponding vector representations. 3/ Query processing and semantic search with MongoDB Atlas When a user poses a query, it undergoes a similar embedding process, using Cohere to generate a vector representation of the search intent. MongoDB Atlas then uses this query vector to perform a semantic search within its vector index. MongoDB Atlas efficiently identifies the most relevant document chunks based on their vector similarity to the query vector, surpassing simple keyword matching to comprehend the underlying meaning. 4. Reranking with Cohere To further refine the relevance of the retrieved document chunks, you can employ Cohere’s reranking models. The reranker analyzes the initially retrieved chunks in the context of the original query, scoring and ordering them based on a more nuanced understanding of their relevance. This step ensures that you’re prioritizing the most pertinent information for the final answer generation. 5. Grounded answer generation with Cohere Command R+ The architecture then passes the top-ranked document chunks to Cohere’s Command R+ LLM. Command R+ uses its extensive knowledge and understanding of language to generate a grounded and coherent answer to the user’s query, with direct support from the information extracted from the retrieved documents. This ensures that the answers are accurate, contextually relevant, and traceable to the source material. 6. Context-aware interactions and memory with MongoDB To enable more natural and conversational interactions, you can store the history of the conversation in MongoDB. This enables the RAG application to maintain context across multiple turns, referencing previous queries and responses to provide more informed and relevant answers. By incorporating conversation history, the application gains memory and can engage in more meaningful dialogues with users. For a better understanding of what each technical component does, reference the following table, which shows how the architecture assigns roles to each component: table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Component Role MongoDB Atlas Stores text chunks, vector embeddings, and chat logs Cohere Embed API Converts text into dense vector representations MongoDB Atlas Vector Search Performs efficient semantic retrieval via cosine similarity Cohere Rerank API Prioritizes the most relevant results from the retrieval Cohere Command R+ Generates final responses grounded in top documents In summary, this architecture provides a robust and scalable framework for building RAG applications. It integrates the document processing and embedding capabilities of Cohere with the scalable storage and vector search functionalities of MongoDB Atlas. By combining this with the generative power of Command R+, developers can create intelligent applications that provide accurate, contextually relevant, and grounded answers to user queries, while also maintaining conversational context for an enhanced user experience. Application Setup The application requires the following components, ideally readied beforehand. A MongoDB Atlas cluster (free tier is fine) A Cohere account and API key Python 3.8+ Docker and Docker Compose A configured AWS CLI Deployment steps 1. Clone the repository. git clone https://github.com/mongodb-partners/maap-cohere-qs.git cd maap-cohere-qs 2. Configure the one-click.ksh script: Open the script in a text editor and fill in the required values for various environment variables: AWS Auth: Specify the AWS_REGION , AWS_ACCESS_KEY_ID , and AWS_SECRET_ACCESS_KEY for deployment. EC2 Instance Types: Choose suitable instance types for your workload. Network Configuration: Update key names, subnet IDs, security group IDs, etc. Authentication Keys: Fetch Project ID and API public and private keys for MongoDB Atlas cluster setup. Update the script file with the keys for APIPUBLICKEY , APIPRIVATEKEY , and GROUPID suitably. 3. Deploy the application. chmod +x one-click.ksh ./one-click.ksh 4. Access the application: http://<ec2-instance-ip>:8501 Core workflow Load and chunk data: Currently, data is loaded from a static, dummy source. However, you can update this to a live data source to ensure the latest data and reports are always available. For details on data loading, refer to the documentation . 2. Embed and store: Each chunk is embedded using embed-english-v3.0 , and both the original chunk and the vector are stored in a MongoDB collection: model = "embed-english-v3.0" response = self.co.embed( texts=[text], model=model, input_type=input_type, embedding_types=['float'] ) 3. Semantic retrieval with vector search: Create a vector search index on top of your collection: index_models = [ { "database": "asset_management_use_case", "collection": "market_reports", "index_model": SearchIndexModel( definition={ "fields": [ { "type": "vector", "path": "embedding", "numDimensions": 1024, "similarity": "cosine" }, { "type": "filter", "path": "key_metrics.p_e_ratio" }, { "type": "filter", "path": "key_metrics.market_cap" }, { "type": "filter", "path": "key_metrics.dividend_yield" }, { "type": "filter", "path": "key_metrics.current_stock_price" } ] }, name="vector_index", type="vectorSearch", ), } ] A vector index in MongoDB enables fast, cosine-similarity-based lookups. MongoDB Atlas returns the top-k semantically similar documents, on top of which you can apply additional post filters to get more fine-grained results set in a bounded space. 4. Re-ranking for accuracy: Instead of relying solely on vector similarity, the retrieved documents are reranked using Cohere’s Rerank API , which is trained to order results by relevance. This dramatically improves answer quality and prevents irrelevant context from polluting the response. response = self.co.rerank( query=query, documents=rerank_docs, top_n=top_n, model="rerank-english-v3.0", rank_fields=["company", "combined_attributes"] ) The importance of reranking A common limitation in RAG systems is that dense vector search alone may retrieve documents that are semantically close but not contextually relevant . The Cohere Rerank API solves this by using a lightweight model to score query-document pairs for relevance. The ability to combine everything The end application works and functions on a streamlit UI, as displayed below. Figure 2. Working application with UI. To achieve more direct and nuanced responses in data retrieval and analysis, you’ll find that the strategic implementation of prefilters is paramount. Prefilters act as an initial, critical layer of data reduction, sifting through larger datasets to present a more manageable and relevant subset for subsequent, more intensive processing. This not only significantly enhances the efficiency of queries but also refines the precision and interpretability of the results. For instance, instead of analyzing sales trends across an entire product catalogue, a prefilter can limit the analysis to a specific product line, thereby revealing more granular insights into its performance, customer demographics, or regional variations. This level of specificity enables the extraction of more subtle patterns and relationships that might otherwise be obscured within a broader, less filtered dataset. Figure 3. Prefilters to be applied on top of MongoDB Atlas Vector Search. Conclusion Just by using MongoDB Atlas and Cohere’s API suite, you can deploy a fully grounded, semantically aware RAG system that is cost effective, flexible, and production grade. This quick-start enables your developers to build AI assistants that reason with your data without requiring extensive infrastructure. Start building intelligent AI agents powered by MongoDB Atlas. Visit our GitHub repo to try out the quick-start and unlock the full potential of semantic search, secure automation, and real-time analytics. Your AI-agent journey starts now. Ready to learn more about building AI applications with MongoDB? Head over to our AI Learning Hub .

July 23, 2025
Developer Blog

Transforming Financial Services with MongoDB and IBM Watsonx.ai

Financial institutions around the world are increasingly adopting AI-driven solutions to enhance user experiences, streamline operations, and deliver personalized financial insights. As a part of the MongoDB AI Applications Program (MAAP), IBM’s Watsonx.ai and MongoDB Atlas unite to deliver scalable, enterprise-grade AI development. By integrating MongoDB Atlas and IBM Watsonx.ai , we’ve built an intelligent finance assistant that combines cutting-edge database management and generative AI (gen AI) capabilities. Modern financial institutions face challenges in delivering personalized, real-time assistance to their customers. Generic chatbots or static systems often fail to address nuanced queries, limiting their utility and customer satisfaction. By using MongoDB Atlas Vector Search and IBM Watsonx.ai’s gen AI models, we can create a finance assistant capable of handling complex queries, retrieving relevant financial data, and providing actionable insights. This blog post will walk you through: The core architecture behind the finance assistant. The ways that MongoDB Atlas and IBM Watsonx.ai complement each other in building AI-driven financial solutions. The method of building an intelligent finance assistant with MongoDB Atlas and IBM Watsonx.ai. Architecture overview The architecture of the finance assistant integrates advanced Vector Search capabilities with IBM Watsonx.ai’s reasoning and language generation models. The system provides an end-to-end pipeline for handling user queries, from natural language understanding to intelligent data retrieval and response generation. Figure 1. Finance assistant architecture. Key components Each component plays a specific role in enabling natural language understanding, data retrieval, and intelligent response generation. User input: Users interact with the Finance Assistant using natural language queries like: “What are my last three transactions?” “How can I improve my savings?” IBM Watsonx.ai Granite embedding models: Convert user queries into high-dimensional vector embeddings that represent semantic meaning. Granite language models: Generate intelligent and context-aware responses by reasoning over retrieved data. MongoDB Atlas Vector search index: Stores vector embeddings of transactional and financial data for fast, accurate similarity-based retrieva.l Hybrid search: Combines keyword search with vector similarity for holistic data retrieval. Operational data store: Maintains structured and unstructured financial data in a scalable and secure database. LangChain Orchestrates the flow between MongoDB Atlas and IBM Watsonx.ai. Implements retrieval-augmented generation (RAG) for real-time query handling and response generation. The flow of the architecture Describes how user queries are transformed into insights through embedding, retrieval, and AI-driven response generation. Preprocessing Financial data, such as customer transactions or private knowledge bases, is vectorized using IBM Watsonx.ai embedding models. Vector embeddings are stored in MongoDB Atlas alongside metadata. Query execution User input is processed into embeddings and matched against the Vector Search index . Relevant data is retrieved and passed to IBM Watsonx.ai for contextual reasoning. Response generation Watsonx.ai generates intelligent, explainable recommendations based on the retrieved data.. The response is delivered to the user in natural language. Why MongoDB Atlas? MongoDB Atlas provides a powerful platform for managing and querying large-scale data using its vector search capabilities. When building a RAG pipeline, it simplifies the process by enabling the storage of vectorized embeddings alongside metadata in a flexible schema. Its hybrid search capabilities—combining traditional keyword searches with vector similarity searches—make it ideal for efficiently retrieving relevant documents or financial data based on user input. MongoDB Atlas also provides scalability and real-time data updates, making it a robust operational data layer for dynamic RAG workflows. By seamlessly integrating vector search with existing data, MongoDB Atlas minimizes latency and complexity so that your gen AI applications can retrieve the right context every time. Why IBM Watsonx.ai? IBM Watsonx.ai brings enterprise-grade foundation models to power the reasoning and generative components of a RAG pipeline. Watsonx.ai’s foundation models, such as the Granite series, offer robust embeddings and advanced reasoning capabilities, enabling the system to process retrieved documents and generate natural language responses tailored to the user’s query. With its focus on transparency, security, and customization, Watsonx.ai is particularly suited for regulated industries like finance. Its integration with tools like LangChain facilitates seamless orchestration between retrieval and generation, enabling RAG systems to go beyond static responses by delivering personalized, insightful, and context-rich outputs. Method for building an intelligent finance assistant with MongoDB Atlas and IBM Watsonx.ai For this tutorial, we will be using a financial dataset containing customer details, transactions, spending insights, and metadata. These records represent real-world information such as payments, savings, and expenses, making the dataset highly relevant for building an intelligent finance assistant. To generate the vector embeddings for storing and retrieving this data, we will use the Granite embedding models from IBM Watsonx.ai. These embeddings capture the semantic meaning of financial data, enabling efficient similarity searches and contextual data retrieval. To follow along, you will need an integrated development environment, a MongoDB Atlas account for data storage and indexing, and an IBM Watsonx.ai account for generating embeddings. By the end of this tutorial, you’ll have a functional system ready to support real-time financial assistance and personalized recommendations. Prerequisites Before starting the implementation, ensure you have the following set up: MongoDB Atlas: Cluster with transaction and customer data collections. MongoDB Atlas will be the primary database for storing and querying transaction and customer data. Steps: Create a MongoDB Atlas account Visit MongoDB Atlas and click “ Get Started .” Sign up using your email or log in with Google, GitHub, or Microsoft. Set up a cluster Click “ Build a Cluster ” after logging in. Choose a free tier cluster or upgrade for more features. Select your cloud provider (AWS, Google Cloud, or Azure) and region . Click “ Create Cluster ” to deploy (this may take a few minutes). Configure your cluster Go to “ Database Access ” and create a user with a username, password, and role (e.g., “Read and Write to Any Database”). In “ Network Access ,” add your IP address or allow all IPs (0.0.0.0/0) for unrestricted development access. IBM Watsonx.ai: API key for accessing large language models (LLMs). IBM Watsonx.ai will handle the reasoning and generative tasks. Steps: Create an IBM Cloud account Visit IBM Cloud and sign up for a free account. Set up Watsonx.ai Log in and search for “ Watsonx.ai ” in the catalog. Create an instance; a sandbox environment will be set up automatically. Generate an API key Go to “ Manage ,” then “ Access (IAM) ” in the IBM Cloud dashboard. Click “ Create API Key ,” name it (e.g., “watsonx_key”), and save it securely. Retrieve service URL Find the service URL (e.g., https://us-south.ml.cloud.ibm.com) in the Watsonx.ai instance dashboard. You’re ready to start building your finance assistant! Implementation steps To set up and run your finance assistant, follow the steps below to clone, configure, and execute the code. Ensure that your MongoDB Atlas cluster and IBM Watsonx.ai configurations are ready before proceeding. Step 1: Clone the code repository. The demo code is available on GitHub . Clone the project repository from the provided GitHub link, using this command: git clone <repository_url> cd <repository_directory> #Install dependencies (requires python version 3.11 or higher) pip install -r requirements.txt This repository contains all the necessary files, including preprocessing.py, processing.py, and the HTML templates. Step 2: Configure the preprocessing script. Open the preprocessing.py file. This script is responsible for ingesting and vectorizing the financial data into MongoDB Atlas. Locate the MONGO_CONN variable and replace it with your MongoDB Atlas connection string : MONGO_CONN = "<your_mongodb_connection_string>" Save the file. Step 3: Run the preprocessing script. Execute the preprocessing.py script to preprocess and ingest the financial data into MongoDB Atlas: python preprocessing.py If the script runs successfully: A new database named banking_quickstart will be created in your MongoDB Atlas cluster. The following collections will appear: faqs customers_details transactions_details spending_insight_details The script will also generate vector embeddings for textual data, enabling efficient similarity searches in MongoDB Atlas. Create a vector search index as follows for all four collections. Change the field name accordingly: Step 4: Configure the processing script Open the processing.py file. This script integrates IBM Watsonx.ai for reasoning and query handling. Update the following variables: MONGO_CONN: Your MongoDB Atlas connection string Watsonx.ai Configuration: These configurations enable secure access to both MongoDB Atlas and Watsonx.ai for data retrieval and AI-powered query handling. Step 5: Run the processing script Execute the processing.py file to start the backend Flask server: python processing.py. If the server starts successfully, the application will be hosted locally at 127.0.0.1. Step 6: Access the application Open your browser and navigate to the following URL: http://127.0.0.1:5000/login You will see the finance assistant login page . Use the provided credentials (or modify the preprocessing.py script to create custom login data). Use any customer i.d. between 1-1000. Eg: CUST0571 Figure 2. Customer login portal. Figure 3. Finance assistant dashboard. Watch the full tutorial: Additional technical details Preprocessing with Watsonx.ai: During the execution of preprocessing.py, the Granite embedding models from Watsonx.ai are used to vectorize textual data (e.g., transaction descriptions). The generated embeddings are stored in MongoDB Atlas for similarity-based queries. API configuration: The processing.py script integrates with IBM Watsonx.ai’s Granite language models to process natural language queries and generate meaningful responses. Server logs: Check the terminal logs for any errors or status updates during the execution of the Flask server. Logs provide insights into API calls, database interactions, and AI responses. The power of advanced vector search and enterprise AI Building a finance assistant using MongoDB Atlas and IBM Watsonx.ai demonstrates the power of combining advanced vector search capabilities with enterprise-grade AI models. This architecture not only provides real-time, accurate, and personalized financial insights but also highlights the scalability and flexibility needed for modern financial applications. In this tutorial, you’ve learned how to: Preprocess financial data using Watsonx.ai’s Granite embedding models to create vector embeddings. Store and query data efficiently in MongoDB Atlas using its Vector Search index and hybrid search capabilities. Integrate IBM Watsonx.ai’s foundation models for intelligent reasoning and natural language understanding. Build a seamless user interface to enable customers to access their financial information intuitively. And with this system, you can deliver: Personalized financial insights: Deliver tailored responses for individual users based on their financial data. Scalable performance: Effortlessly handle large datasets and complex queries. Enhanced user experiences: Provide customers with real-time, explainable, and context-aware recommendations. As the financial services sector continues to evolve, combining tools like MongoDB Atlas and IBM Watsonx.ai will become essential for delivering smarter AI-driven solutions. You can easily extend this architecture to include advanced analytics, fraud detection, or even investment forecasting, making it a robust foundation for future innovation. Ready to take your finance assistant to the next level? Start experimenting with more data, refining AI prompts, or exploring MongoDB Atlas and Watsonx.ai’s advanced features to unlock even greater potential! To fast-track your AI journey, explore the MongoDB AI Applications Program (MAAP). It brings together cutting-edge technologies and expert services from top AI and tech leaders including IBM to help your organization move seamlessly from concept to road map, prototype, and full-scale production.

July 21, 2025
Developer Blog

Embedded Objects and Other Index Gotchas

In a recent design review , the customer's application was in production, but performance had taken a nosedive as data volumes grew. It turned out that the issue was down to how they were indexing the embedded objects in their documents. This article explains why their indexes were causing problems, and how they could be fixed. Note that I've changed details for this use case to obfuscate the customer and application. All customer information shared in a design review is kept confidential. We looked at the schema, and things looked good. They'd correctly split their claim information across two documents: One contained a modest amount of queryable data (20 KB per claim). These documents included the _id of the second document in case the application needed to fetch it (which was relatively rare). The second contained the bulky raw data that's immutable, unindexed, and rarely read. They had 110K queryable documents in the first collection—claims. With 2.2 GB of documents (before compression, which only reduces on-disk size) and 4 GB of cache, there shouldn't have been any performance issues. We looked at some of the queries, and there was a pretty wide set of keys being filtered on and in different combinations, but none of them returned massive numbers of documents. Some queries were taking tens of seconds. It made no sense. Even a full collection scan should take well under a second for this configuration. And they'd even added indexes for their common queries. So then, we looked at the indexes… Figure 1. Collection size report in MongoDB Atlas. 15 indexes on one collection is on the high side and could slow down your writes, but it's the read performance that we were troubleshooting. But, those 15 indexes are consuming 85 GB of space. With the 4 GB of cache available on their M30 Atlas nodes, that’s a huge problem! There wasn't enough RAM in the system for the indexes to fit in cache. The result was that when MongoDB navigated an index, it would repeatedly hit branches that weren't yet in memory and then have to fetch them from disk. That’s slow. Taking a look at one of the indexes… Figure 2. Index definition in MongoDB Atlas. It's a compound index on six fields, but the first five of those fields are objects, and the sixth is an array of objects—this explains why the indexes were so large. Avoiding indexes on objects Even ignoring the size of the index, adding objects to an index can be problematic. Querying on embedded objects doesn't behave in the way that many people expect. If an index on an embedded object is to be used, then the query needs to include every field in the embedded object. E.g., if I execute this query, then it matches exactly one of the documents in the database: db.getCollection('claim').findOne( { "policy_holder": { "first_name": "Janelle", "last_name": "Nienow", "dob": new Date("2024-12-16T23:56:49.643Z"), "location": { "street": "67628 Warren Road", "city": "Padbergstead", "state": "Minnesota", "zip_code": "44832-7187" }, "contact": { "email": "Janelle.Nienow@noxious-flood.org" } } } ); It delivers this result: { "_id": { "$oid": "67d801b7ad415ad6165ccd5f" }, "region": 12, "policy_holder": { "first_name": "Janelle", "last_name": "Nienow", "dob": { "$date": "2024-12-16T23:56:49.643Z" }, "location": { "street": "67628 Warren Road", "city": "Padbergstead", "state": "Minnesota", "zip_code": "44832-7187" }, "contact": { "email": "Janelle.Nienow@noxious-flood.org" } }, "policy_details": { "policy_number": "POL554359100", "type": "Home Insurance", "coverage": { "liability": 849000000, "collision": 512000, "comprehensive": 699000 } }, ... } The explain plan confirmed that MongoDB was able to use one of the defined indexes: Figure 3. The visual explain plan tool in MongoDB Atlas displaying that the compound index on policy_holder and messages was used. If just one field from the embedded object isn't included in the query, then no documents will match: db.getCollection('claim').findOne( { "policy_holder": { "first_name": "Janelle", "last_name": "Nienow", "dob": new Date("2024-12-16T23:56:49.643Z"), "location": { "street": "67628 Warren Road", "city": "Padbergstead", "state": "Minnesota", // "zip_code": "44832-7187" }, "contact": { "email": "Janelle.Nienow@noxious-flood.org" } } } ); This resulted in no matches—though the index is at least still used. If we instead pick out individual fields from the object to query on, then we get the results we expect: db.getCollection('claim').findOne( { "policy_holder.first_name": "Janelle", "policy_holder.last_name": "Nienow" } ); { "_id": { "$oid": "67d801b7ad415ad6165ccd5f" }, "region": 12, "policy_holder": { "first_name": "Janelle", "last_name": "Nienow", "dob": { "$date": "2024-12-16T23:56:49.643Z" }, "location": { "street": "67628 Warren Road", "city": "Padbergstead", "state": "Minnesota", "zip_code": "44832-7187" }, "contact": { "email": "Janelle.Nienow@noxious-flood.org" } }, "policy_details": { "policy_number": "POL554359100", "type": "Home Insurance", "coverage": { "liability": 849000000, "collision": 512000, "comprehensive": 699000 } }, ... } Unfortunately, none of the indexes that included policy_holder could be used as they were indexing the value of the complete embedded object, not the individual fields within it, and so a full collection scan was performed: Figure 4. The visual explain plan too warning that no index was available. Using compound indexes instead If we instead add a compound index that leads with the fields from the object we need to filter on, then that index will be used: Figure 5. Creating an index in MongoDB Atlas. Figure 6. Explain plan providing information for the compound index. As a quick refresher on using compound indexes, that index will be used if we query on just first_name: db.getCollection('claim').findOne( { "policy_holder.first_name": "Janelle", // "policy_holder.last_name": "Nienow" } ); Figure 7. Explain plan showing that the compound index was used. If we don't include the first key in the compound index, then it won't be used: db.getCollection('claim').findOne( { // "policy_holder.first_name": "Janelle", "policy_holder.last_name": "Nienow" } ); Figure 8. Explain plan providing more information on the query. However, you can use the index if you artificially include the leading keys in the query (though it will be more efficient if last_name had been the first key in the index): db.getCollection('claim').findOne( { "policy_holder.first_name": {$exists: true}, "policy_holder.last_name": "Nienow" } ); Figure 9. Explain plan showing the data for the index. Incompletely indexed queries While having indexes for your queries is critical, there is a cost to having too many, or in having indexes that include too many fields—writes get slower and pressure increases on cache occupancy. Sometimes, it's enough to have an index that does part of the work, and then rely on a scan of the documents found by the index to check the remaining keys. For example, the policy holder’s home state isn't included in our compound index, but we can still query on it: db.getCollection('claim').findOne( { "policy_holder.first_name": "Janelle", "policy_holder.location.state": "Kentucky" } ); Figure 9. Explain plan shows that the index narrowed down the problem. The explain plan shows that the index narrowed down the search from 110,000 documents to 111, which were then scanned to find the three matching documents. If it's rare for the state to be included in the query, then this can be a good solution. Partial indexes The main challenge in this design review was the size of the indexes, and so it's worth looking into another approach to limit the size of an index. Imagine that we need to be able to check on the names and email addresses of witnesses to accidents. We can add an index on the relevant fields: Figure 10. Adding an index to the relevant fields in Atlas. This index consumes 9.8 MB of cache space and must be updated when any document is added, or when any of these three fields are updated. Even if a document has null values for the indexed fields, or if the fields aren’t even present in the document, the document will still be included in the index. If we look deeper into the requirements, we might establish that we only need to query this data for fraudulent claims. That means that we're wasting space in our index for entries for all of the other claims. We can exploit this requirement by creating a partial index , setting the partial filter expression to { "claim.status": "Fraud" } . Only documents that match that pattern will be included in the index. Figure 11. Creating a partial filter in Atlas. That reduces the size of the index to 57 KB (a saving of more than 99%): Figure 12. Indexing sizing report. Note that queries must include { "claim.status": "Fraud" } for this index to be used: db.getCollection('claim').findOne( { "witnesses.email": "Sammy.Bergstrom@hotmail.com", "claim.status": "Fraud" } ); Figure 13. Explain plan providing details on the index keys and documents details. Conclusion Indexes are critical to database performance, whether you're using an RDBMS or MongoDB. MongoDB allows polymorphic documents, arrays, and embedded objects that aren't available in a traditional RDBMS. This leads to extra indexing opportunities, but also potential pitfalls. You should have indexes to optimize all of your frequent queries, but use the wrong type or too many of them and things could backfire. We saw that in this case with indexes taking up too much space and not being as general purpose as the developer believed. To compound problems, the database may perform well in development and for the early days in production. Things go wrong over time as the collections grow and extra indexes are added. As soon as the working data set (indexes and documents) doesn’t fit in the cache, performance quickly declines. Well-informed use of compound and partial indexes will ensure that MongoDB delivers the performance your application needs, even as your database grows. Learn more about MongoDB design reviews Design reviews are a chance for a design expert from MongoDB to advise you on how best to use MongoDB for your application. The reviews are focused on making you successful using MongoDB. It's never too early to request a review. By engaging us early (perhaps before you've even decided to use MongoDB), we can advise you when you have the best opportunity to act on it. This article explained how using a MongoDB schema and set of indexes that match how your application works with data can meet your performance requirements. If you want help to come up with that schema, then a design review is how to get that help. Would your application benefit from a review? Schedule your design review today . Want to read more from Andrew? Head to his website .

July 16, 2025
Developer Blog

Matryoshka Embeddings: Smarter Embeddings with Voyage AI

In the realm of AI, embedding models are the bedrock of advanced applications like retrieval augmented generation (RAG), semantic search , and recommendation systems. These models transform unstructured data (text, images, audio) into high-dimensional numerical vectors, allowing us to perform similarity searches and power intelligent features. However, traditional embedding models often generate fixed-size vectors, leading to trade-offs between performance and computational overhead. This post will dive deep into Matryoshka Representation Learning (MRL) , a novel approach that creates flexible, multi-fidelity embeddings. We'll compare and contrast MRL with traditional embeddings and quantization, detailing its unique training process and showcasing how Voyage AI's voyage-3-large and the recently released voyage-3.5 models leverage MRL as well as quantization to deliver unparalleled efficiency with MongoDB Atlas Vector Search . Understanding embedding models At their core, embedding models learn to represent discrete items (words, sentences, documents) as continuous vectors in a multi-dimensional space. The key principle is that items with similar meanings or characteristics are mapped to points that are close to each other in this vector space. This spatial proximity then allows for efficient similarity comparisons using metrics like cosine similarity. For example, in a semantic search application, when a user queries "best vegan restaurants," the embedding model converts this query into a vector. It then compares this vector against a database of pre-computed embeddings for restaurant descriptions. Restaurants whose embeddings are "nearby" the query embedding are deemed relevant and returned to the user. Figure 1. Example embedding model. Image Credit: &nbsp; Hugging Face Blog Challenges with traditional embeddings Historically, embedding models generate vectors of a fixed size, for example, 768, 1024, or 4096 dimensions. While effective, this fixed-size nature presents challenges: Inflexibility: A model trained for, say, 768-dimensional embeddings, will suffer a significant performance drop if you simply truncate its vectors to a smaller size, like 256 dimensions, without retraining. This means you're locked into a specific dimension size, even if a smaller representation would suffice for certain tasks. High computational load: Higher-dimensional vectors demand more computational resources for storage, transfer, and similarity calculations. In scenarios with large datasets or real-time inference, this can lead to increased latency and operational costs. Information loss on truncation: Without specific training, truncating traditional embeddings inevitably leads to substantial information loss, compromising the quality of downstream tasks. Matryoshka Representation Learning MRL, introduced by researchers from the University of Washington, Google Research, and Harvard University in 2022 , offers an elegant solution to these challenges. Inspired by the Russian nesting dolls, MRL trains a single embedding model such that its full-dimensional output can be truncated to various smaller dimensions while still retaining high semantic quality. The magic lies in how the model is trained to ensure that the initial dimensions of the embedding are the most semantically rich, with subsequent dimensions adding progressively finer-grained information. This means you can train a model to produce, say, a 1024-dimensional embedding. Then, for different use cases or performance requirements, you can simply take the first 256, 512, or any other number of dimensions from that same 1024-dimensional vector. Each truncated vector is still a valid and semantically meaningful representation, just at a different level of detail. Figure 2. Matryoshka embedding model truncating the output. Image Credit: &nbsp; Hugging Face Blog Understanding MRL with an analogy Imagine a movie. A 2048-dimensional MRL embedding might represent the "Full Movie". Truncating it to: 1024 dimensions: Still provides enough information for a "Movie Trailer." 512 dimensions: Gives a "Plot Summary & Movie Details." 256 dimensions: Captures the "Movie Title & Plot One-liner." This "coarse-to-fine" property ensures that each prefix of the full vector remains semantically rich and usable. You simply keep the first N dimensions from the full vector to truncate it. Figure 3. Visualizing the Matryoshka doll analogy for MRL. The unseen hand: How the loss function shapes embedding quality To truly grasp what makes MRL distinct, we must first understand the pivotal role of the loss function in the training of any embedding model. This mathematical function is the core mechanism that teaches these sophisticated models to understand and represent meaning. During a typical training step, an embedding model processes a batch of input data, producing a set of predicted output vectors. The loss function (“J” in the below diagram) then steps in, comparing these predicted embeddings (“y_pred”) against known "ground truth" or expected target values (“y”). It quantifies the discrepancy between what the model predicts and what it should ideally produce, effectively gauging the "error" in its representations. A high loss value signifies a significant deviation – a large "penalty" indicating the model is failing to capture the intended relationships (e.g., placing semantically similar items far apart in the vector space). Conversely, a low loss value indicates accurate capture of these relationships, ensuring that similar concepts (like different images of cats) are mapped close together, while dissimilar ones remain distant. Figure 4. Training workflow including the loss function. The iterative training process, guided by an optimizer, continuously adjusts the model's internal weights with the sole aim of minimizing this loss value. This relentless pursuit of a lower loss is precisely how an embedding model learns to generate high-quality, semantically meaningful vectors. MRL training process The key differentiator for MRL lies in its training methodology. Unlike traditional embeddings, where a single loss value is computed for the full vector, MRL training involves: Multiple loss values: Separate loss values are computed for multiple truncated prefixes of the vector (e.g., at 256, 512, 1024, and 2048 dimensions). Loss averaging: These individual losses are averaged (or summed), to calculate a total loss. Incentivized information packing: The model is trained to minimize this total loss. This process penalizes even the smallest prefixes if their loss is high, strongly incentivizing the model to pack the most crucial information into the earliest dimensions of the vector. This results in a model where information is "front-loaded" into early dimensions, ensuring accuracy remains strong even with fewer dimensions, unlike traditional models where accuracy drops significantly upon truncation. Examples of MRL-trained models include voyage-3-large and voyage-3.5 . MRL vs. quantization It's important to differentiate MRL from quantization, another common technique for reducing embedding size. While both aim to make embeddings more efficient, their approaches and benefits differ fundamentally. Quantization techniques compress existing high-dimensional embeddings into a more compact form, by reducing the precision of the numerical values (e.g., from float32 to int8). The following table describes the precise differences between MRL and Quantization. table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Aspect MRL Quantization Goal Reduce embedding dimensionality (e.g., 256 out of 2048 dims) Reduce embedding precision (e.g., instead of using fp32, using int8/binary embeddings) Output Type Float32 vectors of varying lengths Fixed-length vectors with lower bit representations Training Awareness Uses multi-loss training across dimensions Often uses quantization-aware training (QAT) Use Case Trade-off accuracy vs compute/memory at inference Minimize storage and accelerate vector math operations Example (Voyage AI) voyage-3-large @ 512-dim-fp32 voyage-3-large @ 2048-dim-int8 Flexibility and efficiency with MRL The core benefit of MRL is its unparalleled flexibility and efficiency. Instead of being locked into a single, large vector size, you can: Choose what you need: Generate a full 2048-dimensional vector and then slice it to 256, 512, or 1024 dimensions based on your specific needs. One vector, multiple fidelities: A single embedding provides multiple levels of detail and accuracy. Lower compute, bandwidth, and storage: By using smaller vector dimensions, you drastically reduce the computational load for indexing, query processing, and data transfer, as well as the storage footprint in your database. Efficient computation: The embedding is computed once, and then you simply slice it to the desired dimensions, making it highly efficient. Voyage AI, in particular, leverages MRL by default across its models, including voyage-3-large and the latest voyage-3.5, enabling scalable embeddings with one model and multiple dimensions. This allows you to dynamically choose between space/latency and quality at query time, leading to efficient retrieval with minimal accuracy loss. Voyage AI's dual approach: MRL and quantization for ultimate efficiency Voyage AI models maximize efficiency by combining MRL and quantization. MRL enables flexible embeddings by allowing you to select the optimal vector length—for instance, using 512 instead of 2048 dimensions—resulting in significant reductions in size and computational overhead with minimal accuracy loss. Quantization further compresses these vectors by reducing their bit precision, which cuts storage needs and speeds up similarity search operations. This synergy allows you to choose embeddings tailored to your application’s requirements: a voyage-3-large embedding can be used as a compact 512-dimensional floating-point vector (leveraging MRL) or as a full 2048-dimensional 8-bit integer vector (via quantization). The dual approach empowers you to balance accuracy, storage, and performance, ensuring highly efficient, flexible embeddings for your workload. As a result, Voyage AI models deliver faster inferences and help reduce infrastructure costs when powering applications with MongoDB Atlas Vector Search. Head over to the MongoDB AI Learning Hub to learn how to build and deploy AI applications with MongoDB.

July 14, 2025
Developer Blog

Don’t Just Build Agents, Build Memory-Augmented AI Agents

Insight Breakdown: This piece aims to reveal that regardless of architectural approach—whether Anthropic's multi-agent coordination or Cognition's single-threaded consolidation—sophisticated memory management emerges as the fundamental determinant of agent reliability, believability, and capability. It marks the evolution from stateless AI applications toward truly intelligent, memory-augmented systems that learn and adapt over time. AI agents are intelligent computational systems that can perceive their environment, make informed decisions, use tools, and, in some cases, maintain persistent memory across interactions—evolving beyond stateless chatbots toward autonomous action. Multi-agent systems coordinate multiple specialized agents to tackle complex tasks, like a research team where different agents handle searching, fact-checking, citations and research synthesis. Recently, two major players in the AI space released different perspectives on how to build these systems. Anthropic released an insightful piece highlighting their learnings on building multi-agent systems for deep research use cases. Cognition also released a post titled: " Don't Build Multi-Agents ," which appears to contradict Anthropic's approach directly. Two things stand out: Both pieces are right Yes, this sounds contradictory, but working with customers building agents of all scales and sizes in production, we find that both the use case and application mode, in particular, are key factors to consider when determining how to architect your agent(s). Anthropic's multi-agent approach makes sense for deep research scenarios where sustained, comprehensive analysis across multiple domains over extended periods is required. Cognition's single-agent approach is optimal for conversational agents or coding tasks where consistency and coherent decision-making are paramount. The application mode—whether research assistant, conversational agent, or coding assistant—fundamentally shapes the optimal memory architecture. Anthropic also highlights this point when discussing the downside of multi-agent architecture. For instance, most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time. Anthropic, Building Multi-Agent Research System Both pieces are saying the same thing Memory is the foundational challenge that determines agent reliability, believability, and capability. Anthropic emphasizes sophisticated memory management techniques (compression, external storage, context handoffs) for multi-agent coordination. Cognition emphasizes context engineering and continuous memory flow to prevent the fragmentation that destroys agent reliability. Both teams arrived at the same core insight: agents fail without robust memory management . Anthropic chose to solve memory distribution across multiple agents, while Cognition chose to solve memory consolidation within single agents. The key takeaway from both pieces for AI Engineers or anyone developing an agentic platform is not just build agents, build Memory Augmented AI agents . With that out of the way, the rest of this piece will provide you with the essential insights from both pieces that we think are important and point to the memory management principles and design patterns we’ve observed in our customers’ building agents. The key insights If you are building your agentic platform from scratch, you can extract much value from Anthropic's approach to building multi-agent systems, particularly their sophisticated memory management principles, which are essential for effective agentic systems. Their implementation reveals critical design considerations, including techniques to overcome context window limitations through compression, function calling, and storage functions that enable sustained reasoning across extended multi-agent interactions—foundational elements that any serious agentic platform must address from the architecture phase. Key insights: Agents are overthinkers Multi-agent systems trade efficiency for capability Systematic agent observation reveals failure patterns Context windows remain insufficient for extended sessions Context compression enables distributed memory management Let's go a bit deeper into how these insights translate into practical implementation strategies. Agents are overthinkers Anthropic researchers mentioned using explicit guidelines to steer agents into allocating the right amount of resources (tool calls, sub-agent creation, etc.), or else, they tend to overengineer solutions. Without proper constraints, the agents would spawn excessive subagents for simple queries, conduct endless searches for nonexistent information, and apply complex multi-step processes to tasks requiring straightforward responses. Explicit guidance for agent behavior isn't entirely new—system prompts and instructions are typical parameters in most agent frameworks. However, the key insight here goes deeper than traditional prompting approaches. When agents are given access to resources such as data, tools, and the ability to create sub-agents, there needs to be explicit, unambiguous direction on how these resources are expected to be leveraged to address specific tasks. This goes beyond system prompts and instructions into resource allocation guidance, operational constraints, and decision-making boundaries that prevent agents from overengineering solutions or misusing available capabilities. Take, for example, the OpenAI Agent SDK with several parameters to describe behaviours of resources to the agent, such as handoff_description , which will be utilized in a multi-agent system built with the OpenAI SDK. This argument specifies how the subagent should be leveraged in a multi-agent system. Or the explicit argument tool_use_behavior that describes to the agent how a tool should be used, as the name suggests. The key takeaway for AI Engineers is that multi-agent system implementation requires an extensive thinking process that involves what tools the agents are expected to leverage, the subagents in the system, and how resource utilization is communicated to the calling agent in a multi-agent system. When implementing resource allocation constraints for your agents, consider traditional approaches of managing multiple specialized databases (vector DB for embeddings, graph DB for relationships, relational DB for structured data) compound the complexity problem, and introduce tech stack sprawl, an anti-pattern to rapid AI innovation. Multi-agent systems trade efficiency for capability While multi-agent architectures can utilize more tokens and parallel processing for complex tasks, Anthropic found operational costs significantly higher due to coordination overhead, context management, and the computational expense of maintaining a coherent state across multiple agents. In some cases, two heads are better than one, but they are also expensive within multi-agent systems. One thing we note here is that the use case used in Anthropic's multi-agent system is deep research. This use case requires extensive exploration of resources, including heavily worded research papers, sites, and documentation, to accumulate enough information to formulate the result of this use case (which is typically a 2000+ word essay on the user’s starting prompt). In other use cases, such as automated workflow with agents representing processes within the workflow, there might not be as much token consumption, especially if the process encapsulates deterministic steps such as database reads and write operations, and its output is execution results that are sentences or short summaries. The coordination overhead challenge becomes particularly acute when agents need to share state across different storage systems. Rather than managing complex data synchronization between specialized databases, MongoDB's native ACID compliance ensures that multi-agent handoffs maintain data integrity without external coordination mechanisms. This unified approach reduces both the computational overhead of distributed state management and the engineering complexity of maintaining consistency across multiple storage systems. Context compression enables distributed memory management Beyond reducing inference costs, compression techniques allow multi-agent systems to maintain shared context across distributed agents. Anthropic's approach involves summarizing completed work phases and storing essential information in external memory before agents transition to new tasks. This, coupled with the insight that Context windows remain insufficient for extended sessions, points to the fact that prompt compression or compaction techniques are still relevant and useful in a world where LLMs have extensive context windows. Even with a 200K token (approximately 150,000 words) capacity, Anthropic’s agents in multi-round conversations require sophisticated context management strategies, including compression, external memory offloading, and spawning fresh agents when limits are reached. We previously partnered with Andrew Ng and DeepLearning AI on a course on prompt compression techniques and retrieval-augmented generation (RAG) optimization. Systematic agent observation reveals failure patterns Systematic agent observation represents one of Anthropic's most practical insights. Essentially, rather than relying on guesswork (or vibes), the team built detailed simulations using identical production prompts and tools and then systematically observed step-by-step execution to identify specific failure modes. This phase in an agentic system has an extensive operational cost. From our perspective, working with customers building agents in production, this methodology addresses a critical gap most teams face: understanding how your agents actually behave versus how you think they should behave . Anthropic's approach immediately revealed concrete failure patterns that many of us have encountered but struggled to diagnose systematically. Their observations uncovered agents overthinking simple tasks, like we mentioned earlier, using verbose search queries that reduced effectiveness, and selecting inappropriate tools for specific contexts. As they note in their piece: " This immediately revealed failure modes: agents continuing when they already had sufficient results, using overly verbose search queries, or selecting incorrect tools. Effective prompting relies on developing an accurate mental model of the agent. " The key insight here is moving beyond trial-and-error prompt engineering toward purposeful debugging . Instead of making assumptions about what should work, Anthropic demonstrates the value of systematic behavioral observation to identify the root causes of poor performance. This enables targeted prompt improvements based on actual evidence rather than intuition. We find that gathering, tracking, and storing agent process memory serves a dual critical purpose: not only is it vital for agent context and task performance, but it also provides engineers with the essential data needed to evolve and maintain agentic systems over time. Agent memory and behavioral logging remain the most reliable method for understanding system behavior patterns, debugging failures, and optimizing performance, regardless of whether you implement a single comprehensive agent or a system of specialized subagents collaborating to solve problems. MongoDB's flexible document model naturally accommodates the diverse logging requirements for both operational memory and engineering observability within a single, queryable system. One key piece that would be interesting to know from the Anthropic research team is what evaluation metrics they use. We’ve spoken extensively about evaluating LLMs in RAG pipelines, but what new agentic system evaluation metrics are developers working towards? We are answering these questions ourselves and have partnered with Galileo, a key player in the AI Stack, whose focus is purely on evaluating RAG and Agentic applications and making these systems reliable for production. Our learning will be shared in this upcoming webinar , taking place on July 17, 2025. However, for anyone building agentic systems, this represents a shift in development methodology—building agents requires building the infrastructure to understand them, and sandbox environments might become a key component of the evaluation and observability stack for Agents. Advanced implementation patterns Beyond the aforementioned core insights, Anthropic's research reveals several advanced patterns worth examining: The Anthropic piece hints at the implementation of advanced retrieval mechanisms that go beyond vector-based similarity between query vectors and stored information. Their multi-agent architecture enables sub-agents to call tools (an approach also seen in MemGPT ) to store their work in external systems, then pass lightweight references—presumably unique identification numbers of summarized memory components—back to the coordinator. We generally emphasize the importance of the multi-model retrieval approach to our customers and developers, where hybrid approaches combine multiple retrieval methods—using vector search to understand intent while simultaneously performing text search for specific product details. MongoDB's native support for vector similarity search and traditional indexing within a single system eliminates the need for complex reference management across multiple databases, simplifying the coordination mechanisms that Anthropic's multi-agent architecture requires. The Anthropic team implements continuity in the agent execution process by establishing clear boundaries between task completion and summarizing the current phase before moving to the next task. This creates a scalable system where memory constraints don't bottleneck the research process, allowing for truly deep and comprehensive analysis that spans beyond what any single context window could accommodate. In a multi-agent pipeline, each sub-agent produces partial results—intermediate summaries, tool outputs, extracted facts—and then hands them off into a shared “memory” database. Downstream agents will then read those entries, append their analyses, and write updated records back. Because these handoffs happen in parallel, you must ensure that one agent’s commit doesn’t overwrite another’s work or that a reader doesn’t pick up a half-written summary. Without atomic transactions and isolation guarantees, you risk: Lost updates , where two agents load the same document, independently modify it, and then write back, silently discarding one agent’s changes. Dirty or non-repeatable reads , where an agent reads another’s uncommitted or rolled-back write, leading to decisions based on phantom data. To coordinate these handoffs purely in application code would force you to build locking layers or distributed consensus, quickly becoming a brittle, error-prone web of external orchestrators. Instead, you want your database to provide those guarantees natively so that each read-modify-write cycle appears to execute in isolation and either fully succeeds or fully rolls back. MongoDB's ACID compliance becomes crucial here, ensuring that these boundary transitions maintain data integrity across multi-agent operations without requiring external coordination mechanisms that could introduce failure points. Application mode is crucial when discussing memory implementation . In Anthropic's case, the application functions as a research assistant, while in other implementations, like Cognition's approach, the application mode is conversational. This distinction significantly influences how agents operate and manage memory based on their specific application contexts. Through our internal work and customer engagements, we extend this insight to suggest that application mode affects not only agent architecture choices but also the distinct memory types used in the architecture. AI agents need augmented memory Anthropic’s research makes one thing abundantly clear: context window is not all you need. This extends to the key point that memory and agent engineering are two sides of the same coin. Reliable, believable, and truly capable agents depend on robust, persistent memory systems that can store, retrieve, and update knowledge over long, complex workflows. As the AI ecosystem continues to innovate on memory mechanisms, mastering sophisticated context and memory management approaches will be the key differentiator for the next generation of successful agentic applications. Looking ahead, we see “Memory Engineering” or “Memory Management” emerge as a key specialization within AI Engineering, focused on building the foundational infrastructure that lets agents remember, reason, and collaborate at scale. For hands-on guidance on memory management, check out our webinar on YouTube, which covers essential concepts and proven techniques for building memory-augmented agents. Head over to the MongoDB AI Learning Hub to learn how to build and deploy AI applications with MongoDB.

July 9, 2025
Developer Blog

Real-Time Threat Detection With MongoDB & PuppyGraph

Security operations teams face an increasingly complex environment. Cloud-native applications, identity sprawl, and continuous infrastructure changes generate a flood of logs and events. From API calls in AWS to lateral movement between virtual machines, the volume of telemetry is enormous—and it’s growing. The challenge isn’t just scale. Its structure. Traditional security tooling often looks at events in isolation, relying on static rules or dashboards to highlight anomalies. But real attacks unfold as chains of related actions: A user assumes a role, launches a resource, accesses data, and then pivots again. These relationships are hard to capture with flat queries or disconnected logs. That’s where graph analytics comes in. By modeling your data as a network of users, sessions, identities, and events, you can trace how threats emerge and evolve. And with PuppyGraph, you don’t need a separate graph database or batch pipelines to get there. In this post, we’ll show how to combine MongoDB and PuppyGraph to analyze AWS CloudTrail data as a graph—without moving or duplicating data. You’ll see how to uncover privilege escalation chains, map user behavior across sessions, and detect suspicious access patterns in real time. Why MongoDB for cybersecurity data MongoDB is a popular choice for managing security telemetry. Its document-based model is ideal for ingesting unstructured and semi-structured logs like those generated by AWS CloudTrail, GuardDuty, or Kubernetes audit logs. Events are stored as flexible JSON documents, which evolve naturally as logging formats change. This flexibility matters in security, where schemas can shift as providers update APIs or teams add new context to events. MongoDB handles these changes without breaking pipelines or requiring schema migrations. It also supports high-throughput ingestion and horizontal scaling, making it well-suited for operational telemetry. Many security products and SIEM backends already support MongoDB as a destination for real-time event streams. That makes it a natural foundation for graph-based security analytics: The data is already there—rich, semi-structured, and continuously updated. Why graph analytics for threat detection Modern security incidents rarely unfold as isolated events. Attackers don’t just trip a single rule—they navigate through systems, identities, and resources, often blending in with legitimate activity. Understanding these behaviors means connecting the dots across multiple entities and actions. That’s precisely what graph analytics excels at. By modeling users, sessions, events, and assets as interconnected nodes and edges, analysts can trace how activity flows through a system. This structure makes it easy to ask questions that involve multiple hops or indirect relationships—something traditional queries often struggle to express. For example, imagine you’re investigating activity tied to a specific AWS account. You might start by counting how many sessions are associated with that account. Then, you might break those sessions down by whether they were authenticated using MFA. If some weren’t, the next question becomes: What resources were accessed during those unauthenticated sessions? This kind of multi-step investigation is where graph queries shine. Instead of scanning raw logs or filtering one table at a time, you can traverse the entire path from account to identity to session to event to resource, all in a single query. You can also group results by attributes like resource type to identify which services were most affected. And when needed, you can go beyond metrics and pivot to visualization, mapping out full access paths to see how a specific user or session interacted with sensitive infrastructure. This helps surface lateral movement, track privilege escalation, and uncover patterns that static alerts might miss. Graph analytics doesn’t replace your existing detection rules; it complements them by revealing the structure behind security activity. It turns complex event relationships into something you can query directly, explore interactively, and act on with confidence. Query MongoDB data as a graph without ETL MongoDB is a popular choice for storing security event data, especially when working with logs that don’t always follow a fixed structure. Services like AWS CloudTrail produce large volumes of JSON-based records with fields that can differ across events. MongoDB’s flexible schema makes it easy to ingest and query that data as it evolves. PuppyGraph builds on this foundation by introducing graph analytics—without requiring any data movement. Through the MongoDB Atlas SQL Interface , PuppyGraph can connect directly to your collections and treat them as relational tables. From there, you define a graph model by mapping key fields into nodes and relationships. Figure 1. Architecture of the integration of MongoDB and PuppyGraph. This makes it possible to explore questions that involve multiple entities and steps, such as tracing how a session relates to an identity or which resources were accessed without MFA. The graph itself is virtual. There’s no ETL process or data duplication. Queries run in real time against the data already stored in MongoDB. While PuppyGraph works with tabular structures exposed through the SQL interface, many security logs already follow a relatively flat pattern: consistent fields like account IDs, event names, timestamps, and resource types. That makes it straightforward to build graphs that reflect how accounts, sessions, events, and resources are linked. By layering graph capabilities on top of MongoDB, teams can ask more connected questions of their security data, without changing their storage strategy or duplicating infrastructure. Investigating CloudTrail activity using graph queries To demonstrate how graph analytics can enhance security investigations, we’ll explore a real-world dataset of AWS CloudTrail logs. This dataset originates from flaws.cloud , a security training environment developed by Scott Piper. The dataset comprises anonymized CloudTrail logs collected over 3.5 years, capturing a wide range of simulated attack scenarios within a controlled AWS environment. It includes over 1.9 million events, featuring interactions from thousands of unique IP addresses and user agents. The logs encompass various AWS API calls, providing a comprehensive view of potential security events and misconfigurations. For our demonstration, we imported a subset of approximately 100,000 events into MongoDB Atlas. By importing this dataset into MongoDB Atlas and applying PuppyGraph’s graph analytics capabilities, we can model and analyze complex relationships between accounts, identities, sessions, events, and resources. Demo Let’s walk through the demo step by step! We have provided all the materials for this demo on GitHub . Please download the materials or clone the repository directly. If you’re new to integrating MongoDB Atlas with PuppyGraph, we recommend starting with the MongoDB Atlas + PuppyGraph Quickstart Demo to get familiar with the setup and core concepts. Prerequisites A MongoDB Atlas account (free tier is sufficient) Docker Python 3 Set up MongoDB Atlas Follow the MongoDB Atlas Getting Started guide to: Create a new cluster (free tier is fine). Add a database user. Configure IP access. Note your connection string for the MongoDB Python driver (you’ll need it shortly). Download and import CloudTrail logs Run the following commands to fetch and prepare the dataset: wget https://summitroute.com/downloads/flaws_cloudtrail_logs.tar mkdir -p ./raw_data tar -xvf flaws_cloudtrail_logs.tar --strip-components=1 -C ./raw_data gunzip ./raw_data/*.json.gz Create a virtual environment and install dependencies: # On some Linux distributions, install `python3-venv` first. sudo apt-get update sudo apt-get install python3-venv # Create a virtual environment, activate it, and install the necessary packages python -m venv venv source venv/bin/activate pip install ijson faker pandas pymongo Import the first chunk of CloudTrail data (replace the connection string with your Atlas URI): export MONGODB_CONNECTION_STRING="your_mongodb_connection_string" python import_data.py raw_data/flaws_cloudtrail00.json --database cloudtrail This creates a new cloudtrail database and loads the first chunk of data containing 100,000 structured events. Enable Atlas SQL interface and get JDBC URI To enable graph access: Create an Atlas SQL Federated Database instance. Ensure the schema is available (generate from sample, if needed). Copy the JDBC URI from the Atlas SQL interface. See PuppyGraph’s guide for setting up MongoDB Atlas SQL . Start PuppyGraph and upload the graph schema Start the PuppyGraph container: docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \ -e PUPPYGRAPH_PASSWORD=puppygraph123 \ -d --name puppy --rm --pull=always puppygraph/puppygraph:stable Log in to the web UI at http://localhost:8081 with: Username: puppygraph. Password: puppygraph123. Upload the schema: Open schema.json. Fill in your JDBC URI, username, and password. Upload via the Upload Graph Schema JSON section or run: curl -XPOST -H "content-type: application/json" \ --data-binary @./schema.json \ --user "puppygraph:puppygraph123" localhost:8081/schema Wait for the schema to upload and initialize (approximately five minutes). Figure 2: A graph visualization of the schema, which models the graph from relational data. Run graph queries to investigate security activity Once the graph is live, open the Query panel in PuppyGraph’s UI. Let's say we want to investigate the activity of a specific account. First, we count the number of sessions associated with the account. Cypher: MATCH (a:Account)-[:HasIdentity]->(i:Identity) -[:HasSession]->(s:Session) WHERE id(a) = "Account[811596193553]" RETURN count(s) Gremlin: g.V("Account[811596193553]") .out("HasIdentity").out("HasSession").count() Figure 3. Graph query in the PuppyGraph UI. Then, we want to see how many of these sessions are MFA-authenticated or not. Cypher: MATCH (a:Account)-[:HasIdentity]->(i:Identity) -[:HasSession]->(s:Session) WHERE id(a) = "Account[811596193553]" RETURN s.mfa_authenticated AS mfaStatus, count(s) AS count Gremlin: g.V("Account[811596193553]") .out("HasIdentity").out("HasSession") .groupCount().by("mfa_authenticated") Figure 4. Graph query results in the PuppyGraph UI. Next, we investigate those sessions that are not MFA authenticated and see what resources they accessed. Cypher: MATCH (a:Account)-[:HasIdentity]-> (i:Identity)-[:HasSession]-> (s:Session {mfa_authenticated: false}) -[:RecordsEvent]->(e:Event) -[:OperatesOn]->(r:Resource) WHERE id(a) = "Account[811596193553]" RETURN r.resource_type AS resourceType, count(r) AS count Gremlin: g.V("Account[811596193553]").out("HasIdentity") .out("HasSession") .has("mfa_authenticated", false) .out('RecordsEvent').out('OperatesOn') .groupCount().by("resource_type") Figure 5. PuppyGraph UI showing results that are not MFA authenticated. We show those access paths in a graph. Cypher: MATCH path = (a:Account)-[:HasIdentity]-> (i:Identity)-[:HasSession]-> (s:Session {mfa_authenticated: false}) -[:RecordsEvent]->(e:Event) -[:OperatesOn]->(r:Resource) WHERE id(a) = "Account[811596193553]" RETURN path Gremlin: g.V("Account[811596193553]").out("HasIdentity").out("HasSession").has("mfa_authenticated", false) .out('RecordsEvent').out('OperatesOn') .path() Figure 6. Graph visualization in PuppyGraph UI. Tear down the environment When you’re done: docker stop puppy Your MongoDB data will persist in Atlas, so you can revisit or expand the graph model at any time. Conclusion Security data is rich with relationships, between users, sessions, resources, and actions. Modeling these connections explicitly makes it easier to understand what’s happening in your environment, especially when investigating incidents or searching for hidden risks. By combining MongoDB Atlas and PuppyGraph, teams can analyze those relationships in real time without moving data or maintaining a separate graph database . MongoDB provides the flexibility and scalability to store complex, evolving security logs like AWS CloudTrail, while PuppyGraph adds a native graph layer for exploring that data as connected paths and patterns. In this post, we walked through how to import real-world audit logs, define a graph schema, and investigate access activity using graph queries. With just a few steps, you can transform a log collection into an interactive graph that reveals how activity flows across your cloud infrastructure. If you’re working with security data and want to explore graph analytics on MongoDB Atlas , try PuppyGraph’s free Developer Edition . It lets you query connected data, such as users, sessions, events, and resources, all without ETL or infrastructure changes.

July 7, 2025
Developer Blog

Natural-Language Agents: MongoDB Text-to-MQL + LangChain

The text-to-MQL capability available in the LangChain MongoDB package converts natural language into MongoDB Query Language, enabling applications to process queries like, "Show me movies from the 1990s with ratings above 8.0," and automatically generate the corresponding MongoDB operations. This guide demonstrates how to build production-ready applications that leverage text-to-MQL for conversational database interfaces, covering agent architectures, conversation memory, and reliable database interactions at scale. Understanding text-to-MQL: Beyond simple query translation Text-to-MQL shifts database interaction from manual query construction to natural language processing. Traditional database applications require developers to parse user intent, construct queries, handle validation, and format results. Text-to-MQL applications can accept natural language directly: # Traditional approach def get_top_movies_by_rating(min_rating, limit): return db.movies.aggregate([ {"$match": {"imdb.rating": {"$gte": min_rating}}}, {"$sort": {"imdb.rating": -1}}, {"$limit": limit} ]) # Text-to-MQL approach def process_natural_language_query(user_query): return agent.invoke({"messages": [("user", user_query)]}) This transformation enables natural language interfaces for complex database operations, making data access intuitive for end users while reducing development effort for database interaction logic. The MongoDB agent toolkit: Implementing text-to-MQL Users can access agent_toolkit by: pip install langchain-mongodb langchain-openai langgraph The LangChain MongoDB agent_toolkit provides four core tools that work together to implement text-to-MQL functionality: from langchain_mongodb.agent_toolkit import MongoDBDatabase, MongoDBDatabaseToolkit db = MongoDBDatabase.from_connection_string(connection_string, database="sample_mflix") toolkit = MongoDBDatabaseToolkit(db=db, llm=llm) tools = toolkit.get_tools() Set up MongoDB Atlas with our sample movie dataset and start experimenting with text-to-MQL in minutes. Get started with Atlas Free Tier Load the sample MFlix Dataset Full notebook demonstration The text-to-MQL workflow When a user asks, "Which theaters are furthest west?" the text-to-MQL system follows this process: table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Step Tool What happens Example 1. Discovery mongodb_list_collections Agent identifies available data Finds theaters, movies, users, collections 2. Schema understanding mongodb_schema Agent examines relevant collection structure Discovers location.geo field in theaters 3. Query generation LLM reasoning Natural language converts to MongoDB syntax Creates geospatial aggregation pipeline 4. Validation mongodb_query_checker Agent verifies query correctness Checks syntax and field references 5. Execution mongodb_query Agent runs the validated query Returns sorted theaters by longitude This workflow handles complex operations automatically—including geospatial queries, aggregations, and multi-collection operations—without requiring manual aggregation pipeline development. Building your first agent? Follow our step-by-step guide: Build Agents with LangGraph and MongoDB . Complex query examples Text-to-MQL handles sophisticated analytical queries: def demo_basic_queries(): queries = [ "List the top 5 movies with highest IMDb ratings", "Who are the top 10 most active commenters?", # ... additional queries for theaters, geographic analysis, director analytics ] for query in queries: # Execute each text-to-MQL query in separate conversation thread execute_graph_with_memory(f"demo_{i}", query) Query complexity examples: Temporal analysis: "Show me movie rating trends by decade for sci-fi films"—automatically filters by genre, groups by decade, and calculates statistical aggregations. Geographic intelligence: "Which states have the most theaters and what's their average capacity?"—discovers geographic fields, groups by state boundaries, and calculates regional statistics. Cross-collection analytics: "Find directors with at least 10 films who have the highest average ratings"—joins movie and director data, applies complex filtering and ranking logic. See these workflows in action Our interactive notebook demonstrates each step with live code examples you can run and modify. Explore the complete notebook in our Gen AI Showcase . Agent architecture patterns for text-to-MQL Two proven patterns address different text-to-MQL requirements based on your application's predictability needs. Pattern 1: ReAct agents for dynamic processing ReAct (Reasoning + Acting) agents provide flexible text-to-MQL processing where the optimal query strategy isn't predetermined: from langgraph.prebuilt import create_react_agent def create_flexible_text_to_mql_agent(): # Create ReAct agent with MongoDB tools and conversation memory checkpointer = MongoDBSaver(client) return create_react_agent(llm, toolkit.get_tools(), checkpointer=checkpointer) # Usage: Create agent and execute queries with conversation context agent = create_flexible_text_to_mql_agent() config = {"configurable": {"thread_id": "exploration_session"}} agent.invoke({"messages": [("user", "Find anomalies in user behavior patterns")]}, config) For more details, see how the MongoDBDatabaseToolkit can be used to develop ReAct-style agents. Pattern 2: Structured workflows for predictable operations For applications requiring consistent text-to-MQL behavior, implement deterministic workflows: def list_collections(state: MessagesState): # Call mongodb_list_collections tool to discover available data # Returns updated message state with collection list return {"messages": [call_msg, tool_response]} def generate_query(state: MessagesState): # Use LLM with MongoDB tools to convert natural language to MQL # Returns updated message state with generated query return {"messages": [llm_response]} # See notebook for complete node implementations def create_langgraph_agent_with_enhanced_memory(): summarizing_checkpointer = LLMSummarizingMongoDBSaver(client, llm) g = StateGraph(MessagesState) g.add_node("list_collections", list_collections) g.add_node("get_schema", schema_node) # ... add nodes for: generate_query, run_query, format_answer g.add_edge(START, "list_collections") g.add_edge("list_collections", "get_schema") # ... connect remaining edges: get_schema → generate_query → run_query → format_answer → END return g.compile(checkpointer=summarizing_checkpointer) Choosing your agent pattern table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Considerations ReAct agents Structured workflows Exploratory analytics ✅ Adapts to unpredictable queries, by dynamically selecting and chaining appropriate tools at runtime ❌ Too rigid for exploration, as a fixed workflow must be manually updated to support new “what-if” path Interactive dashboards ✅ Flexible drill-down capabilities, enabling on-the-fly responses to any dashboard interaction ❌ Fixed workflow limiting, because a structured graph requires advanced enumeration API endpoint optimization ❌ Unpredictable response times, since ReAct’s dynamic reasoning loops can lead to variable pre-request latency ✅ Consistent performance, as a structured agent runs the same sequence of steps Customer-facing apps ❌ Variable behavior, as ReAct may choose different tool paths for identical outputs ✅ Predictable user experience, since a fixed workflow yields the same sequence and similar output the majority of the time Automated systems ❌ Hard to debug failures, as troubleshooting requires tracing through dynamic chain of LLM decisions and tool calls ✅ Clear failure isolation, where failures immediately point to the specific node that broke, speeding up diagnostics Conversational text-to-MQL: Maintaining query context Text-to-MQL's real power emerges in multi-turn conversations where users can build complex analytical workflows through natural dialogue. LangGraph's MongoDB checkpointing implementation preserves conversation context across interactions. LangGraph MongoDB checkpointer for stateful text-to-MQL Users can leverage the MongoDBSaver checkpointer with the following command: pip install -U langgraph-checkpoint-mongodb pymongo The MongoDBSaver checkpointer transforms text to MQL from isolated query translation into conversational analytics: from langgraph.checkpoint.mongodb import MongoDBSaver class LLMSummarizingMongoDBSaver(MongoDBSaver): def __init__(self, client, llm): super().__init__(client) self.llm = llm # ... initialize summary cache def put(self, config, checkpoint, metadata, new_versions): # Generate human-readable step summary using LLM step_summary = self.summarize_step(checkpoint) # Add summary to checkpoint metadata for debugging enhanced_metadata = metadata.copy() if metadata else {} enhanced_metadata['step_summary'] = step_summary # ... add timestamp and other metadata return super().put(config, checkpoint, enhanced_metadata, new_versions) def create_react_agent_with_enhanced_memory(): # Create ReAct agent with intelligent conversation memory summarizing_checkpointer = LLMSummarizingMongoDBSaver(client, llm) return create_react_agent(llm, toolkit.get_tools(), checkpointer=summarizing_checkpointer) Conversational workflows in practice The checkpointer enables sophisticated, multi-turn text-to-MQL conversations: def demo_conversation_memory(): thread_id = f"conversation_demo_{uuid.uuid4().hex[:8]}" conversation = [ "List the top 3 directors by movie count", "What was the movie count for the first director?", # ... additional contextual follow-up questions ] for query in conversation: # Execute each query in same thread to maintain conversation context execute_graph_with_memory(thread_id, query) What conversation memory enables: Contextual follow-ups: Users can ask "What about comedies?" after querying movie genres. Progressive refinement: Each query builds on previous results for natural drill-down analysis. Session persistence: Conversations survive application restarts and resume exactly where they left off. Multi-user isolation: Different users maintain separate conversation threads. This creates readable execution logs for debugging: Step 1 [14:23:45] User asks about movie trends Step 2 [14:23:46] Text-to-MQL discovers movies collection Step 3 [14:23:47] Generated aggregation pipeline Step 4 [14:23:48] Query validation successful Step 5 [14:23:49] Returned 15 trend results Production implementation guide Moving text-to-MQL applications from development to production requires addressing performance, monitoring, testing, and integration concerns. Performance optimization Text-to-MQL applications face unique challenges: LLM API calls are expensive while generated queries can be inefficient. Implement comprehensive optimization: # Optimization ideas for production text-to-MQL systems: class OptimizedTextToMQLAgent: def __init__(self): # Cache frequently requested queries and schema information self.query_cache = {} self.schema_cache = {} def process_query(self, user_query): # Check cache for similar queries to reduce LLM API calls if cached_result := self.check_query_cache(user_query): return cached_result # Generate new query and cache result mql_result = self.agent.invoke({"messages": [("user", user_query)]}) # ... cache result for future use return mql_result def optimize_generated_mql(query, collection_name): # Add performance hints and limits to agent-generated queries # Example: Add index hints for known collections if collection_name == 'movies' and '$sort' in str(query): query.append({'$hint': {'imdb.rating': -1}}) # Always limit result sets to prevent runaway queries if not any('$limit' in stage for stage in query): query.append({'$limit': 1000}) return query Optimization strategies: Query caching: Cache based on semantic similarity rather than exact string matching. Index hints: Map common query patterns to existing indexes for better performance. Result limits: Always add limits to prevent runaway queries from returning entire collections. Schema caching: Cache collection schemas to reduce repeated discovery operations. Read more about how to implement caching using the MongoDBCache module. Monitoring and testing Unlike traditional database applications, text-to-MQL systems require monitoring conversation state and agent decision-making: def memory_system_stats(): # Monitor text-to-MQL conversation system health db_checkpoints = client['checkpointing_db'] total_checkpoints = checkpoints.count_documents({}) total_threads = len(checkpoints.distinct('thread_id')) # ... additional metrics like average session length, memory usage return {"checkpoints": total_checkpoints, "threads": total_threads} def test_enhanced_summarization(): # Test agent with variety of query patterns test_queries = [ "How many movies are in the database?", "Find the average rating of all movies", # ... additional test queries covering different analytical patterns ] # Execute all queries in same thread to test conversation flow for query in test_queries: execute_graph_with_memory(thread_id, query) # Inspect results to verify LLM summarization quality inspect_thread_history(thread_id) def compare_agents_with_memory(query: str): # Compare ReAct vs structured workflow performance # Execute same query with both agent types execute_react_with_memory(react_thread, query) execute_graph_with_memory(graph_thread, query) return {"react_thread": react_thread, "graph_thread": graph_thread} Essential monitoring Monitoring is crucial for maintaining the reliability and performance of your text-to-MQL agents. Start by tracking conversation thread growth and average session length to understand usage patterns and memory demands over time. Keep a close eye on query success rates, response times, and large language model (LLM) API usage to identify potential performance bottlenecks. Following MongoDB monitoring best practices can help you set up robust observability across your stack. Additionally, set alerts for any degradation in key text-to-MQL performance metrics, such as increased latency or failed query generation. Finally, implement automated cleanup policies to archive or delete stale conversation threads, ensuring that your system remains performant and storage-efficient. Testing strategies Thorough testing ensures your agents produce consistent and accurate results under real-world conditions. Begin by testing semantically similar natural language queries to validate that they generate equivalent MQL results. It's also helpful to regularly compare the behavior and output of different agent execution modes—such as ReAct-style agents versus structured workflow agents—to benchmark performance and consistency. Establish baseline metrics for success rates and response times so you can track regressions or improvements over time. Don’t forget to simulate concurrent conversations and introduce varying query complexity in your tests to evaluate how your system handles real-time load and edge cases. Integration patterns Text-to-MQL agents can be integrated into applications in several ways, depending on your architecture and latency requirements. One common pattern is exposing agent functionality via RESTful endpoints or WebSocket streams, allowing client apps to send natural language queries and receive real-time responses. Alternatively, you can deploy agents as dedicated microservices, making it easier to scale, monitor, and update them independently from the rest of your system. For deeper integration, agents can be embedded directly into existing data access layers, enabling seamless transitions between traditional query logic and natural language interfaces without major architectural changes. Security and access control To safely run text-to-MQL agents in production, robust security practices must be in place. Start by implementing role-based query restrictions so that different agents or user groups have tailored access to specific data. Logging all agent-generated queries—along with the user identities and corresponding natural language inputs—creates an audit trail for traceability and debugging. To prevent runaway queries or abuse, enforce limits on query complexity and result set size. Lastly, use connection pooling strategies that can scale with agent activity while maintaining session isolation, ensuring responsiveness and security across high-traffic workloads. Production Deployment Checklist Before deploying your text-to-MQL agent system to production, it’s important to implement safeguards and best practices that ensure reliability, security, and maintainability. Start by setting appropriate resource limits, such as timeouts for both LLM API calls and MongoDB queries, to prevent long-running or stalled requests from impacting performance. Incorporate robust error handling to ensure the system can gracefully degrade or return fallback messages when query generation or execution fails. To protect your system from abuse or unintentional overuse, enforce rate limiting with per-user query limits. Maintain clear environment separation by using different agents and database connections for development, staging, and production environments, reducing the risk of cross-environment interference. Adopt configuration management practices by externalizing critical parameters such as the LLM model being used, timeout thresholds, and database settings—making it easier to update or tune the system without redeploying code. Make sure your monitoring integration includes text-to-MQL-specific metrics, tracked alongside broader application health metrics. Finally, establish a robust backup strategy that ensures conversation history and agent memory are backed up according to your organization’s data retention and recovery policies. Together, these practices create a resilient foundation for deploying intelligent agents at scale. Atlas database features supporting agents Atlas offers powerful core database features that make it a strong foundation for LangChain text-to-MQL agents. While these features aren’t specific to text-to-MQL, they provide the performance, scalability, and flexibility needed to support production-grade agentic systems. 3-in-one backend architecture Atlas can serve as a unified backend that fulfills three critical roles in an agentic stack by acting as the: Primary data store , housing your queryable application collections—such as movies, users, or analytics. Vector store for embedding-based semantic search if you’re leveraging vector search capabilities. Memory store , enabling conversation history persistence and agent checkpointing across user interactions. This 3-in-one architecture reduces the need for external services and simplifies your overall infrastructure. Single connection benefits By using a single Atlas cluster to manage your data, vectors, and memory, you streamline the development and deployment process. This unified approach minimizes configuration complexity and makes it easier to maintain your system. It also provides performance advantages through data locality—allowing your agent to query related information efficiently without needing to switch between services or endpoints. Logical database organization To keep your agent system organized and maintainable, you can logically separate storage needs within your Atlas cluster. Application data can reside in collections like movies, users, or analytics. Agent-related infrastructure—such as conversation state and memory—can be stored in a dedicated checkpointing_db . If your agent uses semantic search, vector embeddings can be stored in purpose-built vector_search collections. This structure supports clear boundaries between functionality while maintaining the simplicity of a single database backend. Future directions for text-to-MQL applications Text-to-MQL represents the foundation for several emerging application patterns: Multi-modal data interfaces: Applications that combine text-to-MQL with vector search and graph queries, enabling users to ask questions that span structured data, semantic search, and relationship analysis within single conversations. Autonomous data exploration: Text-to-MQL agents that can suggest follow-up questions and identify interesting patterns in data, guiding users through exploratory analysis workflows. Intelligent query optimization: Text-to-MQL systems that learn from usage patterns to automatically optimize query generation, suggest more efficient question phrasings, and recommend database schema improvements. Collaborative analytics: Multi-user text-to-MQL environments where teams can share conversation contexts and build on each other's analytical discoveries through natural language interfaces. These trends point toward a future where natural language becomes a powerful, flexible layer for interacting with data across every stage of the analytics and application lifecycle. Conclusion The text-to-MQL capabilities available in the LangChain MongoDB package provide the foundation for building data-driven applications with conversational interfaces. The architectural patterns shown here—ReAct agents for flexibility and structured workflows for predictability—address different technical requirements while sharing common patterns for memory management and error handling. When choosing between these patterns, consider your specific requirements: ReAct agents work well for flexible data exploration and dynamic query generation, while structured workflows provide predictable performance and easier debugging. The memory systems and production patterns demonstrated here help ensure these agents can operate reliably at scale. These implementation patterns show how to move beyond basic database APIs toward more natural, conversational data interfaces. The LangChain text-to-MQL toolkit provides the building blocks, and these patterns provide the architectural guidance for building reliable, production-ready systems. The future of application development increasingly lies in natural language interfaces for data. Text-to-MQL provides the technical foundation to build that future today, enabling applications that understand what users want to know and automatically translate those questions into precise database operations. Start building conversational database apps today The LangChain MongoDB text-to-MQL package gives you everything needed to build production-ready applications with natural language database interfaces. What's next? Get hands-on: Load the MFlix sample dataset and run your first text-to-MQL queries. Go deeper: Implement conversation memory and production patterns from our notebook . Get support: Join thousands of developers building AI-powered apps with MongoDB. Join the MongoDB Developer Community to learn about MongoDB events, discuss relevant topics, and meet other community members from around the world. Visit the MongoDB AI Learning Hub to learn more about how MongoDB can support your AI use case. Get implementation support through the MongoDB support portal. The complete implementation demonstrating these text-to-MQL patterns is available in our companion notebook, which includes both agent architectures with conversation memory and production-grade debugging capabilities specifically designed for natural language database interfaces.

June 30, 2025
Developer Blog

Dynamic Term-Based Boosting in MongoDB Atlas Search

Search relevance is the bedrock of any modern user experience. While MongoDB Atlas Search offers a fantastic out-of-the-box relevance model with BM25, its standard approach treats all search terms with a uniform level of importance. For applications that demand precision, this isn't enough. What if you need to boost content from an expert author? Or prioritize a trending topic for the next 48 hours? Or ensure a specific promotional product always appears at the top? Relying on query-time boosting alone can lead to complex, brittle queries that are a nightmare to maintain. There's a more elegant solution. Enter the embedded scoring pattern—an advanced technique in Atlas Search that allows you to embed term-level boosting logic directly within your documents. It's a powerful way to make your relevance scoring data-driven, adaptable, and incredibly precise without ever changing your query structure. Why you need embedded scoring: From uniform to granular The standard approach to boosting is like using a single volume knob for an entire orchestra. The embedded scoring pattern, on the other hand, gives you a mixing board with a dedicated slider for every single instrument. This enables application owners to seamlessly build business-focused use cases, such as: Prioritizing authority: Elevate content from verified experts or high-authority authors. Boosting trends: Dynamically increase the rank of time-sensitive or trending topics. Elevating promotions: Ensure seasonal or promotional products get the visibility they need. By encoding scoring logic alongside your content, you solve the "one-size-fits-all" limitation and give yourself unparalleled control. Under the hood: Building the embedded scoring pattern Let's get practical. Implementing this pattern involves two key steps: designing the index and structuring your documents. 1. The index design: Defining your boosts First, you need to tell Atlas Search how to understand your custom boosts. You do this by defining a field with the embeddedDocuments type in your search index. This creates a dedicated space for your term-boost pairs. { "mappings": { "dynamic": true, "fields": { "indexed_terms": { "type": "embeddedDocuments", "dynamic": false, "fields": { "term": { "type": "string" }, "boost": { "type": "number" } } } } } } This index definition creates a special array, indexed_terms , ready to hold our custom scoring rules. 2. The document structure: Encoding relevance With the index in place, you can now add the indexed_terms array to your documents. Each object in this array contains a term and a corresponding boost value. Consider this sample document: { "id": "content_12345", "title": "Advanced Machine Learning Techniques for Natural Language Processing", "description": "Comprehensive guide covering transformer models and neural networks", "tags": ["technology", "AI", "tutorial"], "author": "Dr. Sarah Chen", "indexed_terms": [ { "term": "machine learning", "boost": 25.0 }, // High boost for the primary topic { "term": "dr. sarah chen", "boost": 20.0 }, // High boost for an expert author { "term": "tutorial", "boost": 8.0 } // Lower boost for the content format ] } As you can see, we've assigned a high score to the core topic ("machine learning") and the expert author, ensuring this document ranks highly for those queries. The query: Putting embedded scores into action Now for the magic. The query below uses the compound operator to combine our new embedded scoring with traditional field-based search. [ { "$search": { "index": "default", "compound": { "should": [ { // Clause 1: Use our embedded scores "embeddedDocument": { "path": "indexed_terms", "operator": { "text": { "path": "indexed_terms.term", "query": "machine learning", "score": { // Use the boost value from the document! "function": { "path": { "value": "indexed_terms.boost", "undefined": 0.0 } } } } } } }, { // Clause 2: Standard search across other fields "text": { "path": ["title", "description"], "query": "machine learning", // Add a small constant score for matches in these fields "score": { "constant": { "value": 5 } } } } ] }, "scoreDetails": true } }, { "$project": { "_id": 0, "id": 1, "title": 1, "author": 1, "relevanceScore": { "$meta": "searchScore" }, "scoreDetails": { "$meta": "searchScoreDetails" } } } ] In this query, a user searches for "machine learning". If our sample document is part of the index, the final score is a combination of our boosts: 25 points from the indexed_terms match. 5 points from the match in the title field. Total Score: 30 This gives us precise, predictable, and highly tunable ranking behavior. Aggregation strategies You can even control how multiple matches within the indexed_terms array contribute to the score. The three main strategies are: table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Strategy Use Case maximum Highlights the single most relevant term that matched. sum Accumulates the score across all matching terms. mean Normalizes the score by averaging the boost of all matching terms. Power comes with responsibility: Performance considerations While powerful, this pattern requires foresight and planning. Embedding terms increases your index size. If 1 million documents each get five embedded terms, your index now has to manage 6 million entries. To keep things snappy and scalable, follow these best practices: Be selective: Only embed high-impact terms. Don't use it for your entire vocabulary. Quantize boosts: Use discrete boost levels (e.g., 5, 10, 15, 20) instead of hyper-specific decimals. This improves caching and consistency. Perform regular cleanup: Create processes to remove obsolete or low-performing terms from the indexed_terms arrays. Always monitor your index size, query latency, and memory usage in the Atlas UI to ensure your implementation remains performant. Take control of your search destiny The embedded scoring pattern in MongoDB Atlas Search is a game-changer for anyone serious about search relevance. It moves beyond static, one-size-fits-all ranking and gives you dynamic, context-aware control directly within your data. You can use this pattern to implement business-driven ranking logic, enable real-time personalization, and achieve full transparency for tuning and debugging your search scores. While this article gives you a powerful head start, your journey into advanced relevance doesn't end here. For more in-depth implementation examples, guidance on operational analytics, and best practices to ensure your embedded boost values stay aligned with business goals, we highly recommend diving into the official MongoDB Atlas Search documentation . It's the perfect resource for taking this pattern from concept to production. Stop letting your search engine make all the decisions. Try the embedded scoring pattern today and unlock a new level of precision and power in Atlas Search.

June 24, 2025
Developer Blog

Build AI Memory Systems with MongoDB Atlas, AWS and Claude

When working with conversational AI, most developers fall into a familiar trap: They treat memory as simple storage—write data in, read data out. But human memory doesn't work this way. Our brains actively evaluate information importance, strengthen connections through repetition, and let irrelevant details fade over time. This disconnect creates AI systems that either remember too much (overwhelming users with irrelevant details) or too little (forgetting critical context). The stakes are significant: Without sophisticated memory management, AI assistants can't provide truly personalized experiences, maintain consistent personalities, or build meaningful relationships with users. The application we're exploring represents a paradigm shift—treating AI memory not as a database problem but as a cognitive architecture challenge. This transforms AI memory from passive storage into an active, evolving knowledge network. A truly intelligent cognitive memory isn't one that never forgets, but one that forgets with intention and remembers with purpose. Imagine an AI assistant that doesn't just store information but builds a living, adaptive memory system that carefully evaluates, reinforces, and connects knowledge just like a human brain. This isn't science fiction—it's achievable today by combining MongoDB Atlas Vector Search with AWS Bedrock and Anthropic's Claude. You'll move from struggling with fragmented AI memory systems to building sophisticated knowledge networks that evolve organically, prioritize important information, and recall relevant context exactly when needed. The cognitive architecture of AI memory At its simplest, our memory system mimics three core aspects of human memory: Importance-weighted storage: Not all memories are equally valuable. Reinforcement through repetition: Important concepts strengthen over time. Contextual retrieval: Memories are recalled based on relevance to current context. This approach differs fundamentally from traditional conversation storage: table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Traditional conversation storage Cognitive memory architecture Flat history retention Hierarchical knowledge graph Equal weighting of all information Importance-based prioritization Keyword or vector-only search Hybrid semantic & keyword retrieval Fixed memory lifetime Dynamic reinforcement & decay Isolated conversation fragments Connected knowledge network The practical implication is an AI that "thinks" before remembering—evaluating what information to prioritize, how to connect it with existing knowledge, and when to let less important details fade. Let's build a minimum viable implementation of this cognitive memory architecture using MongoDB Atlas , AWS Bedrock, and Anthropic's Claude. Our focus will be on creating the fundamental components that make this system work. Service architecture The following service architecture defines the foundational components and their interactions that power the cognitive memory system. Figure 1. AI memory service architecture. Built on AWS infrastructure, this comprehensive architecture connects user interactions with sophisticated memory management processes. The User Interface (Client application) serves as the entry point where humans interact with the system, sending messages and receiving AI responses enriched with conversation summaries and relevant contextual memories. At the centre sits the AI Memory Service, the critical processing hub that coordinates information flow, processes messages, and manages memory operations across the entire system. MongoDB Atlas provides a scalable, secure, multi-cloud database foundation. The system processes data through the following key functions: Bedrock Titan Embeddings for converting text to vector representations. Memory Reinforcement for strengthening important information. Relevance-based Retrieval for finding contextually appropriate memories. Anthropic’s Claude LLM handles the importance assessment to evaluate long-term storage value, memory merging for efficient information organization, and conversation summary generation. This architecture ultimately enables AI systems to maintain contextual awareness across conversations, providing more natural, consistent, and personalized interactions over time. Database structure The database structure organizes information storage with specialized collections and indexes that enable efficient semantic retrieval and importance-based memory management. Figure 2. Example of a database structure. The database design strategically separates raw conversation data from processed memory nodes to optimize performance and functionality. The Conversations Collection maintains chronological records of all interactions, preserving the complete historical context, while the Memory Nodes Collection stores higher-level semantic information with importance ratings that facilitate cognitive prioritization. Vector Search Indexes enable efficient semantic similarity searches with O(log n) performance, allowing the system to rapidly identify contextually relevant information regardless of database size. To manage storage growth automatically, TTL(Time-To-Live) Indexes expire older conversations based on configurable retention policies. Finally, Importance and User ID indexes optimize retrieval patterns critical to the system's function, ensuring that high-priority information and user-specific context can be accessed with minimal latency. Memory node structure The Memory node structure defines the data schemas that combine content with cognitive metadata to enable human-like memory operations. Figure 3. The memory node structure. Each node includes an importance score that enables memory prioritization similar to human memory processes, allowing the system to focus on what matters most. The structure tracks access count, which facilitates reinforcement learning by recording how frequently memories are retrieved. A critical feature is the summary field, providing quick semantic access without processing the full content, significantly improving efficiency. Vector embeddings within each node enable powerful semantic search capabilities that mirror human associative thought, connecting related concepts across the knowledge base. Complementing this, the ConversationMessage structure preserves raw conversational context without interpretation, maintaining the original exchange integrity. Both structures incorporate vector embeddings as a unifying feature, enabling sophisticated semantic operations that allow the system to navigate information based on meaning rather than just keywords, creating a more human-like cognitive architecture. Memory creation process The memory creation process transforms conversational exchanges into structured memory nodes through a cognitive pipeline mimicking human memory formation by thoughtfully evaluating new information against existing knowledge, rather than indiscriminately storing everything. Figure 3. The memory creation process. Through repetition, memories are strengthened via reinforcement, similar to human cognitive processes. At its core, the LLM functions as an "importance evaluator" that assigns each memory a value on a 1-10 scale, reflecting how humans naturally prioritize information based on relevance, uniqueness, and utility. This importance rating directly affects a memory's persistence, recall probability, and survival during pruning operations. As the system evolves, memory merging simulates the human brain's ability to consolidate related concepts over time, while importance updating reflects how new discoveries change our perception of existing knowledge. The framework's pruning mechanism mirrors our natural forgetting of less significant information. Rather than simply accumulating data, this dynamic system creates an evolving memory architecture that continuously refines itself through processes remarkably similar to human cognition. Memory retrieval process The memory retrieval process leverages multiple search methodologies that optimize both recall and precision to find and contextualize relevant information across conversations and memory nodes. Figure 4. The memory retrieval process. When initiated, the system converts user queries into vector embeddings while simultaneously executing parallel operations to enhance performance. The core of this system is its hybrid search methodology that combines vector-based semantic understanding with traditional text-based keyword search, allowing it to capture both conceptual similarities and exact term matches. The process directly searches memory nodes and applies different weighting algorithms to combine scores from various search methods, producing a comprehensive relevance ranking. After identifying relevant memories, the system fetches surrounding conversation context to ensure retrieved information maintains appropriate background, followed by generating concise summaries that distill essential insights. A key innovation is the effective importance calculation that dynamically adjusts memory significance based on access patterns and other usage metrics. The final step involves building a comprehensive response package that integrates the original memories, their summaries, relevance scores, and contextual information, providing users with a complete understanding of retrieved information without requiring exhaustive reading of all content. This multi-faceted approach ensures that memory retrieval is both comprehensive and precisely tailored to user needs. Code execution flowchart The code execution flowchart provides a comprehensive mapping of how API requests navigate through the system architecture, illuminating the runtime path from initial client interaction to final response delivery. Figure 5. The code execution flowchart. When a request enters the system, it first encounters the FastAPI endpoint, which serves as the primary entry point for all client communications. From there, specialized API route handlers direct the request to appropriate processing functions based on its type and intent. During processing, the system creates and stores message objects in the database, ensuring a permanent record of all conversation interactions. For human-generated messages meeting specific significance criteria, a parallel memory creation branch activates, analyzing the content for long-term storage. This selective approach preserves only meaningful information while reducing storage overhead. The system then processes queries through embedding generation, transforming natural language into vector representations that enable semantic understanding. One of the most sophisticated aspects is the implementation of parallel search functions that simultaneously execute different retrieval strategies, dramatically improving response times while maintaining comprehensive result quality. These searches connect to MongoDB Atlas to perform complex database operations against the stored knowledge base. Retrieved information undergoes context enrichment and summary generation, where the AWS Bedrock (Anthropic’s Claude) LLM augments raw data with contextual understanding and concise overviews of relevant conversation history. Finally, the response combination module assembles diverse data components—semantic matches, text-based results, contextual information, and generated summaries—into a coherent, tailored response that addresses the original request. The system's behavior can be fine-tuned through configurable parameters that govern memory processing, AI model selection, database structure, and service operations, allowing for optimization without code modifications. Memory updating process The memory updating process dynamically adjusts memory importance through sophisticated reinforcement and decay mechanisms that mimic human cognitive functions. Figure 6. The memory updating process. When new information arrives, the system first retrieves all existing user memories from the database, then methodically calculates similarity scores between this new content and each stored memory. Memories exceeding a predetermined similarity threshold are identified as conceptually related and undergo importance reinforcement and access count incrementation, strengthening their position in the memory hierarchy. Simultaneously, unrelated memories experience gradual decay as their importance values diminish over time, creating a naturally evolving memory landscape. This balanced approach prevents memory saturation by ensuring that frequently accessed topics remain prominent while less relevant information gracefully fades. The system maintains a comprehensive usage history through access counts, which informs more effective importance calculations and provides valuable metadata for memory management. All these adjustments are persistently stored in MongoDB Atlas, ensuring continuity across user sessions and maintaining a dynamic memory ecosystem that evolves with each interaction. Client integration flow The following diagram illustrates the complete interaction sequence between client applications and the memory system, from message processing to memory retrieval. This flow encompasses two primary pathways: Message sending flow: When a client sends a message, it triggers a sophisticated processing chain where the API routes it to the Conversation Service, which generates embeddings via AWS Bedrock. After storing the message in MongoDB Atlas, the Memory Service evaluates it for potential memory creation, performing importance assessment and summary generation before creating or updating a memory node in the database. The flow culminates with a confirmation response returning to the client. Check out the code reference on Github . Memory retrieval flow: During retrieval, the client's request initiates parallel search operations where query embeddings are generated simultaneously across conversation history and memory nodes. These dual search paths—conversation search and memory node search—produce results that are intelligently combined and summarized to provide contextual understanding. The client ultimately receives a comprehensive memory package containing all relevant information. Check out the code reference on Github . Figure 7. The client integration flow. The architecture deliberately separates conversation storage from memory processing, with MongoDB Atlas serving as the central persistence layer. Each component maintains clear responsibilities and interfaces, ensuring that despite complex internal processing, clients receive unified, coherent responses. Action plan: Bringing your AI memory system to life To implement your own AI memory system: Start with the core components: MongoDB Atlas, AWS Bedrock, and Anthropic’s Claude. Focus on cognitive functions: Importance assessment, memory reinforcement, relevance-based retrieval, and memory merging Tune parameters iteratively: Start with the defaults provided, then adjust based on your application's needs. Measure the right metrics: Track uniqueness of memories, retrieval precision, and user satisfaction—not just storage efficiency. To evaluate your implementation, ask these questions: Does your system effectively prioritize truly important information? Can it recall relevant context without excessive prompting? Does it naturally form connections between related concepts? Can users perceive the system's improving memory over time? Real-world applications and insights Case Study: From repetitive Q&A to evolving knowledge A customer service AI using traditional approaches typically needs to relearn user preferences repeatedly. With our cognitive memory architecture: First interaction: User mentions they prefer email communication. The system stores this with moderate importance. Second interaction: User confirms email preference. The system reinforces this memory, increasing its importance. Future interactions: The system consistently recalls email preference without asking again, but might still verify after long periods due to natural decay. The result? A major reduction in repetitive questions, leading to a significantly better user experience. Benefits Applications implementing this approach achieved unexpected benefits: Emergent knowledge graphs: Over time, the system naturally forms conceptual clusters of related information. Insight mining: Analysis of high-importance memories across users reveals shared concerns and interests not obvious from raw conversation data. Reduced compute costs: Despite the sophisticated architecture, the selective nature of memory storage reduces overall embedding and storage costs compared to retaining full conversation histories. Limitations When implementing this system, teams typically face three key challenges: Configuration tuning: Finding the right balance of importance thresholds, decay rates, and reinforcement factors requires experimentation. Prompt engineering: Getting consistent, numeric importance ratings from LLMs requires careful prompt design. Our implementation uses clear constraints and numeric-only output requirements. Memory sizing: Determining the optimal memory depth per user depends on the application context. Too shallow and the AI seems forgetful; too deep and it becomes sluggish. Future directions The landscape for AI memory systems is evolving rapidly. Here are key developments on the horizon: Short-term developments Emotion-aware memory: Extending importance evaluation to include emotional salience, remembering experiences that evoke strong reactions. Temporal awareness: Adding time-based decay that varies by information type (factual vs. preferential). Multi-modal memory: Incorporating image and voice embeddings alongside text for unified memory systems. Long-term possibilities Self-supervised memory optimization: Systems that learn optimal importance ratings, decay rates, and memory structures based on user satisfaction. Causal memory networks: Moving beyond associative memory to create causal models of user intent and preferences. Privacy-preserving memory: Implementing differential privacy and selective forgetting capabilities to respect user privacy boundaries. This approach to AI memory is still evolving. The future of AI isn't just about more parameters or faster inference—it's about creating systems that learn and remember more like humans do. With the cognitive memory architecture we've explored, you're well on your way to building AI that remembers what matters. Transform your AI applications with cognitive memory capabilities today. Get started with MongoDB Atlas for free and implement vector search in minutes. For hands-on guidance, explore our GitHub repository containing complete implementation code and examples.

June 18, 2025
Developer Blog

Scaling Vector Search with MongoDB Atlas Quantization & Voyage AI Embeddings

Key Takeaways Vector quantization fundamentals: A technique that compresses high-dimensional embeddings from 32-bit floats to lower precision formats (scalar/int8 or binary/1-bit), enabling significant performance gains while maintaining semantic search capabilities Performance vs. precision trade-offs: Binary quantization provides maximum speed (80% faster queries) with minimal resources; scalar quantization offers balanced performance and accuracy; float32 maintains highest fidelity at significant resource cost Resource optimization: Vector quantization can reduce RAM usage by up to 24x (binary) or 3.75x (scalar); storage footprint decreases by 38% using BSON binary format Scaling benefits: Performance advantages multiply at scale; most significant for vector databases exceeding 1M embeddings Semantic preservation: Quantization-aware models like Voyage AI's retain high representation capacity even after compression Search quality control: Binary quantization may require rescoring for maximum accuracy; scalar quantization typically maintains 90%+ retention of float32 results Implementation ease: MongoDB's automatic quantization requires minimal code changes to leverage quantization techniques As vector databases scale into the millions of embeddings, the computational and memory requirements of high-dimensional vector operations become critical bottlenecks in production AI systems. Without effective scaling strategies, organizations face: Infrastructure costs that grow exponentially with data volume Unacceptable query latency that degrades user experience and limits real-time applications Limited and restricted deployment options, particularly on edge devices or resource-constrained environments Diminished competitive advantage as AI capabilities become limited by technical constraints and bottlenecks rather than use case innovation This technical guide demonstrates advanced techniques for optimizing vector search operations through precision-controlled quantization—transforming resource-intensive 32-bit float embeddings into performance-optimized representations while preserving semantic fidelity. By leveraging MongoDB Atlas Vector Search ’s automatic quantization capabilities with Voyage AI's quantization-aware embedding models, we'll implement systematic optimization strategies that dramatically reduce both computational overhead and memory footprint. This guide provides an empirical analysis of the critical performance metrics: Retrieval latency benchmarking: Quantitative comparison of search performance across binary, scalar, and float32 precision levels with controlled evaluation of HNSW(hierarchical navigable small world) graph exploration parameters and k-retrieval variations. Representational capacity retention: Precise measurement of semantic information preservation through direct comparison of quantized vector search results against full-fidelity retrieval, with particular attention to retention curves across varying retrieval depths. We'll present implementation strategies and evaluation methodologies for vector quantization that simultaneously optimize for both computational efficiency and semantic fidelity—enabling you to make evidence-based architectural decisions for production-scale AI retrieval systems handling millions of embeddings. The techniques demonstrated here are directly applicable to enterprise-grade RAG architectures, recommendation engines, and semantic search applications where millisecond-level latency improvements and dramatic RAM reduction translate to significant infrastructure cost savings. The full end to end implementation for automatic vector quantization and other operations involved in RAG/Agent pipelines can be found on our Github repository . Auto-quantization of Voyage AI embeddings with MongoDB Our approach addresses the complete optimization cycle for vector search operations, covering: Generating embeddings with quantization-aware models Implementing automatic vector quantization in MongoDB Atlas Creating and configuring specialized vector search indices Measuring and comparing latency across different quantization strategies Quantifying representational capacity retention Analyzing performance trade-offs between binary, scalar, and float32 implementations Making evidence-based architectural decisions for production AI retrieval systems Figure 1. Vector quantization architecture with MongoDB Atlas and Voyage AI. Using text data as an example, we convert documents into numerical vector embeddings that capture semantic relationships. MongoDB then indexes and stores these embeddings for efficient similarity searches. By comparing queries run against float32, int8, and binary embeddings, you can gauge the trade-offs between precision and performance and better understand which quantization strategy best suits large-scale, high-throughput workloads. One key takeaway from this article is that representational capacity retention is highly dependent on the embedding model used. With quantization-aware models like Voyage AI’s voyage-3-large at appropriate dimensionality (1024 dimensions), our tests demonstrate that we can achieve 95%+ recall retention at reasonable numCandidate values. This means organizations can significantly reduce memory and computational requirements while preserving semantic search quality, provided they select embedding models specifically designed to maintain their representation capacity after quantization. For more information on why vector quantization is crucial for AI workloads, refer to this blog post . Dataset information Our quantization evaluation framework leverages two complementary datasets designed specifically to benchmark semantic search performance across different precision levels. Primary Dataset ( Wikipedia-22-12-en-voyage-embed ): Contains approximately 300,000 Wikipedia article fragments with pre-generated 1024-dimensional embeddings from Voyage AI’s voyage-3-large model. This dataset serves as a diverse vector corpus for testing vector quantization effects in semantic search. Throughout this tutorial, we'll use the primary dataset to demonstrate the technical implementation of quantization. Embedding generation with Voyage AI For generating new embeddings for AI Search applications, we use Voyage AI's voyage-3-large model, which is specifically designed to be quantization-aware. The voyage-3-large model generates 1024-dimensional vectors and has been specifically trained to maintain semantic properties even after quantization, making it ideal for our AI retrieval optimization strategy. For more information on how MongoDB and Voyage AI work together for optimal retrieval, see our previous article, Rethinking Information Retrieval with MongoDB and Voyage AI . import voyageai # Initialize the Voyage AI client client = voyageai.Client() def get_embedding(text, task_prefix="document"): """ Generate embeddings using the voyage-3-large model for AI Retrieval. Parameters: text (str): The input text to be embedded. task_prefix (str): A prefix describing the task; this is prepended to the text. Returns: list: The embedding vector (1024 dimensions). """ if not text.strip(): print("Attempted to get embedding for empty text.") return [] # Call the Voyage API to generate the embedding result = client.embed([text], model="voyage-3-large", input_type=task_prefix) # Return the first embedding from the result return result.embeddings[0] Converting embeddings to BSON BinData format A critical optimization step is converting embeddings to MongoDB's BSON BinData format , which significantly reduces storage and memory requirements. The BinData vector format provides significant advantages: Reduces disk space by approximately 3x compared to arrays Enables more efficient indexing with alternate types (int8, binary) Reduces RAM usage by 3.75x for scalar and 24x for binary quantization from bson.binary import Binary, BinaryVectorDtype def generate_bson_vector(array, data_type): return Binary.from_vector(array, BinaryVectorDtype(data_type)) # Convert embeddings to BSON BinData vector format wikipedia_data_df["embedding"] = wikipedia_data_df["embedding"].apply( lambda x: generate_bson_vector(x, BinaryVectorDtype.FLOAT32) ) Vector index creation with different quantization strategies The cornerstone of our performance optimization framework lies in creating specialized vector indices with different quantization strategies. This process leverages MongoDB for general-purpose database functionalities, more specifically, its high-performance vector database capabilities of efficiently handling million-scale embedding collections. This implementation step focuses on how to set up MongoDB's vector search capabilities with automatic quantization, focusing on two primary quantization strategies: scalar (int8) and binary. Two indices are created to measure and evaluate the retrieval latency and recall performance of various precision data types, including the full fidelity vector representation. The MongoDB database uses the vector index HNSW, which is a graph-based indexing algorithm that organizes vectors in a hierarchical structure of layers. In this structure, vector data points within a layer are contextually similar, while higher layers are sparse compared to lower layers, which are denser and contain more vector data points. The code snippet below showcases the implementation of two quantization strategies in parallel; this enables the systematic evaluation of the latency, memory usage, and representational capacity trade-offs across the precision spectrum, enabling data-driven decisions about the optimal approach for specific application requirements. MongoDB Atlas automatic quantization is activated entirely through the vector index definition. By including the "quantization" attribute and setting its value to either "scalar" or "binary", you enable automatic compression of your embeddings at index creation time. This declarative approach means no separate preprocessing of vectors is required—MongoDB handles the dimensional reduction transparently while maintaining the original embeddings for potential rescoring operations. from pymongo.operations import SearchIndexModel def setup_vector_search_index(collection, index_definition, index_name="vector_index"): """Setup a vector search index with the specified configuration""" ... # 1. Scalar Quantized Index (int8) vector_index_definition_scalar_quantized = { "fields": [{ "type": "vector", "path": "embedding", "quantization": "scalar", # Uses int8 quantization "numDimensions": 1024, "similarity": "cosine", }] } # 2. Binary Quantized Index (1-bit) vector_index_definition_binary_quantized = { "fields": [{ "type": "vector", "path": "embedding", "quantization": "binary", # Uses binary (1-bit) quantization "numDimensions": 1024, "similarity": "cosine", }] } # 3. Float32 ANN Index (no quantization) vector_index_definition_float32_ann = { "fields": [{ "type": "vector", "path": "embedding", "numDimensions": 1024, "similarity": "cosine", }] } # Create the indices setup_vector_search_index( wiki_data_collection, vector_index_definition_scalar_quantized, "vector_index_scalar_quantized" ) setup_vector_search_index( wiki_data_collection, vector_index_definition_binary_quantized, "vector_index_binary_quantized" ) setup_vector_search_index( wiki_data_collection, vector_index_definition_float32_ann, "vector_index_float32_ann" ) Implementing vector search functionality Vector search serves as the computational foundation of modern generative AI systems. While LLMs provide reasoning and generation capabilities, vector search delivers the contextual knowledge necessary for grounding these capabilities in relevant information. This semantic retrieval operation forms the backbone of RAG architectures that power enterprise-grade AI applications, such as knowledge-intensive chatbots and domain-specific assistants. In more advanced implementations, vector search enables agentic RAG systems where autonomous agents dynamically determine what information to retrieve, when to retrieve it, and how to incorporate it into complex reasoning chains. The implementation below provides the technical overview that transforms raw embedding vectors into intelligent search components that move beyond lexical matching to true semantic understanding. Our implementation below supports both approximate nearest neighbor (ANN) search and exact nearest neighbor (ENN) search through the use_full_precision parameter: Approximate nearest neighbor (ANN) search: When use_full_precision = False , the system performs an approximate search using: The specified quantized index (binary or scalar) The HNSW graph navigation algorithm A controlled exploration breadth via numCandidates This approach sacrifices perfect accuracy for dramatic performance gains, particularly at scale. The HNSW algorithm enables sub-linear time complexity by intelligently sampling the vector space, making it possible to search billions of vectors in milliseconds instead of seconds. When combined with quantization, ANN delivers order-of-magnitude improvements in both speed and memory efficiency. Exact nearest neighbor (ENN) search: When use_full_precision = True , the system performs exact search using: The original float32 embeddings (regardless of the index specified) An exhaustive comparison approach The exact = True directive to bypass approximation techniques ENN guarantees finding the mathematically optimal nearest neighbors by computing distances between the query vector and every single vector in the database. This brute-force approach provides perfect recall but scales linearly with collection size, becoming prohibitively expensive as vector counts increase beyond millions. We include both search modes for several critical reasons: Establishing ground truth: ENN provides the "perfect" baseline against which we measure the quality degradation of approximation techniques. The representational retention metrics discussed later directly compare ANN results against this ENN ground truth. Varying application requirements: Not all AI applications prioritize the same metrics. Time-sensitive applications (real-time customer service) might favor ANN's speed, while high-stakes applications (legal document analysis) might require ENN's accuracy. def custom_vector_search( user_query, collection, embedding_path, vector_search_index_name="vector_index", top_k=5, num_candidates=25, use_full_precision=False, ): """ Perform vector search with configurable precision and parameters for AI Search applications. """ # Generate embedding for the query query_embedding = get_embedding(user_query, task_prefix="query") # Define the vector search stage vector_search_stage = { "$vectorSearch": { "index": vector_search_index_name, "queryVector": query_embedding, "path": embedding_path, "limit": top_k, } } # Configure search precision approach if not use_full_precision: # For approximate nearest neighbor (ANN) search vector_search_stage["$vectorSearch"]["numCandidates"] = num_candidates else: # For exact nearest neighbor (ENN) search vector_search_stage["$vectorSearch"]["exact"] = True # Project only needed fields project_stage = { "$project": { "_id": 0, "title": 1, "text": 1, "wiki_id": 1, "url": 1, "score": {"$meta": "vectorSearchScore"} } } # Build and execute the pipeline pipeline = [vector_search_stage, project_stage] ... # Execute the query results = list(collection.aggregate(pipeline)) return {"results": results, "execution_time_ms": execution_time_ms} Measuring the retrieval latency of various quantized vectors In production AI retrieval systems, query latency directly impacts user experience, operational costs, and system throughput capacity. Vector search operations typically constitute the primary performance bottleneck in RAG architectures, making latency optimization a critical engineering priority. Sub-100ms response times are often necessary for interactive applications and mission-critical applications, while batch processing systems may tolerate higher latencies but require consistent predictability for resource planning. Our latency measurement methodology employs a systematic, parameterized approach that models real-world query patterns while isolating the performance characteristics of different quantization strategies. This parameterized benchmarking enables us to: Construct detailed latency profiles across varying retrieval depths Identify performance inflection points where quantization benefits become significant Map the scaling curves of different precision levels as the data volume increases Determine optimal configuration parameters for specific throughput targets def measure_latency_with_varying_topk( user_query, collection, vector_search_index_name, use_full_precision=False, top_k_values=[5, 10, 50, 100], num_candidates_values=[25, 50, 100, 200, 500, 1000, 2000], ): """ Measure search latency across different configurations. """ results_data = [] for top_k in top_k_values: for num_candidates in num_candidates_values: # Skip invalid configurations if num_candidates < top_k: continue # Get precision type from index name precision_name = vector_search_index_name.split("vector_index")[1] precision_name = precision_name.replace("quantized", "").capitalize() if use_full_precision: precision_name = "_float32_ENN" # Perform search and measure latency vector_search_results = custom_vector_search( user_query=user_query, collection=collection, embedding_path="embedding", vector_search_index_name=vector_search_index_name, top_k=top_k, num_candidates=num_candidates, use_full_precision=use_full_precision, ) latency_ms = vector_search_results["execution_time_ms"] # Store results results_data.append({ "precision": precision_name, "top_k": top_k, "num_candidates": num_candidates, "latency_ms": latency_ms, }) print(f"Top-K: {top_k}, NumCandidates: {num_candidates}, " f"Latency: {latency_ms} ms, Precision: {precision_name}") return results_data Latency results analysis Our systematic benchmarking reveals dramatic performance differences between quantization strategies across different retrieval scenarios. The visualizations below capture these differences for top-k=10 and top-k=100 configurations. Figure 2. Search latency vs the number candidates for top-k=10 Figure 3. Search latency vs the number of candidates for top-k=100. Several critical patterns emerge from these latency profiles: Quantization delivers exponential performance gains: The float32_ENN approach (purple line) demonstrates latency measurements an order of magnitude higher than any quantized approach. At top-k=10, ENN latency starts at ~1600ms and never drops below 500ms, while quantized approaches maintain sub-100ms performance until extremely high candidate counts. This performance gap widens further as data volume scales. Scalar quantization offers the best performance profile: Somewhat surprisingly, scalar quantization (orange line) consistently outperforms both binary quantization and float32 ANN across most configurations. This is particularly evident at higher num_candidates values, where scalar quantization maintains near-flat latency scaling. This suggests scalar quantization achieves an optimal balance in the memory-computation trade-off for HNSW traversal. Binary quantization shows linear latency scaling: While binary quantization (red line) starts with excellent performance, its latency increases more steeply as num_candidates grows, eventually exceeding scalar quantization at very high exploration depths. This suggests that while binary vectors require less memory, their distance computation savings are partially offset by the need for more complex traversal patterns in the HNSW graph and rescoring. All quantization methods maintain interactive-grade performance: Even with 10,000 candidate explorations and top-k=100, all quantized approaches maintain sub-200ms latency, well within interactive application requirements. This demonstrates that quantization enables order-of-magnitude increases in exploration depth without sacrificing user experience, allowing for dramatic recall improvements while maintaining acceptable latency. These empirical results validate our theoretical understanding of quantization benefits and provide concrete guidance for production deployment: scalar quantization offers the best general-purpose performance profile, while binary quantization excels in memory-constrained environments with moderate exploration requirements. In the images below we employ logarithmic scaling for both axes in our latency analysis because search performance data typically spans multiple orders of magnitude. When comparing different precision types (scalar, binary, float32_ann) across varying numbers of candidates, the latency values can range from milliseconds to seconds, while candidate counts may vary from hundreds to millions. Linear plots would compress smaller values and make it difficult to observe performance trends across the full range(as we see above). Logarithmic scaling transforms exponential relationships into linear ones, making it easier to identify proportional changes, compare relative performance improvements, and detect patterns that would otherwise be obscured. This visualization approach is particularly valuable for understanding how each precision type scales with increasing workload and for identifying the optimal operating ranges where certain methods outperform others(as shown below). Figure 4. Search latency vs the number of candidates (log scale) for top-k=10. Figure 5. Search latency vs the number of candidates (log scale) for top-k=100. The performance characteristics observed in the logarithmic plots above directly reflect the architectural differences inherent in binary quantization's two-stage retrieval process. Binary quantization employs a coarse-to-fine search strategy: an initial fast retrieval phase using low-precision binary representations, followed by a refinement phase that rescores the top-k candidates using full-precision vectors to restore accuracy. This dual-phase approach creates a fundamental performance trade-off that manifests differently across varying candidate pool sizes. For smaller candidate sets, the computational savings from binary operations during the initial retrieval phase can offset the rescoring overhead, making binary quantization competitive with other methods. However, as the candidate pool expands, the rescoring phase—which must compute full-precision similarity scores for an increasing number of retrieved candidates—begins to dominate the total latency profile. Measuring representational capacity retention While latency optimization is critical for operational efficiency, the primary concern for AI applications remains semantic accuracy. Vector quantization introduces a fundamental trade-off: computational efficiency versus representational capacity. Even the most performant quantization approach is useless if it fails to maintain the semantic relationships encoded in the original embeddings. To quantify this critical quality dimension, we developed a systematic methodology for measuring representational capacity retention—the degree to which quantized vectors preserve the same nearest-neighbor relationships as their full-precision counterparts. This approach provides an objective, reproducible framework for evaluating semantic fidelity across different quantization strategies. def measure_representational_capacity_retention_against_float_enn( ground_truth_collection, collection, quantized_index_name, top_k_values, num_candidates_values, num_queries_to_test=1, ): """ Compare quantized search results against full-precision baseline. For each test query: 1. Perform baseline search with float32 exact search 2. Perform same search with quantized vectors 3. Calculate retention as % of baseline results found in quantized results """ retention_results = {"per_query_retention": {}} overall_retention = {} # Initialize tracking structures for top_k in top_k_values: overall_retention[top_k] = {} for num_candidates in num_candidates_values: if num_candidates < top_k: continue overall_retention[top_k][num_candidates] = [] # Get precision type precision_name = quantized_index_name.split("vector_index")[1] precision_name = precision_name.replace("quantized", "").capitalize() # Load test queries from ground truth annotations ground_truth_annotations = list( ground_truth_collection.find().limit(num_queries_to_test) ) # For each annotation, test all its questions for annotation in ground_truth_annotations: ground_truth_wiki_id = annotation["wiki_id"] ... # Calculate average retention for each configuration avg_overall_retention = {} for top_k, cand_dict in overall_retention.items(): avg_overall_retention[top_k] = {} for num_candidates, retentions in cand_dict.items(): if retentions: avg = sum(retentions) / len(retentions) else: avg = 0 avg_overall_retention[top_k][num_candidates] = avg retention_results["average_retention"] = avg_overall_retention return retention_results Our methodology takes a rigorous approach to retention measurement: Establishing ground truth: We use float32 exact nearest neighbor (ENN) search as the baseline "perfect" result set, acknowledging that these are the mathematically optimal neighbors. Controlled comparison: For each query in our annotation dataset, we perform parallel searches using different quantization strategies, carefully controlling for top-k and num_candidates parameters. Retention calculation: We compute retention as the ratio of overlapping results between the quantized search and the ENN baseline: |quantized_results ∩ baseline_results| / |baseline_results|. Statistical aggregation: We average retention scores across multiple queries to account for query-specific variations and produce robust, generalizable metrics. This approach provides a direct, quantitative measure of how much semantic fidelity is preserved after quantization. A retention score of 1.0 indicates that the quantized search returns exactly the same results as the full-precision search, while lower scores indicate divergence. Representational capacity results analysis The findings from the representational capacity retention evaluation provide empirical validation that properly implemented quantization—particularly scalar quantization—can maintain semantic fidelity while dramatically reducing computational and memory requirements. Note that in the chart below, the scalar curve (yellow) exactly matches the float32_ann performance (blue)—so much so that the blue line is completely hidden beneath the yellow. The near-perfect retention of scalar quantization should alleviate concerns about quality degradation, while binary quantization's retention profile suggests it's suitable for applications with higher performance demands that can tolerate slight quality trade-offs or compensate with increased exploration depth. Figure 6. Retention score vs the number of candidates for top-k=10. Figure 7. Retention score vs the number of candidates for top-k=50. Figure 8. Retention score vs the number of candidates for top-k=100. Scalar quantization achieves near-perfect retention: The scalar quantization approach (orange line) demonstrates extraordinary representational capacity preservation, achieving 98-100% retention across nearly all configurations. At top-k=10, it reaches perfect 1.0 retention with just 100 candidates, effectively matching full-precision ENN results while using 4x less memory. This remarkable performance validates the effectiveness of int8 quantization when implemented with MongoDB's automatic quantization. Binary quantization shows retention-exploration trade-off: Binary quantization (red line) exhibits a clear correlation between exploration depth and retention quality. At top-k=10, it starts at ~91% retention with minimal candidates but improves to 98% at 500 candidates. The effect is more pronounced at higher top-k values (50 and 100), where initial retention drops to ~74% but recovers substantially with increased exploration. This suggests that binary quantization's information loss can be effectively mitigated by exploring more of the vector space. Retention dynamics change with retrieval depth: As top-k increases from 10 to 100, the retention patterns become more differentiated between quantization strategies. This reflects the increasing challenge of maintaining accurate rankings as more results are requested. While scalar quantization remains relatively stable across different top-k values, binary quantization shows more sensitivity, indicating it's better suited for targeted retrieval scenarios (low top-k) than for broad exploration. Exploration depth compensates for precision loss: A fascinating pattern emerges across all quantization methods: increased num_candidates consistently improves retention. This demonstrates that reduced precision can be effectively counterbalanced by broader exploration of the vector space. For example, binary quantization at 500 candidates achieves better retention than scalar quantization at 25 candidates, despite using 32x less memory per vector. Float32 ANN vs. scalar quantization: The float32 ANN approach (blue line) shows virtually identical retention to scalar quantization at higher top-k values, while consuming 4x more memory. This suggests scalar quantization represents an optimal balance point, offering full-precision quality with significantly reduced resource requirements. Conclusion This guide has demonstrated the powerful impact of vector quantization in optimizing vector search operations through MongoDB Atlas Vector Search and automatic quantization feature, using Voyage AI embeddings. These findings provide empirical validation that properly implemented quantization—particularly scalar quantization—can maintain semantic fidelity while dramatically reducing computational and memory requirements. The near-perfect retention of scalar quantization should alleviate concerns about quality degradation, while binary quantization's retention profile suggests it's suitable for applications with higher performance demands that can tolerate slight quality trade-offs or compensate with increased exploration depth. Binary quantization achieves optimal latency and resource efficiency, particularly valuable for high-scale deployments where speed is critical. Scalar quantization provides an effective balance between performance and precision, suitable for most production applications. Float32 maintains maximum accuracy but incurs significant performance and memory costs. Figure 9. Performance and memory usage metrics for binary quantization, scalar quantization, and float32 implementation. Based on the image above our implementation demonstrated substantial efficiency gains: Binary Quantized Index achieves the most compact disk footprint at 407.66MB, representing approximately 4KB per document. This compression comes from representing high-dimensional vectors as binary bits, dramatically reducing storage requirements while maintaining retrieval capability. Float32 ANN Index requires 394.73MB of disk space, slightly less than binary due to optimized index structures, but demands the full storage footprint be loaded into memory for optimal performance. Scalar Quantized Index shows the largest storage requirement at 492.83MB (approximately 5KB per document), suggesting this method maintains higher precision than binary while still applying compression techniques, resulting in a middle-ground approach between full precision and extreme quantization. The most striking difference lies in memory requirements. Binary quantization demonstrates a 23:1 memory efficiency ratio, requiring only 16.99MB in RAM versus the 394.73MB needed by float32_ann. Scalar quantization provides a 3:1 memory optimization, requiring 131.42MB compared to float32_ann's full memory footprint. For production AI Retrieval implementation, general guidance is as follows: Use scalar quantization for general use cases requiring good balance of speed and accuracy. Use binary quantization for large-scale applications (1M+ vectors) where speed is critical. Use float32 only for applications requiring maximum precision, where accuracy is paramount. Vector quantization becomes particularly valuable for databases exceeding 1M vectors, where it enables significant scalability improvements without compromising retrieval accuracy. When combined with MongoDB Atlas Search Nodes , this approach effectively addresses both cost and performance constraints in advanced vector search applications. Boost your MongoDB skills today through our Atlas Learning Hub . Head over to our quick start guide to get started with Atlas Vector Search.

June 10, 2025
Developer Blog

Ready to get Started with MongoDB Atlas?

Start Free