MongoDB Developer Blog
Deep dives into technical concepts, architectures, and innovations with MongoDB.
Matryoshka Embeddings: Smarter Embeddings with Voyage AI
In the realm of AI, embedding models are the bedrock of advanced applications like retrieval augmented generation (RAG), semantic search , and recommendation systems. These models transform unstructured data (text, images, audio) into high-dimensional numerical vectors, allowing us to perform similarity searches and power intelligent features. However, traditional embedding models often generate fixed-size vectors, leading to trade-offs between performance and computational overhead. This post will dive deep into Matryoshka Representation Learning (MRL) , a novel approach that creates flexible, multi-fidelity embeddings. We'll compare and contrast MRL with traditional embeddings and quantization, detailing its unique training process and showcasing how Voyage AI's voyage-3-large and the recently released voyage-3.5 models leverage MRL as well as quantization to deliver unparalleled efficiency with MongoDB Atlas Vector Search . Understanding embedding models At their core, embedding models learn to represent discrete items (words, sentences, documents) as continuous vectors in a multi-dimensional space. The key principle is that items with similar meanings or characteristics are mapped to points that are close to each other in this vector space. This spatial proximity then allows for efficient similarity comparisons using metrics like cosine similarity. For example, in a semantic search application, when a user queries "best vegan restaurants," the embedding model converts this query into a vector. It then compares this vector against a database of pre-computed embeddings for restaurant descriptions. Restaurants whose embeddings are "nearby" the query embedding are deemed relevant and returned to the user. Figure 1. Example embedding model. Image Credit: Hugging Face Blog Challenges with traditional embeddings Historically, embedding models generate vectors of a fixed size, for example, 768, 1024, or 4096 dimensions. While effective, this fixed-size nature presents challenges: Inflexibility: A model trained for, say, 768-dimensional embeddings, will suffer a significant performance drop if you simply truncate its vectors to a smaller size, like 256 dimensions, without retraining. This means you're locked into a specific dimension size, even if a smaller representation would suffice for certain tasks. High computational load: Higher-dimensional vectors demand more computational resources for storage, transfer, and similarity calculations. In scenarios with large datasets or real-time inference, this can lead to increased latency and operational costs. Information loss on truncation: Without specific training, truncating traditional embeddings inevitably leads to substantial information loss, compromising the quality of downstream tasks. Matryoshka Representation Learning MRL, introduced by researchers from the University of Washington, Google Research, and Harvard University in 2022 , offers an elegant solution to these challenges. Inspired by the Russian nesting dolls, MRL trains a single embedding model such that its full-dimensional output can be truncated to various smaller dimensions while still retaining high semantic quality. The magic lies in how the model is trained to ensure that the initial dimensions of the embedding are the most semantically rich, with subsequent dimensions adding progressively finer-grained information. This means you can train a model to produce, say, a 1024-dimensional embedding. Then, for different use cases or performance requirements, you can simply take the first 256, 512, or any other number of dimensions from that same 1024-dimensional vector. Each truncated vector is still a valid and semantically meaningful representation, just at a different level of detail. Figure 2. Matryoshka embedding model truncating the output. Image Credit: Hugging Face Blog Understanding MRL with an analogy Imagine a movie. A 2048-dimensional MRL embedding might represent the "Full Movie". Truncating it to: 1024 dimensions: Still provides enough information for a "Movie Trailer." 512 dimensions: Gives a "Plot Summary & Movie Details." 256 dimensions: Captures the "Movie Title & Plot One-liner." This "coarse-to-fine" property ensures that each prefix of the full vector remains semantically rich and usable. You simply keep the first N dimensions from the full vector to truncate it. Figure 3. Visualizing the Matryoshka doll analogy for MRL. The unseen hand: How the loss function shapes embedding quality To truly grasp what makes MRL distinct, we must first understand the pivotal role of the loss function in the training of any embedding model. This mathematical function is the core mechanism that teaches these sophisticated models to understand and represent meaning. During a typical training step, an embedding model processes a batch of input data, producing a set of predicted output vectors. The loss function (“J” in the below diagram) then steps in, comparing these predicted embeddings (“y_pred”) against known "ground truth" or expected target values (“y”). It quantifies the discrepancy between what the model predicts and what it should ideally produce, effectively gauging the "error" in its representations. A high loss value signifies a significant deviation – a large "penalty" indicating the model is failing to capture the intended relationships (e.g., placing semantically similar items far apart in the vector space). Conversely, a low loss value indicates accurate capture of these relationships, ensuring that similar concepts (like different images of cats) are mapped close together, while dissimilar ones remain distant. Figure 4. Training workflow including the loss function. The iterative training process, guided by an optimizer, continuously adjusts the model's internal weights with the sole aim of minimizing this loss value. This relentless pursuit of a lower loss is precisely how an embedding model learns to generate high-quality, semantically meaningful vectors. MRL training process The key differentiator for MRL lies in its training methodology. Unlike traditional embeddings, where a single loss value is computed for the full vector, MRL training involves: Multiple loss values: Separate loss values are computed for multiple truncated prefixes of the vector (e.g., at 256, 512, 1024, and 2048 dimensions). Loss averaging: These individual losses are averaged (or summed), to calculate a total loss. Incentivized information packing: The model is trained to minimize this total loss. This process penalizes even the smallest prefixes if their loss is high, strongly incentivizing the model to pack the most crucial information into the earliest dimensions of the vector. This results in a model where information is "front-loaded" into early dimensions, ensuring accuracy remains strong even with fewer dimensions, unlike traditional models where accuracy drops significantly upon truncation. Examples of MRL-trained models include voyage-3-large and voyage-3.5 . MRL vs. quantization It's important to differentiate MRL from quantization, another common technique for reducing embedding size. While both aim to make embeddings more efficient, their approaches and benefits differ fundamentally. Quantization techniques compress existing high-dimensional embeddings into a more compact form, by reducing the precision of the numerical values (e.g., from float32 to int8). The following table describes the precise differences between MRL and Quantization. table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Aspect MRL Quantization Goal Reduce embedding dimensionality (e.g., 256 out of 2048 dims) Reduce embedding precision (e.g., instead of using fp32, using int8/binary embeddings) Output Type Float32 vectors of varying lengths Fixed-length vectors with lower bit representations Training Awareness Uses multi-loss training across dimensions Often uses quantization-aware training (QAT) Use Case Trade-off accuracy vs compute/memory at inference Minimize storage and accelerate vector math operations Example (Voyage AI) voyage-3-large @ 512-dim-fp32 voyage-3-large @ 2048-dim-int8 Flexibility and efficiency with MRL The core benefit of MRL is its unparalleled flexibility and efficiency. Instead of being locked into a single, large vector size, you can: Choose what you need: Generate a full 2048-dimensional vector and then slice it to 256, 512, or 1024 dimensions based on your specific needs. One vector, multiple fidelities: A single embedding provides multiple levels of detail and accuracy. Lower compute, bandwidth, and storage: By using smaller vector dimensions, you drastically reduce the computational load for indexing, query processing, and data transfer, as well as the storage footprint in your database. Efficient computation: The embedding is computed once, and then you simply slice it to the desired dimensions, making it highly efficient. Voyage AI, in particular, leverages MRL by default across its models, including voyage-3-large and the latest voyage-3.5, enabling scalable embeddings with one model and multiple dimensions. This allows you to dynamically choose between space/latency and quality at query time, leading to efficient retrieval with minimal accuracy loss. Voyage AI's dual approach: MRL and quantization for ultimate efficiency Voyage AI models maximize efficiency by combining MRL and quantization. MRL enables flexible embeddings by allowing you to select the optimal vector length—for instance, using 512 instead of 2048 dimensions—resulting in significant reductions in size and computational overhead with minimal accuracy loss. Quantization further compresses these vectors by reducing their bit precision, which cuts storage needs and speeds up similarity search operations. This synergy allows you to choose embeddings tailored to your application’s requirements: a voyage-3-large embedding can be used as a compact 512-dimensional floating-point vector (leveraging MRL) or as a full 2048-dimensional 8-bit integer vector (via quantization). The dual approach empowers you to balance accuracy, storage, and performance, ensuring highly efficient, flexible embeddings for your workload. As a result, Voyage AI models deliver faster inferences and help reduce infrastructure costs when powering applications with MongoDB Atlas Vector Search. Head over to the MongoDB AI Learning Hub to learn how to build and deploy AI applications with MongoDB.
Don’t Just Build Agents, Build Memory-Augmented AI Agents
Insight Breakdown: This piece aims to reveal that regardless of architectural approach—whether Anthropic's multi-agent coordination or Cognition's single-threaded consolidation—sophisticated memory management emerges as the fundamental determinant of agent reliability, believability, and capability. It marks the evolution from stateless AI applications toward truly intelligent, memory-augmented systems that learn and adapt over time. AI agents are intelligent computational systems that can perceive their environment, make informed decisions, use tools, and, in some cases, maintain persistent memory across interactions—evolving beyond stateless chatbots toward autonomous action. Multi-agent systems coordinate multiple specialized agents to tackle complex tasks, like a research team where different agents handle searching, fact-checking, citations and research synthesis. Recently, two major players in the AI space released different perspectives on how to build these systems. Anthropic released an insightful piece highlighting their learnings on building multi-agent systems for deep research use cases. Cognition also released a post titled: " Don't Build Multi-Agents ," which appears to contradict Anthropic's approach directly. Two things stand out: Both pieces are right Yes, this sounds contradictory, but working with customers building agents of all scales and sizes in production, we find that both the use case and application mode, in particular, are key factors to consider when determining how to architect your agent(s). Anthropic's multi-agent approach makes sense for deep research scenarios where sustained, comprehensive analysis across multiple domains over extended periods is required. Cognition's single-agent approach is optimal for conversational agents or coding tasks where consistency and coherent decision-making are paramount. The application mode—whether research assistant, conversational agent, or coding assistant—fundamentally shapes the optimal memory architecture. Anthropic also highlights this point when discussing the downside of multi-agent architecture. For instance, most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time. Anthropic, Building Multi-Agent Research System Both pieces are saying the same thing Memory is the foundational challenge that determines agent reliability, believability, and capability. Anthropic emphasizes sophisticated memory management techniques (compression, external storage, context handoffs) for multi-agent coordination. Cognition emphasizes context engineering and continuous memory flow to prevent the fragmentation that destroys agent reliability. Both teams arrived at the same core insight: agents fail without robust memory management . Anthropic chose to solve memory distribution across multiple agents, while Cognition chose to solve memory consolidation within single agents. The key takeaway from both pieces for AI Engineers or anyone developing an agentic platform is not just build agents, build Memory Augmented AI agents . With that out of the way, the rest of this piece will provide you with the essential insights from both pieces that we think are important and point to the memory management principles and design patterns we’ve observed in our customers’ building agents. The key insights If you are building your agentic platform from scratch, you can extract much value from Anthropic's approach to building multi-agent systems, particularly their sophisticated memory management principles, which are essential for effective agentic systems. Their implementation reveals critical design considerations, including techniques to overcome context window limitations through compression, function calling, and storage functions that enable sustained reasoning across extended multi-agent interactions—foundational elements that any serious agentic platform must address from the architecture phase. Key insights: Agents are overthinkers Multi-agent systems trade efficiency for capability Systematic agent observation reveals failure patterns Context windows remain insufficient for extended sessions Context compression enables distributed memory management Let's go a bit deeper into how these insights translate into practical implementation strategies. Agents are overthinkers Anthropic researchers mentioned using explicit guidelines to steer agents into allocating the right amount of resources (tool calls, sub-agent creation, etc.), or else, they tend to overengineer solutions. Without proper constraints, the agents would spawn excessive subagents for simple queries, conduct endless searches for nonexistent information, and apply complex multi-step processes to tasks requiring straightforward responses. Explicit guidance for agent behavior isn't entirely new—system prompts and instructions are typical parameters in most agent frameworks. However, the key insight here goes deeper than traditional prompting approaches. When agents are given access to resources such as data, tools, and the ability to create sub-agents, there needs to be explicit, unambiguous direction on how these resources are expected to be leveraged to address specific tasks. This goes beyond system prompts and instructions into resource allocation guidance, operational constraints, and decision-making boundaries that prevent agents from overengineering solutions or misusing available capabilities. Take, for example, the OpenAI Agent SDK with several parameters to describe behaviours of resources to the agent, such as handoff_description , which will be utilized in a multi-agent system built with the OpenAI SDK. This argument specifies how the subagent should be leveraged in a multi-agent system. Or the explicit argument tool_use_behavior that describes to the agent how a tool should be used, as the name suggests. The key takeaway for AI Engineers is that multi-agent system implementation requires an extensive thinking process that involves what tools the agents are expected to leverage, the subagents in the system, and how resource utilization is communicated to the calling agent in a multi-agent system. When implementing resource allocation constraints for your agents, consider traditional approaches of managing multiple specialized databases (vector DB for embeddings, graph DB for relationships, relational DB for structured data) compound the complexity problem, and introduce tech stack sprawl, an anti-pattern to rapid AI innovation. Multi-agent systems trade efficiency for capability While multi-agent architectures can utilize more tokens and parallel processing for complex tasks, Anthropic found operational costs significantly higher due to coordination overhead, context management, and the computational expense of maintaining a coherent state across multiple agents. In some cases, two heads are better than one, but they are also expensive within multi-agent systems. One thing we note here is that the use case used in Anthropic's multi-agent system is deep research. This use case requires extensive exploration of resources, including heavily worded research papers, sites, and documentation, to accumulate enough information to formulate the result of this use case (which is typically a 2000+ word essay on the user’s starting prompt). In other use cases, such as automated workflow with agents representing processes within the workflow, there might not be as much token consumption, especially if the process encapsulates deterministic steps such as database reads and write operations, and its output is execution results that are sentences or short summaries. The coordination overhead challenge becomes particularly acute when agents need to share state across different storage systems. Rather than managing complex data synchronization between specialized databases, MongoDB's native ACID compliance ensures that multi-agent handoffs maintain data integrity without external coordination mechanisms. This unified approach reduces both the computational overhead of distributed state management and the engineering complexity of maintaining consistency across multiple storage systems. Context compression enables distributed memory management Beyond reducing inference costs, compression techniques allow multi-agent systems to maintain shared context across distributed agents. Anthropic's approach involves summarizing completed work phases and storing essential information in external memory before agents transition to new tasks. This, coupled with the insight that Context windows remain insufficient for extended sessions, points to the fact that prompt compression or compaction techniques are still relevant and useful in a world where LLMs have extensive context windows. Even with a 200K token (approximately 150,000 words) capacity, Anthropic’s agents in multi-round conversations require sophisticated context management strategies, including compression, external memory offloading, and spawning fresh agents when limits are reached. We previously partnered with Andrew Ng and DeepLearning AI on a course on prompt compression techniques and retrieval-augmented generation (RAG) optimization. Systematic agent observation reveals failure patterns Systematic agent observation represents one of Anthropic's most practical insights. Essentially, rather than relying on guesswork (or vibes), the team built detailed simulations using identical production prompts and tools and then systematically observed step-by-step execution to identify specific failure modes. This phase in an agentic system has an extensive operational cost. From our perspective, working with customers building agents in production, this methodology addresses a critical gap most teams face: understanding how your agents actually behave versus how you think they should behave . Anthropic's approach immediately revealed concrete failure patterns that many of us have encountered but struggled to diagnose systematically. Their observations uncovered agents overthinking simple tasks, like we mentioned earlier, using verbose search queries that reduced effectiveness, and selecting inappropriate tools for specific contexts. As they note in their piece: " This immediately revealed failure modes: agents continuing when they already had sufficient results, using overly verbose search queries, or selecting incorrect tools. Effective prompting relies on developing an accurate mental model of the agent. " The key insight here is moving beyond trial-and-error prompt engineering toward purposeful debugging . Instead of making assumptions about what should work, Anthropic demonstrates the value of systematic behavioral observation to identify the root causes of poor performance. This enables targeted prompt improvements based on actual evidence rather than intuition. We find that gathering, tracking, and storing agent process memory serves a dual critical purpose: not only is it vital for agent context and task performance, but it also provides engineers with the essential data needed to evolve and maintain agentic systems over time. Agent memory and behavioral logging remain the most reliable method for understanding system behavior patterns, debugging failures, and optimizing performance, regardless of whether you implement a single comprehensive agent or a system of specialized subagents collaborating to solve problems. MongoDB's flexible document model naturally accommodates the diverse logging requirements for both operational memory and engineering observability within a single, queryable system. One key piece that would be interesting to know from the Anthropic research team is what evaluation metrics they use. We’ve spoken extensively about evaluating LLMs in RAG pipelines, but what new agentic system evaluation metrics are developers working towards? We are answering these questions ourselves and have partnered with Galileo, a key player in the AI Stack, whose focus is purely on evaluating RAG and Agentic applications and making these systems reliable for production. Our learning will be shared in this upcoming webinar , taking place on July 17, 2025. However, for anyone building agentic systems, this represents a shift in development methodology—building agents requires building the infrastructure to understand them, and sandbox environments might become a key component of the evaluation and observability stack for Agents. Advanced implementation patterns Beyond the aforementioned core insights, Anthropic's research reveals several advanced patterns worth examining: The Anthropic piece hints at the implementation of advanced retrieval mechanisms that go beyond vector-based similarity between query vectors and stored information. Their multi-agent architecture enables sub-agents to call tools (an approach also seen in MemGPT ) to store their work in external systems, then pass lightweight references—presumably unique identification numbers of summarized memory components—back to the coordinator. We generally emphasize the importance of the multi-model retrieval approach to our customers and developers, where hybrid approaches combine multiple retrieval methods—using vector search to understand intent while simultaneously performing text search for specific product details. MongoDB's native support for vector similarity search and traditional indexing within a single system eliminates the need for complex reference management across multiple databases, simplifying the coordination mechanisms that Anthropic's multi-agent architecture requires. The Anthropic team implements continuity in the agent execution process by establishing clear boundaries between task completion and summarizing the current phase before moving to the next task. This creates a scalable system where memory constraints don't bottleneck the research process, allowing for truly deep and comprehensive analysis that spans beyond what any single context window could accommodate. In a multi-agent pipeline, each sub-agent produces partial results—intermediate summaries, tool outputs, extracted facts—and then hands them off into a shared “memory” database. Downstream agents will then read those entries, append their analyses, and write updated records back. Because these handoffs happen in parallel, you must ensure that one agent’s commit doesn’t overwrite another’s work or that a reader doesn’t pick up a half-written summary. Without atomic transactions and isolation guarantees, you risk: Lost updates , where two agents load the same document, independently modify it, and then write back, silently discarding one agent’s changes. Dirty or non-repeatable reads , where an agent reads another’s uncommitted or rolled-back write, leading to decisions based on phantom data. To coordinate these handoffs purely in application code would force you to build locking layers or distributed consensus, quickly becoming a brittle, error-prone web of external orchestrators. Instead, you want your database to provide those guarantees natively so that each read-modify-write cycle appears to execute in isolation and either fully succeeds or fully rolls back. MongoDB's ACID compliance becomes crucial here, ensuring that these boundary transitions maintain data integrity across multi-agent operations without requiring external coordination mechanisms that could introduce failure points. Application mode is crucial when discussing memory implementation . In Anthropic's case, the application functions as a research assistant, while in other implementations, like Cognition's approach, the application mode is conversational. This distinction significantly influences how agents operate and manage memory based on their specific application contexts. Through our internal work and customer engagements, we extend this insight to suggest that application mode affects not only agent architecture choices but also the distinct memory types used in the architecture. AI agents need augmented memory Anthropic’s research makes one thing abundantly clear: context window is not all you need. This extends to the key point that memory and agent engineering are two sides of the same coin. Reliable, believable, and truly capable agents depend on robust, persistent memory systems that can store, retrieve, and update knowledge over long, complex workflows. As the AI ecosystem continues to innovate on memory mechanisms, mastering sophisticated context and memory management approaches will be the key differentiator for the next generation of successful agentic applications. Looking ahead, we see “Memory Engineering” or “Memory Management” emerge as a key specialization within AI Engineering, focused on building the foundational infrastructure that lets agents remember, reason, and collaborate at scale. For hands-on guidance on memory management, check out our webinar on YouTube, which covers essential concepts and proven techniques for building memory-augmented agents. Head over to the MongoDB AI Learning Hub to learn how to build and deploy AI applications with MongoDB.
Real-Time Threat Detection With MongoDB & PuppyGraph
Security operations teams face an increasingly complex environment. Cloud-native applications, identity sprawl, and continuous infrastructure changes generate a flood of logs and events. From API calls in AWS to lateral movement between virtual machines, the volume of telemetry is enormous—and it’s growing. The challenge isn’t just scale. Its structure. Traditional security tooling often looks at events in isolation, relying on static rules or dashboards to highlight anomalies. But real attacks unfold as chains of related actions: A user assumes a role, launches a resource, accesses data, and then pivots again. These relationships are hard to capture with flat queries or disconnected logs. That’s where graph analytics comes in. By modeling your data as a network of users, sessions, identities, and events, you can trace how threats emerge and evolve. And with PuppyGraph, you don’t need a separate graph database or batch pipelines to get there. In this post, we’ll show how to combine MongoDB and PuppyGraph to analyze AWS CloudTrail data as a graph—without moving or duplicating data. You’ll see how to uncover privilege escalation chains, map user behavior across sessions, and detect suspicious access patterns in real time. Why MongoDB for cybersecurity data MongoDB is a popular choice for managing security telemetry. Its document-based model is ideal for ingesting unstructured and semi-structured logs like those generated by AWS CloudTrail, GuardDuty, or Kubernetes audit logs. Events are stored as flexible JSON documents, which evolve naturally as logging formats change. This flexibility matters in security, where schemas can shift as providers update APIs or teams add new context to events. MongoDB handles these changes without breaking pipelines or requiring schema migrations. It also supports high-throughput ingestion and horizontal scaling, making it well-suited for operational telemetry. Many security products and SIEM backends already support MongoDB as a destination for real-time event streams. That makes it a natural foundation for graph-based security analytics: The data is already there—rich, semi-structured, and continuously updated. Why graph analytics for threat detection Modern security incidents rarely unfold as isolated events. Attackers don’t just trip a single rule—they navigate through systems, identities, and resources, often blending in with legitimate activity. Understanding these behaviors means connecting the dots across multiple entities and actions. That’s precisely what graph analytics excels at. By modeling users, sessions, events, and assets as interconnected nodes and edges, analysts can trace how activity flows through a system. This structure makes it easy to ask questions that involve multiple hops or indirect relationships—something traditional queries often struggle to express. For example, imagine you’re investigating activity tied to a specific AWS account. You might start by counting how many sessions are associated with that account. Then, you might break those sessions down by whether they were authenticated using MFA. If some weren’t, the next question becomes: What resources were accessed during those unauthenticated sessions? This kind of multi-step investigation is where graph queries shine. Instead of scanning raw logs or filtering one table at a time, you can traverse the entire path from account to identity to session to event to resource, all in a single query. You can also group results by attributes like resource type to identify which services were most affected. And when needed, you can go beyond metrics and pivot to visualization, mapping out full access paths to see how a specific user or session interacted with sensitive infrastructure. This helps surface lateral movement, track privilege escalation, and uncover patterns that static alerts might miss. Graph analytics doesn’t replace your existing detection rules; it complements them by revealing the structure behind security activity. It turns complex event relationships into something you can query directly, explore interactively, and act on with confidence. Query MongoDB data as a graph without ETL MongoDB is a popular choice for storing security event data, especially when working with logs that don’t always follow a fixed structure. Services like AWS CloudTrail produce large volumes of JSON-based records with fields that can differ across events. MongoDB’s flexible schema makes it easy to ingest and query that data as it evolves. PuppyGraph builds on this foundation by introducing graph analytics—without requiring any data movement. Through the MongoDB Atlas SQL Interface , PuppyGraph can connect directly to your collections and treat them as relational tables. From there, you define a graph model by mapping key fields into nodes and relationships. Figure 1. Architecture of the integration of MongoDB and PuppyGraph. This makes it possible to explore questions that involve multiple entities and steps, such as tracing how a session relates to an identity or which resources were accessed without MFA. The graph itself is virtual. There’s no ETL process or data duplication. Queries run in real time against the data already stored in MongoDB. While PuppyGraph works with tabular structures exposed through the SQL interface, many security logs already follow a relatively flat pattern: consistent fields like account IDs, event names, timestamps, and resource types. That makes it straightforward to build graphs that reflect how accounts, sessions, events, and resources are linked. By layering graph capabilities on top of MongoDB, teams can ask more connected questions of their security data, without changing their storage strategy or duplicating infrastructure. Investigating CloudTrail activity using graph queries To demonstrate how graph analytics can enhance security investigations, we’ll explore a real-world dataset of AWS CloudTrail logs. This dataset originates from flaws.cloud , a security training environment developed by Scott Piper. The dataset comprises anonymized CloudTrail logs collected over 3.5 years, capturing a wide range of simulated attack scenarios within a controlled AWS environment. It includes over 1.9 million events, featuring interactions from thousands of unique IP addresses and user agents. The logs encompass various AWS API calls, providing a comprehensive view of potential security events and misconfigurations. For our demonstration, we imported a subset of approximately 100,000 events into MongoDB Atlas. By importing this dataset into MongoDB Atlas and applying PuppyGraph’s graph analytics capabilities, we can model and analyze complex relationships between accounts, identities, sessions, events, and resources. Demo Let’s walk through the demo step by step! We have provided all the materials for this demo on GitHub . Please download the materials or clone the repository directly. If you’re new to integrating MongoDB Atlas with PuppyGraph, we recommend starting with the MongoDB Atlas + PuppyGraph Quickstart Demo to get familiar with the setup and core concepts. Prerequisites A MongoDB Atlas account (free tier is sufficient) Docker Python 3 Set up MongoDB Atlas Follow the MongoDB Atlas Getting Started guide to: Create a new cluster (free tier is fine). Add a database user. Configure IP access. Note your connection string for the MongoDB Python driver (you’ll need it shortly). Download and import CloudTrail logs Run the following commands to fetch and prepare the dataset: wget https://summitroute.com/downloads/flaws_cloudtrail_logs.tar mkdir -p ./raw_data tar -xvf flaws_cloudtrail_logs.tar --strip-components=1 -C ./raw_data gunzip ./raw_data/*.json.gz Create a virtual environment and install dependencies: # On some Linux distributions, install `python3-venv` first. sudo apt-get update sudo apt-get install python3-venv # Create a virtual environment, activate it, and install the necessary packages python -m venv venv source venv/bin/activate pip install ijson faker pandas pymongo Import the first chunk of CloudTrail data (replace the connection string with your Atlas URI): export MONGODB_CONNECTION_STRING="your_mongodb_connection_string" python import_data.py raw_data/flaws_cloudtrail00.json --database cloudtrail This creates a new cloudtrail database and loads the first chunk of data containing 100,000 structured events. Enable Atlas SQL interface and get JDBC URI To enable graph access: Create an Atlas SQL Federated Database instance. Ensure the schema is available (generate from sample, if needed). Copy the JDBC URI from the Atlas SQL interface. See PuppyGraph’s guide for setting up MongoDB Atlas SQL . Start PuppyGraph and upload the graph schema Start the PuppyGraph container: docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \ -e PUPPYGRAPH_PASSWORD=puppygraph123 \ -d --name puppy --rm --pull=always puppygraph/puppygraph:stable Log in to the web UI at http://localhost:8081 with: Username: puppygraph. Password: puppygraph123. Upload the schema: Open schema.json. Fill in your JDBC URI, username, and password. Upload via the Upload Graph Schema JSON section or run: curl -XPOST -H "content-type: application/json" \ --data-binary @./schema.json \ --user "puppygraph:puppygraph123" localhost:8081/schema Wait for the schema to upload and initialize (approximately five minutes). Figure 2: A graph visualization of the schema, which models the graph from relational data. Run graph queries to investigate security activity Once the graph is live, open the Query panel in PuppyGraph’s UI. Let's say we want to investigate the activity of a specific account. First, we count the number of sessions associated with the account. Cypher: MATCH (a:Account)-[:HasIdentity]->(i:Identity) -[:HasSession]->(s:Session) WHERE id(a) = "Account[811596193553]" RETURN count(s) Gremlin: g.V("Account[811596193553]") .out("HasIdentity").out("HasSession").count() Figure 3. Graph query in the PuppyGraph UI. Then, we want to see how many of these sessions are MFA-authenticated or not. Cypher: MATCH (a:Account)-[:HasIdentity]->(i:Identity) -[:HasSession]->(s:Session) WHERE id(a) = "Account[811596193553]" RETURN s.mfa_authenticated AS mfaStatus, count(s) AS count Gremlin: g.V("Account[811596193553]") .out("HasIdentity").out("HasSession") .groupCount().by("mfa_authenticated") Figure 4. Graph query results in the PuppyGraph UI. Next, we investigate those sessions that are not MFA authenticated and see what resources they accessed. Cypher: MATCH (a:Account)-[:HasIdentity]-> (i:Identity)-[:HasSession]-> (s:Session {mfa_authenticated: false}) -[:RecordsEvent]->(e:Event) -[:OperatesOn]->(r:Resource) WHERE id(a) = "Account[811596193553]" RETURN r.resource_type AS resourceType, count(r) AS count Gremlin: g.V("Account[811596193553]").out("HasIdentity") .out("HasSession") .has("mfa_authenticated", false) .out('RecordsEvent').out('OperatesOn') .groupCount().by("resource_type") Figure 5. PuppyGraph UI showing results that are not MFA authenticated. We show those access paths in a graph. Cypher: MATCH path = (a:Account)-[:HasIdentity]-> (i:Identity)-[:HasSession]-> (s:Session {mfa_authenticated: false}) -[:RecordsEvent]->(e:Event) -[:OperatesOn]->(r:Resource) WHERE id(a) = "Account[811596193553]" RETURN path Gremlin: g.V("Account[811596193553]").out("HasIdentity").out("HasSession").has("mfa_authenticated", false) .out('RecordsEvent').out('OperatesOn') .path() Figure 6. Graph visualization in PuppyGraph UI. Tear down the environment When you’re done: docker stop puppy Your MongoDB data will persist in Atlas, so you can revisit or expand the graph model at any time. Conclusion Security data is rich with relationships, between users, sessions, resources, and actions. Modeling these connections explicitly makes it easier to understand what’s happening in your environment, especially when investigating incidents or searching for hidden risks. By combining MongoDB Atlas and PuppyGraph, teams can analyze those relationships in real time without moving data or maintaining a separate graph database . MongoDB provides the flexibility and scalability to store complex, evolving security logs like AWS CloudTrail, while PuppyGraph adds a native graph layer for exploring that data as connected paths and patterns. In this post, we walked through how to import real-world audit logs, define a graph schema, and investigate access activity using graph queries. With just a few steps, you can transform a log collection into an interactive graph that reveals how activity flows across your cloud infrastructure. If you’re working with security data and want to explore graph analytics on MongoDB Atlas , try PuppyGraph’s free Developer Edition . It lets you query connected data, such as users, sessions, events, and resources, all without ETL or infrastructure changes.
Dynamic Term-Based Boosting in MongoDB Atlas Search
Search relevance is the bedrock of any modern user experience. While MongoDB Atlas Search offers a fantastic out-of-the-box relevance model with BM25, its standard approach treats all search terms with a uniform level of importance. For applications that demand precision, this isn't enough. What if you need to boost content from an expert author? Or prioritize a trending topic for the next 48 hours? Or ensure a specific promotional product always appears at the top? Relying on query-time boosting alone can lead to complex, brittle queries that are a nightmare to maintain. There's a more elegant solution. Enter the embedded scoring pattern—an advanced technique in Atlas Search that allows you to embed term-level boosting logic directly within your documents. It's a powerful way to make your relevance scoring data-driven, adaptable, and incredibly precise without ever changing your query structure. Why you need embedded scoring: From uniform to granular The standard approach to boosting is like using a single volume knob for an entire orchestra. The embedded scoring pattern, on the other hand, gives you a mixing board with a dedicated slider for every single instrument. This enables application owners to seamlessly build business-focused use cases, such as: Prioritizing authority: Elevate content from verified experts or high-authority authors. Boosting trends: Dynamically increase the rank of time-sensitive or trending topics. Elevating promotions: Ensure seasonal or promotional products get the visibility they need. By encoding scoring logic alongside your content, you solve the "one-size-fits-all" limitation and give yourself unparalleled control. Under the hood: Building the embedded scoring pattern Let's get practical. Implementing this pattern involves two key steps: designing the index and structuring your documents. 1. The index design: Defining your boosts First, you need to tell Atlas Search how to understand your custom boosts. You do this by defining a field with the embeddedDocuments type in your search index. This creates a dedicated space for your term-boost pairs. { "mappings": { "dynamic": true, "fields": { "indexed_terms": { "type": "embeddedDocuments", "dynamic": false, "fields": { "term": { "type": "string" }, "boost": { "type": "number" } } } } } } This index definition creates a special array, indexed_terms , ready to hold our custom scoring rules. 2. The document structure: Encoding relevance With the index in place, you can now add the indexed_terms array to your documents. Each object in this array contains a term and a corresponding boost value. Consider this sample document: { "id": "content_12345", "title": "Advanced Machine Learning Techniques for Natural Language Processing", "description": "Comprehensive guide covering transformer models and neural networks", "tags": ["technology", "AI", "tutorial"], "author": "Dr. Sarah Chen", "indexed_terms": [ { "term": "machine learning", "boost": 25.0 }, // High boost for the primary topic { "term": "dr. sarah chen", "boost": 20.0 }, // High boost for an expert author { "term": "tutorial", "boost": 8.0 } // Lower boost for the content format ] } As you can see, we've assigned a high score to the core topic ("machine learning") and the expert author, ensuring this document ranks highly for those queries. The query: Putting embedded scores into action Now for the magic. The query below uses the compound operator to combine our new embedded scoring with traditional field-based search. [ { "$search": { "index": "default", "compound": { "should": [ { // Clause 1: Use our embedded scores "embeddedDocument": { "path": "indexed_terms", "operator": { "text": { "path": "indexed_terms.term", "query": "machine learning", "score": { // Use the boost value from the document! "function": { "path": { "value": "indexed_terms.boost", "undefined": 0.0 } } } } } } }, { // Clause 2: Standard search across other fields "text": { "path": ["title", "description"], "query": "machine learning", // Add a small constant score for matches in these fields "score": { "constant": { "value": 5 } } } } ] }, "scoreDetails": true } }, { "$project": { "_id": 0, "id": 1, "title": 1, "author": 1, "relevanceScore": { "$meta": "searchScore" }, "scoreDetails": { "$meta": "searchScoreDetails" } } } ] In this query, a user searches for "machine learning". If our sample document is part of the index, the final score is a combination of our boosts: 25 points from the indexed_terms match. 5 points from the match in the title field. Total Score: 30 This gives us precise, predictable, and highly tunable ranking behavior. Aggregation strategies You can even control how multiple matches within the indexed_terms array contribute to the score. The three main strategies are: table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Strategy Use Case maximum Highlights the single most relevant term that matched. sum Accumulates the score across all matching terms. mean Normalizes the score by averaging the boost of all matching terms. Power comes with responsibility: Performance considerations While powerful, this pattern requires foresight and planning. Embedding terms increases your index size. If 1 million documents each get five embedded terms, your index now has to manage 6 million entries. To keep things snappy and scalable, follow these best practices: Be selective: Only embed high-impact terms. Don't use it for your entire vocabulary. Quantize boosts: Use discrete boost levels (e.g., 5, 10, 15, 20) instead of hyper-specific decimals. This improves caching and consistency. Perform regular cleanup: Create processes to remove obsolete or low-performing terms from the indexed_terms arrays. Always monitor your index size, query latency, and memory usage in the Atlas UI to ensure your implementation remains performant. Take control of your search destiny The embedded scoring pattern in MongoDB Atlas Search is a game-changer for anyone serious about search relevance. It moves beyond static, one-size-fits-all ranking and gives you dynamic, context-aware control directly within your data. You can use this pattern to implement business-driven ranking logic, enable real-time personalization, and achieve full transparency for tuning and debugging your search scores. While this article gives you a powerful head start, your journey into advanced relevance doesn't end here. For more in-depth implementation examples, guidance on operational analytics, and best practices to ensure your embedded boost values stay aligned with business goals, we highly recommend diving into the official MongoDB Atlas Search documentation . It's the perfect resource for taking this pattern from concept to production. Stop letting your search engine make all the decisions. Try the embedded scoring pattern today and unlock a new level of precision and power in Atlas Search.
Build AI Memory Systems with MongoDB Atlas, AWS and Claude
When working with conversational AI, most developers fall into a familiar trap: They treat memory as simple storage—write data in, read data out. But human memory doesn't work this way. Our brains actively evaluate information importance, strengthen connections through repetition, and let irrelevant details fade over time. This disconnect creates AI systems that either remember too much (overwhelming users with irrelevant details) or too little (forgetting critical context). The stakes are significant: Without sophisticated memory management, AI assistants can't provide truly personalized experiences, maintain consistent personalities, or build meaningful relationships with users. The application we're exploring represents a paradigm shift—treating AI memory not as a database problem but as a cognitive architecture challenge. This transforms AI memory from passive storage into an active, evolving knowledge network. A truly intelligent cognitive memory isn't one that never forgets, but one that forgets with intention and remembers with purpose. Imagine an AI assistant that doesn't just store information but builds a living, adaptive memory system that carefully evaluates, reinforces, and connects knowledge just like a human brain. This isn't science fiction—it's achievable today by combining MongoDB Atlas Vector Search with AWS Bedrock and Anthropic's Claude. You'll move from struggling with fragmented AI memory systems to building sophisticated knowledge networks that evolve organically, prioritize important information, and recall relevant context exactly when needed. The cognitive architecture of AI memory At its simplest, our memory system mimics three core aspects of human memory: Importance-weighted storage: Not all memories are equally valuable. Reinforcement through repetition: Important concepts strengthen over time. Contextual retrieval: Memories are recalled based on relevance to current context. This approach differs fundamentally from traditional conversation storage: table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Traditional conversation storage Cognitive memory architecture Flat history retention Hierarchical knowledge graph Equal weighting of all information Importance-based prioritization Keyword or vector-only search Hybrid semantic & keyword retrieval Fixed memory lifetime Dynamic reinforcement & decay Isolated conversation fragments Connected knowledge network The practical implication is an AI that "thinks" before remembering—evaluating what information to prioritize, how to connect it with existing knowledge, and when to let less important details fade. Let's build a minimum viable implementation of this cognitive memory architecture using MongoDB Atlas , AWS Bedrock, and Anthropic's Claude. Our focus will be on creating the fundamental components that make this system work. Service architecture The following service architecture defines the foundational components and their interactions that power the cognitive memory system. Figure 1. AI memory service architecture. Built on AWS infrastructure, this comprehensive architecture connects user interactions with sophisticated memory management processes. The User Interface (Client application) serves as the entry point where humans interact with the system, sending messages and receiving AI responses enriched with conversation summaries and relevant contextual memories. At the centre sits the AI Memory Service, the critical processing hub that coordinates information flow, processes messages, and manages memory operations across the entire system. MongoDB Atlas provides a scalable, secure, multi-cloud database foundation. The system processes data through the following key functions: Bedrock Titan Embeddings for converting text to vector representations. Memory Reinforcement for strengthening important information. Relevance-based Retrieval for finding contextually appropriate memories. Anthropic’s Claude LLM handles the importance assessment to evaluate long-term storage value, memory merging for efficient information organization, and conversation summary generation. This architecture ultimately enables AI systems to maintain contextual awareness across conversations, providing more natural, consistent, and personalized interactions over time. Database structure The database structure organizes information storage with specialized collections and indexes that enable efficient semantic retrieval and importance-based memory management. Figure 2. Example of a database structure. The database design strategically separates raw conversation data from processed memory nodes to optimize performance and functionality. The Conversations Collection maintains chronological records of all interactions, preserving the complete historical context, while the Memory Nodes Collection stores higher-level semantic information with importance ratings that facilitate cognitive prioritization. Vector Search Indexes enable efficient semantic similarity searches with O(log n) performance, allowing the system to rapidly identify contextually relevant information regardless of database size. To manage storage growth automatically, TTL(Time-To-Live) Indexes expire older conversations based on configurable retention policies. Finally, Importance and User ID indexes optimize retrieval patterns critical to the system's function, ensuring that high-priority information and user-specific context can be accessed with minimal latency. Memory node structure The Memory node structure defines the data schemas that combine content with cognitive metadata to enable human-like memory operations. Figure 3. The memory node structure. Each node includes an importance score that enables memory prioritization similar to human memory processes, allowing the system to focus on what matters most. The structure tracks access count, which facilitates reinforcement learning by recording how frequently memories are retrieved. A critical feature is the summary field, providing quick semantic access without processing the full content, significantly improving efficiency. Vector embeddings within each node enable powerful semantic search capabilities that mirror human associative thought, connecting related concepts across the knowledge base. Complementing this, the ConversationMessage structure preserves raw conversational context without interpretation, maintaining the original exchange integrity. Both structures incorporate vector embeddings as a unifying feature, enabling sophisticated semantic operations that allow the system to navigate information based on meaning rather than just keywords, creating a more human-like cognitive architecture. Memory creation process The memory creation process transforms conversational exchanges into structured memory nodes through a cognitive pipeline mimicking human memory formation by thoughtfully evaluating new information against existing knowledge, rather than indiscriminately storing everything. Figure 3. The memory creation process. Through repetition, memories are strengthened via reinforcement, similar to human cognitive processes. At its core, the LLM functions as an "importance evaluator" that assigns each memory a value on a 1-10 scale, reflecting how humans naturally prioritize information based on relevance, uniqueness, and utility. This importance rating directly affects a memory's persistence, recall probability, and survival during pruning operations. As the system evolves, memory merging simulates the human brain's ability to consolidate related concepts over time, while importance updating reflects how new discoveries change our perception of existing knowledge. The framework's pruning mechanism mirrors our natural forgetting of less significant information. Rather than simply accumulating data, this dynamic system creates an evolving memory architecture that continuously refines itself through processes remarkably similar to human cognition. Memory retrieval process The memory retrieval process leverages multiple search methodologies that optimize both recall and precision to find and contextualize relevant information across conversations and memory nodes. Figure 4. The memory retrieval process. When initiated, the system converts user queries into vector embeddings while simultaneously executing parallel operations to enhance performance. The core of this system is its hybrid search methodology that combines vector-based semantic understanding with traditional text-based keyword search, allowing it to capture both conceptual similarities and exact term matches. The process directly searches memory nodes and applies different weighting algorithms to combine scores from various search methods, producing a comprehensive relevance ranking. After identifying relevant memories, the system fetches surrounding conversation context to ensure retrieved information maintains appropriate background, followed by generating concise summaries that distill essential insights. A key innovation is the effective importance calculation that dynamically adjusts memory significance based on access patterns and other usage metrics. The final step involves building a comprehensive response package that integrates the original memories, their summaries, relevance scores, and contextual information, providing users with a complete understanding of retrieved information without requiring exhaustive reading of all content. This multi-faceted approach ensures that memory retrieval is both comprehensive and precisely tailored to user needs. Code execution flowchart The code execution flowchart provides a comprehensive mapping of how API requests navigate through the system architecture, illuminating the runtime path from initial client interaction to final response delivery. Figure 5. The code execution flowchart. When a request enters the system, it first encounters the FastAPI endpoint, which serves as the primary entry point for all client communications. From there, specialized API route handlers direct the request to appropriate processing functions based on its type and intent. During processing, the system creates and stores message objects in the database, ensuring a permanent record of all conversation interactions. For human-generated messages meeting specific significance criteria, a parallel memory creation branch activates, analyzing the content for long-term storage. This selective approach preserves only meaningful information while reducing storage overhead. The system then processes queries through embedding generation, transforming natural language into vector representations that enable semantic understanding. One of the most sophisticated aspects is the implementation of parallel search functions that simultaneously execute different retrieval strategies, dramatically improving response times while maintaining comprehensive result quality. These searches connect to MongoDB Atlas to perform complex database operations against the stored knowledge base. Retrieved information undergoes context enrichment and summary generation, where the AWS Bedrock (Anthropic’s Claude) LLM augments raw data with contextual understanding and concise overviews of relevant conversation history. Finally, the response combination module assembles diverse data components—semantic matches, text-based results, contextual information, and generated summaries—into a coherent, tailored response that addresses the original request. The system's behavior can be fine-tuned through configurable parameters that govern memory processing, AI model selection, database structure, and service operations, allowing for optimization without code modifications. Memory updating process The memory updating process dynamically adjusts memory importance through sophisticated reinforcement and decay mechanisms that mimic human cognitive functions. Figure 6. The memory updating process. When new information arrives, the system first retrieves all existing user memories from the database, then methodically calculates similarity scores between this new content and each stored memory. Memories exceeding a predetermined similarity threshold are identified as conceptually related and undergo importance reinforcement and access count incrementation, strengthening their position in the memory hierarchy. Simultaneously, unrelated memories experience gradual decay as their importance values diminish over time, creating a naturally evolving memory landscape. This balanced approach prevents memory saturation by ensuring that frequently accessed topics remain prominent while less relevant information gracefully fades. The system maintains a comprehensive usage history through access counts, which informs more effective importance calculations and provides valuable metadata for memory management. All these adjustments are persistently stored in MongoDB Atlas, ensuring continuity across user sessions and maintaining a dynamic memory ecosystem that evolves with each interaction. Client integration flow The following diagram illustrates the complete interaction sequence between client applications and the memory system, from message processing to memory retrieval. This flow encompasses two primary pathways: Message sending flow: When a client sends a message, it triggers a sophisticated processing chain where the API routes it to the Conversation Service, which generates embeddings via AWS Bedrock. After storing the message in MongoDB Atlas, the Memory Service evaluates it for potential memory creation, performing importance assessment and summary generation before creating or updating a memory node in the database. The flow culminates with a confirmation response returning to the client. Check out the code reference on Github . Memory retrieval flow: During retrieval, the client's request initiates parallel search operations where query embeddings are generated simultaneously across conversation history and memory nodes. These dual search paths—conversation search and memory node search—produce results that are intelligently combined and summarized to provide contextual understanding. The client ultimately receives a comprehensive memory package containing all relevant information. Check out the code reference on Github . Figure 7. The client integration flow. The architecture deliberately separates conversation storage from memory processing, with MongoDB Atlas serving as the central persistence layer. Each component maintains clear responsibilities and interfaces, ensuring that despite complex internal processing, clients receive unified, coherent responses. Action plan: Bringing your AI memory system to life To implement your own AI memory system: Start with the core components: MongoDB Atlas, AWS Bedrock, and Anthropic’s Claude. Focus on cognitive functions: Importance assessment, memory reinforcement, relevance-based retrieval, and memory merging Tune parameters iteratively: Start with the defaults provided, then adjust based on your application's needs. Measure the right metrics: Track uniqueness of memories, retrieval precision, and user satisfaction—not just storage efficiency. To evaluate your implementation, ask these questions: Does your system effectively prioritize truly important information? Can it recall relevant context without excessive prompting? Does it naturally form connections between related concepts? Can users perceive the system's improving memory over time? Real-world applications and insights Case Study: From repetitive Q&A to evolving knowledge A customer service AI using traditional approaches typically needs to relearn user preferences repeatedly. With our cognitive memory architecture: First interaction: User mentions they prefer email communication. The system stores this with moderate importance. Second interaction: User confirms email preference. The system reinforces this memory, increasing its importance. Future interactions: The system consistently recalls email preference without asking again, but might still verify after long periods due to natural decay. The result? A major reduction in repetitive questions, leading to a significantly better user experience. Benefits Applications implementing this approach achieved unexpected benefits: Emergent knowledge graphs: Over time, the system naturally forms conceptual clusters of related information. Insight mining: Analysis of high-importance memories across users reveals shared concerns and interests not obvious from raw conversation data. Reduced compute costs: Despite the sophisticated architecture, the selective nature of memory storage reduces overall embedding and storage costs compared to retaining full conversation histories. Limitations When implementing this system, teams typically face three key challenges: Configuration tuning: Finding the right balance of importance thresholds, decay rates, and reinforcement factors requires experimentation. Prompt engineering: Getting consistent, numeric importance ratings from LLMs requires careful prompt design. Our implementation uses clear constraints and numeric-only output requirements. Memory sizing: Determining the optimal memory depth per user depends on the application context. Too shallow and the AI seems forgetful; too deep and it becomes sluggish. Future directions The landscape for AI memory systems is evolving rapidly. Here are key developments on the horizon: Short-term developments Emotion-aware memory: Extending importance evaluation to include emotional salience, remembering experiences that evoke strong reactions. Temporal awareness: Adding time-based decay that varies by information type (factual vs. preferential). Multi-modal memory: Incorporating image and voice embeddings alongside text for unified memory systems. Long-term possibilities Self-supervised memory optimization: Systems that learn optimal importance ratings, decay rates, and memory structures based on user satisfaction. Causal memory networks: Moving beyond associative memory to create causal models of user intent and preferences. Privacy-preserving memory: Implementing differential privacy and selective forgetting capabilities to respect user privacy boundaries. This approach to AI memory is still evolving. The future of AI isn't just about more parameters or faster inference—it's about creating systems that learn and remember more like humans do. With the cognitive memory architecture we've explored, you're well on your way to building AI that remembers what matters. Transform your AI applications with cognitive memory capabilities today. Get started with MongoDB Atlas for free and implement vector search in minutes. For hands-on guidance, explore our GitHub repository containing complete implementation code and examples.
Scaling Vector Search with MongoDB Atlas Quantization & Voyage AI Embeddings
Key Takeaways Vector quantization fundamentals: A technique that compresses high-dimensional embeddings from 32-bit floats to lower precision formats (scalar/int8 or binary/1-bit), enabling significant performance gains while maintaining semantic search capabilities Performance vs. precision trade-offs: Binary quantization provides maximum speed (80% faster queries) with minimal resources; scalar quantization offers balanced performance and accuracy; float32 maintains highest fidelity at significant resource cost Resource optimization: Vector quantization can reduce RAM usage by up to 24x (binary) or 3.75x (scalar); storage footprint decreases by 38% using BSON binary format Scaling benefits: Performance advantages multiply at scale; most significant for vector databases exceeding 1M embeddings Semantic preservation: Quantization-aware models like Voyage AI's retain high representation capacity even after compression Search quality control: Binary quantization may require rescoring for maximum accuracy; scalar quantization typically maintains 90%+ retention of float32 results Implementation ease: MongoDB's automatic quantization requires minimal code changes to leverage quantization techniques As vector databases scale into the millions of embeddings, the computational and memory requirements of high-dimensional vector operations become critical bottlenecks in production AI systems. Without effective scaling strategies, organizations face: Infrastructure costs that grow exponentially with data volume Unacceptable query latency that degrades user experience and limits real-time applications Limited and restricted deployment options, particularly on edge devices or resource-constrained environments Diminished competitive advantage as AI capabilities become limited by technical constraints and bottlenecks rather than use case innovation This technical guide demonstrates advanced techniques for optimizing vector search operations through precision-controlled quantization—transforming resource-intensive 32-bit float embeddings into performance-optimized representations while preserving semantic fidelity. By leveraging MongoDB Atlas Vector Search ’s automatic quantization capabilities with Voyage AI's quantization-aware embedding models, we'll implement systematic optimization strategies that dramatically reduce both computational overhead and memory footprint. This guide provides an empirical analysis of the critical performance metrics: Retrieval latency benchmarking: Quantitative comparison of search performance across binary, scalar, and float32 precision levels with controlled evaluation of HNSW(hierarchical navigable small world) graph exploration parameters and k-retrieval variations. Representational capacity retention: Precise measurement of semantic information preservation through direct comparison of quantized vector search results against full-fidelity retrieval, with particular attention to retention curves across varying retrieval depths. We'll present implementation strategies and evaluation methodologies for vector quantization that simultaneously optimize for both computational efficiency and semantic fidelity—enabling you to make evidence-based architectural decisions for production-scale AI retrieval systems handling millions of embeddings. The techniques demonstrated here are directly applicable to enterprise-grade RAG architectures, recommendation engines, and semantic search applications where millisecond-level latency improvements and dramatic RAM reduction translate to significant infrastructure cost savings. The full end to end implementation for automatic vector quantization and other operations involved in RAG/Agent pipelines can be found on our Github repository . Auto-quantization of Voyage AI embeddings with MongoDB Our approach addresses the complete optimization cycle for vector search operations, covering: Generating embeddings with quantization-aware models Implementing automatic vector quantization in MongoDB Atlas Creating and configuring specialized vector search indices Measuring and comparing latency across different quantization strategies Quantifying representational capacity retention Analyzing performance trade-offs between binary, scalar, and float32 implementations Making evidence-based architectural decisions for production AI retrieval systems Figure 1. Vector quantization architecture with MongoDB Atlas and Voyage AI. Using text data as an example, we convert documents into numerical vector embeddings that capture semantic relationships. MongoDB then indexes and stores these embeddings for efficient similarity searches. By comparing queries run against float32, int8, and binary embeddings, you can gauge the trade-offs between precision and performance and better understand which quantization strategy best suits large-scale, high-throughput workloads. One key takeaway from this article is that representational capacity retention is highly dependent on the embedding model used. With quantization-aware models like Voyage AI’s voyage-3-large at appropriate dimensionality (1024 dimensions), our tests demonstrate that we can achieve 95%+ recall retention at reasonable numCandidate values. This means organizations can significantly reduce memory and computational requirements while preserving semantic search quality, provided they select embedding models specifically designed to maintain their representation capacity after quantization. For more information on why vector quantization is crucial for AI workloads, refer to this blog post . Dataset information Our quantization evaluation framework leverages two complementary datasets designed specifically to benchmark semantic search performance across different precision levels. Primary Dataset ( Wikipedia-22-12-en-voyage-embed ): Contains approximately 300,000 Wikipedia article fragments with pre-generated 1024-dimensional embeddings from Voyage AI’s voyage-3-large model. This dataset serves as a diverse vector corpus for testing vector quantization effects in semantic search. Throughout this tutorial, we'll use the primary dataset to demonstrate the technical implementation of quantization. Embedding generation with Voyage AI For generating new embeddings for AI Search applications, we use Voyage AI's voyage-3-large model, which is specifically designed to be quantization-aware. The voyage-3-large model generates 1024-dimensional vectors and has been specifically trained to maintain semantic properties even after quantization, making it ideal for our AI retrieval optimization strategy. For more information on how MongoDB and Voyage AI work together for optimal retrieval, see our previous article, Rethinking Information Retrieval with MongoDB and Voyage AI . import voyageai # Initialize the Voyage AI client client = voyageai.Client() def get_embedding(text, task_prefix="document"): """ Generate embeddings using the voyage-3-large model for AI Retrieval. Parameters: text (str): The input text to be embedded. task_prefix (str): A prefix describing the task; this is prepended to the text. Returns: list: The embedding vector (1024 dimensions). """ if not text.strip(): print("Attempted to get embedding for empty text.") return [] # Call the Voyage API to generate the embedding result = client.embed([text], model="voyage-3-large", input_type=task_prefix) # Return the first embedding from the result return result.embeddings[0] Converting embeddings to BSON BinData format A critical optimization step is converting embeddings to MongoDB's BSON BinData format , which significantly reduces storage and memory requirements. The BinData vector format provides significant advantages: Reduces disk space by approximately 3x compared to arrays Enables more efficient indexing with alternate types (int8, binary) Reduces RAM usage by 3.75x for scalar and 24x for binary quantization from bson.binary import Binary, BinaryVectorDtype def generate_bson_vector(array, data_type): return Binary.from_vector(array, BinaryVectorDtype(data_type)) # Convert embeddings to BSON BinData vector format wikipedia_data_df["embedding"] = wikipedia_data_df["embedding"].apply( lambda x: generate_bson_vector(x, BinaryVectorDtype.FLOAT32) ) Vector index creation with different quantization strategies The cornerstone of our performance optimization framework lies in creating specialized vector indices with different quantization strategies. This process leverages MongoDB for general-purpose database functionalities, more specifically, its high-performance vector database capabilities of efficiently handling million-scale embedding collections. This implementation step focuses on how to set up MongoDB's vector search capabilities with automatic quantization, focusing on two primary quantization strategies: scalar (int8) and binary. Two indices are created to measure and evaluate the retrieval latency and recall performance of various precision data types, including the full fidelity vector representation. The MongoDB database uses the vector index HNSW, which is a graph-based indexing algorithm that organizes vectors in a hierarchical structure of layers. In this structure, vector data points within a layer are contextually similar, while higher layers are sparse compared to lower layers, which are denser and contain more vector data points. The code snippet below showcases the implementation of two quantization strategies in parallel; this enables the systematic evaluation of the latency, memory usage, and representational capacity trade-offs across the precision spectrum, enabling data-driven decisions about the optimal approach for specific application requirements. MongoDB Atlas automatic quantization is activated entirely through the vector index definition. By including the "quantization" attribute and setting its value to either "scalar" or "binary", you enable automatic compression of your embeddings at index creation time. This declarative approach means no separate preprocessing of vectors is required—MongoDB handles the dimensional reduction transparently while maintaining the original embeddings for potential rescoring operations. from pymongo.operations import SearchIndexModel def setup_vector_search_index(collection, index_definition, index_name="vector_index"): """Setup a vector search index with the specified configuration""" ... # 1. Scalar Quantized Index (int8) vector_index_definition_scalar_quantized = { "fields": [{ "type": "vector", "path": "embedding", "quantization": "scalar", # Uses int8 quantization "numDimensions": 1024, "similarity": "cosine", }] } # 2. Binary Quantized Index (1-bit) vector_index_definition_binary_quantized = { "fields": [{ "type": "vector", "path": "embedding", "quantization": "binary", # Uses binary (1-bit) quantization "numDimensions": 1024, "similarity": "cosine", }] } # 3. Float32 ANN Index (no quantization) vector_index_definition_float32_ann = { "fields": [{ "type": "vector", "path": "embedding", "numDimensions": 1024, "similarity": "cosine", }] } # Create the indices setup_vector_search_index( wiki_data_collection, vector_index_definition_scalar_quantized, "vector_index_scalar_quantized" ) setup_vector_search_index( wiki_data_collection, vector_index_definition_binary_quantized, "vector_index_binary_quantized" ) setup_vector_search_index( wiki_data_collection, vector_index_definition_float32_ann, "vector_index_float32_ann" ) Implementing vector search functionality Vector search serves as the computational foundation of modern generative AI systems. While LLMs provide reasoning and generation capabilities, vector search delivers the contextual knowledge necessary for grounding these capabilities in relevant information. This semantic retrieval operation forms the backbone of RAG architectures that power enterprise-grade AI applications, such as knowledge-intensive chatbots and domain-specific assistants. In more advanced implementations, vector search enables agentic RAG systems where autonomous agents dynamically determine what information to retrieve, when to retrieve it, and how to incorporate it into complex reasoning chains. The implementation below provides the technical overview that transforms raw embedding vectors into intelligent search components that move beyond lexical matching to true semantic understanding. Our implementation below supports both approximate nearest neighbor (ANN) search and exact nearest neighbor (ENN) search through the use_full_precision parameter: Approximate nearest neighbor (ANN) search: When use_full_precision = False , the system performs an approximate search using: The specified quantized index (binary or scalar) The HNSW graph navigation algorithm A controlled exploration breadth via numCandidates This approach sacrifices perfect accuracy for dramatic performance gains, particularly at scale. The HNSW algorithm enables sub-linear time complexity by intelligently sampling the vector space, making it possible to search billions of vectors in milliseconds instead of seconds. When combined with quantization, ANN delivers order-of-magnitude improvements in both speed and memory efficiency. Exact nearest neighbor (ENN) search: When use_full_precision = True , the system performs exact search using: The original float32 embeddings (regardless of the index specified) An exhaustive comparison approach The exact = True directive to bypass approximation techniques ENN guarantees finding the mathematically optimal nearest neighbors by computing distances between the query vector and every single vector in the database. This brute-force approach provides perfect recall but scales linearly with collection size, becoming prohibitively expensive as vector counts increase beyond millions. We include both search modes for several critical reasons: Establishing ground truth: ENN provides the "perfect" baseline against which we measure the quality degradation of approximation techniques. The representational retention metrics discussed later directly compare ANN results against this ENN ground truth. Varying application requirements: Not all AI applications prioritize the same metrics. Time-sensitive applications (real-time customer service) might favor ANN's speed, while high-stakes applications (legal document analysis) might require ENN's accuracy. def custom_vector_search( user_query, collection, embedding_path, vector_search_index_name="vector_index", top_k=5, num_candidates=25, use_full_precision=False, ): """ Perform vector search with configurable precision and parameters for AI Search applications. """ # Generate embedding for the query query_embedding = get_embedding(user_query, task_prefix="query") # Define the vector search stage vector_search_stage = { "$vectorSearch": { "index": vector_search_index_name, "queryVector": query_embedding, "path": embedding_path, "limit": top_k, } } # Configure search precision approach if not use_full_precision: # For approximate nearest neighbor (ANN) search vector_search_stage["$vectorSearch"]["numCandidates"] = num_candidates else: # For exact nearest neighbor (ENN) search vector_search_stage["$vectorSearch"]["exact"] = True # Project only needed fields project_stage = { "$project": { "_id": 0, "title": 1, "text": 1, "wiki_id": 1, "url": 1, "score": {"$meta": "vectorSearchScore"} } } # Build and execute the pipeline pipeline = [vector_search_stage, project_stage] ... # Execute the query results = list(collection.aggregate(pipeline)) return {"results": results, "execution_time_ms": execution_time_ms} Measuring the retrieval latency of various quantized vectors In production AI retrieval systems, query latency directly impacts user experience, operational costs, and system throughput capacity. Vector search operations typically constitute the primary performance bottleneck in RAG architectures, making latency optimization a critical engineering priority. Sub-100ms response times are often necessary for interactive applications and mission-critical applications, while batch processing systems may tolerate higher latencies but require consistent predictability for resource planning. Our latency measurement methodology employs a systematic, parameterized approach that models real-world query patterns while isolating the performance characteristics of different quantization strategies. This parameterized benchmarking enables us to: Construct detailed latency profiles across varying retrieval depths Identify performance inflection points where quantization benefits become significant Map the scaling curves of different precision levels as the data volume increases Determine optimal configuration parameters for specific throughput targets def measure_latency_with_varying_topk( user_query, collection, vector_search_index_name, use_full_precision=False, top_k_values=[5, 10, 50, 100], num_candidates_values=[25, 50, 100, 200, 500, 1000, 2000], ): """ Measure search latency across different configurations. """ results_data = [] for top_k in top_k_values: for num_candidates in num_candidates_values: # Skip invalid configurations if num_candidates < top_k: continue # Get precision type from index name precision_name = vector_search_index_name.split("vector_index")[1] precision_name = precision_name.replace("quantized", "").capitalize() if use_full_precision: precision_name = "_float32_ENN" # Perform search and measure latency vector_search_results = custom_vector_search( user_query=user_query, collection=collection, embedding_path="embedding", vector_search_index_name=vector_search_index_name, top_k=top_k, num_candidates=num_candidates, use_full_precision=use_full_precision, ) latency_ms = vector_search_results["execution_time_ms"] # Store results results_data.append({ "precision": precision_name, "top_k": top_k, "num_candidates": num_candidates, "latency_ms": latency_ms, }) print(f"Top-K: {top_k}, NumCandidates: {num_candidates}, " f"Latency: {latency_ms} ms, Precision: {precision_name}") return results_data Latency results analysis Our systematic benchmarking reveals dramatic performance differences between quantization strategies across different retrieval scenarios. The visualizations below capture these differences for top-k=10 and top-k=100 configurations. Figure 2. Search latency vs the number candidates for top-k=10 Figure 3. Search latency vs the number of candidates for top-k=100. Several critical patterns emerge from these latency profiles: Quantization delivers exponential performance gains: The float32_ENN approach (purple line) demonstrates latency measurements an order of magnitude higher than any quantized approach. At top-k=10, ENN latency starts at ~1600ms and never drops below 500ms, while quantized approaches maintain sub-100ms performance until extremely high candidate counts. This performance gap widens further as data volume scales. Scalar quantization offers the best performance profile: Somewhat surprisingly, scalar quantization (orange line) consistently outperforms both binary quantization and float32 ANN across most configurations. This is particularly evident at higher num_candidates values, where scalar quantization maintains near-flat latency scaling. This suggests scalar quantization achieves an optimal balance in the memory-computation trade-off for HNSW traversal. Binary quantization shows linear latency scaling: While binary quantization (red line) starts with excellent performance, its latency increases more steeply as num_candidates grows, eventually exceeding scalar quantization at very high exploration depths. This suggests that while binary vectors require less memory, their distance computation savings are partially offset by the need for more complex traversal patterns in the HNSW graph and rescoring. All quantization methods maintain interactive-grade performance: Even with 10,000 candidate explorations and top-k=100, all quantized approaches maintain sub-200ms latency, well within interactive application requirements. This demonstrates that quantization enables order-of-magnitude increases in exploration depth without sacrificing user experience, allowing for dramatic recall improvements while maintaining acceptable latency. These empirical results validate our theoretical understanding of quantization benefits and provide concrete guidance for production deployment: scalar quantization offers the best general-purpose performance profile, while binary quantization excels in memory-constrained environments with moderate exploration requirements. In the images below we employ logarithmic scaling for both axes in our latency analysis because search performance data typically spans multiple orders of magnitude. When comparing different precision types (scalar, binary, float32_ann) across varying numbers of candidates, the latency values can range from milliseconds to seconds, while candidate counts may vary from hundreds to millions. Linear plots would compress smaller values and make it difficult to observe performance trends across the full range(as we see above). Logarithmic scaling transforms exponential relationships into linear ones, making it easier to identify proportional changes, compare relative performance improvements, and detect patterns that would otherwise be obscured. This visualization approach is particularly valuable for understanding how each precision type scales with increasing workload and for identifying the optimal operating ranges where certain methods outperform others(as shown below). Figure 4. Search latency vs the number of candidates (log scale) for top-k=10. Figure 5. Search latency vs the number of candidates (log scale) for top-k=100. The performance characteristics observed in the logarithmic plots above directly reflect the architectural differences inherent in binary quantization's two-stage retrieval process. Binary quantization employs a coarse-to-fine search strategy: an initial fast retrieval phase using low-precision binary representations, followed by a refinement phase that rescores the top-k candidates using full-precision vectors to restore accuracy. This dual-phase approach creates a fundamental performance trade-off that manifests differently across varying candidate pool sizes. For smaller candidate sets, the computational savings from binary operations during the initial retrieval phase can offset the rescoring overhead, making binary quantization competitive with other methods. However, as the candidate pool expands, the rescoring phase—which must compute full-precision similarity scores for an increasing number of retrieved candidates—begins to dominate the total latency profile. Measuring representational capacity retention While latency optimization is critical for operational efficiency, the primary concern for AI applications remains semantic accuracy. Vector quantization introduces a fundamental trade-off: computational efficiency versus representational capacity. Even the most performant quantization approach is useless if it fails to maintain the semantic relationships encoded in the original embeddings. To quantify this critical quality dimension, we developed a systematic methodology for measuring representational capacity retention—the degree to which quantized vectors preserve the same nearest-neighbor relationships as their full-precision counterparts. This approach provides an objective, reproducible framework for evaluating semantic fidelity across different quantization strategies. def measure_representational_capacity_retention_against_float_enn( ground_truth_collection, collection, quantized_index_name, top_k_values, num_candidates_values, num_queries_to_test=1, ): """ Compare quantized search results against full-precision baseline. For each test query: 1. Perform baseline search with float32 exact search 2. Perform same search with quantized vectors 3. Calculate retention as % of baseline results found in quantized results """ retention_results = {"per_query_retention": {}} overall_retention = {} # Initialize tracking structures for top_k in top_k_values: overall_retention[top_k] = {} for num_candidates in num_candidates_values: if num_candidates < top_k: continue overall_retention[top_k][num_candidates] = [] # Get precision type precision_name = quantized_index_name.split("vector_index")[1] precision_name = precision_name.replace("quantized", "").capitalize() # Load test queries from ground truth annotations ground_truth_annotations = list( ground_truth_collection.find().limit(num_queries_to_test) ) # For each annotation, test all its questions for annotation in ground_truth_annotations: ground_truth_wiki_id = annotation["wiki_id"] ... # Calculate average retention for each configuration avg_overall_retention = {} for top_k, cand_dict in overall_retention.items(): avg_overall_retention[top_k] = {} for num_candidates, retentions in cand_dict.items(): if retentions: avg = sum(retentions) / len(retentions) else: avg = 0 avg_overall_retention[top_k][num_candidates] = avg retention_results["average_retention"] = avg_overall_retention return retention_results Our methodology takes a rigorous approach to retention measurement: Establishing ground truth: We use float32 exact nearest neighbor (ENN) search as the baseline "perfect" result set, acknowledging that these are the mathematically optimal neighbors. Controlled comparison: For each query in our annotation dataset, we perform parallel searches using different quantization strategies, carefully controlling for top-k and num_candidates parameters. Retention calculation: We compute retention as the ratio of overlapping results between the quantized search and the ENN baseline: |quantized_results ∩ baseline_results| / |baseline_results|. Statistical aggregation: We average retention scores across multiple queries to account for query-specific variations and produce robust, generalizable metrics. This approach provides a direct, quantitative measure of how much semantic fidelity is preserved after quantization. A retention score of 1.0 indicates that the quantized search returns exactly the same results as the full-precision search, while lower scores indicate divergence. Representational capacity results analysis The findings from the representational capacity retention evaluation provide empirical validation that properly implemented quantization—particularly scalar quantization—can maintain semantic fidelity while dramatically reducing computational and memory requirements. Note that in the chart below, the scalar curve (yellow) exactly matches the float32_ann performance (blue)—so much so that the blue line is completely hidden beneath the yellow. The near-perfect retention of scalar quantization should alleviate concerns about quality degradation, while binary quantization's retention profile suggests it's suitable for applications with higher performance demands that can tolerate slight quality trade-offs or compensate with increased exploration depth. Figure 6. Retention score vs the number of candidates for top-k=10. Figure 7. Retention score vs the number of candidates for top-k=50. Figure 8. Retention score vs the number of candidates for top-k=100. Scalar quantization achieves near-perfect retention: The scalar quantization approach (orange line) demonstrates extraordinary representational capacity preservation, achieving 98-100% retention across nearly all configurations. At top-k=10, it reaches perfect 1.0 retention with just 100 candidates, effectively matching full-precision ENN results while using 4x less memory. This remarkable performance validates the effectiveness of int8 quantization when implemented with MongoDB's automatic quantization. Binary quantization shows retention-exploration trade-off: Binary quantization (red line) exhibits a clear correlation between exploration depth and retention quality. At top-k=10, it starts at ~91% retention with minimal candidates but improves to 98% at 500 candidates. The effect is more pronounced at higher top-k values (50 and 100), where initial retention drops to ~74% but recovers substantially with increased exploration. This suggests that binary quantization's information loss can be effectively mitigated by exploring more of the vector space. Retention dynamics change with retrieval depth: As top-k increases from 10 to 100, the retention patterns become more differentiated between quantization strategies. This reflects the increasing challenge of maintaining accurate rankings as more results are requested. While scalar quantization remains relatively stable across different top-k values, binary quantization shows more sensitivity, indicating it's better suited for targeted retrieval scenarios (low top-k) than for broad exploration. Exploration depth compensates for precision loss: A fascinating pattern emerges across all quantization methods: increased num_candidates consistently improves retention. This demonstrates that reduced precision can be effectively counterbalanced by broader exploration of the vector space. For example, binary quantization at 500 candidates achieves better retention than scalar quantization at 25 candidates, despite using 32x less memory per vector. Float32 ANN vs. scalar quantization: The float32 ANN approach (blue line) shows virtually identical retention to scalar quantization at higher top-k values, while consuming 4x more memory. This suggests scalar quantization represents an optimal balance point, offering full-precision quality with significantly reduced resource requirements. Conclusion This guide has demonstrated the powerful impact of vector quantization in optimizing vector search operations through MongoDB Atlas Vector Search and automatic quantization feature, using Voyage AI embeddings. These findings provide empirical validation that properly implemented quantization—particularly scalar quantization—can maintain semantic fidelity while dramatically reducing computational and memory requirements. The near-perfect retention of scalar quantization should alleviate concerns about quality degradation, while binary quantization's retention profile suggests it's suitable for applications with higher performance demands that can tolerate slight quality trade-offs or compensate with increased exploration depth. Binary quantization achieves optimal latency and resource efficiency, particularly valuable for high-scale deployments where speed is critical. Scalar quantization provides an effective balance between performance and precision, suitable for most production applications. Float32 maintains maximum accuracy but incurs significant performance and memory costs. Figure 9. Performance and memory usage metrics for binary quantization, scalar quantization, and float32 implementation. Based on the image above our implementation demonstrated substantial efficiency gains: Binary Quantized Index achieves the most compact disk footprint at 407.66MB, representing approximately 4KB per document. This compression comes from representing high-dimensional vectors as binary bits, dramatically reducing storage requirements while maintaining retrieval capability. Float32 ANN Index requires 394.73MB of disk space, slightly less than binary due to optimized index structures, but demands the full storage footprint be loaded into memory for optimal performance. Scalar Quantized Index shows the largest storage requirement at 492.83MB (approximately 5KB per document), suggesting this method maintains higher precision than binary while still applying compression techniques, resulting in a middle-ground approach between full precision and extreme quantization. The most striking difference lies in memory requirements. Binary quantization demonstrates a 23:1 memory efficiency ratio, requiring only 16.99MB in RAM versus the 394.73MB needed by float32_ann. Scalar quantization provides a 3:1 memory optimization, requiring 131.42MB compared to float32_ann's full memory footprint. For production AI Retrieval implementation, general guidance is as follows: Use scalar quantization for general use cases requiring good balance of speed and accuracy. Use binary quantization for large-scale applications (1M+ vectors) where speed is critical. Use float32 only for applications requiring maximum precision, where accuracy is paramount. Vector quantization becomes particularly valuable for databases exceeding 1M vectors, where it enables significant scalability improvements without compromising retrieval accuracy. When combined with MongoDB Atlas Search Nodes , this approach effectively addresses both cost and performance constraints in advanced vector search applications. Boost your MongoDB skills today through our Atlas Learning Hub . Head over to our quick start guide to get started with Atlas Vector Search.
Strategic Database Architecture for AI - Unified vs. Split
Key takeaways: A unified architecture significantly reduces development complexity by eliminating synchronization challenges between separate vector and operational databases. Data consistency is guaranteed through atomic transactions in unified systems, preventing "ghost documents" and other split architecture failures. The total cost of ownership is typically lower with unified architectures due to consolidated infrastructure and reduced maintenance burden. Developer velocity increases with unified approaches as teams can focus on building features rather than integration code and error handling. MongoDB Atlas provides future-proofing benefits with integrated AI capabilities like vector search, automatic quantization and more. AI demands more from databases, and the architectural decisions organizations make today directly affect their time‑to‑market and competitive edge. In the generative AI era, your database must support both high‑dimensional vector searches and fast transactional workloads to keep pace with rapid business and technological change. In this piece, we examine the architectural considerations technology leaders and architects should consider when managing AI applications’ diverse data requirements, including high-dimensional vector embeddings for semantic search alongside traditional operational data (user profiles, content metadata, etc.). This dichotomy presents two distinct architectural approaches— split versus unified —each with significant implications for application performance, consistency, and developer experience. Note: For technical leaders who want to equip their teams with the nuts and bolts details—or who need solid evidence to win over skeptical developers—we've published a comprehensive implementation guide . While this article focuses on the strategic considerations, the guide dives into the code-level realities that your development team will appreciate. Why data architecture matters Building successful AI products and features involves thinking ahead about the speed and cost of intelligence at scale . Whether you’re implementing semantic search for a knowledge base or powering a real-time recommendation engine, your database architecture underpins how quickly and reliably you can bring those features to market. In the AI era, success no longer hinges solely on having innovative algorithms—it's fundamentally determined by output accuracy and relevancy. This represents a profound shift: data architecture, once relegated to IT departments, has become everyone's strategic concern. It directly influences how quickly your developers can innovate ( developer velocity ), how rapidly you can introduce new capabilities to the market ( time-to-market ), and how reliably your systems perform under real-world conditions ( operational reliability ). In essence, your data architecture has become the foundation upon which your entire AI strategy either thrives or falters. Your data architecture is your data foundation. Unlike traditional applications that dealt mostly with structured data and simple CRUD queries, AI applications generate and query vector representations of unstructured data (like text, images, and audio) to find “similar” items. These vectors are often stored in dedicated vector databases or search engines optimized for similarity search. At the same time, applications still need traditional queries (exact lookups, aggregations, transactions on business data). This raises a key fundamental architectural question: Do we use separate specialized databases for these different workloads and data structures, or unify them in one system? Let’s also take the opportunity to briefly address the concept of an “ AI database ” that has emerged to describe a system that handles both standard operational workloads and AI-specific operations such as vector search . In short, behind AI Search capabilities in modern AI applications, are AI retrieval techniques enabled by databases optimized for AI workloads. Split architecture: Integrating a separate vector store In a split architecture, vector operations and transactional data management are delegated to separate, specialized systems. A general purpose database (e.g., MongoDB, PostgreSQL) maintains operational data, while a dedicated vector store (e.g., Elasticsearch, Pinecone) manages embeddings and similarity search operations. On the surface, this divide and conquer approach lets each system do what it's best at. The search engine or dedicated vector store can specialize in vector similarity queries, while the operational database handles updates and persistence. This leverages specialized optimizations in each system but introduces synchronization requirements between data stores. Many AI teams have implemented semantic search and other AI functionalities this way, using an external vector index alongside their application database, with both systems kept in sync through custom middleware or application-level logic. Split architecture characteristics: Specialized systems: Each database is optimized for its role (e.g. the operational DB ensures fast writes, ACID transactions, and rich queries; the vector search engine provides efficient similarity search using indexes like HNSW for approximate nearest neighbor). Data duplication: Vector embeddings (and often some identifiers or metadata) are duplicated in the vector store. The primary ID or key exists in both systems to link results. Synchronization logic: The application must handle synchronization – for every create/update/delete of a record, you need to also update or delete the corresponding vector entry in the search index. This can be done via event streams, change capture, or in application code calling two systems. Data querying: Multi-stage query patterns requiring cross-system coordination Example stack: An example is using MongoDB as the source of truth for product documents, and Elasticsearch as a vector search engine for product description embeddings. The app writes to MongoDB, then indexes the embedding into Elasticsearch, and at query time does a vector search in Elasticsearch, then fetches the full document from MongoDB by ID. This system pattern is what we hear from a number of AI teams that leverage MongoDB and... well, just about anything else that promises to make vectors dance faster. It's the architectural equivalent of wearing both a belt and suspenders—sure, your pants aren't falling down, but you're working awfully hard to solve what could be a simpler problem. These teams often find themselves building more synchronization code than actual features, turning what should be AI innovation into a complex juggling act of database coordination. Figure 1. Split architecture: MongoDB operational database + Elasticsearch vector store. Putting belts and suspenders aside, the notable point is that splitting the architecture comes at a cost . You now have two sources of truth that need to stay in sync. Every time you add or update data, you must index the vector in the search engine. Every query involves multiple round trips – one to the search service to find relevant items, and another to the database to fetch full details. This added complexity can slow development and introduces potential points of failure. Operating a split system introduces challenges, as we’ll discuss, around consistency (e.g. “ghost” records when the two systems get out of sync) and added complexity in development and maintenance. In extremely high-scale or ultra-low-latency use cases (e.g., >1B vectors or <1 ms NN SLAs), a dedicated vector engine such as FAISS or Milvus may still outperform a general-purpose database on raw similarity-search throughput. However, MongoDB Atlas’s Search Nodes isolate vector search workloads onto separate, memory-optimized instances—allowing you to scale and tune search performance independently of your database nodes, often delivering the low-latency guarantees modern AI applications require. Unified architecture with MongoDB Atlas: One platform for AI data In a unified architecture , a single database platform handles both operational data and vector search functionalities. MongoDB Atlas Vector Search integrates vector indexing and search directly into the MongoDB database. This architectural pattern simplifies the data model by storing embeddings alongside associated data in the same document structure. The database system internally manages vector indexing (using algorithms like HNSW) and provides integrated query capabilities across both vector and traditional data patterns. In practice, this means your application can execute one query (to MongoDB) that filters and finds data based on vector similarity, without needing a second system. This means all data – your application’s documents and their vector representations – live in one place, under one ACID-compliant transactional system for your AI workload. Unified architecture characteristics: Single source of truth: Both the raw data and the vector indexes reside in one database. For example, MongoDB Atlas allows storing vector fields in documents and querying them with integrated vector search operators. There is no need to duplicate or sync data between different systems. Atomic operations: Updates to a document and its vector embedding occur in one atomic transaction or write operation. This guarantees strong consistency – your vector index can’t drift from your document data. If a transaction fails, none of the changes (neither the document nor its embedding) are committed. This eliminates issues like “ghost documents” (we’ll define this shortly) because it's impossible to have an embedding without its corresponding document in the same database. Unified query capabilities: The query language (e.g. MongoDB’s MQL) can combine traditional filters, full-text search, and vector similarity search in one query. This hybrid search capability means you can, for instance, find documents where category = "Tech" and embedding is similar to a query vector – all in one go. You don’t have to do two queries in different systems and then merge results in your application. Operational simplicity: There’s only one system to manage, secure, scale, and monitor. In a managed cloud platform like MongoDB Atlas, you get a fully managed service that handles both operational and vector workloads, often with features to optimize each (for example, dedicated “search nodes” that handle search indexing and queries so that heavy vector searches don’t impact transactional workload performance). Figure 2. Unified Architecture: MongoDB Atlas with integrated Vector Search. MongoDB Atlas integrates an Atlas Vector Search engine (built on Apache Lucene, same technology used in some dedicated vector search engines) directly into the database. This allows developers to store high-dimensional vectors in documents and run similarity searches using indexes powered by algorithms like HNSW (Hierarchical Navigable Small World graphs) for approximate nearest neighbor (ANN) search. Additional features like vector quantization (to compress vectors for efficiency) and hybrid search (combining vector and text searches) are supported out-of-the-box and constructed with the MongoDB Query Language (MQL). All of this occurs under the umbrella of the MongoDB Atlas database’s transaction engine and security architecture. In short, the unified approach aims to provide the best of both worlds – the rich functionality of a specialized vector store and the reliability/consistency of a single operational datastore. A strategic consideration for decision makers For technical leaders managing both innovation and budgets, the unified approach presents a compelling financial case alongside its technical merits. If your organization is already leveraging MongoDB as your operational database—as thousands of enterprises worldwide do—the path to AI enablement becomes remarkably streamlined. Rather than allocating budget for an entirely new vector database system, with all the associated licensing, infrastructure, and staffing costs, you can extend your existing MongoDB investment to handle vector workloads. Your teams already understand MongoDB's architecture, security model, and operational characteristics. Adding vector capabilities becomes an incremental skill addition rather than a steep learning curve for an entirely new system. For projects already in flight, migrating vector data or generating new embeddings within your existing MongoDB infrastructure can be accomplished without disrupting ongoing operations. Technical overview of split vs. unified architecture To illustrate the practical implications of each architecture, let’s observe high level implementation and operational considerations for a knowledge base question-answering application. Both approaches enable vector similarity search, but with notable differences in implementation complexity and consistency guarantees. Figure 3. Split Architecture: The hidden cost. In a split architecture (e.g. using MongoDB + Elasticsearch) : We store the article content and metadata in MongoDB, and store the embedding vectors in an Elasticsearch index. At query time, we’ll search the Elasticsearch index by vector similarity to get a list of top article IDs, then retrieve those articles from MongoDB by their IDs. There are several key operations that are involved in a dual database architecture: Creation: During document creation, the application must coordinate insertions across both systems. First, the document is stored in MongoDB, then its vector embedding is generated and stored in Elasticsearch. If either operation fails, manual rollback logic is needed to maintain consistency. For example, if the MongoDB insertion succeeds but the Elasticsearch indexing fails, developers must implement custom cleanup code to delete the orphaned MongoDB document. Read: Vector search becomes a multi-stage process in a split architecture. The application first queries Elasticsearch to find similar vectors, retrieves only the document IDs, then makes a second round-trip to MongoDB to fetch the complete documents matching those IDs. This introduces additional network latency and requires error handling for cases where documents exist in one system but not the other. Update: Updating content presents significant synchronization challenges. After updating a document in MongoDB, the application must also update the corresponding vector in Elasticsearch. If the Elasticsearch update fails after the MongoDB update succeeds, the systems become out of sync, with the vector search returning outdated or incorrect results. There's no atomic transaction spanning both systems, requiring complex recovery mechanisms. Deletion: Deletion operations face similar synchronization issues. When a document is deleted from MongoDB but the corresponding deletion in Elasticsearch fails, "ghost documents" appear in search results - vectors pointing to documents that no longer exist. Users receive search results they cannot access, creating a confusing experience and potential security concerns if sensitive information remains indirectly accessible through preview content stored in Elasticsearch. Each of these operations requires careful error handling, retry mechanisms, monitoring systems, and background reconciliation processes to maintain consistency between the two databases. And notably, the complexity compounds over time, with synchronization issues becoming more difficult to detect and resolve as the data volume grows, ultimately impacting both developer productivity and user experience. Figure 4. CRUD operations in a unified architecture: MongoDB Atlas with vector search. In a unified architecture (using MongoDB Atlas Vector Search): We store both the article data and its embedding vector in a single MongoDB document. An Atlas Vector Search index on the embedding field allows us to perform a similarity search directly within MongoDB using a single query. The database will internally use the vector index to find nearest neighbors and return the documents. Let's examine how the same operations simplify dramatically in a unified architecture: Creation: Document creation becomes an atomic operation. The application stores both the document and its vector embedding in a single MongoDB document with one insert operation. Either the entire document (with its embedding) is stored successfully, or nothing is stored at all. There's no need for custom rollback logic or cleanup code since MongoDB's transaction guarantees ensure data integrity without additional application code. Read: Vector search is streamlined into a single step. Using MongoDB's aggregation pipeline with Atlas Vector Search, the application queries for similar vectors and retrieves the complete documents in a single round-trip. There's no need to coordinate between separate systems or handle inconsistencies, as the vector search is directly integrated with document retrieval, substantially reducing both latency and code complexity. Update: Document updates maintain perfect consistency. When updating a document's content, the application can atomically update both the document and its vector embedding in a single operation. MongoDB's transactional guarantees ensure that either both are updated or neither is, eliminating the possibility of out-of-sync data representations. Developers no longer need to implement complex recovery mechanisms for partial failures. Deletion: The ghost document problem vanishes entirely. When a document is deleted, its vector embedding is automatically removed as well, since they exist in the same document. There's no possibility of orphaned vectors or inconsistent search results. This ensures that search results always reflect the current state of the database, improving both reliability and security. This unified approach eliminates the entire category of synchronization challenges inherent in split architectures. Developers can focus on building features rather than synchronization mechanisms, monitoring tools, and recovery processes. The system naturally scales without increasing complexity, maintaining consistent performance and reliability even as data volumes grow. Beyond the technical benefits, this translates to faster development cycles, more reliable applications, and ultimately a better experience for end users who receive consistently accurate search results. The vector search and document retrieval happen in one round-trip to the database, which fundamentally transforms both the performance characteristics and operational simplicity of AI-powered applications. Syncing data: Challenges and "ghost documents" One of the biggest challenges with the split architecture is data synchronization. Because there are two sources of truth (the operational DB and the vector index), any change to data must be propagated to both. In practice, perfect synchronization is hard — network glitches, bugs, or process failures can result in one store updating while the other doesn't. This can lead to inconsistencies that are difficult to detect and resolve. A notorious example in a split setup is the "ghost document" scenario. A ghost document refers to a situation where the vector search returns a reference to a document that no longer exists (or no longer matches criteria) in the primary database. For instance, suppose an article was deleted or marked private in MongoDB but its embedding was not removed from Elasticsearch. A vector search might still retrieve its ID as a top result – leading your application to try to fetch a document that isn't there or shouldn't be shown. From a user's perspective, this could surface a result that is broken or stale. Let's go back to our practical scenario earlier: imagine a knowledge base system for customer support where articles are constantly being updated and occasionally removed when they become outdated. When a support agent deletes an article about a discontinued product, the deletion occurs in MongoDB successfully, but due to a network timeout, the corresponding vector deletion in Elasticsearch fails. And yes, that happens, especially with applications handling millions of requests daily. Later, when a customer searches for solutions related to that discontinued product, the vector search in Elasticsearch identifies the now deleted article as highly relevant and returns its ID. When the application attempts to fetch the full content from MongoDB using this ID, it discovers the document no longer exists. The customer sees a broken link or an error message instead of helpful content, creating a confusing and frustrating experience. What's particularly insidious about this problem is that it can manifest in various ways across the application. Beyond complete document deletion issues, you might encounter: Stale embeddings: A document is updated in MongoDB with new content, but the vector in Elasticsearch still represents the old version, causing search results that don't match the actual content. Permission inconsistencies: A document's access permissions change in MongoDB (e.g., from public to private), but it still appears in vector search results for users who shouldn't access it. Partial updates: Only some fields get updated across the systems, leading to mismatched metadata between what's shown in search previews versus the actual document. In production environments, development teams often resort to implementing complex workarounds to mitigate these synchronization issues: Background reconciliation jobs that periodically compare documents across both systems and repair inconsistencies Outbox patterns where operations are logged to a separate store and retried until successful Custom monitoring systems specifically designed to detect and alert on cross-database inconsistencies Manual intervention processes for support teams to address user-reported discrepancies All these mechanisms represent significant development effort that could otherwise be directed toward building features that deliver real business value. They also introduce additional points of failure and operational complexity. Crucially, a unified architecture avoids this entire class of problems . Since there is only one database, a document that is deleted is automatically removed from any associated indexes within the same transaction . A unified data model makes it relatively impossible to have a vector without its document , because they are one and held in the same document. As a result, issues like ghost documents, stale vector references, or needing to catch up two datastores simply go away. No synchronization needed – when documents and embeddings live in one database, you'll reduce the risk of ghost documents or inconsistent reads. Trade-offs and considerations There are several key trade-offs you have to weigh when comparing split and unified architectures for AI data. As mentioned, your choice will affect system complexity, performance characteristics, scalability, cost, and development agility. For AI project leads and Enterprise AI leaders it's vital to understand these considerations, below are a few: Figure 5. Trade-off comparison: Split vs. MongoDB Unified Architecture. System complexity vs. data consistency: Maintaining consistency in a split setup requires additional logic and increases system complexity. Every piece of data is effectively handled twice, introducing opportunities for inconsistency and complex failure modes. In a unified architecture, ACID transactions ensure that updates to data and its embedding vector occur together or not at all, simplifying the design and reducing custom error handling code. Operational overhead vs. performance: A split architecture can leverage specialized engines optimized for similarity queries, but introduces network latency with multiple round trips and increases operational overhead with two systems to monitor. Unified architectures eliminate the extra network hop, potentially reducing query latency. MongoDB Atlas offers optimizations like vector quantization and dedicated search processing nodes that can match or exceed the performance of separate search engines. Scalability vs. cost efficiency: Split architectures allow independent scaling of components but come with infrastructure cost duplication and data redundancy. A unified architecture consolidates resources while still enabling workload isolation through features like Atlas Search Nodes . This simplifies capacity planning and helps avoid over-provisioning multiple systems. Maintenance burden vs. developer velocity: Split architectures require substantial "glue code" for integration, dual writes, and synchronization, slowing development and complicating schema changes. Unified architectures let developers focus on application logic with fewer moving parts and a single query language, potentially accelerating time-to-market for AI features. Future-proofing: Simpler unified architectures make it easier and faster to adopt new capabilities as AI technology evolves. Split systems accumulate technical debt with each component upgrade, while unified platforms can incorporate new features transparently without redesigning integration points. While some organizations may initially choose a split approach due to legacy systems or specialized requirements, MongoDB's unified architecture with Atlas Vector Search now addresses many historical reasons for separate search engines, offering hybrid search capabilities, accuracy options, and optimization tools within a single database environment. Choosing the right architecture for AI workloads When should you choose a split architecture, and when does a unified architecture make more sense? The answer ultimately depends on your specific requirements and constraints . Consider a Split Architecture if you already have significant infrastructure built around a specialized search or vector database and it’s meeting your needs. In some cases, extremely high-scale search applications might be deeply tuned on a separate engine, or regulatory requirements might dictate separate data stores. A split approach can also make sense if one type of workload far outstrips the other (e.g., you perform vector searches on billions of items, but have relatively light transactional operations – though even then, a unified solution with the right indexing can handle surprising scale). Just be prepared to invest in the tooling and engineering effort to keep the two systems in harmony. If you go this route, design your sync processes carefully and consider using change streams or event buses to propagate changes reliably. Also, weigh the operational cost: maintaining expertise in two platforms and the integration between them is non-trivial. Consider a Unified Architecture if you are building a new AI-powered application or modernizing an existing one, and you want simplicity, consistency, and speed of development . If avoiding the pitfalls of data sync and reducing operational complexity are priorities, unified is a great choice. A unified platform shines when your application needs tight integration between operational and vector data – for example, performing a semantic search with runtime filters on metadata, or updating content and immediately reflecting it in search results. With a solution like MongoDB’s modern data platform , you get a fully managed, cloud-ready database that can handle both your online application needs and AI search needs under one roof. This leads to faster development cycles (since your team can work with one system and one query language) and greater confidence that your search results reflect the true state of your data at any moment. Figure 6. Unified architecture benefits in MongoDB Atlas Vector Search. Looking ahead, a unified architecture is arguably the more future-proof approach. AI capabilities evolve at an accelerated pace, so having your data in one place allows you to leverage new features immediately. We work with AI customers building sophisticated AI applications, and one key observation is the requirement to streamline data processing operations within AI applications that leverage RAG pipelines or Agentic AI. Critical operations include chunking, embedding generation, vector search operation, and reranking. We've also brought in Voyage AI ’s state-of-the-art embedding models and rerankers to MongoDB. Soon, these models will reside within MongoDB Atlas and enable the conversion of data objects into embeddings and enforce an additional layer of data management in retrieval pipelines will all be within MongoDB Atlas. This step is one of the key ways MongoDB continues to bring intelligence to the data layer and creating a truly intelligent data foundation for AI applications. MongoDB's Atlas platform is continually expanding its AI-focused features – from vector search improvements to integration with data streams and real-time analytics – all while ensuring the core database guarantees (like ACID transactions and high availability) remain solid. This means you don't have to re-architect your data layer to adopt the next big advancement in AI; your existing platform grows to support it. Understandably, the split vs unified architecture debate is a classic example of balancing specialization against simplicity . Split systems can offer best-of-breed components for each task, but at the cost of complexity and potential inconsistency. Unified systems offer elegance and ease, bundling capabilities in one place, and have rapidly closed the gap in terms of features and performance. Let’s end on this, MongoDB was built for change , and that ethos is exactly what organizations need as they navigate the AI revolution. By consolidating your data infrastructure and embracing technologies that unify capabilities, you equip your teams with the freedom to experiment and the confidence to execute. The future will belong to those who can harness AI and data together seamlessly . It’s time to evaluate your own architecture and make sure it enables you to ride the wave of AI innovation, and not be washed away by it. In an AI-first era, the ability to adapt quickly and execute with excellence is what separates and defines leaders. The choice of database infrastructure is a pivotal part of that execution. Choose wisely – your next breakthrough might depend on it. Try MongoDB Atlas for free today , or head over to our Atlas Learning Hub to boost your MongoDB Atlas skills!
People Who Ship: Building Centralized AI Tooling
Welcome to People Who Ship! In this new video and blog series, we'll be bringing you behind-the-scenes stories and hard-won insights from developers building and shipping production-grade AI applications using MongoDB. In each month's episode, your host—myself, Senior AI Developer Advocate at MongoDB—will chat with developers from both inside and outside MongoDB about their projects, tools, and lessons learned along the way. Are you a developer? Great! This is the place for you; People Who Ship is by developers, for developers. And if you're not (yet) a developer, that's great too! Stick around to learn how your favorite applications are built. In this episode, John Ziegler , Engineering Lead on MongoDB's internal generative AI (Gen AI) tooling team, shares technical decisions made and practical lessons learned while developing a centralized infrastructure called Central RAG (RAG = Retrieval Augmented Generation ), which enables teams at MongoDB to rapidly build RAG-based chatbots and copilots for diverse use cases. John’s top three insights During our conversation, John shared a number of insights learned during the Central RAG project. Here are the top three: 1. Enforce access controls across all operations Maintaining data sensitivity and privacy is a key requirement when building enterprise-grade AI applications. This is especially important when curating data sources and building centralized infrastructure that teams and applications across the organization can use. In the context of Central RAG, for example, users should only be able to select or link data sources that they have access to, as knowledge sources for their LLM applications. Even at query time, the LLM should only pull information that the querying user has access to, as context to answer the user's query. Access controls are typically enforced by an authentication service using access control lists (ACLs) that define the relationships between users and resources. In Central RAG, this is managed by Credal’s permissions service . You can check out this article that shows you how to build an authentication layer using Credal’s permissions service, and other tools like OpenFGA. 2. Anchor your evaluations in the problem you are trying to solve Evaluation is a critical aspect of shipping software, including LLM applications. It is not a one-and-done process—each time you change any component of the system, you need to ensure that it does not adversely impact the system's performance. The evaluation metrics depend on your application's specific use cases. For Central RAG, which aims to help teams securely access relevant and up-to-date data sources for building LLM applications, the team incorporates the following checks in the form of integration and end-to-end tests in their CI/CD pipeline: Ensure access controls are enforced when adding data sources. Ensure access controls are enforced when retrieving information from data sources. Ensure that data retention policies are respected, so that removed data sources are no longer retrieved or referenced downstream. LLM-as-a-judge to evaluate response quality across various use cases with a curated dataset of question-answer pairs. If you would like to learn more about evaluating LLM applications, we have a detailed tutorial with code . 3. Educate your users on what’s possible and what’s not User education is critical yet often overlooked when deploying software. This is especially true for this new generation of AI applications, where explaining best practices and setting clear expectations can prevent data security issues and user frustration. For Central RAG, teams must review the acceptable use policies, legal guidelines, and documentation on available data sources and appropriate use cases before gaining access to the platform. These materials also highlight scenarios to avoid, such as connecting sensitive data sources, and provide guidance on prompting best practices to ensure users can effectively leverage the platform within its intended boundaries. John’s AI tool recommendations The backbone of Central RAG is a tool called Credal . Credal provides a platform for teams to quickly create AI applications on top of their data. As maintainers of Central RAG, Credal allows John’s team to create a curated list of data sources for teams to choose from and manage applications created by different teams. Teams can choose from the curated list or connect custom data sources via connectors, select from an exhaustive list of large language models (LLMs), configure system prompts, and deploy their applications to platforms like Slack, etc., directly from the Credal UI or via their API. Surprising and delighting users Overall, John describes his team’s goal with Central RAG as “making it stunningly easy for teams to build RAG applications that surprise and delight people.” We see several organizations adopting this central RAG model to both democratize the development of AI applications and to reduce the time to impact of their teams. If you are working on similar problems and want to learn about how MongoDB can help, submit a request to speak with one of our specialists. If you would like to explore on your own, check out our self-paced AI Learning Hub and our gen AI examples GitHub repository .
Advancing Integration Between Drupal and MongoDB
MongoDB regularly collaborates with open source innovators like David Bekker, a Drupal core contributor with over 600 commit credits. David's expertise lies in Drupal's Database API and database driver modules, and he's passionate about delivering business value through open source development. Drupal is a widely used open-source content management system known for its robustness and flexibility, enabling users to create everything from personal blogs to enterprise-level applications. While Drupal typically relies on relational databases (e.g.,MySQL), there has been growing interest in the Drupal community in exploring how modern databases like MongoDB can improve efficiency. In this guest post, David explores integrating MongoDB with Drupal to enhance its performance and scalability, helping Drupal remain competitive in the digital landscape. - Rishabh Bisht, Product Manager, Developer Experience Who am I? Hello! My name is David Bekker (a.k.a. daffie ), and I’m a seasoned Drupal core contributor with over 600 commit credits. I maintain Drupal’s Database API and database driver modules. My passion lies in open source development, driven by a desire to create maximum business value. When I was looking for a new high-impact project to work on, I chose to develop a MongoDB driver for Drupal —one that stores entity instances as JSON objects. This project addresses Drupal’s evolving needs in a meaningful way. User-centric innovation: Drupal’s next evolution Drupal is rapidly evolving, making it particularly suitable for community and client portal solutions. This progression introduces new technical requirements, especially for authenticated, session-based scenarios like intranets and dashboards, which benefit from more adaptable storage solutions. While Drupal's abstract database layer remains tied to the relational models, embracing non-relational databases would better support its evolving needs for modern applications. To understand why this shift is crucial, let's compare this transition to a challenge Drupal faced years ago: optimizing sites for mobile devices. Back then, significant changes were needed to enhance mobile usability. Now, we face a similar paradigm shift as the market evolves from sites for anonymous users to those centered on authenticated users. Drupal must adapt, and Drupal on MongoDB is the key to this transformation. MongoDB, with its flexible, JSON-based structure, complements Drupal's architecture well. A robust integration with MongoDB would enhance capabilities and better equip Drupal to meet the expanding demands of enterprises. Beyond traditional use cases, Drupal on MongoDB is also ideal as a backend for iOS, Android, and JavaScript applications, providing personalized and scalable solutions. Redefining data storage and retrieval Drupal on MongoDB is more than just a new database option. It enhances Drupal’s ability to compete in a changing digital landscape. Drupal’s robust entity system provides a solid foundation where everything is structured as an entity. Traditionally, Drupal leverages relational databases like MySQL or MariaDB, efficiently managing data across multiple tables. This approach performs well for sites with a large number of anonymous users. However, for sites with many authenticated users, the complexity of retrieving entity data from multiple tables can introduce performance challenges. Optimizing data retrieval can significantly enhance the user experience, making Drupal even more powerful for dynamic, user-centric applications. With MongoDB, every Drupal entity instance is stored as a single JSON object, including all revisions, translations, and field data. This streamlined data structure allows for significantly faster retrieval, making Drupal a stronger solution for personalized, user-focused experiences. As the market shifts toward authentication-driven sites, supporting MongoDB ensures that Drupal remains a competitive and scalable option. Rather than replacing Drupal’s strengths, this integration enhances them, allowing Drupal to meet modern performance demands while maintaining its flexibility and power. Scalability: Why MongoDB makes sense for large Drupal projects The scalability of non-relational databases like MongoDB sets them apart from traditional relational databases such as MySQL or MariaDB. While relational databases typically rely on a single-server model, MongoDB supports horizontal scaling, enabling distributed setups with thousands of servers acting as a unified database. This architecture provides the performance needed for large-scale projects with millions of authenticated users. As community-driven software, Drupal is built to support interactive, user-focused experiences, including forums, profiles, and content management. Traditionally, its relational model organizes data across multiple tables—similar to storing the chapters of a book separately in a library. This approach ensures data consistency and flexibility, making it highly effective for managing structured content. However, as the demand for authentication-heavy sites grows, the way data is stored becomes a crucial factor in performance. MongoDB offers a more efficient alternative by storing entire entities as JSON objects—just like keeping an entire book intact rather than splitting it into separate chapters across different locations. This eliminates the need for complex table joins, significantly accelerating data retrieval and making MongoDB well suited for personalized dashboards and dynamic content feeds. For small-scale sites, both relational and non-relational approaches work. But when scalability, speed, and efficiency become priorities—particularly for sites with millions of authenticated users—MongoDB provides a natural and powerful solution for taking Drupal to the next level. Example of a user entity stored in MongoDB The sample document below is an example of how a user entity could look like in MongoDB, containing fields like _id , uid , uuid , and langcode . It includes an embedded user_translations array that holds user details such as name , email , timezone , status , and timestamps for various activities. { _id: ObjectId('664afdd4a3a001e71e0b49c7'), uid: 1, uuid: '841149cd-fe56-47c4-a112-6d23f561332f', langcode: 'en', user_translations: [ { uid: 1, uuid: '841149cd-fe56-47c4-a112-6d23f561332f', langcode: 'en', preferred_langcode: 'en', name: 'root', pass: '$2y$10$kjGuIsPOTDa2TseuWMFGS.veLzH/khl0SfsuZNAeRPRtABgfq5GSC', mail: 'admin@example.com', timezone: 'Europe/Amsterdam', status: true, created: ISODate('2024-05-20T07:37:54.000Z'), changed: ISODate('2024-05-20T07:42:08.000Z'), access: ISODate('2024-05-20T08:46:47.000Z'), login: ISODate('2024-05-20T07:44:16.000Z'), init: 'admin@example.com', default_langcode: true, user_translations__roles: [ { bundle: 'user', deleted: false, langcode: 'en', entity_id: 1, revision_id: 1, delta: 0, roles_target_id: 'administrator' } ] } ], login: ISODate('2024-05-20T07:44:16.000Z'), access: ISODate('2024-05-20T08:46:47.000Z') } Optimizing data storage for performance Switching to MongoDB alone isn’t enough to make Drupal a top-tier solution for sites with a high number of authenticated users. Indeed, developers must rethink how data is stored. In traditional Drupal setups optimized for anonymous users, caching mechanisms like Redis compensate for slow database queries. However, for authenticated users, where content is dynamic and personalized, this approach falls short. Drupal itself needs to be fast, not just its caching layer. MongoDB enables developers to store data in the way the application uses it, reducing the need for complex queries that slow down performance. Instead of relying on costly operations like joins and subqueries, simple and efficient queries should be the norm. Tools like materialized views—precomputed query results stored as database tables—help achieve this, ensuring faster data retrieval while keeping the database structured for high performance. Why MongoDB for Drupal? While many databases support JSON storage, MongoDB is the only one that fully meets Drupal’s needs . Its capabilities extend beyond basic JSON support, making it the optimal choice for storing entity instances efficiently. Additionally, MongoDB offers several key advantages that align with Drupal’s evolving requirements: Horizontal scaling: Easily distribute database load across multiple servers, making it scalable for large user bases. Integrated file storage: Store user-uploaded files directly in the database instead of on the web server, simplifying hosting. Built-in full-text search: Eliminates the need for separate search solutions like SOLR, reducing infrastructure complexity. AI capabilities: Supports AI vectors, allowing for features like advanced search and personalization tailored to a site’s content. Current status Drupal’s journey to embracing more flexible data storage solutions is advancing with promising developments: The MongoDB driver for Drupal is available as a contrib module for Drupal 11, with over 99% of core tests passing. Discussions are ongoing to merge MongoDB support into Drupal core, pending community contributions. Finalist / Tech Blog is already running entirely on MongoDB. These steps mark a significant transition for Drupal, showcasing its evolution towards accommodating non-relational databases like MongoDB. It paves the way for broader applications and more robust infrastructure by leveraging MongoDB’s strengths in flexibility and scalability. Conclusion As the web moves toward more personalized, user-centric experiences, Drupal must evolve to remain competitive. MongoDB is a key enabler of this evolution, providing faster, more scalable solutions for authenticated user-heavy sites. By embracing MongoDB, Drupal developers can unlock new performance possibilities, simplify infrastructure, and build future-ready web applications. Check out the tutorial on how to run Drupal on MongoDB Atlas and start experiencing the benefits of this powerful integration today! Want to get involved? Join the conversation in the Drupal community via Slack in the #mongodb and #contribute channels. Let’s shape the future of Drupal together!