rust

3133 results

Innovating with MongoDB | Customer Successes, July 2025

How time flies! Summer is in full swing, and it’s already time for another MongoDB customer success roundup. This month, we’re focusing on customers who have combined the flexibility of MongoDB Atlas with cutting-edge AI advancements to unlock insights and fuel innovation. Let’s be honest—AI is everywhere. We’re intrigued, inspired, and maybe a little overwhelmed by its possibilities. But the hype exists for a good reason: AI is a groundbreaking technology that’s poised to transform every industry, job, and task, and it’s fundamentally changing how software interacts with data. We’re quickly learning, though, that delivering meaningful outcomes with AI requires the right infrastructure. With MongoDB Atlas , companies are leveraging vector search, seamless document modeling, and large language model (LLM) integrations to make smarter use of their data in real time. Whether that means enhancing engagement, simplifying decision-making, or enabling more efficient processes, MongoDB is helping organizations redefine how they leverage AI to solve critical challenges and create lasting impact. In this issue, I’m particularly excited to share the highlight of an impactful platform developed by CentralReach in the Autism care space – a cause near and dear to my family. They, along with customers like the Financial Times, Ubuy, and Base39, are demonstrating AI’s possibilities and transforming how data powers success. Ubuy Ubuy , an e-commerce platform serving customers in over 180 countries, needed a faster, more scalable solution to manage its catalog of over 300 million products. They were facing significant search performance bottlenecks, which impacted user experience and limited growth potential. By migrating from MySQL to MongoDB Atlas and leveraging Atlas Search and Atlas Vector Search , Ubuy reduced search response times from 4–5 seconds to milliseconds and enabled intent-driven product discovery powered by AI. Now, Ubuy easily handles over 150 million searches annually while delivering personalized recommendations and seamless scalability. AI-driven search enhancements have boosted customer engagement and SEO visibility, transforming global e-commerce and redefining how Ubuy customers access international products. Financial Times The Financial Times (FT) , a global leader in business journalism, wanted to deliver a hybrid search experience that combined traditional keyword precision with AI-driven discovery. With over a million daily searches, scaling this innovative solution quickly was critical. Using MongoDB Atlas—including Atlas Vector Search—the FT developed its AI-powered hybrid search in just 18 weeks. By blending full-text and semantic search capabilities, the solution delivers relevant recommendations instantly, enhancing content discovery for time-strapped readers. Partnering with MongoDB streamlined deployment, enabling the FT to surface hyper-relevant results while positioning itself as a leader in media innovation. With plans to roll out hybrid search across mobile apps and specialist titles next, the FT continues redefining how readers engage with trusted journalism in an AI-enabled world. Through our partnership with The Stack, learn how our customers are achieving extraordinary results with MongoDB. This exclusive content could spark the insights you need to drive your business forward. CentralReach CentralReach , a global leader in autism and intellectual and developmental disability (IDD) care technology, faced the challenge of managing 4 billion clinical data points annually while reducing administrative burdens for behavioral analysts. By building its Care360 platform on MongoDB Atlas and joining the MongoDB AI Applications Program (MAAP), CentralReach unified data across 62 million service appointments per year. With flexible document modeling, vector search, and advanced AI pipelines, the platform enables seamless access to patient records and intelligent querying, reducing manual workflows and improving care consistency. CentralReach’s AI-powered solution has streamlined processes, reduced documentation errors, and helped expand access to care for hundreds of thousands globally. With MongoDB Atlas’s scalability and powerful AI integrations, CentralReach is redefining autism care delivery. Base39 Base39 , a Brazil-based fintech, set out to streamline complex credit analysis using AI-driven insights. Manual processes and data scarcity limited efficiency and accuracy, often delaying loan assessments by up to 10 days. By leveraging MongoDB Atlas on AWS , as well as Atlas Vector Search and LLM integrations, Base39 transformed its workflow. With agentic AI and predictive algorithms, loan applications are now assessed in minutes, achieving 96% cost reductions and improved data insights. MongoDB’s flexible schema and native vector search capabilities helped boost productivity while cutting infrastructure costs by 84%. By empowering developers to focus on innovation instead of management, Base39 has set a new standard in AI-powered credit analysis. Video spotlight: Cisco Before you go, check out how Cisco is redefining innovation with generative AI while prioritizing security. Omar Santos, Distinguished Engineer at Cisco, shares how MongoDB Atlas Vector Search accelerated development and saved millions through smarter, safer AI applications. Want to get inspired by your peers and discover all the ways we empower businesses to innovate for the future? Visit MongoDB’s Customer Success Stories hub to see why these customers, and so many more, build modern applications with MongoDB.

July 17, 2025

PLAID, Inc. Optimizes Real-Time Data With MongoDB Atlas Stream Processing

A MongoDB customer since 2015, Tokyo, Japan-based PLAID, Inc. works to “maximize the value of people with the power of data,” according to the company’s mission statement. PLAID’s customer experience platform, KARTE, analyzes and visualizes website and application users’ data in real time, offering the company’s customers a one-stop solution that helps them better understand their customers and provide personalized experiences. After running a self-hosted instance of MongoDB for several years, in 2021, PLAID adopted MongoDB Atlas , a fully managed suite of cloud database services. Subsequently, however, the company ran into real-time data challenges. Specifically, PLAID faced challenges when trying to migrate an existing batch processing system that sent real-time data from MongoDB Atlas to Google BigQuery, which helps organizations “go from data to AI action faster.” While their initial cloud setup with Kafka connectors provided valuable streaming capabilities by capturing events from MongoDB and streaming them to BigQuery, the complexity tied to the number of pipelines became a concern. The staging environment, which required duplicate pipelines, further exacerbated the issue, and rising costs could hinder PLAID's ability to scale and expand its real-time data processing system efficiently. Easy event data processing with Atlas Stream Processing To address these challenges, PLAID turned to MongoDB Atlas Stream Processing , which enables development teams to process streams of complex data using the same query API used in their MongoDB Atlas databases. Atlas Stream Processing provided PLAID with a cost-effective way of acquiring and processing event data in real time, all while being natively integrated within their existing MongoDB Atlas environment for a seamless developer experience. This allowed them to replace some of their costly Kafka source connectors while maintaining the overall data flow to BigQuery via their existing Confluent Cloud Kafka setup. Key aspects of the solution included: Replacing Kafka source connectors: Atlas Stream Processing efficiently captures event data from MongoDB Atlas databases and writes them to Kafka, reducing costs associated with the previous Kafka source connectors. MongoDB Atlas Stream Processing: Stream processing instance (SPI): PLAID used SPIs, where cost is determined by the instance tier and the number of workers, which in turn depends on the number of stream processors. This offered a more optimized cost structure compared to the previous connector-task-based pricing. Connection management: Atlas Stream Processing simplifies connection management. Connecting to Atlas databases is straightforward, and a single connection can be used for the Kafka cluster. Stream processors: These processing units perform data transformation and routing with the same aggregation pipelines used by MongoDB databases. Thus, the PLAID team leveraged their existing MongoDB knowledge to define pipeline logic, making the transition smoother. Custom backfill mechanism: To address the lack of a backfill feature in Stream Processing, PLAID developed a custom application to synchronize existing data. Custom metric collection: Since native monitoring integration with Datadog was unavailable, PLAID created a bot to collect Atlas Stream Processing metrics and send them to Datadog for monitoring and alerting. Atlas Stream Processing provided us with a robust solution for real-time data processing, which has significantly reduced costs and improved scalability throughout our platform. Hajime Shiozawa, senior software engineer, PLAID, Inc. The outcome: Lower costs, improved efficiency By implementing MongoDB Atlas Stream Processing, PLAID achieved significant improvements. These include everything from reduced costs to operational efficiencies: Reduced costs: PLAID eliminated the cost structure that was proportional to the number of pipelines, resulting in substantial cost savings. The new cost model based on Atlas Stream Processing workers offered a more scalable and predictable pricing structure. Improved scalability: The optimized architecture allowed PLAID to scale their real-time data processing system efficiently, supporting the addition of new products and Atlas clusters without escalating costs. Simplified management: Because Stream Processing is a native MongoDB Atlas capability, it simplified connection management and pipeline configuration, reducing operational overhead. Stable operation: PLAID successfully deployed and operated more than 20 pipelines, processing over 3 million events per day to BigQuery. Enhanced real-time data capabilities: The improved system strengthened the real-time nature of their data, improving operational efficiency. MongoDB Atlas Stream Processing provided PLAID with a robust and cost-effective solution for real-time data processing to BigQuery. By replacing costly Kafka Source Connectors and optimizing their architecture, PLAID significantly reduced costs and improved scalability. The seamless integration with MongoDB Atlas and the developer-friendly API further enhanced their operational efficiency. PLAID’s success with Atlas Stream Processing demonstrates that it is a valuable tool for organizations that are looking to streamline their data integration pipelines and leverage real-time data effectively. To learn how Atlas Stream Processing helps organizations integrate MongoDB with Apache Kafka to build event-driven applications, see the MongoDB Atlas Stream Processing page.

July 17, 2025

Embedded Objects and Other Index Gotchas

In a recent design review , the customer's application was in production, but performance had taken a nosedive as data volumes grew. It turned out that the issue was down to how they were indexing the embedded objects in their documents. This article explains why their indexes were causing problems, and how they could be fixed. Note that I've changed details for this use case to obfuscate the customer and application. All customer information shared in a design review is kept confidential. We looked at the schema, and things looked good. They'd correctly split their claim information across two documents: One contained a modest amount of queryable data (20 KB per claim). These documents included the _id of the second document in case the application needed to fetch it (which was relatively rare). The second contained the bulky raw data that's immutable, unindexed, and rarely read. They had 110K queryable documents in the first collection—claims. With 2.2 GB of documents (before compression, which only reduces on-disk size) and 4 GB of cache, there shouldn't have been any performance issues. We looked at some of the queries, and there was a pretty wide set of keys being filtered on and in different combinations, but none of them returned massive numbers of documents. Some queries were taking tens of seconds. It made no sense. Even a full collection scan should take well under a second for this configuration. And they'd even added indexes for their common queries. So then, we looked at the indexes… Figure 1. Collection size report in MongoDB Atlas. 15 indexes on one collection is on the high side and could slow down your writes, but it's the read performance that we were troubleshooting. But, those 15 indexes are consuming 85 GB of space. With the 4 GB of cache available on their M30 Atlas nodes, that’s a huge problem! There wasn't enough RAM in the system for the indexes to fit in cache. The result was that when MongoDB navigated an index, it would repeatedly hit branches that weren't yet in memory and then have to fetch them from disk. That’s slow. Taking a look at one of the indexes… Figure 2. Index definition in MongoDB Atlas. It's a compound index on six fields, but the first five of those fields are objects, and the sixth is an array of objects—this explains why the indexes were so large. Avoiding indexes on objects Even ignoring the size of the index, adding objects to an index can be problematic. Querying on embedded objects doesn't behave in the way that many people expect. If an index on an embedded object is to be used, then the query needs to include every field in the embedded object. E.g., if I execute this query, then it matches exactly one of the documents in the database: db.getCollection('claim').findOne( { "policy_holder": { "first_name": "Janelle", "last_name": "Nienow", "dob": new Date("2024-12-16T23:56:49.643Z"), "location": { "street": "67628 Warren Road", "city": "Padbergstead", "state": "Minnesota", "zip_code": "44832-7187" }, "contact": { "email": "Janelle.Nienow@noxious-flood.org" } } } ); It delivers this result: { "_id": { "$oid": "67d801b7ad415ad6165ccd5f" }, "region": 12, "policy_holder": { "first_name": "Janelle", "last_name": "Nienow", "dob": { "$date": "2024-12-16T23:56:49.643Z" }, "location": { "street": "67628 Warren Road", "city": "Padbergstead", "state": "Minnesota", "zip_code": "44832-7187" }, "contact": { "email": "Janelle.Nienow@noxious-flood.org" } }, "policy_details": { "policy_number": "POL554359100", "type": "Home Insurance", "coverage": { "liability": 849000000, "collision": 512000, "comprehensive": 699000 } }, ... } The explain plan confirmed that MongoDB was able to use one of the defined indexes: Figure 3. The visual explain plan tool in MongoDB Atlas displaying that the compound index on policy_holder and messages was used. If just one field from the embedded object isn't included in the query, then no documents will match: db.getCollection('claim').findOne( { "policy_holder": { "first_name": "Janelle", "last_name": "Nienow", "dob": new Date("2024-12-16T23:56:49.643Z"), "location": { "street": "67628 Warren Road", "city": "Padbergstead", "state": "Minnesota", // "zip_code": "44832-7187" }, "contact": { "email": "Janelle.Nienow@noxious-flood.org" } } } ); This resulted in no matches—though the index is at least still used. If we instead pick out individual fields from the object to query on, then we get the results we expect: db.getCollection('claim').findOne( { "policy_holder.first_name": "Janelle", "policy_holder.last_name": "Nienow" } ); { "_id": { "$oid": "67d801b7ad415ad6165ccd5f" }, "region": 12, "policy_holder": { "first_name": "Janelle", "last_name": "Nienow", "dob": { "$date": "2024-12-16T23:56:49.643Z" }, "location": { "street": "67628 Warren Road", "city": "Padbergstead", "state": "Minnesota", "zip_code": "44832-7187" }, "contact": { "email": "Janelle.Nienow@noxious-flood.org" } }, "policy_details": { "policy_number": "POL554359100", "type": "Home Insurance", "coverage": { "liability": 849000000, "collision": 512000, "comprehensive": 699000 } }, ... } Unfortunately, none of the indexes that included policy_holder could be used as they were indexing the value of the complete embedded object, not the individual fields within it, and so a full collection scan was performed: Figure 4. The visual explain plan too warning that no index was available. Using compound indexes instead If we instead add a compound index that leads with the fields from the object we need to filter on, then that index will be used: Figure 5. Creating an index in MongoDB Atlas. Figure 6. Explain plan providing information for the compound index. As a quick refresher on using compound indexes, that index will be used if we query on just first_name: db.getCollection('claim').findOne( { "policy_holder.first_name": "Janelle", // "policy_holder.last_name": "Nienow" } ); Figure 7. Explain plan showing that the compound index was used. If we don't include the first key in the compound index, then it won't be used: db.getCollection('claim').findOne( { // "policy_holder.first_name": "Janelle", "policy_holder.last_name": "Nienow" } ); Figure 8. Explain plan providing more information on the query. However, you can use the index if you artificially include the leading keys in the query (though it will be more efficient if last_name had been the first key in the index): db.getCollection('claim').findOne( { "policy_holder.first_name": {$exists: true}, "policy_holder.last_name": "Nienow" } ); Figure 9. Explain plan showing the data for the index. Incompletely indexed queries While having indexes for your queries is critical, there is a cost to having too many, or in having indexes that include too many fields—writes get slower and pressure increases on cache occupancy. Sometimes, it's enough to have an index that does part of the work, and then rely on a scan of the documents found by the index to check the remaining keys. For example, the policy holder’s home state isn't included in our compound index, but we can still query on it: db.getCollection('claim').findOne( { "policy_holder.first_name": "Janelle", "policy_holder.location.state": "Kentucky" } ); Figure 9. Explain plan shows that the index narrowed down the problem. The explain plan shows that the index narrowed down the search from 110,000 documents to 111, which were then scanned to find the three matching documents. If it's rare for the state to be included in the query, then this can be a good solution. Partial indexes The main challenge in this design review was the size of the indexes, and so it's worth looking into another approach to limit the size of an index. Imagine that we need to be able to check on the names and email addresses of witnesses to accidents. We can add an index on the relevant fields: Figure 10. Adding an index to the relevant fields in Atlas. This index consumes 9.8 MB of cache space and must be updated when any document is added, or when any of these three fields are updated. Even if a document has null values for the indexed fields, or if the fields aren’t even present in the document, the document will still be included in the index. If we look deeper into the requirements, we might establish that we only need to query this data for fraudulent claims. That means that we're wasting space in our index for entries for all of the other claims. We can exploit this requirement by creating a partial index , setting the partial filter expression to { "claim.status": "Fraud" } . Only documents that match that pattern will be included in the index. Figure 11. Creating a partial filter in Atlas. That reduces the size of the index to 57 KB (a saving of more than 99%): Figure 12. Indexing sizing report. Note that queries must include { "claim.status": "Fraud" } for this index to be used: db.getCollection('claim').findOne( { "witnesses.email": "Sammy.Bergstrom@hotmail.com", "claim.status": "Fraud" } ); Figure 13. Explain plan providing details on the index keys and documents details. Conclusion Indexes are critical to database performance, whether you're using an RDBMS or MongoDB. MongoDB allows polymorphic documents, arrays, and embedded objects that aren't available in a traditional RDBMS. This leads to extra indexing opportunities, but also potential pitfalls. You should have indexes to optimize all of your frequent queries, but use the wrong type or too many of them and things could backfire. We saw that in this case with indexes taking up too much space and not being as general purpose as the developer believed. To compound problems, the database may perform well in development and for the early days in production. Things go wrong over time as the collections grow and extra indexes are added. As soon as the working data set (indexes and documents) doesn’t fit in the cache, performance quickly declines. Well-informed use of compound and partial indexes will ensure that MongoDB delivers the performance your application needs, even as your database grows. Learn more about MongoDB design reviews Design reviews are a chance for a design expert from MongoDB to advise you on how best to use MongoDB for your application. The reviews are focused on making you successful using MongoDB. It's never too early to request a review. By engaging us early (perhaps before you've even decided to use MongoDB), we can advise you when you have the best opportunity to act on it. This article explained how using a MongoDB schema and set of indexes that match how your application works with data can meet your performance requirements. If you want help to come up with that schema, then a design review is how to get that help. Would your application benefit from a review? Schedule your design review today . Want to read more from Andrew? Head to his website .

July 16, 2025

Revolutionizing Inventory Classification with Generative AI

In today's volatile geopolitical environment, the global automotive industry faces compounding disruptions that require a fundamental rethink of data and operations strategy. After decades of low import taxes, the return of tariffs as a tool of economic negotiations has led the global automotive industry to delay model-year transitions and disrupt traditional production and release cycles. As of June 2025, only 3% of US automotive inventory comprises next-model-year vehicles —less than half the number seen at this time in previous years. This severe decline in new-model availability, compounded by a 12.2% year-over-year drop in overall inventory, is pressuring consumer pricing and challenging traditional dealer inventory management. In this environment of constrained supply, better tools are urgently needed to classify and control vehicle, spare part, and raw material inventories for both dealers and manufacturers. Traditionally, dealerships and automakers have relied on ABC analysis to segment and control inventory by value. This widely used method classifies items into Category A, B, or C. For example, Category A items typically represent just 20% of stock but drive 80% of sales, while Category C items might comprise half the inventory yet contribute only 5% to the bottom line. This approach effectively helps prioritize resource allocation and promotional efforts. Figure 1. ABC analysis for inventory classification. While ABC analysis is known for its ease of use, it has been criticized for its focus on dollar usage. For example, not all Category C items are necessarily low-priority, as some may be next-model-year units arriving early or aging stock affected by shifting consumer preferences. Other criteria—such as lead-time, commonality, obsolescence, durability, inventory cost, and order size requirements—have also been recognized as critical for inventory classification. A multi-criteria inventory classification (MCIC) methodology, therefore, adds additional criteria to dollar usage. MCIC can be achieved with methods like statistical clustering or unsupervised machine learning techniques. Yet, a significant blind spot remains: the vast amount of unstructured data that organizations must deal with; unstructured data accounts for an estimated 80% of the world's total. Traditional ABC analysis—and even MCIC—often overlook the growing influence of insights gleaned from unstructured sources like customer sentiment and product reviews on digital channels. But now, valuable intelligence from reviews, social media posts, and dealer feedback can be vectorized and transformed into actionable features using large language models (LLMs). For instance, analyzing product reviews can yield qualitative metrics like the probability of recommending or repurchasing a product, or insights into customer expectations vs. the reality of ownership. This textual analysis can also reveal customers' product perspectives, directly informing future demand. By integrating these signals into inventory classification models, businesses can gain a deeper understanding of true product value and demand elasticity. This fusion of structured and unstructured data represents a crucial shift from reactive inventory management to predictive and customer-centric decision-making. In this blog post, we propose a novel methodology to convert unstructured data into powerful feature sets for augmenting inventory classification models. Figure 2. Transforming unstructured data into features for machine learning models. How MongoDB enables AI-driven inventory classification So, how does MongoDB empower the next generation of AI-driven inventory classification? It all comes down to four crucial steps, and MongoDB provides the robust technology and features to support every single one. Figure 3. Methodology and requirements for gen AI-powered inventory classification. Step 1: Create and store vector embeddings from unstructured data MongoDB Atlas enables modern vector search workflows. Unstructured data like product reviews, supplier notes, or customer support transcripts can be vectorized via embedding models (such as Voyage AI models) and ingested into MongoDB Atlas, where they are stored next to the original text chunks. This data then becomes searchable using MongoDB Atlas Vector Search, which allows you to run native semantic search queries directly inside the database. Unlike solutions that require separate databases for structured and vector data, MongoDB stores them side by side using the flexible document model, enabling unified access via one API. This reduces system complexity, technical debt, and infrastructure footprint—and allows for low-latency semantic searches. Figure 4. Product reviews can be stored as vector embeddings in MongoDB Atlas. Step 2: Design and store evaluation criteria In a gen AI-powered inventory classification system, evaluation criteria are no longer a set of static rules stored in a spreadsheet. Instead, the criteria are dynamic and data-backed, and are generated via an AI agent using structured and unstructured data—and enriched by domain experts using business objectives and constraints. As shown in Figure 5, the criteria for features like “Product Durability” can be defined based on relevant unstructured data stored in MongoDB (product reviews, audit reports) as well as structured data like inventory turnover and sales history. Such criteria are not just instructions or rules, but are knowledge objects with structure and semantic depth. The AI agent uses tools such as generate_criteria and embed_criteria tool and iterates over each product in the inventory. It leverages the LLM to create the criteria definition and uses an embedding model (e.g., voyage-3-large ) to generate embeddings of each definition. MongoDB Atlas is uniquely suited to store these dynamic criteria. Each rule is modeled as a flexible JSON document containing the name of the feature, criteria definition, data sources use, and the embeddings. Since there are different types of products (different car models/makes and different car parts), the documents can evolve over time without requiring schema migrations and be queried and retrieved by the AI agent in real time. MongodB Atlas provides all the necessary tools for this design—a flexible document model database, vector search, and full search tools—that can be leveraged by the AI agent to create the criteria. Figure 5. Unstructured and structured data are used by the AI agent to create criteria for feature generation. Step 3: Create an agentic application to perform transformation based on the criteria In the third step, we have another AI agent that operates over products, criteria, and unstructured data to generate enriched feature sets. This agent iterates over every product and uses MongoDB Atlas Vector Search to find relevant customer reviews to apply the criteria to and calculate a numerical feature score. The new features are added to the original features JSON document in MongoDB. In Figure 6, the agent has created “durability” and “criticality” features from the product reviews. MongoDB Atlas is the ideal foundation for this agentic architecture. Again, it provides the agent the tools it needs for features to evolve, adding new dimensions without requiring schema redesign. This results in an adaptive classification dataset that contains both structured and unstructured data. Figure 6. An AI agent enriches product features with vectorized review data to generate new features. Step 4: Rerun the inventory classification model with new features added As a final step, the inventory classification domain experts can assign or balance weights to existing and new features, choose a classification technique, and rerun inventory classification to find new inventory classes. Figure 7 shows the process where generative AI features are used in the existing inventory classification algorithm. Figure 7. Domain experts can rerun classification after balancing weights. Figure 8 shows the solution in action. The customer satisfaction score is created by LLM a using customer reviews vectorized collection and then utilized in the inventory classification model with a new weight of 0.2. Figure 8. Inventory classification using generative AI. Driving smarter inventory decisions As the automotive industry navigates slowing sales and uneven inventory, traditional inventory classification techniques also need to evolve. Though such techniques provide a solid foundation, they fall short in the face of geopolitical uncertainty, tariff-driven supply shifts, and fast-evolving consumer expectations. By combining structured sales and consumption data with unstructured insights, and enabling agentic AI using MongoDB, the automotive industry can enable a new era of inventory intelligence where products are dynamically classified based on all available data—both structured and unstructured. Clone the GitHub repository if you are interested in trying out this solution yourself. To learn more about MongoDB’s role in the manufacturing industry, please visit our manufacturing and automotive webpage .

July 16, 2025

Introducing MongoDB’s Multimodal Search Library For Python

AI applications increasingly rely on a variety of different data types—text, images, charts, and complex documents—to drive rich user experiences. For developers building these applications, determining how to effectively search and retrieve information that spans these data types presents a challenge. Developers have to consider different chunking strategies, figure out how to incorporate figures and tables, and manage context that could bleed across chunks. To simplify this, we're excited to announce the public preview of MongoDB’s Multimodal Search Python Library . This new library makes it easy to build sophisticated applications using multimodal data, providing a single interface for integrating MongoDB Atlas Vector Search , AWS S3, and Voyage AI's multimodal embedding model voyage-multimodal-3 . The library handles: Processing and storage: It interacts with S3 for storing PDFs from a URL or referring to a PDF already stored in S3. PDFs are then turned into single-page images and stored in S3. Generating embeddings: Images use voyage-multimodal-3 to produce high-quality embeddings. Vector indexing: Finally, it indexes the embeddings using Atlas Vector Search and provides a reference back to S3. The power of multimodal Traditional search methods often struggle when dealing with documents that contain text alongside visual elements like charts and graphs, which are common in research papers, financial reports, and more. Developers typically need to build complex, custom pipelines to handle image storage, embedding generation, and vector indexing. Our Multimodal Search Library abstracts this complexity away, using the best-in-class voyage-multimodal-3. It empowers developers to build applications that can understand and search the content of images just as easily as text. This enables accurate and efficient information retrieval and richer user experiences when working with either multimodal data or PDFs with visually rich documents. Figure 1. Traditional chunking vs. multimodal embedding. Imagine you're a financial analyst sifting through hundreds of annual reports—dense PDFs filled with text, tables, and charts—to find a specific trend. With our Multimodal Search Library, you can simply ask a question in natural language, like: " Show me all the charts illustrating revenue growth over the past three years ." The library will process the query and retrieve pages containing the relevant charts from your corpus of knowledge. Likewise, consider an e-commerce platform with a large product catalog. A shopper might be looking for a specific style of shoes but may not know the right keywords to describe exactly what they are looking for. By leveraging multimodal search, the user could upload an image of the shoes they like, and the application finds visually similar in-stock items, creating a seamless product discovery journey. Learn how to get started To get started, you’ll need: A MongoDB Atlas cluster ( sign up for the free tier) A MongoDB collection in that cluster A MongoDB Atlas Vector Search index A Voyage AI API key ( sign up ) An S3 bucket ( sign up ) Installation and setup First, we’ll ensure that we can connect to MongoDB Atlas, AWS S3, and Voyage AI. pip install pymongo-voyageai-multimodal import os from pymongo import MongoClient from pymongo_voyageai_multimodal import PyMongoVoyageAI client = PyMongoVoyageAI.from_connection_string( connection_string=os.environ["MONGODB_ATLAS_CONNECTION_STRING"], database_name="db_name", collection_name="collection_name", s3_bucket_name=os.environ["S3_BUCKET_NAME"], voyageai_api_key=os.environ["VOYAGEAI_API_KEY"], ) Adding documents Next, we’ll add relevant documents for embedding generation. from pymongo_voyageai_multimodal import TextDocument, ImageDocument text = TextDocument(text="foo", metadata={"baz": "bar"}) images = client.url_to_images( "https://www.fdrlibrary.org/documents/356632/390886/readingcopy.pdf" ) documents = [text, images[0], images[1]] ids = ["1", "2", "3"] client.add_documents(documents=documents, ids=ids) Performing search Finally, we’ll search for content most semantically similar to our query. results = client.similarity_search(query="example", k=1) for doc in results: print(f"* {doc['id']} [{doc['inputs']}]") Loading data already stored in S3 Developers can also query against documents already stored in S3. See more information in the documentation . import os from pymongo_voyageai_multimodal import PyMongoVoyageAI client = PyMongoVoyageAI( voyageai_api_key=os.environ["VOYAGEAI_API_KEY"], s3_bucket_name=os.environ["S3_BUCKET_NAME"], mongo_connection_string=os.environ["MONGODB_URI"], collection_name="test", database_name="test_db", ) query = "The consequences of a dictator's peace" url = "s3://my-bucket-name/readingcopy.pdf" images = client.url_to_images(url) resp = client.add_documents(images) client.wait_for_indexing() data = client.similarity_search(query, extract_images=True) print(f"Found {len(data)} relevant pages") client.close() A few important notes: Automatic updates to source data are not supported. Changes to indexed data need to be made via application code calling the client using the add_documents and delete functions. This library is primarily meant to support integrating multimodal embeddings and MongoDB Atlas on relatively static datasets. It is not intended to support sophisticated aggregation pipelines that combine multiple stages or data that updates frequently. voyage-multimodal-3 is the only embedding model supported directly, and AWS is the only cloud provider supported directly. Ready to try it yourself? Learn more in our documentation , and please share feedback . We can't wait to see what you build!

July 16, 2025

“Hello, Community!”: Meet the 2025 MongoDB Community Champions!

We are so excited to announce this year’s new cohort of MongoDB Community Champions! Community Champions are the connective tissue between MongoDB and our community, keeping them informed about MongoDB’s latest developments and offerings. Community Champions also share their knowledge and experiences with others through a variety of media channels and event engagements. “The MongoDB Community Champions program is one of the best influencer programs,” says Shrey Batra, Head of Engineering and a fifth-year returning Champion. “We can contribute directly to the product development, participate in developer outreach, get developer feedback to the right people, and so much more! “ This year’s 47-member group includes 21 new champions. They come to us from countries all over the world, including Canada, the United States, South Korea, Malaysia, China, Australia, Serbia, Germany, India, Portugal, and Brazil. As a group, they represent a broad range of expertise and serve in a variety of community and professional roles—ranging from engineering leads to chief architects to heads of developer relations. “I’m excited to join the MongoDB Community Champions program because it brings together engineers who are deeply invested in solving real-world data challenges,” says Ruthvik Reddy Anumasu, Principal Database Engineer and a first-year Champion. “As someone who’s worked on scaling, securing, and optimizing critical data systems, I see this as a chance to both share practical insights and learn from others pushing boundaries.” Each Community Champion demonstrates exceptional leadership in advancing the growth and knowledge of MongoDB’s brand and technology. “Being part of the MongoDB Community Champions program is like a solo leveling process—from gathering like-minded personnel to presenting valuable insights that help others in their careers,” says Lai Kai Yong, a Software Engineer and first-year Champion. “I’m excited to continue shipping things, as I believe MongoDB is not only a great product and an amazing company, but also a vibe.” As members of this program, Community Champions gain a variety of experiences—including exclusive access to executives, product roadmaps, preview programs, an annual Champions Summit with product leaders—and relationships that grow their professional stature as MongoDB practitioners, helping them be seen as leaders in the technology community. “After working with MongoDB for more than a decade, I’m happy to be a MongoDB Community Champion,” says Patrick Pittich-Rinnerthaler, Hands-on Web Architect and first-year Champion. “One of the things I’m interested in particular, is the connection to other Champions and Engineers. Together, we enable customers and users to do more with MongoDB.” And now, without further ado, let’s meet the 2025 cohort of Community Champions! NEW COMMUNITY CHAMPIONS: Maria Khalusova, Margaret Menzin, Samuel Molling, Karen Zhang, Shaun Roberts, Joey Marburger, Steve Jones, Ruthvik Reddy Anumasu, Karen Huaulme, Lai Kai Yong, XiaoLei Dai, Luke Thompson, Darae Park, Kim Joong Hui, Rishi Agrawal, Sachin Hejip, Sachin Gupta, Patrick Pittich-Rinnerthaler, Marko Aleksendrić, PhD, Markus Wildgruber, Carla Barata. RETURNING COMMUNITY CHAMPIONS: Abirami Sukumaran, Arek Borucki, Azri Azmi, Christoph Strobl, Christopher Dellaway, Claudia Cardeno Cano, Elie Hannouch, Flavia da Silva Bomfim Policante, Igor Alekseev, Justin Jenkins, Kevin Smith, Leandro Domingues, Malak Abu Hammad, Mateus Leonardi, Michael Höller, Mustafa Kadioglu, Nancy Agarwal, Nenad Milosavljevic, Nilesh Soni, Nuri Halperin, Rajesh Nair, Roman Right, Shrey Batra, Tamara Manzi de Azevedo, Vivekanandan Sakthivelu, Zidan M. For more, visit our MongoDB Community Champions page. If you’d like to connect with your local MongoDB community, check out our MongoDB User Groups on Meetup .

July 15, 2025

Matryoshka Embeddings: Smarter Embeddings with Voyage AI

In the realm of AI, embedding models are the bedrock of advanced applications like retrieval augmented generation (RAG), semantic search , and recommendation systems. These models transform unstructured data (text, images, audio) into high-dimensional numerical vectors, allowing us to perform similarity searches and power intelligent features. However, traditional embedding models often generate fixed-size vectors, leading to trade-offs between performance and computational overhead. This post will dive deep into Matryoshka Representation Learning (MRL) , a novel approach that creates flexible, multi-fidelity embeddings. We'll compare and contrast MRL with traditional embeddings and quantization, detailing its unique training process and showcasing how Voyage AI's voyage-3-large and the recently released voyage-3.5 models leverage MRL as well as quantization to deliver unparalleled efficiency with MongoDB Atlas Vector Search . Understanding embedding models At their core, embedding models learn to represent discrete items (words, sentences, documents) as continuous vectors in a multi-dimensional space. The key principle is that items with similar meanings or characteristics are mapped to points that are close to each other in this vector space. This spatial proximity then allows for efficient similarity comparisons using metrics like cosine similarity. For example, in a semantic search application, when a user queries "best vegan restaurants," the embedding model converts this query into a vector. It then compares this vector against a database of pre-computed embeddings for restaurant descriptions. Restaurants whose embeddings are "nearby" the query embedding are deemed relevant and returned to the user. Figure 1. Example embedding model. Image Credit:   Hugging Face Blog Challenges with traditional embeddings Historically, embedding models generate vectors of a fixed size, for example, 768, 1024, or 4096 dimensions. While effective, this fixed-size nature presents challenges: Inflexibility: A model trained for, say, 768-dimensional embeddings, will suffer a significant performance drop if you simply truncate its vectors to a smaller size, like 256 dimensions, without retraining. This means you're locked into a specific dimension size, even if a smaller representation would suffice for certain tasks. High computational load: Higher-dimensional vectors demand more computational resources for storage, transfer, and similarity calculations. In scenarios with large datasets or real-time inference, this can lead to increased latency and operational costs. Information loss on truncation: Without specific training, truncating traditional embeddings inevitably leads to substantial information loss, compromising the quality of downstream tasks. Matryoshka Representation Learning MRL, introduced by researchers from the University of Washington, Google Research, and Harvard University in 2022 , offers an elegant solution to these challenges. Inspired by the Russian nesting dolls, MRL trains a single embedding model such that its full-dimensional output can be truncated to various smaller dimensions while still retaining high semantic quality. The magic lies in how the model is trained to ensure that the initial dimensions of the embedding are the most semantically rich, with subsequent dimensions adding progressively finer-grained information. This means you can train a model to produce, say, a 1024-dimensional embedding. Then, for different use cases or performance requirements, you can simply take the first 256, 512, or any other number of dimensions from that same 1024-dimensional vector. Each truncated vector is still a valid and semantically meaningful representation, just at a different level of detail. Figure 2. Matryoshka embedding model truncating the output. Image Credit:   Hugging Face Blog Understanding MRL with an analogy Imagine a movie. A 2048-dimensional MRL embedding might represent the "Full Movie". Truncating it to: 1024 dimensions: Still provides enough information for a "Movie Trailer." 512 dimensions: Gives a "Plot Summary & Movie Details." 256 dimensions: Captures the "Movie Title & Plot One-liner." This "coarse-to-fine" property ensures that each prefix of the full vector remains semantically rich and usable. You simply keep the first N dimensions from the full vector to truncate it. Figure 3. Visualizing the Matryoshka doll analogy for MRL. The unseen hand: How the loss function shapes embedding quality To truly grasp what makes MRL distinct, we must first understand the pivotal role of the loss function in the training of any embedding model. This mathematical function is the core mechanism that teaches these sophisticated models to understand and represent meaning. During a typical training step, an embedding model processes a batch of input data, producing a set of predicted output vectors. The loss function (“J” in the below diagram) then steps in, comparing these predicted embeddings (“y_pred”) against known "ground truth" or expected target values (“y”). It quantifies the discrepancy between what the model predicts and what it should ideally produce, effectively gauging the "error" in its representations. A high loss value signifies a significant deviation – a large "penalty" indicating the model is failing to capture the intended relationships (e.g., placing semantically similar items far apart in the vector space). Conversely, a low loss value indicates accurate capture of these relationships, ensuring that similar concepts (like different images of cats) are mapped close together, while dissimilar ones remain distant. Figure 4. Training workflow including the loss function. The iterative training process, guided by an optimizer, continuously adjusts the model's internal weights with the sole aim of minimizing this loss value. This relentless pursuit of a lower loss is precisely how an embedding model learns to generate high-quality, semantically meaningful vectors. MRL training process The key differentiator for MRL lies in its training methodology. Unlike traditional embeddings, where a single loss value is computed for the full vector, MRL training involves: Multiple loss values: Separate loss values are computed for multiple truncated prefixes of the vector (e.g., at 256, 512, 1024, and 2048 dimensions). Loss averaging: These individual losses are averaged (or summed), to calculate a total loss. Incentivized information packing: The model is trained to minimize this total loss. This process penalizes even the smallest prefixes if their loss is high, strongly incentivizing the model to pack the most crucial information into the earliest dimensions of the vector. This results in a model where information is "front-loaded" into early dimensions, ensuring accuracy remains strong even with fewer dimensions, unlike traditional models where accuracy drops significantly upon truncation. Examples of MRL-trained models include voyage-3-large and voyage-3.5 . MRL vs. quantization It's important to differentiate MRL from quantization, another common technique for reducing embedding size. While both aim to make embeddings more efficient, their approaches and benefits differ fundamentally. Quantization techniques compress existing high-dimensional embeddings into a more compact form, by reducing the precision of the numerical values (e.g., from float32 to int8). The following table describes the precise differences between MRL and Quantization. table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Aspect MRL Quantization Goal Reduce embedding dimensionality (e.g., 256 out of 2048 dims) Reduce embedding precision (e.g., instead of using fp32, using int8/binary embeddings) Output Type Float32 vectors of varying lengths Fixed-length vectors with lower bit representations Training Awareness Uses multi-loss training across dimensions Often uses quantization-aware training (QAT) Use Case Trade-off accuracy vs compute/memory at inference Minimize storage and accelerate vector math operations Example (Voyage AI) voyage-3-large @ 512-dim-fp32 voyage-3-large @ 2048-dim-int8 Flexibility and efficiency with MRL The core benefit of MRL is its unparalleled flexibility and efficiency. Instead of being locked into a single, large vector size, you can: Choose what you need: Generate a full 2048-dimensional vector and then slice it to 256, 512, or 1024 dimensions based on your specific needs. One vector, multiple fidelities: A single embedding provides multiple levels of detail and accuracy. Lower compute, bandwidth, and storage: By using smaller vector dimensions, you drastically reduce the computational load for indexing, query processing, and data transfer, as well as the storage footprint in your database. Efficient computation: The embedding is computed once, and then you simply slice it to the desired dimensions, making it highly efficient. Voyage AI, in particular, leverages MRL by default across its models, including voyage-3-large and the latest voyage-3.5, enabling scalable embeddings with one model and multiple dimensions. This allows you to dynamically choose between space/latency and quality at query time, leading to efficient retrieval with minimal accuracy loss. Voyage AI's dual approach: MRL and quantization for ultimate efficiency Voyage AI models maximize efficiency by combining MRL and quantization. MRL enables flexible embeddings by allowing you to select the optimal vector length—for instance, using 512 instead of 2048 dimensions—resulting in significant reductions in size and computational overhead with minimal accuracy loss. Quantization further compresses these vectors by reducing their bit precision, which cuts storage needs and speeds up similarity search operations. This synergy allows you to choose embeddings tailored to your application’s requirements: a voyage-3-large embedding can be used as a compact 512-dimensional floating-point vector (leveraging MRL) or as a full 2048-dimensional 8-bit integer vector (via quantization). The dual approach empowers you to balance accuracy, storage, and performance, ensuring highly efficient, flexible embeddings for your workload. As a result, Voyage AI models deliver faster inferences and help reduce infrastructure costs when powering applications with MongoDB Atlas Vector Search. Head over to the MongoDB AI Learning Hub to learn how to build and deploy AI applications with MongoDB.

July 14, 2025

Improving Industrial Safety with Game Theory and MongoDB

In industrial operations, safety is both a business and a human imperative. Heavy-asset industries like aerospace, shipbuilding, and construction constantly invest in better safety systems and policies to keep their staff safe. But a variety of factors—tight physical environments, time pressures, and steep production targets—can lead workers to take unsafe shortcuts to meet quotas. For instance, the European Maritime Safety Agency (EMSA) cited 650 fatalities and over 7,600 injuries linked to marine incidents involving EU-registered ships between 2014 and 2023, and human factors contributed to 80% of these incidents. Traditional safety incident reporting tools focus on retrospective data. Such systems capture and document safety incidents only after they have occurred, meaning that companies are reacting to events rather than proactively preventing them. On the ground, factory and shipyard workers often find themselves having to make split-second choices: safety versus speed, following protocols versus meeting production targets, etc. To move beyond hindsight—and to proactively guarantee safety—organizations must be able to model and analyze these behavioral trade-offs in real time to build informed policy (as well as an organizational culture) that supports safe behavior on the ground. In this blog post, we’ll dive into how organizations can leverage MongoDB as a unified operational data store for time series sensor telemetry, worker decisions, and contextual factors. By consolidating this information into a single database, MongoDB makes it possible to easily generate proactive insights into how workers will act under different conditions, thereby improving safety policies and incentives. Modeling human decisions and trade-offs in industrial environments Game theory, a mathematical framework used to model and analyze strategic interactions between individuals or entities, can be leveraged here to better anticipate and influence operational decisions. Let’s use the example of a shipyard, in which workers must constantly weigh critical decisions—balancing safety against speed, following rules versus meeting deadlines, deciding whether to take a shortcut that helps them hit a deadline. These decisions are not random and are shaped by peer pressures, working conditions, management oversight, and the incentive structures in place. So in an industrial context, game theory allows us to simulate these decisions as an ongoing, repeated game. For example, “if a policy is too strict, do workers take more risks to save time?” or “if incentives favor speed, does safety compliance drop?” and most importantly, “how do these patterns evolve as conditions and oversight change?” By modeling these decisions and choices as part of a repeated game, we can simulate how workers behave under different combinations of policy strictness and incentive strength. To create such a game-theoretic system, we need to bring together different data sets—real-time environmental sensor telemetry, worker profiles, operations context, etc.—and use this data to simulate a game-theoretic model. A behavior-aware safety simulation engine powered by MongoDB enables this approach; the engine brings together disparate data and models it using MongoDB’s flexible document model. The document model can easily adapt to the fast-changing, real-time conditions, meaning that companies can leverage MongoDB to build data-driven and dynamic safety policy tuning systems in order to predict where, when, and why risky behavior might occur during daily operations . MongoDB Atlas: Turning game theory into industrial intelligence To bring this model to life, we need to simulate, store, and analyze decision flows in real time. This is where MongoDB Atlas plays a central role. In this example, we will build this solution for shipyard operations. Figure 1 shows the conceptual architecture of our simulation engine, in which MongoDB acts as both the behavioral memory and analytical core, capturing decisions, scoring risk, and enabling feedback-driven policy experimentation. Figure 1. A closed feedback loop for safer shipyards. Per below, we can see the figure’s architecture definition of each element that drives smarter decision-making with smarter outcomes for a seamless, real-time integration: Time series data storage: All worker actions/decisions and sensor (temperature, gas, humidity, etc.) data are stored in MongoDB collections as a central, flexible operational database. Game theoretic decision modeling: A game theory-based simulator models worker trade-offs under different policy and incentive setups. Data contextualization and storage: MongoDB stores not just the raw sensor data but context as well, which includes payoff and risk. Flexibility of the document model enables easy data modelling. Risk scoring and analysis: MongoDB’s Aggregation Framework helps analyze trends over time to detect rising risk profiles or policy blind spots. Adaptive safety design: Safety teams can tweak policies and incentives directly, shaping safer behavior before incidents occur. MongoDB acts as the data backbone for the entire solution, storing three key datasets; the code snippets below show a detailed document model visibility per collection in Atlas: Environmental telemetry (sensor_data time series collection) from simulated or actual sensors in the shipyard: { "timestamp": { "$date": "2025-06-06T20:00:22.970Z" }, "zone": "Tank Zone", "run_id": "9722c0e7-c10d-4526-a1a1-2647c9731589", "_id": { "$oid": "684348d687d59464d1f498d0" }, "temperature": 42.6, "gas": "normal" } Worker profiles (workers collection) capturing static attributes and evolving risk indicators: { "timestamp": "2025-04-15T01:57:04.938Z", "workerId": "W539", "zone": "Tank Zone", "environment": { "temperature": 35.3, "gas": "normal" }, "incentive": "high", "decision": "followed_procedure", "policy": "strict", "computed": { "risk_score": 0.24, "payoff": 3 }, "_id": { "$oid": "67fdbcf0b9b3624b42add7b4" } } Behavior logs (worker_behavior time series collection) recording every simulated or real decision made in context (policy, incentive, zone): { "_id": "W539", "name": "Worker89", "role": "Welder", "risk_profile": { "avg_shortcut_rate": 0, "historical_decision_trends": [ { "policy": "strict", "incentive": "high", "rate": 0 } ] }, "metadata": { "ppe_compliance": "good", "training_completed": [ "confined space", "hazmat" ] } } Figure 2, meanwhile, shows the physical architecture of the behavior-aware simulation system. Here, MongoDB acts as the central data backbone, providing data to the risk and decision dashboard for trend analysis and policy experimentation. Figure 2. Physical architecture of the behavior-aware simulation system. MongoDB provides all the foundational building blocks to power our simulation engine from end to end. The time series collections enable high-speed ingestion of sensor data while built-in compression and windowing functions support efficient risk scoring and trend analysis at scale. This eliminates the need for an external time series database. Change streams and Atlas Stream Processing power real-time dashboards and risk analytics pipelines that respond to new inputs as they occur. As policies, sensors, or simulator logic evolve over time, MongoDB’s flexible schema ensures that you do not need to rework your data model or incur any downtime. Finally, Atlas Vector Search can help derive insights from unstructured text data such as incident reports or operator feedback. Figure 3 shows the solution in action; over time, the risk profiles of simulated workers rise because of the policy leniency and low incentive levels. The figure highlights how even well-meaning safety policies can unintentionally encourage risky behavior and even workplace accidents—which is why it’s critical to simulate and evaluate policies’ impact before deploying them in the real world. Figure 3. Game theoretic safety simulation overview. With these safety insights stored and analyzed in MongoDB, organizations can run what-if scenarios, adjust policy configurations, and measure predicted behavioral outcomes in advance. The organizational impact of such a system is significant because safety leaders can move away from reactive investigations to proactive policy design. For example, a shipyard might decide to introduce targeted safety training for specific zones, or fine-tune supervision protocols based on the simulation outcomes, rather than waiting for an actual incident to occur. Together, these features make MongoDB uniquely suited to drive safety innovation where real-world complexity demands flexible and scalable infrastructure. Check out the repo of this solution that you can clone and try out yourself. To learn more about MongoDB’s role in the manufacturing industry, please visit our manufacturing and automotive page .

July 14, 2025

Don’t Just Build Agents, Build Memory-Augmented AI Agents

Insight Breakdown: This piece aims to reveal that regardless of architectural approach—whether Anthropic's multi-agent coordination or Cognition's single-threaded consolidation—sophisticated memory management emerges as the fundamental determinant of agent reliability, believability, and capability. It marks the evolution from stateless AI applications toward truly intelligent, memory-augmented systems that learn and adapt over time. AI agents are intelligent computational systems that can perceive their environment, make informed decisions, use tools, and, in some cases, maintain persistent memory across interactions—evolving beyond stateless chatbots toward autonomous action. Multi-agent systems coordinate multiple specialized agents to tackle complex tasks, like a research team where different agents handle searching, fact-checking, citations and research synthesis. Recently, two major players in the AI space released different perspectives on how to build these systems. Anthropic released an insightful piece highlighting their learnings on building multi-agent systems for deep research use cases. Cognition also released a post titled: " Don't Build Multi-Agents ," which appears to contradict Anthropic's approach directly. Two things stand out: Both pieces are right Yes, this sounds contradictory, but working with customers building agents of all scales and sizes in production, we find that both the use case and application mode, in particular, are key factors to consider when determining how to architect your agent(s). Anthropic's multi-agent approach makes sense for deep research scenarios where sustained, comprehensive analysis across multiple domains over extended periods is required. Cognition's single-agent approach is optimal for conversational agents or coding tasks where consistency and coherent decision-making are paramount. The application mode—whether research assistant, conversational agent, or coding assistant—fundamentally shapes the optimal memory architecture. Anthropic also highlights this point when discussing the downside of multi-agent architecture. For instance, most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time. Anthropic, Building Multi-Agent Research System Both pieces are saying the same thing Memory is the foundational challenge that determines agent reliability, believability, and capability. Anthropic emphasizes sophisticated memory management techniques (compression, external storage, context handoffs) for multi-agent coordination. Cognition emphasizes context engineering and continuous memory flow to prevent the fragmentation that destroys agent reliability. Both teams arrived at the same core insight: agents fail without robust memory management . Anthropic chose to solve memory distribution across multiple agents, while Cognition chose to solve memory consolidation within single agents. The key takeaway from both pieces for AI Engineers or anyone developing an agentic platform is not just build agents, build Memory Augmented AI agents . With that out of the way, the rest of this piece will provide you with the essential insights from both pieces that we think are important and point to the memory management principles and design patterns we’ve observed in our customers’ building agents. The key insights If you are building your agentic platform from scratch, you can extract much value from Anthropic's approach to building multi-agent systems, particularly their sophisticated memory management principles, which are essential for effective agentic systems. Their implementation reveals critical design considerations, including techniques to overcome context window limitations through compression, function calling, and storage functions that enable sustained reasoning across extended multi-agent interactions—foundational elements that any serious agentic platform must address from the architecture phase. Key insights: Agents are overthinkers Multi-agent systems trade efficiency for capability Systematic agent observation reveals failure patterns Context windows remain insufficient for extended sessions Context compression enables distributed memory management Let's go a bit deeper into how these insights translate into practical implementation strategies. Agents are overthinkers Anthropic researchers mentioned using explicit guidelines to steer agents into allocating the right amount of resources (tool calls, sub-agent creation, etc.), or else, they tend to overengineer solutions. Without proper constraints, the agents would spawn excessive subagents for simple queries, conduct endless searches for nonexistent information, and apply complex multi-step processes to tasks requiring straightforward responses. Explicit guidance for agent behavior isn't entirely new—system prompts and instructions are typical parameters in most agent frameworks. However, the key insight here goes deeper than traditional prompting approaches. When agents are given access to resources such as data, tools, and the ability to create sub-agents, there needs to be explicit, unambiguous direction on how these resources are expected to be leveraged to address specific tasks. This goes beyond system prompts and instructions into resource allocation guidance, operational constraints, and decision-making boundaries that prevent agents from overengineering solutions or misusing available capabilities. Take, for example, the OpenAI Agent SDK with several parameters to describe behaviours of resources to the agent, such as handoff_description , which will be utilized in a multi-agent system built with the OpenAI SDK. This argument specifies how the subagent should be leveraged in a multi-agent system. Or the explicit argument tool_use_behavior that describes to the agent how a tool should be used, as the name suggests. The key takeaway for AI Engineers is that multi-agent system implementation requires an extensive thinking process that involves what tools the agents are expected to leverage, the subagents in the system, and how resource utilization is communicated to the calling agent in a multi-agent system. When implementing resource allocation constraints for your agents, consider traditional approaches of managing multiple specialized databases (vector DB for embeddings, graph DB for relationships, relational DB for structured data) compound the complexity problem, and introduce tech stack sprawl, an anti-pattern to rapid AI innovation. Multi-agent systems trade efficiency for capability While multi-agent architectures can utilize more tokens and parallel processing for complex tasks, Anthropic found operational costs significantly higher due to coordination overhead, context management, and the computational expense of maintaining a coherent state across multiple agents. In some cases, two heads are better than one, but they are also expensive within multi-agent systems. One thing we note here is that the use case used in Anthropic's multi-agent system is deep research. This use case requires extensive exploration of resources, including heavily worded research papers, sites, and documentation, to accumulate enough information to formulate the result of this use case (which is typically a 2000+ word essay on the user’s starting prompt). In other use cases, such as automated workflow with agents representing processes within the workflow, there might not be as much token consumption, especially if the process encapsulates deterministic steps such as database reads and write operations, and its output is execution results that are sentences or short summaries. The coordination overhead challenge becomes particularly acute when agents need to share state across different storage systems. Rather than managing complex data synchronization between specialized databases, MongoDB's native ACID compliance ensures that multi-agent handoffs maintain data integrity without external coordination mechanisms. This unified approach reduces both the computational overhead of distributed state management and the engineering complexity of maintaining consistency across multiple storage systems. Context compression enables distributed memory management Beyond reducing inference costs, compression techniques allow multi-agent systems to maintain shared context across distributed agents. Anthropic's approach involves summarizing completed work phases and storing essential information in external memory before agents transition to new tasks. This, coupled with the insight that Context windows remain insufficient for extended sessions, points to the fact that prompt compression or compaction techniques are still relevant and useful in a world where LLMs have extensive context windows. Even with a 200K token (approximately 150,000 words) capacity, Anthropic’s agents in multi-round conversations require sophisticated context management strategies, including compression, external memory offloading, and spawning fresh agents when limits are reached. We previously partnered with Andrew Ng and DeepLearning AI on a course on prompt compression techniques and retrieval-augmented generation (RAG) optimization. Systematic agent observation reveals failure patterns Systematic agent observation represents one of Anthropic's most practical insights. Essentially, rather than relying on guesswork (or vibes), the team built detailed simulations using identical production prompts and tools and then systematically observed step-by-step execution to identify specific failure modes. This phase in an agentic system has an extensive operational cost. From our perspective, working with customers building agents in production, this methodology addresses a critical gap most teams face: understanding how your agents actually behave versus how you think they should behave . Anthropic's approach immediately revealed concrete failure patterns that many of us have encountered but struggled to diagnose systematically. Their observations uncovered agents overthinking simple tasks, like we mentioned earlier, using verbose search queries that reduced effectiveness, and selecting inappropriate tools for specific contexts. As they note in their piece: " This immediately revealed failure modes: agents continuing when they already had sufficient results, using overly verbose search queries, or selecting incorrect tools. Effective prompting relies on developing an accurate mental model of the agent. " The key insight here is moving beyond trial-and-error prompt engineering toward purposeful debugging . Instead of making assumptions about what should work, Anthropic demonstrates the value of systematic behavioral observation to identify the root causes of poor performance. This enables targeted prompt improvements based on actual evidence rather than intuition. We find that gathering, tracking, and storing agent process memory serves a dual critical purpose: not only is it vital for agent context and task performance, but it also provides engineers with the essential data needed to evolve and maintain agentic systems over time. Agent memory and behavioral logging remain the most reliable method for understanding system behavior patterns, debugging failures, and optimizing performance, regardless of whether you implement a single comprehensive agent or a system of specialized subagents collaborating to solve problems. MongoDB's flexible document model naturally accommodates the diverse logging requirements for both operational memory and engineering observability within a single, queryable system. One key piece that would be interesting to know from the Anthropic research team is what evaluation metrics they use. We’ve spoken extensively about evaluating LLMs in RAG pipelines, but what new agentic system evaluation metrics are developers working towards? We are answering these questions ourselves and have partnered with Galileo, a key player in the AI Stack, whose focus is purely on evaluating RAG and Agentic applications and making these systems reliable for production. Our learning will be shared in this upcoming webinar , taking place on July 17, 2025. However, for anyone building agentic systems, this represents a shift in development methodology—building agents requires building the infrastructure to understand them, and sandbox environments might become a key component of the evaluation and observability stack for Agents. Advanced implementation patterns Beyond the aforementioned core insights, Anthropic's research reveals several advanced patterns worth examining: The Anthropic piece hints at the implementation of advanced retrieval mechanisms that go beyond vector-based similarity between query vectors and stored information. Their multi-agent architecture enables sub-agents to call tools (an approach also seen in MemGPT ) to store their work in external systems, then pass lightweight references—presumably unique identification numbers of summarized memory components—back to the coordinator. We generally emphasize the importance of the multi-model retrieval approach to our customers and developers, where hybrid approaches combine multiple retrieval methods—using vector search to understand intent while simultaneously performing text search for specific product details. MongoDB's native support for vector similarity search and traditional indexing within a single system eliminates the need for complex reference management across multiple databases, simplifying the coordination mechanisms that Anthropic's multi-agent architecture requires. The Anthropic team implements continuity in the agent execution process by establishing clear boundaries between task completion and summarizing the current phase before moving to the next task. This creates a scalable system where memory constraints don't bottleneck the research process, allowing for truly deep and comprehensive analysis that spans beyond what any single context window could accommodate. In a multi-agent pipeline, each sub-agent produces partial results—intermediate summaries, tool outputs, extracted facts—and then hands them off into a shared “memory” database. Downstream agents will then read those entries, append their analyses, and write updated records back. Because these handoffs happen in parallel, you must ensure that one agent’s commit doesn’t overwrite another’s work or that a reader doesn’t pick up a half-written summary. Without atomic transactions and isolation guarantees, you risk: Lost updates , where two agents load the same document, independently modify it, and then write back, silently discarding one agent’s changes. Dirty or non-repeatable reads , where an agent reads another’s uncommitted or rolled-back write, leading to decisions based on phantom data. To coordinate these handoffs purely in application code would force you to build locking layers or distributed consensus, quickly becoming a brittle, error-prone web of external orchestrators. Instead, you want your database to provide those guarantees natively so that each read-modify-write cycle appears to execute in isolation and either fully succeeds or fully rolls back. MongoDB's ACID compliance becomes crucial here, ensuring that these boundary transitions maintain data integrity across multi-agent operations without requiring external coordination mechanisms that could introduce failure points. Application mode is crucial when discussing memory implementation . In Anthropic's case, the application functions as a research assistant, while in other implementations, like Cognition's approach, the application mode is conversational. This distinction significantly influences how agents operate and manage memory based on their specific application contexts. Through our internal work and customer engagements, we extend this insight to suggest that application mode affects not only agent architecture choices but also the distinct memory types used in the architecture. AI agents need augmented memory Anthropic’s research makes one thing abundantly clear: context window is not all you need. This extends to the key point that memory and agent engineering are two sides of the same coin. Reliable, believable, and truly capable agents depend on robust, persistent memory systems that can store, retrieve, and update knowledge over long, complex workflows. As the AI ecosystem continues to innovate on memory mechanisms, mastering sophisticated context and memory management approaches will be the key differentiator for the next generation of successful agentic applications. Looking ahead, we see “Memory Engineering” or “Memory Management” emerge as a key specialization within AI Engineering, focused on building the foundational infrastructure that lets agents remember, reason, and collaborate at scale. For hands-on guidance on memory management, check out our webinar on YouTube, which covers essential concepts and proven techniques for building memory-augmented agents. Head over to the MongoDB AI Learning Hub to learn how to build and deploy AI applications with MongoDB.

July 9, 2025

Build an AI-Ready Data Foundation with MongoDB Atlas on Azure

It’s time for a database reality check. While conversations around AI usually focus on its immense potential, these advancements are also bringing developers face to face with an immediate challenge: Their organizations’ data infrastructure isn’t ready for AI. Many developers now find themselves trying to build tomorrow’s applications on yesterday’s foundations. But what if your database could shift from bottleneck to breakthrough? Is your database holding you back? Traditional databases were built for structured data in a pre-AI world—they’re simply not designed to handle today’s need for flexible, real-time data processing. Rigid schemas force developers to spend time managing database structure instead of building features, while separate systems for operational data and analytics create costly delays and complexity. Your data architecture might be holding you back if: Your developers spend more time wrestling with data than innovating. AI implementation feels like forcing a square peg into a round hole. Real-time analytics are anything but real-time. Go from theory to practice: Examples of modern data architecture at work Now is the time to rethink your data foundation by moving from rigid to flexible schemas that adapt as applications evolve. Across industries, leading organizations are unifying operational and analytical structures to eliminate costly synchronization processes. Most importantly, they’re embracing databases that speak developers’ language. In the retail sector , business demands include dynamic pricing that responds to market conditions in real-time. Using MongoDB Atlas with Azure OpenAI from Microsoft Azure, retailers are implementing sophisticated pricing engines that analyze customer behavior and market conditions, enabling data-driven decisions at scale. In the healthcare sector , organizations can connect MongoDB Atlas to Microsoft Fabric for advanced imaging analysis and results management, streamlining the flow of critical diagnostic information while maintaining security and compliance. More specifically, when digital collaboration platform Mural faced a 1,700% surge in users, MongoDB Atlas on Azure handled its unstructured application data. The results aligned optimally with modern data principles: Mural’s small infrastructure team maintained performance during massive growth, while other engineers were able to focus on innovation rather than database management. As noted by Mural’s Director of DevOps, Guido Vilariño, this approach enabled Mural’s team to “build faster, ship faster, and ultimately provide more expeditious value to customers.” This is exactly what happens when your database becomes a catalyst rather than an obstacle. Shift from “database as storage” to “database as enabler” Modern databases do more than store information—they actively participate in application intelligence. When your database becomes a strategic asset rather than just a record-keeping necessity, development teams can focus on innovation instead of infrastructure management. What becomes possible when data and AI truly connect? Intelligent applications can combine operational data with Azure AI services. Vector search capabilities can enhance AI-driven features with contextual data. Applications can handle unpredictable workloads through automated scaling. Seamless integration occurs between data processing and AI model deployment. Take the path to a modern data architecture The deep integration between MongoDB Atlas and Microsoft’s Intelligent Data Platform eliminates complex middleware, so organizations can streamline their data architecture while maintaining enterprise-grade security. The platform unifies operational data, analytics, and AI capabilities—enabling developers to build modern applications without switching between multiple tools or managing separate systems. This unified approach means security and compliance aren’t bolt-on features—they’re core capabilities. From Microsoft Entra ID integration for access control to Azure Key Vault for data protection, the platform provides comprehensive security while simplifying the development experience. As your applications scale, the infrastructure scales with you, handling everything from routine workloads to unexpected traffic spikes without adding operational complexity. Make your first move Starting your modernization journey doesn’t require a complete infrastructure overhaul or the disruption of existing operations. You can follow a gradual migration path that prioritizes business continuity and addresses specific challenges. The key is having clear steps for moving from legacy to modern architecture. Make decisions that simplify rather than complicate: Choose platforms that reduce complexity rather than add to it. Focus on developer experience and productivity. Prioritize solutions that scale with your needs. For example, you can begin with a focused proof of concept that addresses a specific challenge—perhaps an AI feature that’s been difficult to implement or a data bottleneck that’s slowing development. Making small wins in these areas demonstrates value quickly and builds momentum for broader adoption. As you expand your implementation, focus on measurable results that matter to your organization. Tracking these metrics—whether they’re developer productivity, application performance, or new capabilities—helps justify further investment and refine your approach. Avoid these common pitfalls As you undertake your modernization journey, avoid these pitfalls: Attempting to modernize everything simultaneously: This often leads to project paralysis. Instead, prioritize applications based on business impact and technical feasibility. Creating new data silos: In your modernization efforts, the goal must be integration and simplification. Adding complexity: remember that while simplicity scales, complexity compounds. Each decision should move you toward a more streamlined architecture, not a more convoluted one. The path to a modern, AI-ready data architecture is an evolution, not a revolution. Each step builds on the last, creating a foundation that supports not just today’s applications but also tomorrow’s innovations. Take the next step: Ready to modernize your data architecture for AI? Explore these capabilities further by watching the webinar “ Enhance Developer Agility and AI-Readiness with MongoDB Atlas on Azure .” Then get started on your modernization journey! Visit the MongoDB AI Learning Hub to learn more about building AI applications with MongoDB.

July 8, 2025

Why Relational Databases Are So Expensive to Enterprises

Relational databases were designed with a foundational architecture based on the premise of normalization. This principle—often termed “3rd Normal Form”—dictates that repeating groups of information are systematically cast out into child tables, allowing them to be referenced by other entities. While this design inherently reduces redundancy, it significantly complicates underlying data structures. Figure 1. Relational database normalization structure for insurance policy data. Every entity in a business process, its attributes, and their complex interrelations must be dissected and spread across multiple tables—policies, coverages and insured items, each becoming a distinct table. This traditional decomposition results in a convoluted network of interconnected tables that developers must constantly navigate to piece back together the information they need. The cost of relational databases Shrewd C-levels and enterprise portfolio managers are interested in managing cost and risk, not technology. Full stop. This decomposition into countless interconnected tables comes at a significant cost across multiple layers of the organization. Let’s break down the cost of relational databases for three different personas/layers: Developer and software layer Let’s imagine that as a developer you’re dealing with a business application that must create and manage customers and their related insurance policies. That customer has addresses, coverages, and policies. Each policy has insured objects and each object has its own specificities. If you’re building relational databases, it’s likely that you may be dealing with a dozen or more database objects that represent the aggregate business object of policy. In this design, all of these tables require you to break up the logical dataset into many parts, insert that data across many tables, and then execute complex JOIN operations when you wish to retrieve and edit it. As a developer, you’re familiar with working with object-oriented design, and to you, all of those tables likely represent one to two major business objects: the customer and the policy. With MongoDB, these dozen or more relational database tables can be modeled as one single object (see Figure 2). Figure 2. Relational database complexity vs. MongoDB document model for insurance policy data. At the actual business application-scale, with production data volumes, we start to truly see just how complicated this can get for the developers. In order to render it meaningfully to the application user interface, it must be constantly joined back together. When it’s edited, it must again be split apart, and saved into those dozen or more underlying database tables. Relational is therefore not only a more complex storage model, but it’s also cognitively harder to figure out. It’s not uncommon for a developer who didn’t design the original database, and is newer to the application team, to struggle to understand, or even mis-interpret a legacy relational model. Additionally, the normalized relational requires more code to be written for basic create, update, and read operations. An object relational mapping layer will often be introduced to help translate the split-apart representation in the database to an interpretation that the application code can more easily navigate. Why is this so relevant? Because more code equals more developer time and ultimately more cost. Overall it takes noticeably longer to design, build, and test a business feature when using a relational database than it would with a database like MongoDB. Finally, changing a relational schema is a cumbersome process. ALTER TABLE statements are required to change the underlying database object structure. Since relational tables are like spreadsheets, they can only have one schema at any given point in time. Your business feature requires you to add new fields? You must alter the single, fixed schema that is bound to the underlying table. This might seem to be a quick and easy process to execute in a development environment, but by the time you get to the production database, deliberate care, caution must be applied, and extra steps are mandatory to ensure that you do not jeopardize the integrity of the business applications that use the database. Altering production table objects incurs significant risk, so organizations must put in place lengthy and methodical processes that ensure change is thoroughly tested and scheduled, in order to minimize possible disruption. The fundamental premise of normalization, and its corresponding single, rigid and predefined table structures are a constant bottleneck when it comes to speed and cost to market. Infrastructure administrator Performing JOIN operations across multiple database objects at runtime requires more computational resources than if you were to retrieve all of the data you need from a single database object. If your applications are running against well-designed, normalized relational databases, your infrastructure is most certainly feeling the resource impact of those joins. Across a portfolio of applications, the hardware costs of normalization add up. For a private data center, it can mean the need to procure additional, expensive hardware. For the cloud, it likely means your overall spending is higher than that of a portfolio running on a more efficient design (like MongoDB’s Document Model). Ultimately, MongoDB allows more data-intensive workloads to be run on the same server infrastructure than that of relational databases, and this directly translates to lower infrastructure costs. In addition to being inefficient at the hardware layer, normalized relational tables result in complex ways in which the data must be conditionally joined together and queried, especially within the context of actual business rules. Application developers have long pushed this complex logic ‘down to the database’ in an effort to reduce complexity at the application layer, as well as preserve application tier memory and cpu. This decades-long practice can be found across every industry, and in nearly every flavor and variant of relational database platforms. The impact is multi-fold. Database administrators, or those specialized in writing and modifying complex SQL ‘stored procedures,’ are often called upon to augment the application developers who maintain code at the application tier. This external dependency certainly slows down delivery teams tasked with making changes to these applications, but it’s just the tip of the iceberg. Below the waterline, there exists a wealth of complexity. Critical application business logic ends up bifurcated; some in the database as SQL, and some in the application tier in a programming language. The impact to teams wishing to modernize or refactor legacy applications is significant in terms of the level of complexity that must be dealt with. At the root of this complexity and phenomenon is the premise of normalized database objects, which would otherwise be a challenge to join and search, if done at the application tier. Portfolio manager An Application Portfolio Manager is responsible for overseeing an organization’s suite of software applications, ensuring they align with business goals, provide value, and are managed efficiently. The role typically involves evaluating, categorizing, and rationalizing application catalogs to reduce redundancy, lower costs, and enhance the overall ability to execute the business strategy. In short, the portfolio manager cares deeply about speed, complexity, and cost to market. At a macro level, a portfolio with relational databases translates into slower teams that deliver fewer features per agile cycle. In addition, a larger staff is needed as database/infrastructure admins are a necessary interface between the developers and the database. Unlike relational databases, MongoDB allows developers to maintain more than simply one version of a schema at a given time. In addition, documents contain both data and structure, which means you don’t need the complex, lengthy, and risky change cycles that relational demands, to simply add or edit existing fields within the database. The result? Software teams deliver more features than is possible with relational databases, with less time, cost, and complexity. Something the business owners of the portfolio will certainly appreciate, even if they don’t understand the underlying technology. Add in the fact that MongoDB runs more efficiently on the same hardware than relational databases, and your portfolio will see even more cost benefits. Beyond relational databases: A new path to efficiency and agility The fundamental premise of normalization, and its corresponding single, rigid, and predefined table structures are a constant bottleneck when it comes to speed, cost, and complexity to market. At a time when the imperative is to leverage AI to lower operating expenses, the cost, complexity, and agility of the underlying database infrastructure needs to be scrutinized. In contrast, MongoDB’s flexible Document Model offers a superior, generational step-change forward. One that enables your developers to move more quickly, runs more efficiently on anyone's hardware, yours or a cloud data center, and increases your application portfolio's speed to market for advancing the business agenda. Transform your enterprise data architecture today. Start with our free Overview of MongoDB and the Document Model course at MongoDB University, then experience the speed and flexibility firsthand with a free MongoDB Atlas cluster.

July 7, 2025

Real-Time Threat Detection With MongoDB & PuppyGraph

Security operations teams face an increasingly complex environment. Cloud-native applications, identity sprawl, and continuous infrastructure changes generate a flood of logs and events. From API calls in AWS to lateral movement between virtual machines, the volume of telemetry is enormous—and it’s growing. The challenge isn’t just scale. Its structure. Traditional security tooling often looks at events in isolation, relying on static rules or dashboards to highlight anomalies. But real attacks unfold as chains of related actions: A user assumes a role, launches a resource, accesses data, and then pivots again. These relationships are hard to capture with flat queries or disconnected logs. That’s where graph analytics comes in. By modeling your data as a network of users, sessions, identities, and events, you can trace how threats emerge and evolve. And with PuppyGraph, you don’t need a separate graph database or batch pipelines to get there. In this post, we’ll show how to combine MongoDB and PuppyGraph to analyze AWS CloudTrail data as a graph—without moving or duplicating data. You’ll see how to uncover privilege escalation chains, map user behavior across sessions, and detect suspicious access patterns in real time. Why MongoDB for cybersecurity data MongoDB is a popular choice for managing security telemetry. Its document-based model is ideal for ingesting unstructured and semi-structured logs like those generated by AWS CloudTrail, GuardDuty, or Kubernetes audit logs. Events are stored as flexible JSON documents, which evolve naturally as logging formats change. This flexibility matters in security, where schemas can shift as providers update APIs or teams add new context to events. MongoDB handles these changes without breaking pipelines or requiring schema migrations. It also supports high-throughput ingestion and horizontal scaling, making it well-suited for operational telemetry. Many security products and SIEM backends already support MongoDB as a destination for real-time event streams. That makes it a natural foundation for graph-based security analytics: The data is already there—rich, semi-structured, and continuously updated. Why graph analytics for threat detection Modern security incidents rarely unfold as isolated events. Attackers don’t just trip a single rule—they navigate through systems, identities, and resources, often blending in with legitimate activity. Understanding these behaviors means connecting the dots across multiple entities and actions. That’s precisely what graph analytics excels at. By modeling users, sessions, events, and assets as interconnected nodes and edges, analysts can trace how activity flows through a system. This structure makes it easy to ask questions that involve multiple hops or indirect relationships—something traditional queries often struggle to express. For example, imagine you’re investigating activity tied to a specific AWS account. You might start by counting how many sessions are associated with that account. Then, you might break those sessions down by whether they were authenticated using MFA. If some weren’t, the next question becomes: What resources were accessed during those unauthenticated sessions? This kind of multi-step investigation is where graph queries shine. Instead of scanning raw logs or filtering one table at a time, you can traverse the entire path from account to identity to session to event to resource, all in a single query. You can also group results by attributes like resource type to identify which services were most affected. And when needed, you can go beyond metrics and pivot to visualization, mapping out full access paths to see how a specific user or session interacted with sensitive infrastructure. This helps surface lateral movement, track privilege escalation, and uncover patterns that static alerts might miss. Graph analytics doesn’t replace your existing detection rules; it complements them by revealing the structure behind security activity. It turns complex event relationships into something you can query directly, explore interactively, and act on with confidence. Query MongoDB data as a graph without ETL MongoDB is a popular choice for storing security event data, especially when working with logs that don’t always follow a fixed structure. Services like AWS CloudTrail produce large volumes of JSON-based records with fields that can differ across events. MongoDB’s flexible schema makes it easy to ingest and query that data as it evolves. PuppyGraph builds on this foundation by introducing graph analytics—without requiring any data movement. Through the MongoDB Atlas SQL Interface , PuppyGraph can connect directly to your collections and treat them as relational tables. From there, you define a graph model by mapping key fields into nodes and relationships. Figure 1. Architecture of the integration of MongoDB and PuppyGraph. This makes it possible to explore questions that involve multiple entities and steps, such as tracing how a session relates to an identity or which resources were accessed without MFA. The graph itself is virtual. There’s no ETL process or data duplication. Queries run in real time against the data already stored in MongoDB. While PuppyGraph works with tabular structures exposed through the SQL interface, many security logs already follow a relatively flat pattern: consistent fields like account IDs, event names, timestamps, and resource types. That makes it straightforward to build graphs that reflect how accounts, sessions, events, and resources are linked. By layering graph capabilities on top of MongoDB, teams can ask more connected questions of their security data, without changing their storage strategy or duplicating infrastructure. Investigating CloudTrail activity using graph queries To demonstrate how graph analytics can enhance security investigations, we’ll explore a real-world dataset of AWS CloudTrail logs. This dataset originates from flaws.cloud , a security training environment developed by Scott Piper. The dataset comprises anonymized CloudTrail logs collected over 3.5 years, capturing a wide range of simulated attack scenarios within a controlled AWS environment. It includes over 1.9 million events, featuring interactions from thousands of unique IP addresses and user agents. The logs encompass various AWS API calls, providing a comprehensive view of potential security events and misconfigurations. For our demonstration, we imported a subset of approximately 100,000 events into MongoDB Atlas. By importing this dataset into MongoDB Atlas and applying PuppyGraph’s graph analytics capabilities, we can model and analyze complex relationships between accounts, identities, sessions, events, and resources. Demo Let’s walk through the demo step by step! We have provided all the materials for this demo on GitHub . Please download the materials or clone the repository directly. If you’re new to integrating MongoDB Atlas with PuppyGraph, we recommend starting with the MongoDB Atlas + PuppyGraph Quickstart Demo to get familiar with the setup and core concepts. Prerequisites A MongoDB Atlas account (free tier is sufficient) Docker Python 3 Set up MongoDB Atlas Follow the MongoDB Atlas Getting Started guide to: Create a new cluster (free tier is fine). Add a database user. Configure IP access. Note your connection string for the MongoDB Python driver (you’ll need it shortly). Download and import CloudTrail logs Run the following commands to fetch and prepare the dataset: wget https://summitroute.com/downloads/flaws_cloudtrail_logs.tar mkdir -p ./raw_data tar -xvf flaws_cloudtrail_logs.tar --strip-components=1 -C ./raw_data gunzip ./raw_data/*.json.gz Create a virtual environment and install dependencies: # On some Linux distributions, install `python3-venv` first. sudo apt-get update sudo apt-get install python3-venv # Create a virtual environment, activate it, and install the necessary packages python -m venv venv source venv/bin/activate pip install ijson faker pandas pymongo Import the first chunk of CloudTrail data (replace the connection string with your Atlas URI): export MONGODB_CONNECTION_STRING="your_mongodb_connection_string" python import_data.py raw_data/flaws_cloudtrail00.json --database cloudtrail This creates a new cloudtrail database and loads the first chunk of data containing 100,000 structured events. Enable Atlas SQL interface and get JDBC URI To enable graph access: Create an Atlas SQL Federated Database instance. Ensure the schema is available (generate from sample, if needed). Copy the JDBC URI from the Atlas SQL interface. See PuppyGraph’s guide for setting up MongoDB Atlas SQL . Start PuppyGraph and upload the graph schema Start the PuppyGraph container: docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \ -e PUPPYGRAPH_PASSWORD=puppygraph123 \ -d --name puppy --rm --pull=always puppygraph/puppygraph:stable Log in to the web UI at http://localhost:8081 with: Username: puppygraph. Password: puppygraph123. Upload the schema: Open schema.json. Fill in your JDBC URI, username, and password. Upload via the Upload Graph Schema JSON section or run: curl -XPOST -H "content-type: application/json" \ --data-binary @./schema.json \ --user "puppygraph:puppygraph123" localhost:8081/schema Wait for the schema to upload and initialize (approximately five minutes). Figure 2: A graph visualization of the schema, which models the graph from relational data. Run graph queries to investigate security activity Once the graph is live, open the Query panel in PuppyGraph’s UI. Let's say we want to investigate the activity of a specific account. First, we count the number of sessions associated with the account. Cypher: MATCH (a:Account)-[:HasIdentity]->(i:Identity) -[:HasSession]->(s:Session) WHERE id(a) = "Account[811596193553]" RETURN count(s) Gremlin: g.V("Account[811596193553]") .out("HasIdentity").out("HasSession").count() Figure 3. Graph query in the PuppyGraph UI. Then, we want to see how many of these sessions are MFA-authenticated or not. Cypher: MATCH (a:Account)-[:HasIdentity]->(i:Identity) -[:HasSession]->(s:Session) WHERE id(a) = "Account[811596193553]" RETURN s.mfa_authenticated AS mfaStatus, count(s) AS count Gremlin: g.V("Account[811596193553]") .out("HasIdentity").out("HasSession") .groupCount().by("mfa_authenticated") Figure 4. Graph query results in the PuppyGraph UI. Next, we investigate those sessions that are not MFA authenticated and see what resources they accessed. Cypher: MATCH (a:Account)-[:HasIdentity]-> (i:Identity)-[:HasSession]-> (s:Session {mfa_authenticated: false}) -[:RecordsEvent]->(e:Event) -[:OperatesOn]->(r:Resource) WHERE id(a) = "Account[811596193553]" RETURN r.resource_type AS resourceType, count(r) AS count Gremlin: g.V("Account[811596193553]").out("HasIdentity") .out("HasSession") .has("mfa_authenticated", false) .out('RecordsEvent').out('OperatesOn') .groupCount().by("resource_type") Figure 5. PuppyGraph UI showing results that are not MFA authenticated. We show those access paths in a graph. Cypher: MATCH path = (a:Account)-[:HasIdentity]-> (i:Identity)-[:HasSession]-> (s:Session {mfa_authenticated: false}) -[:RecordsEvent]->(e:Event) -[:OperatesOn]->(r:Resource) WHERE id(a) = "Account[811596193553]" RETURN path Gremlin: g.V("Account[811596193553]").out("HasIdentity").out("HasSession").has("mfa_authenticated", false) .out('RecordsEvent').out('OperatesOn') .path() Figure 6. Graph visualization in PuppyGraph UI. Tear down the environment When you’re done: docker stop puppy Your MongoDB data will persist in Atlas, so you can revisit or expand the graph model at any time. Conclusion Security data is rich with relationships, between users, sessions, resources, and actions. Modeling these connections explicitly makes it easier to understand what’s happening in your environment, especially when investigating incidents or searching for hidden risks. By combining MongoDB Atlas and PuppyGraph, teams can analyze those relationships in real time without moving data or maintaining a separate graph database . MongoDB provides the flexibility and scalability to store complex, evolving security logs like AWS CloudTrail, while PuppyGraph adds a native graph layer for exploring that data as connected paths and patterns. In this post, we walked through how to import real-world audit logs, define a graph schema, and investigate access activity using graph queries. With just a few steps, you can transform a log collection into an interactive graph that reveals how activity flows across your cloud infrastructure. If you’re working with security data and want to explore graph analytics on MongoDB Atlas , try PuppyGraph’s free Developer Edition . It lets you query connected data, such as users, sessions, events, and resources, all without ETL or infrastructure changes.

July 7, 2025