Unlock Multi-Agent AI Predictive Maintenance with MongoDB

Humza Akhtar, Rami Pinto Prieto, and Raphael Schor
August 18, 2025

The manufacturing sector is navigating a growing number of challenges: evolving customer demands, intricate software-mechanical product integrations, just-in-time global supply chains, and a shrinking skilled labor force. Meanwhile, the entire sector is working under intense pressure to improve productivity, manage energy consumption, and keep costs in check. To stay competitive, the industry is undergoing a digital transformation—and data is at the center of that shift.

Data-driven manufacturing offers a powerful answer to many of these challenges. On the shop floor, one of the most critical and high-impact applications of these strategies is predictive maintenance. Downtime isn’t just inconvenient—it’s expensive. For example, every unproductive hour in the automotive sector now costs $2.3 million (according to Siemens "The True Cost of Downtime 2024" report). For manufacturers across all sectors, predictive maintenance is no longer optional. It’s a foundational pillar of operational excellence.

At its core, predictive maintenance is about using data to anticipate machine failures before they happen. It began with traditional statistical models, evolved with machine learning, and is now entering a new era. As equipment ages and failure behaviors shift, models must adapt. This has led to the adoption of more advanced approaches, including generative AI with retrieval-augmented generation (RAG) capabilities.

But the next frontier is multi-agent systems—AI-powered agents working together to monitor, reason, and act. We’ve explored how generative AI powers predictive maintenance in previous posts. In this blog post, we’ll go deeper into multi-agent systems and how MongoDB makes it easy to build and scale them for smart, responsive maintenance strategies.

Advance your data-driven manufacturing strategy with Agentic AI

AI agents combine large language models (LLMs) with tools, memory, and logic to autonomously handle complex tasks. On the shop floor, this means agents can automate inspections, reoptimize production schedules, assist with fault diagnostics, and more. According to a LangChain survey, 78% of companies are actively developing AI agents, and over half already have at least one agent in production. Manufacturing companies can especially benefit from agentic capabilities across a great variety of practical use cases, as shown in Figure 1.

Figure 1. Agent capabilities and related practical use cases in manufacturing.

On the left of this diagram are a list of agent capabilities, which includes managing multi step tasks, automating repetitive tasks, task routing and collaboration, and human-like reasoning. On the right are a list of practical use cases, which includes production scheduling, supply chain orchestration, multi stage machine fault diagnostics, auto-generated work orders, quality inspection reports, data logging + compliance, work order routing, inter-departmental collaboration, recall coordination, production re-optimization, complex fault diagnostics, and context-aware maintenance assistance.

But leveraging AI agents in industrial environments presents unique challenges. Integration with industrial protocols like Modbus or PROFINET is complex. Governance and security requirements are strict, especially when agents interact with production equipment. Latency is also a concern as AI models need fast, reliable data access to support real-time responses. And with agents generating and consuming large volumes of data, companies need a data foundation that is reliable and can scale without sacrificing performance.

Many of these challenges are not new to manufacturers—and MongoDB has a proven track record of addressing them. Industry leaders in manufacturing and automotive trust MongoDB to power critical IoT and telemetry use cases. Bosch, for example, uses MongoDB to store, manage, and analyze huge amounts of data to power its Bosch IoT Insights solution. MongoDB’s flexible document model is ideal for diverse sensor inputs and machine telemetry, while allowing systems to iterate and evolve quickly.

It’s important to remember that, at its core, MongoDB was built for change, so when it comes to integrating AI in the shopfloor, it’s no surprise that MongoDB is emerging as the ideal data layer foundation. Companies like Novo Nordisk and Cisco rely on MongoDB to build and scale their AI capabilities, and leading platforms like XMPro APEX AI leverage MongoDB Atlas to create and manage advanced AI agents for industrial applications.

MongoDB Atlas makes it easy to build AI Agents and operate them at scale. As both a vector and a document database, Atlas supports various search methods for agentic RAG, while also enabling agents to store short and long-term memory in the same database. The result is a unified data layer that bridges industrial IoT and agentic AI. Predictive maintenance is a perfect example of how these capabilities come together to drive real impact on the shop floor. In the next section, we’ll walk through a practical blueprint for building a multi-agent predictive maintenance system using MongoDB Atlas.

Building a multi-agent predictive maintenance system

This solution demonstrates how to build a multi-agent predictive maintenance system using MongoDB Atlas, LangGraph, and Amazon Bedrock. This system can streamline complex processes, such as detecting equipment anomalies, diagnosing root causes, generating work orders, and scheduling maintenance. At a high level, this solution leverages MongoDB Atlas as the unified data layer. LangGraph provides the orchestration layer, enabling graph-based coordination among agents, while Amazon Bedrock powers the underlying foundational models used by the agents to reason and make decisions.

The architecture follows a supervisor-agent pattern. The supervisor coordinates tasks and delegates to three specialized agents:

Failure agent, which performs root cause analysis and generates incident reports.
Work order agent, which drafts maintenance work orders with detailed requirements.
Planning agent, which identifies the optimal time slot for the maintenance task based on availability and production constraints.

Figure 2. High-level architecture of a multi-agent predictive maintenance system.

On the top left of this diagram is machine telemetry which sends data to failure prediction ML inference, this in turn sends alerts to MongoDB Atlas and to LangGraph. At the middle left are users, which connect to maintenance application, which connects to LangGraph. LangGraph contains AI agents, and sends data to AWS Bedrock and MongoDB Atlas.

This modular design enables the system to scale easily and adapt to different operational needs. Let’s walk through the full process in four key steps.

Step 1: Failure prediction kicks off the agentic workflow

The process begins with an alert—something unusual in the machine data or logs that could point to a potential failure. MongoDB provides a unified view of operational data, real-time processing capabilities, and seamless compatibility with machine learning tools. Sensor data is processed in real-time using Atlas Stream Processing integrated with ML inference models. Features like native support for Time Series data and Online Archive facilitate managing telemetry data at scale efficiently. All while the downstream applications remain up to date with the latest notifications and dashboards by using Atlas Triggers, Change Streams, and Atlas Charts. From there, the supervisor agent takes over and coordinates the next steps.

Figure 3. End-to-end failure prediction process that generates the alerts.

The left of this diagram begins with the shop floor, which sends data to the cloud that contains the Kafka Cluster, the machine learning inference, and MongoDB Atlas. Atlas then generates data, which creates real-time alerts, real-time visualization, and batch analytics.

Step 2: Leverage your data for root cause analysis

The supervisor notifies the Failure Agent about the alert. Manual diagnostics of a machine can take hours—sifting through manuals, historical logs, and environmental data. The AI agent automates this process. It collects relevant documents, retrieves contextual insights using Atlas vector search, and analyzes environmental conditions stored in the database—like temperature or humidity at the time of failure. With this data, the agent performs a root cause analysis and proposes corrective actions. It generates a concise incident report and shares it with the supervisor agent, which then moves the workflow forward.

Figure 4. Failure Agent performing root cause analysis.

The left of this diagram starts with Alerts, which connects to the failure agent. The failure agent then sends data to the LLM inference and embedding models, which send data to MongoDB Atlas. Atlas also ingests data from external sources, such as documentation, ERP/MES and environment. Atlas then generates incident reports.

Step 3: Work order process automation

The Work Order Agent receives the incident report and drafts a comprehensive maintenance work order. It pulls from previous similar tasks to estimate time requirements, identify the necessary materials, and ensure the right skill sets are listed. All of this is pre-filled into a standardized work order template and saved back into MongoDB Atlas. This step also includes a human-in-the-loop checkpoint. Technicians or supervisors can review and modify the draft before it is finalized.

Figure 5 Work Order Agent is generating a draft work order and routing it for human validation.

This diagram is a continuation of the last one. Starting on the left with incident reports, the reports connect to the work order agent, which sends data to MongoDB Atlas via LLM inference and embedding models. Atlas again pulls in data from external sources. Finally, Atlas generates work orders.

Step 4: Finding the optimal maintenance schedule

Once the work order is approved, the Planning Agent steps in. Its task is to schedule the maintenance activity without disrupting production. The agent queries the production calendar, checks staff shift schedules, and verifies inventory availability for required materials. It considers alert severity and rescheduling constraints to find the most efficient time slot. Once the optimal window is identified, the agent sends the updated plan to the scheduling system.

Figure 6. Planning Agent is evaluating constraints to identify the optimal maintenance schedule.

This diagram again continues the last. On the left are work orders, which are sent to the planning agent, which sends data to MongoDB Atlas via LLM inference and Embedding models. Atlas takes in external data and then generates new production plan.

While we focused on a predictive maintenance work flow, this architecture can be easily extended. Need agents for compliance reporting, spare parts procurement, or shift planning? No problem. With the right foundation, the possibilities are endless.

Unlocking manufacturing excellence with Agentic AI

Agentic AI represents a new chapter in the evolution of predictive maintenance, enabling manufacturers to move from reactive responses to intelligent, autonomous decision-making. By combining AI agents with real-time telemetry and a unified data foundation, teams can reduce downtime, cut maintenance costs, and boost equipment reliability. But to work at scale, these systems need flexible, high-performance infrastructure. With native support for time series data, vector search, stream processing, and more, MongoDB makes it easier to build, operate, and evolve multi-agent solutions in complex industrial environments. The result is smarter operations, greater resilience, and a clear path to manufacturing excellence.

Clone the GitHub repository if you are interested in trying out this solution yourself. To learn more about MongoDB’s role in the manufacturing industry, please visit our manufacturing and automotive webpage.

← Previous

The Art and Science of Sizing Search Nodes

Getting the most out of your search deployment isn't just about writing the perfect query; it's about ensuring the underlying system is perfectly sized for your workload. For many, this has meant facing a difficult choice. If your search indexes are large but your query and indexing rates are moderate, you may have been forced to scale up to more expensive, higher-tiered nodes simply to get the storage capacity you need. This often leads to overprovisioning compute resources and unnecessary costs. To solve this and provide more cost-effective scaling, we are introducing storage-optimized search nodes . These nodes are designed specifically for use cases where large index sizes are the primary scaling factor, rather than high computational demands from indexing or querying. This post will delve into the key components of sizing a search deployment, from data ingestion and index size to query performance. We'll provide context on how to scope your workloads and show how our new storage-optimized nodes offer a powerful new way to build the most cost-effective and performant solution for your specific needs. Understanding the core components of search node sizing Several key factors influence the sizing of your Atlas Search Node deployment: 1. Data size and index size The first consideration is your index size. Dedicated search nodes (DSN) utilize local solid-state drives (SSDs) of a fixed size. Therefore, a DSN must possess adequate disk space for the index. It's crucial to remember that a collection's size and the resulting search index's size are not always directly related due to index mapping. For example, if your documents have 100 fields but your search index is configured for only 5, the index will be substantially smaller than the collection. Conversely, mapping all fields or using features like autocomplete can significantly increase index size. Estimating index size: Insert 1-2 GB of data or create a small collection using $out . Create a search index with your desired field mappings. The resulting index size will give you an index-to-collection size ratio. Use this ratio to estimate the total index size based on your expected collection size. For instance, if a 1GB collection yields a 250MB index (a 0.25:1 ratio), a 12GB collection would likely result in an approximately 3GB index. If you already use Atlas Search, you can find the index size in cluster metrics or on the index list page. 2. Data ingestion Data must first be inserted into a MongoDB collection to become searchable. When a search index is created, a collection scan populates the index. To keep the index current, Atlas Search uses change streams to monitor collection alterations. Both initial indexing and ongoing synchronization can impose considerable read pressure on the cluster. The cluster must be sized appropriately to handle this; otherwise, replication lag between the cluster and the search index can increase. For very heavy data ingestion, MongoDB sharding can distribute the read/write load. 3. Indexing Indexing is the process of applying inserts, updates, and deletes from change streams to the search index. This can be resource-intensive, depending on ingest and update rates. Optimizing indexing involves considering both the cluster and the search node. 4. Steady-state replication and lag The goal is to replicate data from the collection to the index in under one second. However, various factors can extend this time. To minimize replication lag, consider these potential bottlenecks: Cluster: High resource utilization on the cluster can impact its ability to publish change streams quickly enough. Aim to minimize overall load and ensure it's spread evenly across replica set members. Change streams: Listening to change streams can put substantial read pressure on a cluster. Adding additional secondaries to the replica set can alleviate this. DSN indexing: Search Nodes use high-performance local disks. Each tier offers higher input/output per second (IOPS). If you observe high vCPU utilization with heavy insert rates, upgrading to a higher DSN tier might be beneficial. Indexing parallelism: For extremely heavy indexing loads (e.g., >10k inserts/updates per second), sharding the replica set may be necessary. Sharding allows Atlas Search to index each shard independently, reducing the overall load. Number of indexes: A large number of search indexes can also contribute to replication lag and affect eventual consistency. Each index adds overhead, and having a high number of them can slow down the replication process. 5. Query performance: QPS and latency When sizing your Atlas Search deployment, two key performance metrics are critical: Queries Per Second (QPS) and latency. Queries per second (QPS): This measures sustained query throughput. A general starting point for estimation is 10 QPS per vCPU core. For example, a minimum setup of two S20 nodes (each with 2 vCPUs) provides 4 vCPUs, supporting roughly 40 QPS. This is a baseline; query complexity and index mappings will influence actual QPS. QPS is supported by horizontal scaling; you can deploy up to 32 Search Nodes per cluster/shard/region to increase the overall vCPU count. Latency: This is the time between query execution and response receipt. The general aim is sub-100ms latency, though some cases demand much lower latency. Latency is a function of DSN resources and can be improved by vertical scaling (moving to a higher search node tier). Sufficient CPU, RAM, and disk I/O are essential. CPU is primarily leveraged for queries using concurrent segment search; if adding CPU resources reduces latency, it indicates underprovisioning. The challenge: When storage is the bottleneck While horizontal and vertical scaling effectively address performance, what happens when your primary challenge isn't speed, but size? As applications mature, search indexes can grow to hundreds of gigabytes or even terabytes. With a traditional coupled architecture, this growth forces you to scale up your entire database cluster, leading to significant costs for compute resources you may not even need. Even with dedicated search nodes, which utilize fast but fixed-size local SSDs, you previously had to upgrade to a higher tier solely for more storage capacity. This often results in overprovisioning compute resources and inflating costs. For example, a customer needing the compute of an S50 node but the storage of an S70 could face a major price premium. The solution: Introducing storage-optimized search nodes To address this exact problem, MongoDB is introducing storage-optimized search nodes . These nodes are engineered for workloads where the footprint of your index is the main scaling factor, not high query rates or intense indexing operations. If your indexes are large but your query load is moderate, these nodes offer a cost-effective path to scale without overprovisioning. Key benefits include: Increased storage capacity: They provide more than double the storage capacity of other node classes. Cost-effectiveness: These nodes can save up to 50% on DSN storage costs and are approximately 40% less expensive than existing high-CPU nodes when anchored on RAM. At the same vCPU count, users receive roughly 3x the storage compared to high-CPU nodes. Optimized architecture: With an 8:1 RAM-to-vCPU ratio , they offer a balanced profile perfect for large indexes. Choosing the right tool for the job: A comparison of search nodes Atlas Search now provides three distinct classes of dedicated nodes, giving you the flexibility to isolate workloads and perfectly match your infrastructure to your needs. Node classes comparison: table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Class Notes Use Case(s) Storage Capacity Range Low-CPU 8:1 RAM:vCPU ratio, very large memory options, smaller storage options Vector search, Low data volume 50 - 3200 GB High-CPU 2:1 RAM:vCPU ratio FT/Lexical search, Vector Search, Heavy indexing, Heavy query load 100 - 3200 GB Storage-optimized 8:1 RAM:vCPU ratio, very large memory options, 2x+ storage capacity over high/low CPU Large search indexes, Moderate indexing, Moderate query load, Binary quantized vector search 375 - 6000 GB Unlocking modern AI: The impact on vector search The rise of AI has put a new focus on vector search. While traditionally memory-constrained, modern techniques like automatic binary quantization are shifting the bottleneck from RAM to storage. Binary quantization makes indexes more storage-constrained, and storage-optimized nodes are the perfect solution. For a large-scale vector search deployment using BQ that requires 3600GB of storage, you can now select a storage-optimized node that fits your needs precisely, rather than drastically overprovisioning a high-CPU node just for its disk space. This alignment of resources to workload ensures you can build and scale modern AI applications efficiently and economically. Concluding thoughts Sizing search deployments is a blend of art and science. This post has provided guidance on the specific components of search and general approaches to maximize search indexing and query performance. The introduction of storage-optimized search nodes further enhances your ability to right-size your Atlas Search deployments, ensuring you have the most cost-effective and performant solution for your specific workload needs. Learn more about MongoDB Atlas Search Nodes and the new storage-optimized options in our documentation .

August 12, 2025

Next →

Innovating with MongoDB | Customer Successes, October 2025

It’s officially fall! The start of every new season is a perfect time to consider change and new beginnings. While fall might make you think about pumpkin spice and newly chilly evenings, I’m thinking about the latest round of transformations that MongoDB’s customers are embracing to thrive in an AI-powered world. In all seriousness, legacy systems and technical debt are huge challenges: the cost of tech debt has been estimated at almost $4 trillion dollars. That’s trillion with a T! Legacy systems can slow down innovation, create bottlenecks, and make it tough to deliver the seamless, real-time experiences customers increasingly expect. But companies are finding that modernizing their applications isn't just about fixing what's broken—modernization enables them to move faster and innovate for end-users. That’s why I'm incredibly excited to share the recent launch of MongoDB’s Application Modernization Platform (AMP). This AI-powered program is designed to help enterprises move beyond outdated infrastructures to embrace a flexible, data-driven future. AMP is a comprehensive approach to modernization that combines smart AI tooling with proven methodologies, enabling businesses to transform their applications from the ground up, moving from legacy monoliths to a more flexible, microservices-based architecture. In this roundup, we're spotlighting customers who understand the strategic importance of modernization. You'll see how Wells Fargo is using MongoDB to power a new credit card platform, how CSX is ensuring business continuity during a critical migration, how Intellect Design is modernizing its wealth management platform, and how Deutsche Telekom is transforming its B2C digital channels. With MongoDB, customers are showing how integral a modern database is to powering the next generation of applications—and succeeding in the AI era. Wells Fargo Wells Fargo sought to modernize its mainframe-dependent credit card platform to provide a faster, more seamless customer experience and handle an exponential increase in transaction data. The company's legacy system was costly to manage and lacked the scalability needed for its "Cards 2.0" initiative. To solve this, Wells Fargo built an operational data store (ODS) using MongoDB. This new platform allowed them to adopt reusable APIs, streamline integrations, and move from a monolithic architecture to flexible microservices. The ODS now serves 40% of traffic from external vendors, handling more than 7 million transactions with sub-second service. By leveraging MongoDB, Wells Fargo was able to jumpstart its mainframe modernization and create curated data products to serve real-time, personalized financial services. CSX CSX , a major U.S. railroad company, sought to modernize its critical operations platform, RTOP, by migrating it to the cloud. The challenge was to maintain the platform's 24/7 availability with minimal disruption to its mission-critical, near real-time operations during the transition. To solve this, CSX selected MongoDB Atlas on Azure and partnered with MongoDB Professional Services . Leveraging the Cluster-to-Cluster Sync feature, the team was able to facilitate continuous data synchronization and complete the entire migration in just a few hours. The move to MongoDB Atlas has equipped CSX with a more scalable and resilient platform. This modernization effort established a blueprint for migrating other critical applications and helped CSX continue its digital transformation journey toward becoming America’s best-run railroad. Intellect Design Intellect Design , a global fintech company, sought to modernize its wealth management platform to overcome legacy system bottlenecks and multihour batch processing delays. The company's rigid relational database architecture limited its ability to scale and innovate. To solve this, the company partnered with MongoDB, using our AMP methodology and generative AI tools. This transformation reengineered the platform's core components, resulting in an 85% reduction in onboarding workflow times, allowing clients to access critical portfolio insights faster than ever. This initiative is the first step in Intellect Design's long-term vision to integrate its entire application suite into a unified, AI-driven service. By leveraging MongoDB Atlas's flexible schema and powerful native tools, the company is now better positioned to deliver smarter analytics and advanced AI capabilities to its customers. Watch Intellect AI’s MongoDB.local Bengaluru keynote presentation to learn how AMP helped them transform outdated systems into scalable, modern solutions. Deutsche Telekom Deutsche Telekom , a leading telecommunications company, sought to modernize its B2C digital channels, which were fragmented by outdated legacy systems. The company needed to create a unified digital experience for its 30 million customers while improving developer productivity. By leveraging MongoDB Atlas as part of its Internal Developer Platform, Deutsche Telekom built a robust data infrastructure to unify customer data and power its new digital services. This approach allowed the company to retire legacy systems and reduce its reliance on physical shops and call centers. The transition to MongoDB Atlas led to a massive surge in digital engagement, with daily customer interactions rising from under 50,000 to approximately 1.5 million. The company's customer data platform now handles up to 15 times the load of legacy systems, supporting large-scale loyalty programs and transforming the customer experience. Video spotlight: Bendigo Bank Before you go, watch how Bendigo and Adelaide Bank modernized their core banking technology using MongoDB Atlas and generative AI. Bendigo and Adelaide Bank reduced the migration time for legacy applications from 80 hours to just five minutes. This innovative approach allowed them to quickly modernize their systems and better serve their 2.5 million customers. Want to get inspired by your peers and discover all the ways we empower businesses to innovate for the future? Visit MongoDB’s Customer Success Stories hub to see why these customers, and so many more, build modern applications with MongoDB.

October 2, 2025