DaaS with MongoDB and Confluent

Paresh Saraf and Gianluca Natali
October 8, 2020
#partners #technology

An operational data layer (ODL) is an architectural pattern that centrally integrates and organizes siloed enterprise data, making it available to consuming applications. It enables a range of board-level strategic initiatives such as legacy modernization and data as a service (DaaS), and use cases such as single view, real-time analytics, and mainframe offload. The simplest representation of this pattern is something like the diagram shown in Figure 1. An ODL is an intermediary between existing data sources and consumers that need to access that data.

An ODL deployed in front of legacy systems can enable new business initiatives and meet new requirements that the existing architecture can't handle -- without the difficulty and risk of a full rip and replace of legacy systems. It can reduce the workload on source systems, improve availability, reduce end-user response times, combine data from multiple systems into a single repository, serve as a foundation for re-architecting a monolithic application into a suite of microservices, and more. The ODL becomes a system of innovation, allowing the business to take an iterative approach to digital transformation.

ODL diagram — Figure 1: An ODL centrally integrates and organizes siloed enterprise data, making it available to consuming applications.

Architecture

Source Systems and Data Producers

Source systems and data producers are usually databases, but sometimes they are systems or other data stores. Generally, they are systems of record for one or more applications, either off-the -shelf packages apps (ERP, CRM, and so forth) or internally developed custom apps.

In some cases, there may be only one source system feeding the ODL. Usually, this is the case if the main goal of implementing an ODL is to add an abstraction layer on top of that single system. This could be for the purpose of caching or offloading queries from the source system, or it could be to create an opportunity to revise the data model for modernization or new uses that don't fit with the structure of the existing source system. An ODL with a single source system is most useful when the source is a highly used system of record and/or is unable to handle new demands being placed on it; often, this is a mainframe. More often, there are multiple source systems. In this case, the ODL can unify disparate datasets, providing a complete picture of data that otherwise would not be available.

Consuming Systems

An ODL can support any consuming systems that require access to data. These can be either internal or customer-facing. Existing applications can be updated to access the ODL instead of the source systems, while new applications (often delivered as domains of microservices) typically will use the ODL first and foremost. The requirements of a single application may drive the initial implementation of an ODL, but usage usually expands to additional applications once the ODL’s value has been demonstrated to the business. An ODL can also feed analytics, providing insights that were not possible without a unified data system. Ad hoc analytical tools can connect to an ODL for an up-to-the-minute view of the company — without interfering with operational workloads — while the data can also support programmatic real-time analytics to drive richer user experiences with dashboards and aggregations embedded directly into applications.

Data Loading

For a successful ODL implementation, the data must be kept in sync with the source systems. Once the source systems’ producers have been identified, it’s important to understand the frequency and quantity of data changes in producer systems. Similarly, consuming systems should have clear requirements for data currency. Once you understand these, it’s much easier to develop an appropriate data loading strategy.

1. Batch extract and load. This typically is used for an initial one-time operation to load data from source systems. Batch operations extract all required records from the source systems and load them into the ODL for subsequent merging. If none of the consuming systems requires up-to-the-second level data currency and overall data volumes are low, it may also suffice to refresh the entire dataset with periodic (daily/weekly) data refreshes. Batch operations are also good for loading data from producers that are reference data sources, where data changes are typically less frequent — for example, country codes, office locations, tax codes, and similar data. Commercial extract, transform, and load (ETL) tools or custom implementations with Confluent (built on Apache Kafka) are used for carrying out batch operations: extracting data from producers, transforming the data as needed, and then loading it into the ODL. If after the initial load the development team discovers that additional refinements are needed to the transformation logic, then the related data from the ODL may need to be dropped, and the initial load repeated.

2. Delta extract and load. This is an ongoing operation that propagates incremental updates committed to the source systems into the ODL, in real time. To maintain synchronization between the source systems and the ODL, it’s important that the delta load starts immediately following the initial batch load. The frequency of delta operations can vary drastically. In some cases, they may be captured and propagated at regular intervals, for example every few hours. In other cases, they are event-based, propagated to the ODL as soon as new data is committed to the source systems. To keep the ODL current, most implementations use change data capture (CDC) mechanisms to catch the changes to source systems as they happen. Confluent is often used to store these real-time changes captured by the CDC mechanism, thanks to the multiple connectors available from various technologies. After the changes are safely stored in Kafka, you can use a streaming application, an ETL process, or custom handlers to transform the data into the required format for the ODL. Once the data is in the right format, you can leverage the MongoDB Connector for Apache Kafka sink to stream the new delta changes into the ODL. Increasingly, the message queue itself transforms the data, removing the need for a separate ETL mechanism.

Why MongoDB for DaaS?

Unified Data Infrastructure

The move to the cloud has brought forth efficiency and a self-service mindset by addressing the operational and administrative blockers in traditional on-premises environments. However, developer workflows have remained relatively unchanged as cloud lift-and-shift initiatives often replicate pre-existing data infrastructure complexities, including technology sprawl. MongoDB Atlas unifies transactional, operational, and real-time analytics into a single cloud-native platform and API for MongoDB users. This delivers a far better developer experience, because it makes data easier to manipulate, find, and analyze by eliminating the need for migrations across fragmented data services.

Atlas Online Archive:Gives the users the ability to age out older data into cost-effective storage while still giving them the ability to easily query both warm and cold data from a single query.
Atlas Data Lake: Query heterogeneous data stored in Amazon S3 and MongoDB Atlas in place and in its native format by using the MongoDB Query Language (MQL).
Atlas Search: Build fast, Apache Lucene-based search capabilities on top of data in Atlas without the need to migrate it to a separate search platform.

Seamless Application Development

MongoDB Realm Mobile Database allows developers to store data locally on iOS and Android devices as well as IoT edge gateways by using a rich data model that’s intuitive to them. Combined with the MongoDB Realm sync to Atlas, Realm makes it simple to build reactive, reliable apps that work even when users are offline.

MongoDB Realm allows developers to validate and build key features quickly. Application development services such as Realm Sync provide out-of-the box bidirectional synchronization between the cloud and your devices. Realm offers other services, including its GraphQL service, which can query data using any GraphQL client. Realm also provides many other services and features such as functions, triggers, and data access rules — ultimately simplifying the code required and enabling you to focus on adding business value to your applications instead of wasting time writing boilerplate code.

MongoDB is the Best Way for an ODL to Work with Data

Ease. MongoDB’s document model makes it simple to model — or remodel — data in a way that fits the needs of your applications. Documents are a natural way to describe data.

Flexibility With MongoDB, there’s no need to predefine a schema. Documents are polymorphic: fields can vary from document to document within a single collection.

Speed.Using MongoDB for an ODL means you can get better performance when accessing data, and write less code to do so. In most legacy systems, accessing data for an entity, such as a customer, typically requires JOINing multiple tables together. JOINs entail a performance penalty, even when optimized — which takes time, effort, and advanced SQL skills.

Versatility Building on the ease, flexibility, and speed of the document model, MongoDB enables developers to satisfy a range of application requirements, both in the way data is modeled and in how it is queried.

Data access and APIs. Consuming systems require powerful and secure access methods to the data in the ODL. If the ODL is writing back to source systems, this channel also needs to be handled. MongoDB’s drivers provide access to a MongoDB-based ODL from the language of your choice.

MongoDB lets you intelligently distribute and ODL. Consuming systems depend on an ODL. It needs to be reliable and scalable and to offer a high degree of control over data distribution to meet latency and data sovereignty requirements.

Availability. MongoDB maintains multiple copies of data by using replica sets. Replica sets are self-healing, because failover and recovery are fully automated, so it is not necessary to manually intervene to restore a system in the event of a failure, or to add the additional clustering frameworks and agents that are needed for many legacy relational databases.

Scalability. To meet the needs of an ODL with large datasets and high throughput requirements, MongoDB provides horizontal scale-out on low-cost, commodity hardware or cloud infrastructure by using sharding.

Workload isolation. MongoDB’s replication provides a foundation for combining different classes of workload on the same MongoDB cluster, each workload operating against its own copy of the data.

Data locality. MongoDB allows precise control over where data is physically stored in a single logical cluster. For example, data placement can be controlled by geographic region for latency and governance requirements, or by hardware configuration and application features to meet specific classes of service for different consuming systems.

MongoDB Gives You the Freedom to Run Anywhere

Portability. MongoDB runs the same everywhere: on premises in your data centers, on developers’ laptops, in the cloud, or as an on-demand fully managed DaaS: MongoDB Atlas.

Global coverage. MongoDB’s distributed architecture allows a single logical cluster to be distributed around the world, situating data close to users. When you use MongoDB Atlas, global coverage is even easier; Atlas supports more than 70 regions across all the major cloud providers.

No lock-in. With MongoDB, you can reap the benefits of a multi-cloud strategy. Since Atlas clusters can be deployed on all major cloud providers, you get the advantage of an elastic, fully managed service without being locked into a single cloud provider.

What Role Does Confluent Play?

What Confluent builds is an enterprise-ready platform that complements Apache Kafka with advanced capabilities designed to help accelerate application development and connectivity, enable event transformations via stream processing, simplify enterprise operations at scale, and meet stringent architectural and security requirements.

One of Confluent’s goals is to democratize Kafka for a wider range of developers and accelerate how quickly they can build event streaming applications. Confluent enables this via a set of features, including the ability to leverage Kafka in languages other than Java, a rich prebuilt ecosystem including more than 100 connectors so developers don’t have to spend time building connectors themselves, and enabling stream processing with the ease and familiarity of SQL.

Kafka can sometimes be complex and difficult to operate at scale. Confluent makes it easier via GUI-based management and monitoring, DevOps automation including with Kubernetes Operator, and enabling dynamic performance and elasticity in deploying Kafka.

Also, Confluent offers a set of features many organizations consider prerequisites when deploying mission-critical apps on Kafka. These include security features that control who has access to what, the ability to investigate potential security incidents via audit logs, the ability to ensure via schema validation that there is no “dirty” data in Kafka and that only “clean” data is in the system, and resilience features (if your data center goes down, for example, your customer-facing applications stay running).

Confluent offers all of this with freedom of choice, meaning you can choose self-managed software you can deploy anywhere, including on premises or in a public cloud, private cloud, containers, or Kubernetes. Or, you can choose MongoDB’s fully managed cloud service, available on all three major cloud providers.

And underpinning all of this is the Confluent committer-led expertise. Confluent has more than 1 million hours of Kafka expertise and offers support, professional services, training, and a full partner ecosystem. Simply put, there is no other organization in the world better suited to be an enterprise partner, and no organization in the world that is more capable of ensuring your success. This means everything to the organizations Confluent works with.

Learn More:

← Previous

How Inviz.ai built Search Relevancy Platform with MongoDB Realm

MongoDB Realm and the Realm Mobile Database help mobile developers to build performant, resilient mobile applications in significantly less time.

October 8, 2020

Next →

Welcome to the (Tech) Olympics!

Welcome to the Tech Olympics, where code meets competition! With the 2024 Summer Olympics starting today, we thought it’d be fun to imagine developers as athletes, showcasing their skills in a series of thrilling events. From relay races to coding challenges, the Tech Olympics would bring together the brightest minds in tech for a competition like no other. Whether you're a coding wizard, a bug-squashing maestro, or an AI aficionado, there would be something to test your limits and celebrate your talents. Opening ceremony The opening ceremony is one of the most iconic aspects of the Olympics. From the lighting of the torch to performances by local artists, the opening ceremony encapsulates the spectacle of the games, and is a necessity for the Tech Olympics. The Tech Olympics opening ceremony would kick off with a grand procession of teams involved, adorned in attire representing their area of expertise. Next, there’d be a performance by artists and developers using augmented and virtual reality to blend art with cutting-edge technology. Finally, there would be the lighting of the torch, but instead of the flame being run across the country, an application would be written and passed between developers from around the world that, when run, would light the torch and start the games. Now that we’ve kicked off the Tech Olympics, let's consider what its events might look like. Code sprint relay The "code sprint relay" would be a collaborative coding event where teams of developers would tackle a series of coding challenges in relay format. The twist would be that each member could only code for a set period (say, 5-10 minutes) before handing the code off to the next person. This setup requires clear communication and strategic planning, as each coder must quickly understand and build upon their predecessor's work. Code sprint relay challenges would range from algorithm problems to debugging tasks, demanding various skills and swift adaptability. This event would be fast-paced and dynamic, with a lively atmosphere filled with the buzz of coding and quick exchanges of ideas. Success would be measured not only by the completion of challenges but also by the efficiency and quality of the code, making this event a test of teamwork and technical skill under pressure. Security capture the flag Capture the flag might seem more like a kids’ game than an Olympic event, but trust us, there’d be nothing childish about this event. The "security capture the flag" event would be an exciting cybersecurity competition in which participants would need to solve security-related challenges to capture hidden "flags." These challenges would range from web application exploits and reverse engineering, to cryptographic puzzles and network forensics. Working in teams, participants would race against the clock to uncover vulnerabilities, exploit them, and find the embedded flags within a controlled, simulated environment. At the end, a debriefing session would highlight the most innovative solutions and techniques used. Success would be measured by the number of flags captured and the ingenuity of the approaches, showcasing participants' technical skills and strategic thinking under pressure. Bug hunt Have you ever built out your code and then, upon running it, realized that you made a mistake? If you have, you’ll understand just how intense this next event could be! The "bug hunt challenge" would be a fast-paced competition in which participants are tasked with finding and fixing bugs within a complex codebase. Each individual would be given the same software project with numerous hidden bugs, ranging from simple syntax errors to intricate logical flaws. Participants must use their debugging skills and tools to identify and resolve as many issues as possible within a set time limit. The event would be marked by intense focus and strategic problem-solving as competitors meticulously comb through the code. An automated system would verify the fixes instantly, ensuring accuracy and efficiency. Success would be measured by the quantity and severity of bugs resolved, along with the quality of the fixes, making this event a test of attention to detail and technical proficiency. AI arena We’d be remiss not to include an AI event! The "AI arena" event would be a competitive showcase where participants create machine learning models using a provided dataset to solve a specified problem. Teams would have several hours to analyze the data, create features, and train their models. The objective would be to develop a model with the highest accuracy and performance, balancing technical innovation with practical application. In the end, teams would present their models to judges, explaining their methodologies and challenges faced. Judging criteria would include model accuracy, creativity, and clarity of the presentation, making this event a comprehensive test of technical and communication skills. Location Finally, you can’t have an Olympics without a city to host it. There are plenty of tech hubs to choose from—San Francisco, London, Beijing—but we thought it’d be more fun to pick a growing tech hub like Ha Noi, Vietnam, as our location. Vietnam had the highest digital economy growth in Southeast Asia in 2022 , putting it on the path to be named alongside other “tech giant” cities. Also, Vietnamese food is excellent! During the games, local startups and tech companies would showcase their work on the world stage, and visiting developers would see the innovations that Vietnamese companies are working on. Sadly, there won’t be an actual Tech Olympics this year, but maybe in the future, there will be. An event that will bring the world's best developers together to showcase their skills, foster friendly competition, and allow the world to see just how amazing developers are. If you have some ideas about other events you would want to see at a Tech Olympics, connect with us on X (Twitter) and let us know what your ideas are. Interested in learning more about or connecting more with MongoDB? Join our MongoDB Community to meet other community members, hear about inspiring topics, and receive the latest MongoDB news and events.

July 26, 2024