A data platform is an integrated set of technologies that collectively meets an organization’s end-to-end data needs. It enables the acquisition, storage, preparation, delivery, and governance of your data, as well as a security layer for users and applications. A data platform is key to unlocking the value of your data.
But data platforms can be a complex subject. What exactly is behind a data platform? How do you approach designing one? And what’s the difference between a customer data platform, a big data platform, and an operational data platform?
Table of Contents
Over the last 20 years, IT vendors have been trying to develop and offer solutions to address the flood of data that companies face from both inside and outside the business.
Cloud is the new norm, and cloud-native data warehouses are now massively parallel-processed. Data pipelines can handle terabytes of data. Storage has become cheap and fast, and data processing frameworks like Spark can handle large volumes of data. NoSQL augments relational databases and Graph augment traditional languages like SQL while AI/ML applications have proliferated everywhere.
Although these individual pieces of technology have matured, most enterprises have been unable to integrate these tools. The result is data silos that are often unscalable, contain duplicate, often out-of-date data, locked into proprietary solutions, and have no single security layer.
A modern data platform tries to solve this problem. It’s a combination of interoperable, scalable, and replaceable technologies working together to deliver an enterprise’s overall data needs.
People often refer to data platforms with different names. Sometimes these names mean the same thing. Sometimes they refer to different types of data they host and the type of workload they process. To make things even more complicated, there’s an overlap between some of their use cases.
A Modern Data Platform is a natural evolution from the EDP. It has a broader set of flexible and future-proof capabilities in addition to those of an EDP. Generally, a Modern Data Platform is born out of necessity to store and process different varieties and volumes of data. For example, it may enable processing streaming data in addition to the EDP’s more traditional batch workload. It may allow native processing of structured, semi-structured, or unstructured data at a massive scale, developing AI/ML applications, and performing complex operations like Natural Language Processing (NLP).
Modern Data Platforms often use cloud technologies for affordable cost models, elastic scalability, and flexible managed services. However, it’s important to remember that MDPs are not always entirely cloud-based.
A Cloud Data Platform (not to be confused with CDP—Customer Data Platform) is a catch-all term for data platforms entirely built with cloud computing technologies and data stores. For example, a Cloud Data Platform can consist of unlimited object storage, managed relational and NoSQL databases, MPP data warehouses, Spark clusters, Analytics Notebooks, and message queues and middleware that glue them all together.
Modern Data Platforms can straddle both EDP and Cloud Data Platforms. For example, an enterprise’s EDP may consist of its ERP, Supply Chain Management, CRM, and Finance data stores. The business may decide to enhance their capabilities this capability by adding more services from a services to it. Those services could all be from a Cloud Data Platform.
Several cloud and database vendors have created solutions that allow customers to store and process huge volumes of data in multiple formats in managed platforms.
Cloud databases are part of public cloud suites. These are relational and non-relational databases entirely managed as a service, including the software, infrastructure, patching, high availability, scalability, and backup. Customers don’t have to worry about database operations.
A Data Analytics Platform, Big Data Platform, or Big Data Analytics Platform is a specialized data platform for data analytics purposes. It’s a collection of services and features that enables users to run complex queries on massive amounts of data in any form, then analyze, combine, and explore those query results to create meaningful visualizations. Data Analytics Platforms often combine several big data tools and utilities in one place, and take care of scalability, availability, security, and performance behind the scenes. More often than not, Data Analytics Platforms are part of a cloud suite or a SaaS solution, and offered as Data-as-a-Service (DaaS). Its powers are far beyond running traditional SQL on structured data. Often, Data Analytics Platforms are used in conjunction with the operational data from Enterprise, Modern, or Customer Data Platforms.
A Customer Data Platform (CDP) focuses solely on customer-related data. It brings together customer data from multiple sources such as CRM, transactional systems, social media, emails, websites, digital ads, or eCommerce stores. The aggregated data builds a complete user profile that can be used for marketing and other business purposes, like behavior segmentation. Although traditional CRMs often talk about providing a 360-degree customer view, unlike a CRM, a CDP can aggregate both known and anonymous customer data from multiple sources.
Building a modern data platform requires adopting a Modern Data Architecture (MDA) that specifies how data will be collected, cleansed, stored, transformed, processed, and made available to consumers. Modern data architecture has the following characteristics:
End users are at the center of a modern data platform architecture. Rather than being confined to a set of pre-developed data assets and their sources, users can bring their own data to the platform and develop their own pipeline to ingest, cleanse, analyze, and report on that data.
The modern data platform adopts the best of both the on-premise and cloud world. On-premise ensures making minimal changes to legacy applications, and the cloud ensures scalable and elastic capacity, processing power, high-availability, pre-built applications, and security.
At the core of a modern data platform is the virtual data storage layer that can handle diverse data formats and workloads. For example, the platform can support different data storage formats for the operational/transactional databases supporting real-time interactions, the data lakes containing unstructured data, and the data warehouses needed for the structured datasets required for known analytics jobs.
The storage layer is therefore more of an “abstraction” over other platform components. At a low level, users and applications will access it using a common set of protocols and standards like REST APIs. From a usage perspective, this data will be transparently federated and virtualized, allowing users to share and collaborate on it.
Ingestion, validation, cleansing, and preparation is key to a data platform. A flexible data architecture uses scalable pipelines that can handle different scenarios: batch ingestion from legacy sources using APIs, pub/sub for asynchronous event messages, and stream processing for real-time, high-velocity data.
A modern data platform’s processing architecture allows developing and reusing service-oriented applications. These applications take care of domain-specific functions and are often based on open-source technologies.In most advanced cases, the platform can also allow developing future generation applications based on AI and ML logic in different workspaces.
This pluggable architecture allows users to build their applications seamlessly from a standard set of interoperable components.
Data is automatically classified, and tagged in a data platform. This metadata powers a comprehensive data catalog that users can search for self-service data discovery. The governance model also allows users to check the quality and sensitivity of data. Finally, data lineage reporting can show a data element’s journey through the system at any time.
The analytics layer allows developing, distributing, and sharing self-service dashboards, reports, and notebooks based on flexible technologies. Organizations can make use of their existing analytics applications by using different integration libraries.
Modern data architecture heavily relies on automation for two purposes: infrastructure and data onboarding.
The first category ensures all physical elements of the platform like servers, backups, storage, and load balancers can be easily recreated from scratch if needed.
The second type of automation ensures data pipelines, workspaces, notebooks, and functions are created from standard templates whenever onboarding a new data source.
Finally, a modern data architecture’s security layer abstracts the individual applications’ access mechanisms. It can use an enterprise-wide Identity Provider (IdP) for authentication and role-based authorizations for access. A solid data architecture also ensures data is protected by being in compliance with regulatory standards.
Building a modern data platform needs the right data strategy. Although it’s a large topic in itself, here’s a five-point primer.
The data platform types we’ve talked about so far primarily deal with aggregating data from different sources, and using that aggregated data to answer business analytics questions.
Another type of data platform deals with operational, high-volume data used for developing applications. These “operational” and application data platforms are increasingly cloud-hosted for scalability and ease of use, have built-in high-availability and disaster recovery, offer strong data security at rest and in transit, and allow workload isolation, performance monitoring, and alerting.
One such platform is MongoDB Atlas. Atlas is a Database-as-a-Service (DBaaS) from MongoDB that allows organizations to spin up MongoDB clusters in the cloud—without worrying about provisioning infrastructure, patching, scaling, performance monitoring, high availability, security, backups, disaster recovery, or database administration.
MongoDB Atlas can seamlessly work with other data platforms to augment their capabilities. For example, it can natively run federated queries across AWS S3 and Atlas clusters. Allowing to combine both operational data and historical object store data in virtual databases and collections on Atlas Data Lake.
In addition, most SQL-based BI tools can connect to Atlas and analyze its data.
Data platforms are key to understanding, governing, and accessing your organization’s data. In the end, it comes down to what you want to do with your data and how you want to do it. Whether you build a customer data platform, a big data platform, or use an operational data platform like MongoDB Atlas, data platforms can unlock the potential and the revenue your data has been hiding.