What Is a Data Platform?

A data platform is an integrated set of technologies that collectively meets an organization’s end-to-end data needs. It enables the acquisition, storage, preparation, delivery, and governance of your data, as well as a security layer for users and applications. A data platform is key to unlocking the value of your data.

But data platforms can be a complex subject. What exactly is behind a data platform? How do you approach designing one? And what’s the difference between a customer data platform, a big data platform, and an operational data platform?

Table of Contents

Advantages of Data Platforms

Over the last 20 years, IT vendors have been trying to develop and offer solutions to address the flood of data that companies face from both inside and outside the business.

Cloud is the new norm, and cloud-native data warehouses are now massively parallel-processed. Data pipelines can handle terabytes of data. Storage has become cheap and fast, and data processing frameworks like Spark can handle large volumes of data. NoSQL augments relational databasesand Graph augment traditional languages like SQL while AI/ML applications have proliferated everywhere.

Although these individual pieces of technology have matured, most enterprises have been unable to integrate these tools. The result is data silos that are often unscalable, contain duplicate, often out-of-date data, locked into proprietary solutions, and have no single security layer.

A modern data platform tries to solve this problem. It’s a combination of interoperable, scalable, and replaceable technologies working together to deliver an enterprise’s overall data needs.

Data Platforms vs Big Data Platforms

People often refer to data platforms with different names. Sometimes these names mean the same thing. Sometimes they refer to different types of data they host and the type of workload they process. To make things even more complicated, there’s an overlap between some of their use cases.

  • An Enterprise Data Platform (EDP) provides centralized access to an enterprise’s data assets. Typically, EDPs exist in the on-premise or hybrid world and are made up of traditional data sources. For example, an EDP can include OLTP databases, data warehouses, and a data lake. EDPs also include tools and processes for data acquisition, preparation, and analytical reporting.
  • A Modern Data Platform is a natural evolution from the EDP. It has a broader set of flexible and future-proof capabilities in addition to those of an EDP. Generally, a Modern Data Platform is born out of necessity to store and process different varieties and volumes of data. For example, it may enable processing streaming data in addition to the EDP’s more traditional batch workload. It may allow native processing of structured, semi-structured, or unstructured data at a massive scale, developing AI/ML applications, and performing complex operations like Natural Language Processing (NLP).
    Modern Data Platforms often use cloud technologies for affordable cost models, elastic scalability, and flexible managed services. However, it’s important to remember that MDPs are not always entirely cloud-based.

  • A Cloud Data Platform (not to be confused with CDP—Customer Data Platform) is a catch-all term for data platforms entirely built with cloud computing technologies and data stores. For example, a Cloud Data Platform can consist of unlimited object storage, managed relational and NoSQL databases, MPP data warehouses, Spark clusters, Analytics Notebooks, and message queues and middleware that glue them all together.
    Modern Data Platforms can straddle both EDP and Cloud Data Platforms. For example, an enterprise’s EDP may consist of its ERP, Supply Chain Management, CRM, and Finance data stores. The business may decide to enhance their capabilities this capability by adding more services from a services to it. Those services could all be from a Cloud Data Platform.

    Several cloud and database vendors have created solutions that allow customers to store and process huge volumes of data in multiple formats in managed platforms.

    Cloud databases are part of public cloud suites. These are relational and non-relational databases entirely managed as a service, including the software, infrastructure, patching, high availability, scalability, and backup. Customers don’t have to worry about database operations.

  • A Data Analytics Platform, Big Data Platform, or Big Data Analytics Platform is a specialized data platform for data analytics purposes. It’s a collection of services and features that enables users to run complex queries on massive amounts of data in any form, then analyze, combine, and explore those query results to create meaningful visualizations. Data Analytics Platforms often combine several big data tools and utilities in one place, and take care of scalability, availability, security, and performance behind the scenes. More often than not, Data Analytics Platforms are part of a cloud suite or a SaaS solution, and offered as Data-as-a-Service (DaaS). Its powers are far beyond running traditional SQL on structured data. Often, Data Analytics Platforms are used in conjunction with the operational data from Enterprise, Modern, or Customer Data Platforms.

  • A Customer Data Platform (CDP) focuses solely on customer-related data. It brings together customer data from multiple sources such as CRM, transactional systems, social media, emails, websites, digital ads, or eCommerce stores. The aggregated data builds a complete user profile that can be used for marketing and other business purposes, like behavior segmentation. Although traditional CRMs often talk about providing a 360-degree customer view, unlike a CRM, a CDP can aggregate both known and anonymous customer data from multiple sources.

Modern Data Architecture: Elements of a Data Platform

Building a modern data platform requires adopting a Modern Data Architecture (MDA) that specifies how data will be collected, cleansed, stored, transformed, processed, and made available to consumers. Modern data architecture has the following characteristics:

A modern data platform

Power to the User

End users are at the center of a modern data platform architecture. Rather than being confined to a set of pre-developed data assets and their sources, users can bring their own data to the platform and develop their own pipeline to ingest, cleanse, analyze, and report on that data.

Power of the Hybrid Cloud

The modern data platform adopts the best of both the on-premise and cloud world. On-premise ensures making minimal changes to legacy applications, and the cloud ensures scalable and elastic capacity, processing power, high-availability, pre-built applications, and security.

Shared, Virtual Data Layer

At the core of a modern data platform is the virtual data storage layer that can handle diverse data formats and workloads. For example, the platform can support different data storage formats for the operational/transactional databases supporting real-time interactions, the data lakes containing unstructured data, and the data warehouses needed for the structured datasets required for known analytics jobs.

The storage layer is therefore more of an “abstraction” over other platform components. At a low level, users and applications will access it using a common set of protocols and standards like REST APIs. From a usage perspective, this data will be transparently federated and virtualized, allowing users to share and collaborate on it.

Scalable Data Integration

Ingestion, validation, cleansing, and preparation is key to a data platform. A flexible data architecture uses scalable pipelines that can handle different scenarios: batch ingestion from legacy sources using APIs, pub/sub for asynchronous event messages, and stream processing for real-time, high-velocity data.

Extensible Processing Logic

A modern data platform’s processing architecture allows developing and reusing service-oriented applications. These applications take care of domain-specific functions and are often based on open-source technologies.In most advanced cases, the platform can also allow developing future generation applications based on AI and ML logic in different workspaces.

This pluggable architecture allows users to build their applications seamlessly from a standard set of interoperable components.

End-to-End Governance

Data is automatically classified, and tagged in a data platform. This metadata powers a comprehensive data catalog that users can search for self-service data discovery. The governance model also allows users to check the quality and sensitivity of data. Finally, data lineage reporting can show a data element’s journey through the system at any time.

Self-Service Analytics

The analytics layer allows developing, distributing, and sharing self-service dashboards, reports, and notebooks based on flexible technologies. Organizations can make use of their existing analytics applications by using different integration libraries.

Automation for Flexibility

Modern data architecture heavily relies on automation for two purposes: infrastructure and data onboarding.

The first category ensures all physical elements of the platform like servers, backups, storage, and load balancers can be easily recreated from scratch if needed.

The second type of automation ensures data pipelines, workspaces, notebooks, and functions are created from standard templates whenever onboarding a new data source.

Single Security Layer

Finally, a modern data architecture’s security layer abstracts the individual applications’ access mechanisms. It can use an enterprise-wide Identity Provider (IdP) for authentication and role-based authorizations for access. A solid data architecture also ensures data is protected by being in compliance with regulatory standards.

How to Build a Data Platform

Building a modern data platform needs the right data strategy. Although it’s a large topic in itself, here’s a five-point primer.

  1. Engage the Best SMEs: Organizations should seek the best subject matter experts for the project and bring them to the team. This team will be a mixture of non-technical and technical experts and can often include outside resources.
  2. Focus on People and Processes: Focus on the end user and current business processes. Think about the talent and structures needed for managing and using it.
  3. Gather Business Requirements: Data must address business needs in order to generate real value. The requirements should include end user personas, use cases, existing and possible new data sources, security requirements, current applications and so on.
  4. Build Incrementally: Adopt an agile approach for incremental wins. The entire project may be divided into multiple sub-projects with each small project handling one aspect of the platform or functionality. For example, there may be a project to standardize the data capture tools and another to build a common data sharing capability.
  5. Use What’s Already Available: A data platform arranges and augments existing processes, data for maximum benefit. Start with the data that you already have and implement the workflow that has the greatest opportunity for impact.

Operational Data

The data platform types we’ve talked about so far primarily deal with aggregating data from different sources, and using that aggregated data to answer business analytics questions.

Another type of data platform deals with operational, high-volume data used for developing applications. These “operational” and application data platforms are increasingly cloud-hosted for scalability and ease of use, have built-in high-availability and disaster recovery, offer strong data security at rest and in transit, and allow workload isolation, performance monitoring, and alerting.

One such platform is MongoDB Atlas. Atlas is a Database-as-a-Service (DBaaS) from MongoDB that allows organizations to spin up MongoDB clusters in the cloud—without worrying about provisioning infrastructure, patching, scaling, performance monitoring, high availability, security, backups, disaster recovery, or database administration.

MongoDB Atlas can seamlessly work with other data platforms to augment their capabilities. For example, it can natively run federated queries across AWS S3 and Atlas clusters. Allowing to combine both operational data and historical object store data in virtual databases and collections on Atlas Data Lake.

In addition, most SQL-based BI tools can connect to Atlas and analyze its data.

Conclusion

Data platforms are key to understanding, governing, and accessing your organization’s data. In the end, it comes down to what you want to do with your data and how you want to do it. Whether you build a customer data platform, a big data platform, or use an operational data platform like MongoDB Atlas, data platforms can unlock the potential and the revenue your data has been hiding.

FAQs

What are data platform services?

There are many services or functionalities that glue together the components of a data platform. Examples can be data acquisition service, Data Quality Service (DQS), Master Data Management (MDM) service, streaming service, message bus, authentication service, and so on.

What is the best big data platform?

It really depends on the user’s perspective. You can build your own big data platform using applications created by the Apache Software Foundation (ASF), or opt to use a commercial offering. Big data platforms are offered by MongoDB, Amazon (AWS), Microsoft (Azure), Google (GCP), and Cloudera, to name just a few.

What is modern data architecture?

A modern data architecture is the blueprint for building a modern data platform capable of handling any type and volume of data. It specifies how data will be collected, cleansed, stored, transformed, processed, and made available to consumers.

What is an enterprise data platform?

An enterprise data platform is made up of an organization’s existing data sources and applications like data warehouses and data marts, transactional databases, and other legacy data platforms. It can have both cloud and on-premise components. An EDP can be considered a modern data platform when it has ensured any new data source can be seamlessly integrated in the future without making significant changes.