The Great Data Divide: Here's What's Hindering Your AI Goals

Jeff Needham

Organizational data is arguably the lifeblood of most digital-era companies. And yet, despite its significance and importance to the organization as a whole, the creation and subsequent management of data in most organizations are bifurcated - split between whether the data is transactional or analytical (operational vs after-the-fact & historical).

Between these two worlds, a great divide exists. Like the Equator line which divides our planet into northern and southern hemispheres, many of our organizations operate with separate transactional and analytics hemispheres. Rooted in hardware and software limitations, transactional and analytics data processing workloads are run against different systems and hardware, which are run by separate teams as well.

While this has been an effective strategy for managing organizational data assets for a very long time, advances in hardware and software, and the availability of cloud infrastructure, have changed that. When it comes to an organization's ability to deploy AI at scale now, we need to change this approach towards processing and managing data if we want to deliberately increase our organization's overall data processing proficiency.

We’re here to suggest a different operating model for consideration, one that’s based on the collective experiences of working with over 40K data-processing customers, many of whom are leading the way when it comes to reorganizing themselves for high data proficiency, to support their AI ambitions and programs.

Check out our AI resource page to learn more about building AI-powered apps with MongoDB.

Treating data as a product

Let’s erase the line between transactional and analytics for a moment, and instead, view the overall flow and use of information within an organization. It’s created, it’s updated, and it’s read by employees, customers, data & analytics workers, and executives. Sometimes it finds itself inside an application, sometimes it manifests itself on a month-end report. Sometimes it’s used to train and retrain machine learning models. It’s this last scenario that’s starting to reveal significant deficiencies associated with traditional methods of managing data found within many organizations.

Thanks to things like mobile, cloud, and IoT, data is moving at a breakneck pace. 40 years ago, we primarily transacted in a business application and then shuttled the deltas overnight into a data warehouse. Why? Because it was simply not possible to execute analytics queries against a running transactional system. At best, the queries would time out. At worst, you would slow or halt business transactional processing, and bring the business to a stand-still.

In addition, all analytics were after-the-fact. We didn’t need to try and execute analytics queries against transactional systems. A single enterprise data warehouse repository was good enough to satisfy the reporting demands the organization placed on the data. Today, however, our historical data assets are becoming ever more significant, and sometimes even within real-time transactional business processing.
Insights that can be gleaned from historical data, can be fed into decision-making transactional systems, to drive better, or more efficient outcomes. Think of automated decisions and inferences. Machine learning models are now supplementing some of the data analysis and decision-making that humans have traditionally had to perform.

As the benefits from these models become more commonplace within transactional business systems, it’s important that they make accurate decisions, especially in heavily regulated industries such as insurance and financial services. A machine-learning model, as such, may need to be retrained often, and many models now demand access to data that is real-time, or as near real-time as possible. It’s this hunger for data that is causing AI models to cross over the great data equator. Not being satisfied with historical data, these models are increasingly demanding to be trained and retrained on data that is as fresh from having been created or updated, as possible.

When we treat our data as a product, we see it as a thing, a business entity, or a noun. A customer, a policy, a claim, etc. However, it also has characteristics like state, age, and context. Is it in motion, or is it at rest? Has it just been created? Is it in the process of being updated, or is it years old, sitting in a warehouse? For which business context is it being leveraged - a customer browsing products, or a data scientist looking for trends in past sales? Across all of these characteristics and contexts, the data itself isn’t any more or less important. It’s simply important because it’s the data.

Worlds apart

When we task entirely separate teams, however, to manage it - transactional vs analytics - we lose this holistic data-as-a-product perspective. Instead, we put very different lenses on, whether we’re looking at a software delivery team, vs a data engineering team supporting data scientists. The meaning of data, after it’s transacted, for example, may change once it’s landed in the enterprise data warehouse or data lake. Transformations and manipulations are applied to it as it crosses over the great data equator, sometimes creating very different instances of the data. The journey often alters it from its original ground-truth state, done so while in between being copied from a transactional database, and loaded into an analytics one. After that data lands in analytics databases and platforms, it’s often further transformed and copied into even more subsequent databases and platforms.

For the past decade, most AI efforts have been executed within the analytics hemisphere. Historical data assets in our data warehouses and data lakes have been sufficient to serve experiments and even production AI use cases. The more AI becomes commonplace, however, the more we can expect that AI models will want both historical and real-time data. As such, we should be re-aligning our bifurcated transactional and analytics organizations to help them operate as efficiently as possible, to serve the right state of the data to the right consumer, for the right context.

Uniting with Domain Driven Design

Some of the best things that have come from software delivery organizations embracing Domain Driven Design come from aligning developers, architects, business SMEs, and scrum masters into the same team, or team of teams. A bounded context in which all the folks who care about, interact with, or manipulate the software and the data, can work together without having to cross departmental boundaries, or bureaucracy that can cause friction when trying to deliver working software.

If we consider the goal of being highly proficient and effective with data, especially complicated data (data that has fast changed state and context), it stands to reason that an Agile team of teams, or Bounded Context, should include not only the business SME’s, the software developers, and the architects and site reliability engineers (SRE’s) who maintain applications, but also the data engineers and the data scientists who currently manage after-the-fact data assets, and are using it to bring AI models to life.

If we truly want to embrace and treat data as a product, however, we need to eradicate the notion that data should be managed in two different hemispheres across the organization - transactional and analytics. The data will change state, often, and only continue to do so for the foreseeable future. Engineering the organization for success - efficiency, and accuracy when it comes to data processing - requires deliberateness. For that, we have to actively seek out and make our goals happen. Those goals should be focused on removing known friction points. The junctions at which the exchange, or processing of information is inefficient, struggling to scale, costing too much effort and dollars, or all of the above.

All hands on deck

When it comes to building sophisticated digital applications, when it comes to managing data (in whatever state or context), when it comes to building and maintaining AI models, when it comes to incorporating those models into actual business workflows and applications, it truly takes a village.

As AI begins to accelerate the ability to write and deploy code, for example, the pace of application feature delivery in most organizations will increase. In short, we’re going to be expected to do more, in less time, thanks to the forthcoming generation of AI-enabling assistants. This will place even greater demands and expectations on the organization's technology and data workers, and especially the data infrastructure. Similarly, as AI models consume either real-time or historical data, our ability to accurately, efficiently, and quickly process and manage all of this data will need to increase significantly.

The way forward

Aligning people and resources to common goals is an effective way to transform an organization. Setting goals like treating data as a product, and embracing principles of domain-driven when it comes to an organization’s data-engineering practices, can help tremendously in moving towards more accurate, efficient, and performant data processing. In organizations we work with, large and small, this transformation is beginning, and it’s erasing the hard line that’s existed between two distinct data hemispheres in the organization.

As AI becomes more significant, so do your developers, data scientists, and data engineers. We need them working together as efficiently and effectively as possible, to meet our organization’s aspirations. A way to achieve this comes from reducing the friction when it comes to working with data - for developers, data scientists, and AI models alike. We invite you to learn more about our work in insurance.