April 17, 2017
Welcome to part 2 of 3-part single view blog series
- In Part 1 we reviewed the business drivers behind single view projects, introduced a proven and repeatable 10-step methodology to creating the single view, and discussed the initial "Discovery" stage of the project
- In today’s part 2, we dive deeper into the methodology by looking at the development and deployment phases of the project
- In Part 3, we wrap up with the single view maturity model, look at required database capabilities to support the single view, and present a selection of case studies.
If you want to get started right now, download the complete 10-Step Methodology to Creating a Single View whitepaper.
Develop & Deploy Phases of the Single View Methodology
As a reminder, figure 1 shows the 10-step methodology to creating the single view.
In part 1 of the blog series, we covered the Discover phase. We’ll now dive into the Develop and Deploy phases, starting with Step 4 of the methodology.
Step 4: Appoint Data Stewards
A data steward is appointed for each data source identified step 3 of the methodology. The steward needs to command a deep understanding of the source database, with specific knowledge of:
- The schema that stores the source data, and an understanding of which tables store the required attributes, and in what format;
- The clients and applications that generate the source data;
- The clients and applications that consume the source data.
The data steward should also be able to define how the required data can be extracted from the source database to meet the single view requirements (e.g., frequency of data transfer), without impacting either the current producing or consuming applications.
Step 5: Develop the Single View Data Model
With an understanding of both what data is needed, and how it will be queried by the consuming applications, the development team can begin the process of designing the single view schema.
Identify Common Attributes
An important consideration at this stage is to define the common attributes that must appear in every record. Using our customer single view as an example, every customer document should contain a unique customer identifier such as a customer number or email address. This is the field that the consuming applications will use by default to query the single view, and would be indexed as the record’s primary key. Analyzing common query access patterns will also identify the secondary indexes that need to be created for each record. For example, we may regularly query customers against location and products or services they have purchased. Creating secondary indexes on these attributes is necessary to ensure such queries are efficiently serviced.
There may also be many fields that vary from record to record. For example, some customers may have multiple telephone numbers for home, office, and cell phones, while others have only a cell number. Some customers may have social media accounts against which we can track interests and measure sentiments, while other customers have no social presence. MongoDB’s flexible document model with dynamic schema is a huge advantage as we develop our single view. Each record can vary in structure, and so we can avoid the need to define every possible field in the initial schema design, while using MongoDB document validation to enforce specific rules on mandatory fields.
Define Canonical Field Formats
The developers also need to define the canonical format of field names and data attributes. For example, a customer phone number may be stored as a string data type in one source system, and an integer in another, so the development team needs to define what standardized format will be used for the single view schema. We can use approaches such as MongoDB’s native document validation to create and enforce rules governing the presence of mandatory fields and data types.
Define MongoDB Schema
With a data model that allows embedding of rich data structures, such as arrays and sub-documents, within a single localized document, all required data for a business entity can be accessed in a single call to MongoDB. This design results in dramatically improved query latency and throughput when compared to having to JOIN records from multiple relational database tables.
Data modeling is an extensive topic with design decisions ultimately affecting query performance, access patterns, ACID guarantees, data growth, and lifecycle management. The MongoDB data model design documentation provides a good introduction to the factors that need to be considered. In addition, the MongoDB Development Rapid Start service offers custom consulting and training to assist customers in schema design for their specific projects.
Step 6: Data Loading & Standardization
With our data model defined, we are ready to start loading source data into our single view system. Note that the load step is only concerned with capturing the required data, and transforming it into a standardized record format. In Step 7 that follows, we will create the single view data set by merging multiple source records from the load step.
There will be two distinct phases of the data load:
- Initial load. Typically a one-time operation that extracts all required attributes from the source databases, loading them into the single view system for subsequent merging;
- Delta load. An ongoing operation that propagates updates committed to the source databases into the single view. To maintain synchronization between the source and single view systems, it is important that the delta load starts immediately following the initial load.
For all phases of the data load, developers should ensure they capture data in full fidelity, so as not to lose data types. If files are being emitted, then write them out in a JSON format, as this will simplify data interchange between different databases. If possible, use MongoDB Extended JSON as this allows temporal and binary data formats to be preserved.
Several approaches can be used to execute the initial load. An off-the-shelf ETL (Extract, Transform & Load) tool can be used to migrate the required data from the source systems, mapping the attributes and transforming data types into the single view target schema. Alternatively, custom data loaders can be developed, typically used when complex merging between multiple records is required. MongoDB consulting engineers can advise on which approach and tools are most suitable in your context. If after the initial load the development team discovers that additional refinements are needed to the transformation logic, then the single view data should be erased, and the initial load should be repeated.
The appropriate tool for delta loads will be governed by the frequency required for propagating updates from source systems into the single view. In some cases, batch loads taken at regular intervals, for example every 24 hours, may suffice. In this scenario, the ETL or custom loaders used for the initial load would generally be suitable. If data volumes are low, then it may be practical to reload the entire data set from the source system. A more common approach is to reload data only from those customers where a timestamp recorded in the source system indicates a change. More sophisticated approaches track individual attributes and reload only those changed values, even keeping track of the last-modification time in the single-view schema.
If the single view needs to be maintained in near real time with the source databases, then a message queue would be more appropriate. An increasingly common design pattern we have observed is using Apache Kafka to stream updates into the single view schema as they are committed to the source system. Download our Data Streaming with Kafka and MongoDB white paper to learn more about this approach.
Note that in this initial phase of the single view project, we are concerned with moving data from source systems to the single view. Updates to source data will continue to be committed directly to the source systems, and propagated from there to the single view. We have seen customers in more mature phases of single view projects write to the single view, and then propagate updates back to the source systems, which serve as systems of record. This process is beyond the scope of this initial phase.
In a perfect world, an entity’s data would be consistently represented across multiple systems. In the real world, however, this is rarely the case. Instead, the same attributes are often captured differently in each system, described by different field names and stored as different data types. To better understand the challenges, take the example below. We are attempting to build a single view of our frequent travelers, with data currently strewn across our hotel, flight, and car reservation systems. Each system uses different field names and data types to represent the same customer information.
During the load phase, we need to transform the data into the standardized formats defined during the design of the single view data model. This standardized format makes it much simpler to query, compare, and sort our data.
Step 7: Match, Merge, and Reconcile
Even after standardizing divergent field names and data types during the data load, inconsistencies can often exist in the data itself. Accurately merging disparate records is one the toughest challenges in building a single view. The good news is that MongoDB has developed tools that can assist in this process.
Looking again at our frequent traveler example above, we can see that the customer names are slightly different. These variances in the first and last names would result in storing three separate customer records, rather than aggregating the data into our desired single view.
It is not practical, or necessary, to compare each customer record to every other customer record loaded from the source systems. Instead, we can use a grouping function to cluster records with similar matching attributes. This should be executed as an iterative process:
- Start by matching records against unique, authoritative attributes, for example by email address or credit card number;
- Group remaining records by matching combinations of attributes – for example a last_name, date_of_birth, and zip_code triple;
- Finally, we can apply fuzzy matching algorithms such as Levenshtein distance, cosine similarity, and locality sensitive hashing to catch data errors in attributes such as names.
Using the process above, a confidence factor can be applied to each match. For those matches where confidence is high, i.e. 95%+, the records can be automatically merged and written to the authoritative single view. Note that the actual confidence factor can vary by use case, and is often dependent on data quality contained in the source systems. For matches below the desired threshold, the merged record with its conflicting attributes can be written to a pending single view record for manual intervention. Inspecting the record to resolve conflicts might be performed by the data steward, or by the application user when they access the record.
To assist customers, MongoDB consulting engineers have developed tools to facilitate the process above:
- A Workers framework that parallelizes document-to-document comparisons. The framework allows long running jobs to be partitioned and run over collections of records, maintaining progress of grouping and matching.
- A Grouping tool allows records to be clustered based on attribute similarity, using algorithms such as Levenshtein to calculate the distance between different documents, and then single-linkage clustering to create precise matches for merging.
By combining the Workers framework and Grouping tool, merged master data sets are generated, allowing the project team to begin testing the resulting single view.
Step 8: Architecture Design
While the single view may initially address a subset of users, well-implemented solutions will quickly gain traction across the enterprise. The project team therefore needs to have a well-designed plan for scaling the service and delivering continuous uptime with robust security controls.
MongoDB’s Production Readiness consulting engagement will help you achieve just that. Our consulting engineer will collaborate with your devops team to configure MongoDB to satisfy your application’s availability, performance, and security needs.
Step 9: Modify the Consuming Systems
With the merged data set created and systems provisioned, we can begin modifying the applications that will consume the single view.
The first step will be to create an API that exposes the single view. This will typically be a RESTful web service that abstracts access to the underlying data set. Any number of consuming applications – whether customer-facing web and mobile services, or backend enterprise and analytics applications – can be repointed to the web service, with no or minimal modification the application’s underlying logic. Note that write operations will continue to be committed directly to the source systems.
It is generally a best practice to modify one consuming application at a time, thus phasing the development team’s effort to ensure correct operation, while minimizing business risk.
Step 10: Implement Maintenance Processes
No organization is static. Digital transformation initiatives supported by agile development methodologies are enabling enterprises to innovate faster – whether through launching new services or evolving existing applications. Our single view data model needs to maintain pace with business change. This change can manifest itself in adding new attributes from existing source systems, onboarding entirely new data sources, or creating new application uses for the single view.
The project team needs to institutionalize governance around these maintenance processes, defining a strategy on how application changes that generate new attributes of value are integrated into the single view schema. The steps defined above – scoping the required changes, identifying the data producers and stewards, updating the schema, and determining the load and merge strategies – are all essential to maintaining the single view. In some more mature single view projects, the application team may decide to write new attributes directly to the single view, thus avoiding the need to update the legacy relational schemas of source systems. This is discussed in more detail in the Maturity Model section of the whitepaper.
As we consider the maintenance process, the benefits of a flexible schema – such as that offered by MongoDB’s document data model – cannot be underestimated. As we will see in the Case Studies section in part 3, the rigidity of a traditional relational data model has prevented the single view schema from evolving as source systems are updated. This inflexibility has scuppered many single view projects in the past.
Wrapping Up Part 2
That wraps up the second part of our 3-part blog series. In the final Part 3, we will discuss the single view maturity model, explore required database capabilities to host the single view, and present a selection of case studies. Remember, if you want to get started right now, download the complete 10-Step Methodology to Creating a Single View whitepaper.