Creating a Single View Part 3: Data Design and Loading Strategies

Facebook ShareLinkedin ShareReddit ShareTwitter Share

In part 1 of this series on single views, we wrote about why to build a single view with MongoDB, while in part 2, we looked at the implementation of an example single view data model. This week, we’re talking about how to actually get your data into your now single view.

It’s 2015. Moving data from one place to another should be simple, right? Of course, anyone who’s ever tried it knows that it’s not.

Why Is This Still Hard?

It seems simple. You have data in one system, and it works great. You have other data in another system, and it works great too. Shouldn’t you be able to combine it into one new system to do exciting new things, and have that work great too?

Wouldn’t It Be Nice?

Of course, the problem is that those simple-looking blue arrows actually represent a lot of different operations. You might be extracting lots of tables into lots of files, fixing encoding and formatting issues, performing reconciliation, taking different cuts of the data for different purposes, etc.

At its root, the problem is that in the origin systems, the data has a (hopefully) logical representation of the entities in question – products, customers, or whatever. But when you try to extract the data, all too often you lose data fidelity.

For example, if your origin system is a relational database, you may have an ORM sitting on top of it, which turns sensible objects into reasonable relational data. You should be able to reverse the process and extract sensible objects, right?. Well, all too often, the chosen method of getting data back out is open a backdoor instead, and just dump all of the relational data into CSV files. And if you’ve ever tried to load data via a CSV, you’ll know that it’s brittle, hard to understand, and makes it nearly impossible to go revise your data without going back for a whole new cut.

XML was supposed to be better. In theory, it can maintain the structure of your data, so that your object...

class Product {
String productName;
List<Features> ff;
Date introDate;
List<Date> versDates;
int[] unitBundles;

Can turn into something that looks manageable in XML:

 <text>good texture</text>

But there are plenty of drawbacks to XML. It doesn’t handle arrays natively and the conventions for both arrays and attributes vary wildly. The low-level methods to parse XML are performant and flexible but are difficult to turn into well-typed objects. And high-level/easy methods to parse XML into objects are compile-time / configuration-file dependent, impairing flexibility and dynamic data handling. So people end up emitting flattened, attribute-rich XML that ironically looks a lot like the CSV they sought to avoid:

<p name=“widget1” ftxt1=“good texture” ftyp1=“A” idt=“20140203” …
<p name=“widget2” ftxt1=“not fragile” ftyp1=“A” idt=“20110117” …
<p name=“widget3” ftxt1=“dense” idt=“20140203” …
<p name=“widget4” idt=“20140203” versD=“20130403,20130104,20100605” …

But beyond the format of the content, we also have the problem of reacting to changes in it over time, especially as the number of feeder systems grows to 10, 20, or more. In traditional systems, every time a feeder changes, the target schema must change as well. This effort grows nonlinearly with respect to the number of data sources because “friction” starts to occur as multiple change requests start to overlap and must be managed through testing and release. With only 2 or 3 data sources, the destination team might be able to keep up, but with 20 or more, it is likely that the destination system management is in a permanent state of release conflation and triage.

So what’s the solution? Here are a few simple guidelines for how to make your life easier when extracting and loading data for a single view:

  1. Base your exported data design on entities, not the physical database. Yes, when you loaded your Product objects (from the example above) into a relational database, you had to flatten them and make them fit into tables. But that doesn’t mean you should emit tables! Emit the entities in the structure that makes sense for them.
  2. Capture data in full fidelity. It’s tempting to drop portions of data that you think you don’t need. It’s even more tempting to shortcut the proper representation of types and emit things as strings, especially Dates. Resist the urge and let the consumer of the data decide how to deal with the data.
  3. Perform cross-reference and additional logic at the single view, not in transit. The biggest reason for this is so that you can keep track of what has happened to your data. If you need to perform reconciliation down the line, you can demonstrate that you took the data exactly as it came from the source, and then point at the audit log to show how you transformed it. If you changed the data somewhere in the middle of that blue line with a perl script, it’s difficult to look back and figure out what happened. This is particularly important when dozens of inputs need to be contextualized in terms of a day’s worth of data.
And how do you move that data? Typically, the best option is to write a feeder program that grabs the data from the source – could be real-time, could be batch – and moves it over into MongoDB. Yes, creating that program will require an initial investment of time. But in the long run, it will be so much better than the other options. Once you’ve developed the feeder program, the effort to extract new data is minimal. The effort to load the extracted data into MongoDB is easy regardless of the complexity of the shape.

And what about that formatting? Well, if you can, avoid files all together! MongoDB has APIs in more languages than you can shake a stick at. If you lean on those, you can just pipe your objects directly into MongoDB and be done with it!

The Magic of APIs

If for some reason you can’t use one of our APIs, then emit JSON. And not just regular JSON – JSON with MongoDB metadata for types, especially for Dates. Remember, JSON does not natively have a representation for Dates and you don’t want to sacrifice fidelity. Use the 100% JSON-compliant conventions used by MongoDB to identify types other than strings, numbers, arrays, or structures.

Objects –> JSON

Ultimately, all of this comes down to a few big pieces of advice.

Don’t emit data in the form you twisted it into for the existing database. Do emit it the way it makes sense for the business entities. Don’t dump your data out into CSV or XML and hope for the best. Do lean on the APIs, or emit JSON that preserves the object structure. Don’t throw out parts of your data, or try to change it on the fly. Do get it into your destination single view database and manipulate it there.

So now you know the basics of building a single view with MongoDB. We’ve covered what a single view is, looked at a single view data model, and now reviewed data loading strategies.

If you want to learn more about the business benefits a real company realized with a single view built on MongoDB, you can download a white paper to read about MetLife’s single view of the customer.

Read about MetLife

<< Read Part 2



About the Author - Eric
Eric Holzhauer is a Product Marketing Manager at MongoDB.