GIANT Stories at MongoDB

Why XML-in-RDBMS != MongoDB

Introduction

There are many use cases where an application must store complex data structures that are either:

  1. Sufficiently complex so as to make proper relational modeling burdensome.
  2. "Application bespoke" the data structure is flexible and dynamic and it is inefficient or effectively impossible to model in the RDBMS.

The MongoDB document model and APIs and query language make these tasks easy to implement, grow, and maintain.

Why the XML-as-Column Approach Comes with a Large Set of Gotchas

Of late, an alternative design option using XML in an RDBMS has surfaced as a way to address these use cases. The basic approach is to encode the complex structure into XML and store it as a column element "peer" to other simpler, scalar fields in a table. There are many reasons why this is not "the same thing" as what can be done in MongoDB and why the XML-as-column approach comes with a large set of gotchas.

Let's explore some of the more important reasons, and in doing so, we'll use a small piece of rich data for reference.

data = {
   name: "buzz",
   lastUpdate: ISODate("20140606"),
   pets: [ "dog", "cat" ],
   milestones: [
     { what: "bday", d: ISODate("20140101") }
     { what: "hireDate", d: ISODate("20150101") }
   ]
}

The reference example above is Javascript which is the easiest to use to get the point across without extra language overhead. It looks a lot like JSON but it is not. Although it is often convenient to think about MongoDB documents in terms of JSON from a data modeling standpoint, in fact very few of the APIs deal directly with JSON as a string. It is undesirable to deal with JSON as a string for some of the same reasons as dealing with XML as a string, which we'll see in a moment.

Rules and Assumptions:

  1. We'll use Java as the language for implementation but the concepts apply to any typed language.
  2. The data access layer (DAL) communicates to the application with real types: dates, doubles, lists, and maps.

Issues with XML and the Advantage of the MongoDB Document Model

1. XML is a string

There are no dates or lists or binary data in XML. It is a formatted string, period. This means that something in the data access layer must be responsible for converting incoming types like dates into some appropriate string representation -- and therein loses type fidelity:

String xmlString = convertMapToXML(data);
 
<data>
  <name>buzz</name>
  <lastUpdate>20140606</lastUpdate>
  <pets>
    <pet>dog</pet>
    <pet>cat</pet>
  </pets>
  <milestones>
    <milestone>    
      <what>bday</what>
      <d>20140101</d>
    <milestone>    
    <milestone>    
      <what>hireDate</what>
      <d>20150101</d>
    <milestone>    
  </milestones>
<data>

If we save this XML in a column and then retrieve it, how do we properly and accurately reproduce the original structure in a generic way, i.e. without custom coding the XML-to-data logic?

Map data = convertXMLtoMap(xmlString); // how do we know 20150101 is a date?

A solution seems to be storing the XSD for the XML in another peer column and using it to create the proper types. Now, beyond the obvious performance implications of both extracting the XSD and running it through a validating/data conversion utility, the more subtle issue is that there are great many options available to parse, validate, and convert XML. It is difficult to enforce and test the combinations of these options to ensure that, say, a floating point number in Java encoded to XML can be properly decoded by a python script into the same number. The problem is magnified when XML attributes are brought into the picture.

With MongoDB, the Map -- lists, substructures, dates, and all -- is simply set into a field and persisted. No conversions, no loss of type fidelity, no extra software stack dependencies, and a guarantee that data saved by a Java program:           collection.insert(data); can be extracted precisely by a python program:           d = collection.find_one({&#x201C;name&#x201D;:&#x201D;buzz&#x201D;})

2. XML has poor handling for arrays

XML provides no native way to tell a consumer if an element is a scalar or an enclosing array. In the worst case, without a convention for enclosing tags (like "pets" in the above example), you could encounter this:

     dog // Is this a scalar or a single entry for an array?

How about this?

    <pet>dog</pet>   
    <pet>cat</pet>

Without an accompanying XSD or a strong set of conventions like

  <pet_ARRAY>
    <pet>dog</pet>
    <pet>cat</pet>
  </pet_ARRAY>

there is no way to precisely reproduce the list datatype in the host language.

3. Attributes in XML are on a different axis than nested tags

The following XML is legal but clearly confusing:

<name type="foo">
      <type>bar</type>
</name>

Which field named type “wins” when creating the object in the data access layer? In virtually all host languages, the available Map and similar objects have only one key axis, so some sort of convention must be applied to disambiguate “type” as an attribute from “type” as a nested tag. Typically, the approach looks like this:

name: {
  @type: "foo",
  type: "bar"
}

This irritation is a hallmark of XML's legacy as a SGML text markup format. It was not designed to easily and robustly carry rich structured data, and it can be avoided by religiously not using attributes.

4. Features/Capabilities on XML in a column vary widely from RDBMS to RDBMS

In the worst case, the XML is really treated as a string and no query or update capabilities exist; the entire XML CLOB is fetched and saved. In the best case, XPath operators exist but the following important caveats exist:

a. Indexing capabilities, including supported types and options, vary widely if they exist at all b. The XPath query language is very different than SQL and it is a non-trivial exercise to robust construct queries that blend both. c. XPath is even harder to dynamically construct than SQL.

Here is a representative example of the complexity of RDBMS XML using SQLServer2005. Product ID is a numeric type (e.g. not a string) and ProductDate is a peer column to ProductXML.

SELECT ProductXML.query('//description') FROM ProductList 
WHERE ProductXML.exist('//product/header/[number(./@id[1]) = sql:variable("@product_id")]') = 1 AND ProductDate > ‘20140101’

In MongoDB, bespoke data, dynamic data, and "regular" data are all treated exactly the same way. The query language lends itself to powerful parameterized construction and is applied easily and uniformly to the entire data shape regardless of its complexity:

ProductList.find({
  "product.header.id", productId,
  “ProductDate", {“$gt”: ISODate("20140101")}
  },{ "description":1 });

5. Interaction with XML varies across languages

Because XML is fundamentally a string, software must exist in each language to both turn it into some set of objects that can be used, and to format those objects to an XML string. It becomes difficult and error-prone to maintain and test this software across languages, and it is even harder to ensure and enforce its use in all circumstances. Failure to do so will result in subtle errors where, for example, a floating point number persisted in one language will be fetched with a different precision in another. Another common gotcha is inconsistent (or outright incorrect) handling of datetime and timezone.

All datatypes supported by MongoDB behave the same way in each of the 10 drivers supported by the company.

Summary

Designing rich data in XML is not the difficult part of the overall problem. One hard part is precisely converting it to the objects you are using in your software. And an even greater challenge is easily and performantly interacting with the database to query, filter, and index this rich data. The XML-in-RDBMS architecture quickly leads to a platform where an excessive amount of developer-maintained utility code must be created and maintained and impairs the smooth operation of SQL-based tools and products that interact well only with the basic scalar types. A flexible, rich shape document model coupled with a query language designed to work well with these rich shapes makes MongoDB the better solution.

If you’re interested in learning more about the architecture of MongoDB, read our guide:


MongoDB vs SQL: Day 30

Buzz Moschetti

Technical

Welcome back to our blog series comparing building an application with MongoDB versus building the same application with SQL. Today, we’re going to switch gears a little bit and talk about rapid application development (RAD), specifically using Python as an example.

First of all, why would we want to do RAD?

It should be obvious by now that assuming that we’ll only ever address our data with a single language is not a good idea. If you’ll recall from the first post in this series, there are now a plethora of languages we can use, and an even larger ecosystem of frameworks that we can utilize to manipulate our data. We have to assume at this point that the rate of change of technology, drivers, and languages is only going to increase. We want to bring together the power of all of these languages but we want to do it in a way that makes sense for each language. Let’s begin by reintroducing the Contacts theme we have been exploring for a few weeks.

When you look at the code above, you’ll notice that when we construct this piece of data to save into MongoDB with Python, much of the overhead of explicitly constructing Maps (name/value pairs) and Lists goes away. Instead, we can just use the syntax of Python to describe it and save it.

If we wanted to go back and add titles and hire dates to existing documents, the backfill logic would be about ten lines worth of code. Of particular interest is the simplicity of the predicate expression at the bottom of the code compared to when we tried to do the same thing using Java; again, with Python we no longer have to be burdened with the explicit construction of Maps and Lists. Also, the functions in the output formatting (upper(), sorted()) are not MongoDB, they’re native Python. MongoDB’s drivers expose all of this rich data in a form most convenient to the host language. This gives us the capability to use all of the tools, tips and tricks from third parties and the open source community to operate on our data.

Probably the most important thing that ties all of this together is polymorphism, which is our fancy term for being able to store more than one kind of a shape inside of one collection. This capability is available in any language but it’s easily visualized inside of a scripting language like Python.

What we have above is an information architecture where name, ID and personalData are well known. There’s probably going to be an index on name and/or ID. The field name of personalData is well known within our information architecture, but across different documents, the contents, and the shape of personalData are different. For Bob, we have preferred airports and travel time, while Steve is more interested in the last account visited and his favorite number, a floating point approximation of pi. What makes RAD really powerful in MongoDB is that we can let “the data do the talking.” Through index optimization we can quickly navigate to a small set of documents and then, for each document retrieved, have the host code ask “what is your type?” and react dynamically to the response. In the case of Maps, we can recursively ‘walk’ the map until we get to name-scalar value pairs that can be easily formatted to appear on a GUI. Obviously, we can also build specific, visually optimized GUIs that ask for very specific parts of the content.

This is a very powerful feature and it’s one of the reasons why the ER diagram versus the simplified MongoDB representation isn’t so far off from reality; capabilities like this allow us to rethink how we want to structure our data. We don’t need 22 extra tables to hang on to different shapes of things. We can place everything inside one collection and let the data do the talking, so to speak.

We don’t really have to spend much time on the code snippet above because the current convention to deal with polymorphic data in SQL is to just BLOB it, either via serialized Java, XML, JSON as a single string, third party representations (like Avro), or something home-grown. With all of these approaches we will lose query capability on our data, and with most we will also lose cross-language type fidelity information on it, meaning if we store a floating point number, we have to make sure our post-API representation is consistent across all versions of Java, C#, JavaScript, Python, Perl, etc. These are very important considerations if we want to focus on working with our data and not waste time trying to build adapters and conformance layers to ensure that the data is properly serialized and de-serialized.

At a fundamental level, what has changed?

We have to look back and once again understand why we started this series with a little bit of a history lesson. In the old days, when RDBMS systems were conceived, CPU was slow. Disk was slow. Memory in particular was very expensive. There was no malloc() in the old days; we couldn’t code “new MyObject()”, never mind assess the performance optimization around making the call in the first place. Everything was compile-time bound and locked into a small set of datatype primitives that could be efficiently handled by the CPU and the operating system. In the year 2014, we have a lot more flexibility. We have a lot more power at our fingertips. We can afford the few extra cycles to let the data tell the code what it is, in exchange for a much more versatile and adaptable data access layer.

More broadly, this power allows us to construct software that operates on generalized sets of problems, independent of the specific business domain (finance, manufacturing, retail, etc.). Consider the examples below:

Everybody has suffered through the pain of satisfying a requirement like “How do I do reconciliation or find the version delta?” Whether it’s a trade or a product or a recipe or a catalog entry; it doesn’t matter. Unless you’re in single table world - and that’s essentially never - you have a problem. You can do one of two things: You can build a piece of software that will hydrate your RDBMS world into an object then hydrate another object and do an object-to-object compare. If you’re lucky the people doing all that will have implemented Comparable (in the case of Java), and maybe you can iterate through the results that way. But it’s still quite a bit of work. If you do it at the RDBMS level by dumping tables and getting rows of things and processing CSVs, you are setting yourself up for a world of pain of brittle feeds, vaguely typed formatted data, and lots of commas. We all live this pain every single day. We’ve just grown accustomed to it, but it’s not the way you’d want to do it.

The example at the top of the image above shows how you would do it in MongoDB combined with generic software. We can generically ask for an entire set of data, walk the structure, and accurately capture value and datatype differences. There is an investment here in the MapDiff.difference() function but once completed, it can be reused across any information architecture.

Another use case that comes up very often is ‘how do you pull together sets of data and continually add “layers” of overrides’? You’ve got your baseline, let’s say preferences for a community, and then when a new user is added there are defaults at a company level, then defaults at a group level, and finally the user’s preferences. You want to be able to overlay these things. Traditionally overlaying anything in the RDBMS world is really tough, and largely what we’ve done in the past is hydrate things into bespoke objects and do the layering logic there. Unfortunately, this also introduces a whole set of compile-time dependences.

With MongoDB it’s very easy just to iteratively extract shapes expressed as a map, “stack” the maps, and at the very end produce a “top-down look” of the stack. It is easy to add a feature that allows us to ask “Which Map in the stack produces the value we see in the top-down look?” You get all that kind of flexibility because you’re in the rich map ecosystem. You’re not just dealing in the flat ResultSet world of ints, doubles, dates, and strings in a code framework that’s heavily geared towards the database. With MongoDB, you’re now geared toward the structures and objects that are natively and fluidly manipulated inside the host language.

Lastly, let’s spend just a couple of minutes on the command-line interface (CLI). What’s important (and exciting!) to know is that in MongoDB, our CLI is actually just the V8 JavaScript engine (the same as in Chrome) that loads the JavaScript driver for MongoDB plus a few extra syntactic add-ons. In other words, with our CLI, you can program JavaScript in it all day long and never even touch MongoDB. This is a very different paradigm from iSQL and PSQL and other SQL CLIs that largely only manipulate SQL, not programming elements like variables, functions, and branch control.

Because the MongoDB CLI is “Javascript first, MongoDB second,” there’s a whole host of powerful native applications or utilities that you can write in JavaScript that fluidly manipulate all kinds of things in addition to manipulating MongoDB. An example appears in the image below. We’re asking for contacts with sequence numbers greater than 10,000. Then we invoke the explain method to get back information from the engine on how long it took, the number of items it scanned, etc.

I can easily put that into a loop, and then push it on to a array “v” to produce an array of explain() output. Because this is just JavaScript in our CLI, I can grab the jStat package off the Internet and I can run standard deviations or anything else I want. Lastly, I can not only capture these things as they are admitted as rich shapes, but also turn them right around and insert them back in. For example, I could adapt this to have a while loop around it and it will continuously run a little script that sees what the timings are for a particular kind of query against my database, take the results, and place them back into the database, creating a performance “log.” And this log itself is rich shapes, available for full fidelity querying and manipulation! Simple. Symmetric. Efficient.

In conclusion, it’s our belief that once you move beyond trivial use cases, it’s actually easier to use MongoDB to interact with your data than RDBMS for some of your bigger problems. MongoDB harmonizes much better with modern programming languages and ecosystems than RDBMS. When we take that and layer in some of the things that we didn’t cover in this series like robust indexing, horizontal scaling, and isomorphic high availability and disaster recovery, MongoDB becomes the modern database you’re better off with for your modern solutions.


For more information on migration, read our migration best practices white paper.
Read the Migration Guide

<< Click back to MongoDB vs SQL: Day 14 (Part 2)


About the Author - Buzz Moschetti

Buzz is a solutions architect at MongoDB. He was formerly the Chief Architecture Officer of Bear Stearns before joining the Investment Bank division of JPMorganChase as Global Head of Architecture. His areas of expertise include enterprise data design, systems integration, and multi-language tiered software leverage with C/C++, Java, Perl, Python, and Ruby. He holds a bachelor of science degree from the Massachusetts Institute of Technology.

MongoDB vs SQL: Day 14 - Queries

Buzz Moschetti

Business

Welcome back to our blog series highlighting the differences between developing an app with MongoDB vs. with RDBMS/SQL. Last week, we began to cover Day 14 and added a list of startup apps organized by region. We also added data that was provided to us by an external entitlements service. This week we’re going to continue our discussion of Day 14 by diving into queries.

Before we begin, let's refresh our memories with the framework we’ve been using to stage our discussion.

  • We are using Java
  • Assume we have a data access layer in between our application and MongoDB
  • In terms of the date counts as we go through the examples, treat them as relative progress indicators and not the actual time needed to complete the task.
  • We won’t get into exception or error-handling. We won’t muddy the code with boilerplate or persistor logic that does not change from day to day. We won’t get into the database connection or other setup resources. The primary focus will be the core data-handling code.

MongoDB vs SQL: Day 14 - Queries

At this point, we’ve covered how to address broad queries that enable us to get everything or one thing from a collection. But we know the hallmark of SQL is its rich querying capability. Lucky for us, MongoDB has rich querying capabilities as well.

The big difference between SQL and MongoDB’s query language is that the latter is not a single string “sentence.” It doesn’t require white spaces in between words, or commas, or parentheses, or quoted characters. Instead, MongoDB uses a “cascade” of operator/operand structures, typically in name:value pairs. In other words, the operand in one structure can be another operator/operand structure.

This makes things very exciting for developers because the same techniques we use in our code to manipulate rich shapes of data going into and coming out of MongoDB - whether it’s Java, Python, JavaScript, etc. - can be used to construct, manipulate, and “parse” our query expressions. In fact, no parser is necessary. It is trivially easy to walk the cascade with standard programming techniques and find fields, values, etc. Because the query expression is a structured cascade and not a single string, this also means that it is easy to incrementally add subexpressions to the query without first having to break it apart into component pieces.

MongoDB vs SQL: Query Examples

Now let’s look at some compare and contrast, side-by-side code examples for queries.

First, we see one of the popular command line interpreters for SQL which is going to fetch us some contacts and phones. This will yield a rectangle, as we saw in earlier posts.

Next, in the MongoDB command line interpreter (“CLI”), we set up the equivalent query. Notice that we can use “dotpath” syntax to address subfields within fields in the rich shape. It is also worth noting that the “equals” operator is so common that as a shortcut, MongoDB interprets name:value as “name = value”, without having to explicitly supply the $eq operator.

Third, we see the equivalent query in Java / JDBC. Note that although the “sentence” is similar, we start to bump into irritants like escaping quotes.

Lastly, we see the equivalent in MongoDB via the Java driver.

We can see that the overall semantics of queries in MongoDB and SQL are the same and follow the common pattern of query setup, issuance of query, and iteration over a cursor.

Let’s look at some more complicated queries now.

In this query, we’re looking for contacts who either have at least one work phone OR have been hired after a specific date. Again, we can see that the equivalent in MongoDB is pretty straightforward. Note the use of dollar signs in the operator names (`$or`, `$gt`) as syntactic sugar. Also note that in both examples it’s important to use an actual date in our comparison, not a string.

The equivalent query in Java / JDBC will look largely the same as before, with a few more escaped quotes.

However, in practice it isn’t as complicated as it appears -- and it actually offers more flexibility than SQL:
  1. First of all, it’s really the same two or three lines just repeated over and over again with different field:value pairs. This makes it easy to cut-and-paste these expressions as you build up your query.
  2. Second, it is simple to dynamically construct filters and queries without worrying about where we are in the predicate path. We don’t have to worry about white space, commas, or parentheses. We don’t have to worry about splicing in a `SORT` statement or dynamically adjusting the names of returned fields sandwiched between `SELECT` and `WHERE`. Parameter substitution is very straightforward and easily coded, especially when dynamic logical `AND` and `OR` statements come into the picture.

If we extrapolate beyond this small code example, it’s evident how easy it is to add more expressions into the $or statement, or to call out to another function that independently crafts a small filtering fragment that we can add to our overall query. As dynamic queries become more complex in SQL, however, the syntactic sugar that makes SQL “human readable” in the CLI begins to work against you in the programmatic construction of a query.

We’ve used the very basic Java query APIs to illustrate the operator/operand nature of the language. We also deliberately chose standard Java HashMap objects to further reduce coupling until the last moment - when we constructed the BasicDBObject to pass to the find() method. For greater convenience, a Builder pattern set of APIs exist as well, but in the end it is still building a cascade of operator/operand structures.

More MongoDB Query Capabilities

MongoDB offers capabilities you come to expect in a full-featured query language including:

  1. Arbitrary sorting on one or more fields, ascending or descending.
  2. Projection, i.e. retrieving only specified fields from each document in the cursor, not the entire document.
  3. Cursor `skip()` and `limit()` to easily implement pagination if desired.
  4. `explain()`, which returns a wealth of information including full details on query path analysis, document and index counts, and estimated vs. actual processing time.

Perhaps the most important feature is the Aggregation Framework (“agg”), MongoDB’s answer to SQL’s GROUP BY clause. Exploring the power of agg is an entire blog in its own right; please see Asya Kamsky’s posts to get an idea of the agg’s power and programmability. For now, here’s an example to get you thinking about it. Suppose we want to count all the different kinds of cell phones owned by our contacts hired before June 1, 2013. Let’s make it a bit more interesting - let’s capture the names of the people who have these phones and only emit those types where more than 1 person has it. Expressed in the MongoDB CLI, we’d try this:

x2 = db.contact.aggregate([
   { $match: { "hiredate": {"$lt": new ISODate("20130601") }}},
   { $unwind: "$phones"},
   { $group: { "_id": "$phones.type",
               "n": {$sum: 1},
               "who": {$push: “$name”},
             }},
   { $match: { “n”: {“$gt”: 1}} },
   { $sort: { "n": -1, "_id": 1} }
   ]);
x2.forEach(function(r) { printjson(r); });

This might yield:

{ "_id" : "mobile", "n" : 3, who: [ “buzz”, “sam”, “dan” ] }
{ "_id" : "work", "n" : 2, who: [ “sam”, “kay” ] }

The important concepts to grasp with agg are:

  1. Data is “flowed” through a pipeline of operations (e.g. `$match` and `$unwind`). Output from each stage is passed to the next. In the example above, we use the `$match` operator twice: once to filter the input set, then a second to filter the group set.
  2. The `$unwind` operator turns arrays of things into “virtual” documents, one for each element of the array, to simplify further processing.
  3. The `$group` operator is extremely powerful and can even create brand new fields based on numeric and string operations of other fields. In particular, as you group data and aggregate on a scalar (e.g. the count of types), you can use the `$push` operator to capture other information related to that aggregation. The output cursor contains very clear, usable rich shapes.
  4. The agg pipeline in the Java driver (in fact, most of the drivers) is simply a List of operators, very similar to what we saw earlier with `find()`. Thus, the same power and flexibility in terms of parameter substitution and subclause conditional inclusion applies to pipeline construction.

Of course, all this functionality is performed efficiently at the engine, not in the client, so millions (or billions) of documents do not have to be dragged across the network.

Next week, we’ll dive into RAD (Rapid Agile Development) and switch over to some Python examples.


For more information on migration, read our migration best practices white paper.
Read the Migration Guide

<< Day 14 (Part 1)

Day 30 (RAD) >>


About the Author - Buzz Moschetti

Buzz is a solutions architect at MongoDB. He was formerly the Chief Architecture Officer of Bear Stearns before joining the Investment Bank division of JPMorganChase as Global Head of Architecture. His areas of expertise include enterprise data design, systems integration, and multi-language tiered software leverage with C/C++, Java, Perl, Python, and Ruby. He holds a bachelor of science degree from the Massachusetts Institute of Technology.

MongoDB vs SQL: Day 14

Buzz Moschetti

Business

Welcome back to our blog series highlighting the differences between developing an app with MongoDB vs. with RDBMS/SQL. Last week, we added phone numbers to our application. This week, things are going to get a little more sophisticated.

Once again, let’s go over the framework we’ve been using to stage our discussion:

  • We are using Java
  • Assume we have a data access layer in between our application and MongoDB
  • In terms of the date counts as we go through the examples, treat them as relative progress indicators and not the actual time needed to complete the task.
  • We won’t get into exception or error-handling. We won’t muddy the code with boilerplate or persistor logic that does not change from day to day. We won’t get into the database connection or other setup resources. The primary focus will be the core data-handling code.

SQL vs MongoDB: Day 14

Two weeks in and we have 2 new things that need persistence: a list of startup apps organized by region and, more interestingly, some data vended to us by an external entitlements service. The structure might look something like this:

n4.put(“geo”, “US-EAST”);
n4.put(“startupApps”, new String[] {“app1”, “app2”, “app3”});
list2.add(n4);
n5.put(“geo”, “EMEA”);
n5.put(“startupApps”, new String[] {“app6”});
n5.put(“useLocalNumberFormats”, false):
list2.add(n5);
m.put(“preferences”, list2)
n6.put(“optOut”, true);
n6.put(“assertDate”, someDate);
seclist.add(n6);
m.put(“attestations”, seclist)
m.put(“security”, mapOfDataCreatedByExternalSource);

It’s still pretty easy to add this data to the structure, but there are a couple of things that should come to your attention:

  1. Notice that we’re trying to accommodate 2 different geographies. Depending on where you are, you may have different applications at startup in your solution. We therefore model this easily as a list of structures where the “geo” field can be used as a key.
  2. Sometimes you’ll be presented with data over which you have absolutely no control. You don’t necessarily want to manipulate the data; you simply want to be able to persist it in the shape it is presented. When you need to pull the data back out, you’ll want to extract the very same shape you received and pass it to other code for processing without having to worry about (or process) the contents of the shape.

Now let’s look at the persistence code we’d need to write to get all of this in.

SQL Day 14

At this point, the amount of code to handle the existing Day 1-5 work plus our new startup apps list and the new tables that are required to persist it would span a page or more. Just look back to Day 3 and triple it. So we’re not going to dwell on those issues.

But now that we are some way into the development effort, two unpleasant design concessions may have cropped up as a result of fatigue or time-to-market issues:

  1. The development team chooses to not properly model a new table called startup apps with the necessary foreign keys, key management, and additional joins. Instead, acknowledging that regions are a small set and do not change often, two new columns “APPS-US-EAST” and “APPS-EMEA” were created. Each column contains a semi-colon delimited list of app names, eg. “app1;app2;app3”. This is an oft-repeated practice of “column overloading” where complex structure is formatted/encoding into otherwise simple text columns to avoid the tedium of setting up more tables and potentially changing many joins. This practice suborns the ability to do proper data governance and imposes additional data decoding logic on all consumers in all applications in all languages.
  2. Presented with a map of security data of arbitrary and potentially changing shape, the development team chooses to do one of the following:
    1. Pick out a subset of known fields and set them into new columns. The fidelity and transparency gained is completely offset by losing generic persistence ability.
    2. Converting the Map to a CLOB of data in some ASCII form. This keeps the persistence generic but the data cannot be well-queried in the database and forces all readers and writers in all languages to adhere to the same encoding/decoding logic.
    3. Converting the Map to a BLOB of serialized Java. This is the worst of all: the data cannot be queried, it is available ONLY to Java, and there is compile-time coupling between the serialized form in the database and the representation required upon deserialization.

MongoDB Day 14

If you’ve been following along with our blog series, you should know by now that the theme with MongoDB is that there’s no change. As we incrementally change the kinds of data we wish to persist, the database and the code around it can adjust simultaneously. We do not have to resort to extraordinary measures just to get data in and out of our database.

Let’s move onto something that I’m sure many of you are asking: What if we really do have to do a join?

So far we’ve only looked at examples where we’re reading and writing out of a single collection. In the SQL world, we did have to join contacts and phones but as a result of the information architecture in MongoDB, we were able to model everything we needed within a single document inside of our collection.

Let’s assume now that we want to add phone call transactions to our database. It’s clear that transactions are a zero-or-more relationship where “more” could mean a lot more. And it tends to keep growing over time without bound. In this data design use case, it is appropriate in MongoDB -- just like in RDBMS -- to model transactions in a new, separate collection.

We’ll use the calling number as the non-unique key in the transactions collection, and each document will contain the target number (the number being called) and duration of call in seconds:

We have previously explored and detailed some of the pain produced by zombies and outer joins and result set unwinding in the RDBMS space (Day 5), so for the purposes of making join logic clearer in MongoDB, let’s deliberately simplify the SQL:

In addition to linking together what is very clearly a 1 to N relationship between contacts and phones, we are now also going to be looking at 1 to N to M relationships that include the call targets. The challenge then becomes, “How do we turn the rectangle that’s coming back in my result into a usable list?’

And the answer is that this SQL will only work if we either ask for a single ID or all of the IDs:

  • If we ask for a single ID, for example by appending “and A.id = ‘G9’” to the SQL, then we can amass the calling number and target numbers in the data access layer and sort/assign them.
  • If we ask for all of the IDs, we can traverse the entire result set and build a map based on ID and employ the same logic in the bullet above.

This is a way that we might unwind that result set when we want to get all the IDs and their phone transactions :

We end up with a map of data to pass back into our data access layer that’s keyed first by the ID, second by the number, and finally by the list of call targets.

The problem, of course, is that very often the design calls for either the ability to return a partial result to the data access layer or the developer doesn’t want to deal with the complexity of the unwind code. The traditional solution? The tried-and-true ORDER BY:

The use of ORDER BY here isn’t primarily to drive final ordering of the data for presentation; it’s to set up the result set for easier unwinding. As we iterate the result set, when we detect that the ID changes from G10 to G9, we know that we’re done with all the G10 items and we can yield control back to our caller. The logic that we’re doing to build a map of lists, however, is largely the same as what we saw in the previous example.

And unless indexes are properly set up and/or joined sets are relatively small, liberal use of ORDER BY can have a major performance impact on a system.

When you look at the complete fetch logic including query and post-query logic, SQL is about disassembling things. We start with big queries. We tie together tables, business logic, all of our information and material and load it all into a big string at the top.

Then we throw it at the database engine, we cross our fingers, and later our result set comes back and then we have to disassemble it. The more joins that we have in our query, the more disassembly that is required. In the real world, we’re typically talking about three, sometimes four or more N-way joins in order to bring all of the information we want together. And for every additional table that you’re joining, you’re incurring more disassembly logic. And possible additional performance impact if ORDER BY is used to aid the process.

In MongoDB, the philosophy is different. MongoDB is about assembling things. In this first example, we will use an “N+1 select” approach just to keep things simple for the moment:

First of all there is no big SQL statement up on top and we also don’t have the problem of splitting the logic between the SQL statement itself and the unwind code. Keep in mind that as that SQL statement gets larger, more and more logic is going into the operation that we are trying to drive into the database engine; we’ll also have to manage more and more logic in a separate place for disassembling our result set.

This is even further complicated if we start delving into prepared statements, and other types of dynamic structures where that logic is separate from our actual select statement that we’re building, which is separate from our result.

With MongoDB, it’s simple. We just find what we want to find, we iterate through and when we need to go deeper, we simply ask for that information to get back rich shapes that we can populate as we go along.

This is what a join in MongoDB would look like:

There are a few things to note here. First off, this is not much more code than what we saw in our SQL example. It of course benefits from the fact that there is no SQL statement at all. What we see is the only piece of logic that’s necessary.

Furthermore, for anyone who comes in to debug or modify this later on, the way the code is constructed is now very logical and sequential -- and Java friendly. It makes sense how we can iterate over a list of maps and extract the phone number from them. Especially convenient and flexible is that we don’t have to pull out target and duration explicitly; instead, we can simply take the entire substructure, stick it into a list, and return it back to the parent map. In the end, we still end up with the same map of IDs at the bottom, but what we gain is a lot more flexibility and clarity.

We mentioned above that we used an “N+1 select” approach for simplicity. In many circumstances, the performance may be perfectly acceptable, especially if the relative sizes of each tier in the lookup cascade are 1 to many (like 10 or more), not 1:1. But as we learned on Day 1, MongoDB is about choice. With only a little more work, we can change the logic to fetch an entire tier’s worth of data, extracting the entire set of keys to pass to the lookup in the next tier. It is easy to programmatically construct and pass to MongoDB thousands of items in a so-called “in-list” for lookup. We can even mix and match these two approaches as our needs dictate. In the end, the complete fetch logic remains clear, flexible, and scalable as the complexity of the query expands over time.

Next week, we’ll cover rich querying capabilities in MongoDB and how they measure up vs SQL.


For more information on migration, read our migration best practices white paper.
Read the Migration Guide

<< Day 3-5

Day 14 - Queries >>


About the Author - Buzz Moschetti

Buzz is a solutions architect at MongoDB. He was formerly the Chief Architecture Officer of Bear Stearns before joining the Investment Bank division of JPMorganChase as Global Head of Architecture. His areas of expertise include enterprise data design, systems integration, and multi-language tiered software leverage with C/C++, Java, Perl, Python, and Ruby. He holds a bachelor of science degree from the Massachusetts Institute of Technology.

MongoDB vs SQL: Day 3-5

Buzz Moschetti

Business

When we last left off in our MongoDB vs SQL blog series, we covered Day 1 and Day 2 of building the same application using MongoDB vs using SQL with code comparisons. Before we jump into the next couple of days, let’s go over the ground rules again:

  • We’ll be using Java
  • Assume we have a data access layer in between our application and MongoDB
  • In terms of the date counts as we go through the examples, just treat them as progress indicators and not the actual time needed to complete the specified task.
  • We won’t get into exception or error-handling. We won’t muddy the code with boilerplate or persistor logic that does not change from day to day. We won’t get into the database connection or other setup resources. The primary focus will be the core data-handling code.

Now let’s jump into the differences between SQL and MongoDB for Day 3 through Day 5.

SQL vs MongoDB: Day 3

We have already covered saving and fetching data using a Java Map as the data carrier in the Data Access Layer, and adding a few simple fields. For day 3, we’re going to add some phone numbers to the structure.

The Task: Add A List of Phone Numbers

This is where we were:
m.put(“name”, “buzz”);
m.put(“id”, “K1”);
m.put(“title”, “Mr.”);
m.put(“hireDate”, new Date(2011, 11, 1));

Each phone number has associated with it a type, “home” or “work.” I also know that I may want to associate other data with the phone number in the near future like a “do not call” flag. A list of substructures is a great way to organize this data and gives me plenty of room to grow. It is very easy to add this to my map:

n1.put(“type”, “work”);
n1.put(“number”, “1-800-555-1212”));
n1.put(“doNotCall”, false);  // throw one in now just to test...
list.add(n1);
n2.put(“type”, “home”));
n2.put(“number”, “1-866-444-3131”));
list.add(n2);
m.put(“phones”, list);

The persistence code, however, is a different story.

SQL Day 3 - Option 1: Assume Only One Work and One Home Phone Number

With SQL:

This is just plain bad, but it’s worth noting here because we’ve seen this so many times, often far later than day 3 when there’s strong motivation to avoid creating a new table. With this code, we’re assuming that people only have one home and one work phone number. Let’s take the high road on day 3 and model this properly in relational form.

SQL Day 3: Option 2: Proper Approach with Multiple Phone Numbers

Here we’re doing it the right way. We’ve created a phones table and we’ve updated the way we interact with it using joins.

You can see that the incremental addition of a simple list of data is by no means trivial. We once again encounter the “alter table” problem because the SQL will fail unless it points at a database that has been converted to the new schema. The coding techniques used to save and fetch a contact are starting to diverge; the save side doesn’t “look” like the fetch side. And in particular, you’ll notice that fetching data is no longer as simple as building it into the map and passing it back. With joins, one or more (typically many more) of the columns are repeated over and over. Clearly, we don’t want to return such a redundant rectangular structure in the Data Access Layer and burden the application. We must “unwind” the SQL result set and carefully reconstruct the desired output, which is one name associated with a list of phone numbers and types.

This sort of unwinding work takes time and money. Many rely on ORMs like Hibernate to take care of this, but sooner rather than later, the ORM logic required to unwind a complex SQL query leads to unacceptable performance and or resource issues -- and you end up having to code a solution like what’s shown above anyway.

SQL Day 5: Zombies

With SQL, you’ll have to deal with zombies: (z)ero (o)r (m)ore (b)etween (e)ntities. We can’t forget that some people in our contact list do not have phones. Our earlier query, which is a simple join, produces a Cartesian product and will not return individuals without at least one phone.

To address this, we have to go back and change the query to do an outer join. But much more importantly, it also means changing the unwind logic because we don’t want to add blank phone numbers in our list. This takes even more time and money.

As an aside, even though the SQL based logic is burdening us, at least we’ve confined the impact to just the Data Access Layer. Imagine the impact if we had no Data Access Layer and applications were themselves constructing SQL and unwind logic. Just adding a list of phone numbers would have been a major undertaking.

MongoDB Day 3

Now let’s take a look at doing what we just went over, this time with MongoDB:

With MongoDB, there is no change. The list of phone numbers, which is actually a list of structures with numbers and types, flows into MongoDB and is natively stored as a list of structures. Just like on day 2, it is our choice to go back and backfill phone information for those entries already in the database. Gone are the burdens of having to set up another table, another set of foreign keys, managing those keys, and adding yet another join into what will ultimately become a very complex SQL expression. We also don’t have to immediately commit to a one-or-more vs. zero-or-more design. The time and effort saved with richly shaped MongoDB documents is significant.

Next week, we’ll dive even deeper as we add externally sourced material to our contact structure and expose the compromises development teams make in SQL / RDBMS in later-stage development.


For more information on migration, read our migration best practices white paper.
Read the Migration Guide

<< Day 1-2

Day 14 >>

 

*About the Author - Buzz Moschetti*

Buzz is a solutions architect at MongoDB. He was formerly the Chief Architecture Officer of Bear Stearns before joining the Investment Bank division of JPMorganChase as Global Head of Architecture. His areas of expertise include enterprise data design, systems integration, and multi-language tiered software leverage with C/C++, Java, Perl, Python, and Ruby. He holds a bachelor of science degree from the Massachusetts Institute of Technology.

MongoDB vs SQL: Day 1-2

Buzz Moschetti

Business

This will be the first post in an ongoing series based on our popular webinar about the differences in building an application using SQL versus building the same application using MongoDB.

First off - before we get into anything - what is it that we’re all trying to achieve with our data? Sure, there are necessary evils such as reading and writing data from databases, putting data on message buses, and working through open source integrations, but at the end of the day what we really want to do is take information, compute it, and get it in front of the right people to help them make better decisions.

Business demands often seem pretty simple - “I just want to save my trades” or “Can we make a product catalog that handles ¥, £, and $ seamlessly?” - but the way that the data is expressed in code or in the application is different from the way it’s framed in a business use case. When you get to how the use case is implemented in the database, the differences are even more pronounced.

And why is that? One reason is that innovation in business and in the code/application layer has far outpaced innovation in database technologies. RDBMS came onto the scene in 1974. Since then, business goals have changed, the pace of business has increased (time to market actually matters a lot now), and we’re using technologies we could not possibly have imagined would exist 40 years ago. Back then we had fairly simple languages that were well-mated to the ‘rectangular’ and ‘flat’ RDBMS world. Today, we have extremely powerful languages of all types with never-ending streams of updates coming from the open source ecosystems we’ve built. The only thing that’s constant is change.

In 1974... In 2014...
Business Data Goals Capture my company's transactions daily at 5:30PM EST, add them up on a nightly basis, and print a big stack of paper Capture my company's global transactions in real-time plus everything that is happening in the world (customers, competitors, business/regulatory/weather), produce any number of computed results, and pass this all in real-time to predictive analytics with model feedback. Then delivery results to 10s of thousands of mobile devices, multiple GUIs and b2b/b2c channels.
Release Schedule Semi-Annually Yesterday
Application/Code COBOL, Fortran, Algol, PL/1, assembler, proprietary tools C, C++, VB, C#, Java, Javascript, Groovy, Ruby, Perl, Python, Obj-C, SmallTalk, Clojure, ActionScript, Flex, DSLs, spring, AOP, CORBA, ORM, third party software ecosystem, the entire open source movement ... and COBOL and Fortran
Database I/VSAM, early RDBMS Mature RDBMS, legacy I/VSAM
Column & key/value stores, and ... MongoDB

That’s where NoSQL comes in, in particular MongoDB. What makes MongoDB special is that it stores data in rich structures (maps of maps of lists that eventually drill down to integers, float point numbers, dates, and strings). MongoDB was designed to not only fluidly store these objects, but also to present them in APIs and with a query language that knows how to understand all the types of data you’re storing. This is in stark contrast to the legacy technologies designed and built in the programming environments of the 1970s.

In MongoDB, your data is the schema and there is symmetry between the way data goes into the database and the way it comes out. With traditional technologies, the differences between what it means to put data in and take data out increase as applications get more complex. The examples we cover in this series will demonstrate these concepts.

And finally, no MongoDB primer would be complete without the following diagram:

The image on the left illustrates storing customer data in ‘rectangular’ tables and the schema diagram needed to make sense of it all. The image on the right illustrates how data is stored in MongoDB. Suffice it to say that the diagram on the left is more complicated than the one on the right. Now, to be fair, the diagram on the left does contain more entities than just a customer and his phones, and it in all likelihood, it didn’t start out looking like that. It was probably relatively simple in the beginning. Then someone needed something that wasn’t a scalar, and someone else needed a few other things, and before they knew what happened, what was once manageable exploded into what you currently see.

Now let’s get into the actual differences between SQL and MongoDB and how we transition our thinking using code.

Some ground rules:

  • We’ll be using Java
  • Assume we have a data access layer in between our application and MongoDB
  • In terms of the date counts as we go through the examples, just treat them as progress indicators and not the actual time needed to complete the task.
  • We won’t get into exception or error-handling. We won’t muddy the code with boilerplate or persistor logic that does not change from day to day. We won’t get into the database connection or other setup resources. The primary focus will be the core data-handling code.

SQL vs MongoDB: Day 1

We’re going to start with a map, which will let us move data in and out of our data access layer. Maps are rich shapes so we can stick a lot of things into them, and perhaps more importantly, there is no compile-time dependency.

The Task: Saving and Fetching Contact Data

We’ll start with this simple, flat shape in the data access layer:

Map m = new HashMap();
m.put(“name”, “buzz”);
m.put(“id”, “K1”);

We’ll save it in this way:

save(Map m)

And assume we can fetch it by primary key in this way:

Map m = fetch(String id)

In our initial case, we just have two very simple things, “save” and “fetch”. Rich queries will come later on. This is what this might look like in the code for both SQL and MongoDB:

In MongoDB, “data is the schema” and we are not required to create a table. If you look closer at the fetch functions, you’ll notice that they are largely the same. An important takeaway from this is that in MongoDB, your basic way of addressing the database is actually very similar to how you would do it in an RDBMS. You construct the query, you pass it in, you get back the cursor, and you iterate over the cursor. The fidelity of the data moving out will change as we progress through these examples, but for now lets assume that we have parity.

SQL vs MongoDB: Day 2

Let’s add two fields: a title and a date.

The Task: Adding simple fields

m.put(“name”, “buzz”);
m.put(“id”, “K1”);
m.put(“title”, “Mr.”);
m.put(“hireDate”, new Date(2011, 11, 1));

Notice that we’re putting an actual Date object into the “hireDate” slot, not a string representation like “2011-11-01”. In a well-designed data access layer, we always want to use the highest fidelity objects we can. Dates in particular need to be treated this way to avoid YYYYMMDD vs. YYYYDDMM confusion.

This is what the access layer implementation might look like to use SQL:

The first thing to notice is the alter table problem. Before we can touch any code that’s going to interact with the new database, we have to first change the table definition; otherwise, the select statement that’s going after the new fields is simply not going to work. This is nothing new. It’s something that everyone has become numb to over the past 40 years using RDBMS. There are a few other things that developers might also need to consider like case sensitivity, etc.

Now let’s look at what this looks like in MongoDB:

What did we have to change in MongoDB? Nothing.

We put title and hire date into the map, and we simply inserted the whole map into MongoDB. The previously inserted items remain unchanged, without title and hire date fields. If backfilling older items that do not have those fields is important, we can easily write a 4-line javascript program that iterates over the collection and sets default values for title and hire date.

Next week, we’ll cover adding lists - where the real fun begins.


For more information on migration, explore these resources:
Migration Guide

MongoDB vs SQL: Day 3-5 >>

*About the Author - Buzz Moschetti*

Buzz is a solutions architect at MongoDB. He was formerly the Chief Architecture Officer of Bear Stearns before joining the Investment Bank division of JPMorganChase as Global Head of Architecture. His areas of expertise include enterprise data design, systems integration, and multi-language tiered software leverage with C/C++, Java, Perl, Python, and Ruby. He holds a bachelor of science degree from the Massachusetts Institute of Technology.