MongoDB vs SQL: Day 14

Buzz Moschetti


Welcome back to our blog series highlighting the differences between developing an app with MongoDB vs. with RDBMS/SQL. Last week, we added phone numbers to our application. This week, things are going to get a little more sophisticated.

Once again, let’s go over the framework we’ve been using to stage our discussion:

  • We are using Java
  • Assume we have a data access layer in between our application and MongoDB
  • In terms of the date counts as we go through the examples, treat them as relative progress indicators and not the actual time needed to complete the task.
  • We won’t get into exception or error-handling. We won’t muddy the code with boilerplate or persistor logic that does not change from day to day. We won’t get into the database connection or other setup resources. The primary focus will be the core data-handling code.

SQL vs MongoDB: Day 14

Two weeks in and we have 2 new things that need persistence: a list of startup apps organized by region and, more interestingly, some data vended to us by an external entitlements service. The structure might look something like this:

n4.put(“geo”, “US-EAST”);
n4.put(“startupApps”, new String[] {“app1”, “app2”, “app3”});
n5.put(“geo”, “EMEA”);
n5.put(“startupApps”, new String[] {“app6”});
n5.put(“useLocalNumberFormats”, false):
m.put(“preferences”, list2)
n6.put(“optOut”, true);
n6.put(“assertDate”, someDate);
m.put(“attestations”, seclist)
m.put(“security”, mapOfDataCreatedByExternalSource);

It’s still pretty easy to add this data to the structure, but there are a couple of things that should come to your attention:

  1. Notice that we’re trying to accommodate 2 different geographies. Depending on where you are, you may have different applications at startup in your solution. We therefore model this easily as a list of structures where the “geo” field can be used as a key.
  2. Sometimes you’ll be presented with data over which you have absolutely no control. You don’t necessarily want to manipulate the data; you simply want to be able to persist it in the shape it is presented. When you need to pull the data back out, you’ll want to extract the very same shape you received and pass it to other code for processing without having to worry about (or process) the contents of the shape.

Now let’s look at the persistence code we’d need to write to get all of this in.

SQL Day 14

At this point, the amount of code to handle the existing Day 1-5 work plus our new startup apps list and the new tables that are required to persist it would span a page or more. Just look back to Day 3 and triple it. So we’re not going to dwell on those issues.

But now that we are some way into the development effort, two unpleasant design concessions may have cropped up as a result of fatigue or time-to-market issues:

  1. The development team chooses to not properly model a new table called startup apps with the necessary foreign keys, key management, and additional joins. Instead, acknowledging that regions are a small set and do not change often, two new columns “APPS-US-EAST” and “APPS-EMEA” were created. Each column contains a semi-colon delimited list of app names, eg. “app1;app2;app3”. This is an oft-repeated practice of “column overloading” where complex structure is formatted/encoding into otherwise simple text columns to avoid the tedium of setting up more tables and potentially changing many joins. This practice suborns the ability to do proper data governance and imposes additional data decoding logic on all consumers in all applications in all languages.
  2. Presented with a map of security data of arbitrary and potentially changing shape, the development team chooses to do one of the following:
    1. Pick out a subset of known fields and set them into new columns. The fidelity and transparency gained is completely offset by losing generic persistence ability.
    2. Converting the Map to a CLOB of data in some ASCII form. This keeps the persistence generic but the data cannot be well-queried in the database and forces all readers and writers in all languages to adhere to the same encoding/decoding logic.
    3. Converting the Map to a BLOB of serialized Java. This is the worst of all: the data cannot be queried, it is available ONLY to Java, and there is compile-time coupling between the serialized form in the database and the representation required upon deserialization.

MongoDB Day 14

If you’ve been following along with our blog series, you should know by now that the theme with MongoDB is that there’s no change. As we incrementally change the kinds of data we wish to persist, the database and the code around it can adjust simultaneously. We do not have to resort to extraordinary measures just to get data in and out of our database.

Let’s move onto something that I’m sure many of you are asking: What if we really do have to do a join?

So far we’ve only looked at examples where we’re reading and writing out of a single collection. In the SQL world, we did have to join contacts and phones but as a result of the information architecture in MongoDB, we were able to model everything we needed within a single document inside of our collection.

Let’s assume now that we want to add phone call transactions to our database. It’s clear that transactions are a zero-or-more relationship where “more” could mean a lot more. And it tends to keep growing over time without bound. In this data design use case, it is appropriate in MongoDB -- just like in RDBMS -- to model transactions in a new, separate collection.

We’ll use the calling number as the non-unique key in the transactions collection, and each document will contain the target number (the number being called) and duration of call in seconds:

We have previously explored and detailed some of the pain produced by zombies and outer joins and result set unwinding in the RDBMS space (Day 5), so for the purposes of making join logic clearer in MongoDB, let’s deliberately simplify the SQL:

In addition to linking together what is very clearly a 1 to N relationship between contacts and phones, we are now also going to be looking at 1 to N to M relationships that include the call targets. The challenge then becomes, “How do we turn the rectangle that’s coming back in my result into a usable list?’

And the answer is that this SQL will only work if we either ask for a single ID or all of the IDs:

  • If we ask for a single ID, for example by appending “and = ‘G9’” to the SQL, then we can amass the calling number and target numbers in the data access layer and sort/assign them.
  • If we ask for all of the IDs, we can traverse the entire result set and build a map based on ID and employ the same logic in the bullet above.

This is a way that we might unwind that result set when we want to get all the IDs and their phone transactions :

We end up with a map of data to pass back into our data access layer that’s keyed first by the ID, second by the number, and finally by the list of call targets.

The problem, of course, is that very often the design calls for either the ability to return a partial result to the data access layer or the developer doesn’t want to deal with the complexity of the unwind code. The traditional solution? The tried-and-true ORDER BY:

The use of ORDER BY here isn’t primarily to drive final ordering of the data for presentation; it’s to set up the result set for easier unwinding. As we iterate the result set, when we detect that the ID changes from G10 to G9, we know that we’re done with all the G10 items and we can yield control back to our caller. The logic that we’re doing to build a map of lists, however, is largely the same as what we saw in the previous example.

And unless indexes are properly set up and/or joined sets are relatively small, liberal use of ORDER BY can have a major performance impact on a system.

When you look at the complete fetch logic including query and post-query logic, SQL is about disassembling things. We start with big queries. We tie together tables, business logic, all of our information and material and load it all into a big string at the top.

Then we throw it at the database engine, we cross our fingers, and later our result set comes back and then we have to disassemble it. The more joins that we have in our query, the more disassembly that is required. In the real world, we’re typically talking about three, sometimes four or more N-way joins in order to bring all of the information we want together. And for every additional table that you’re joining, you’re incurring more disassembly logic. And possible additional performance impact if ORDER BY is used to aid the process.

In MongoDB, the philosophy is different. MongoDB is about assembling things. In this first example, we will use an “N+1 select” approach just to keep things simple for the moment:

First of all there is no big SQL statement up on top and we also don’t have the problem of splitting the logic between the SQL statement itself and the unwind code. Keep in mind that as that SQL statement gets larger, more and more logic is going into the operation that we are trying to drive into the database engine; we’ll also have to manage more and more logic in a separate place for disassembling our result set.

This is even further complicated if we start delving into prepared statements, and other types of dynamic structures where that logic is separate from our actual select statement that we’re building, which is separate from our result.

With MongoDB, it’s simple. We just find what we want to find, we iterate through and when we need to go deeper, we simply ask for that information to get back rich shapes that we can populate as we go along.

This is what a join in MongoDB would look like:

There are a few things to note here. First off, this is not much more code than what we saw in our SQL example. It of course benefits from the fact that there is no SQL statement at all. What we see is the only piece of logic that’s necessary.

Furthermore, for anyone who comes in to debug or modify this later on, the way the code is constructed is now very logical and sequential -- and Java friendly. It makes sense how we can iterate over a list of maps and extract the phone number from them. Especially convenient and flexible is that we don’t have to pull out target and duration explicitly; instead, we can simply take the entire substructure, stick it into a list, and return it back to the parent map. In the end, we still end up with the same map of IDs at the bottom, but what we gain is a lot more flexibility and clarity.

We mentioned above that we used an “N+1 select” approach for simplicity. In many circumstances, the performance may be perfectly acceptable, especially if the relative sizes of each tier in the lookup cascade are 1 to many (like 10 or more), not 1:1. But as we learned on Day 1, MongoDB is about choice. With only a little more work, we can change the logic to fetch an entire tier’s worth of data, extracting the entire set of keys to pass to the lookup in the next tier. It is easy to programmatically construct and pass to MongoDB thousands of items in a so-called “in-list” for lookup. We can even mix and match these two approaches as our needs dictate. In the end, the complete fetch logic remains clear, flexible, and scalable as the complexity of the query expands over time.

Next week, we’ll cover rich querying capabilities in MongoDB and how they measure up vs SQL.

For more information on migration, read our migration best practices white paper.
Read the Migration Guide

<< Day 3-5

Day 14 - Queries >>

About the Author - Buzz Moschetti

Buzz is a solutions architect at MongoDB. He was formerly the Chief Architecture Officer of Bear Stearns before joining the Investment Bank division of JPMorganChase as Global Head of Architecture. His areas of expertise include enterprise data design, systems integration, and multi-language tiered software leverage with C/C++, Java, Perl, Python, and Ruby. He holds a bachelor of science degree from the Massachusetts Institute of Technology.