MongoDB vs SQL: Day 30

Buzz Moschetti


Welcome back to our blog series comparing building an application with MongoDB versus building the same application with SQL. Today, we’re going to switch gears a little bit and talk about rapid application development (RAD), specifically using Python as an example.

First of all, why would we want to do RAD?

It should be obvious by now that assuming that we’ll only ever address our data with a single language is not a good idea. If you’ll recall from the first post in this series, there are now a plethora of languages we can use, and an even larger ecosystem of frameworks that we can utilize to manipulate our data. We have to assume at this point that the rate of change of technology, drivers, and languages is only going to increase. We want to bring together the power of all of these languages but we want to do it in a way that makes sense for each language. Let’s begin by reintroducing the Contacts theme we have been exploring for a few weeks.

When you look at the code above, you’ll notice that when we construct this piece of data to save into MongoDB with Python, much of the overhead of explicitly constructing Maps (name/value pairs) and Lists goes away. Instead, we can just use the syntax of Python to describe it and save it.

If we wanted to go back and add titles and hire dates to existing documents, the backfill logic would be about ten lines worth of code. Of particular interest is the simplicity of the predicate expression at the bottom of the code compared to when we tried to do the same thing using Java; again, with Python we no longer have to be burdened with the explicit construction of Maps and Lists. Also, the functions in the output formatting (upper(), sorted()) are not MongoDB, they’re native Python. MongoDB’s drivers expose all of this rich data in a form most convenient to the host language. This gives us the capability to use all of the tools, tips and tricks from third parties and the open source community to operate on our data.

Probably the most important thing that ties all of this together is polymorphism, which is our fancy term for being able to store more than one kind of a shape inside of one collection. This capability is available in any language but it’s easily visualized inside of a scripting language like Python.

What we have above is an information architecture where name, ID and personalData are well known. There’s probably going to be an index on name and/or ID. The field name of personalData is well known within our information architecture, but across different documents, the contents, and the shape of personalData are different. For Bob, we have preferred airports and travel time, while Steve is more interested in the last account visited and his favorite number, a floating point approximation of pi. What makes RAD really powerful in MongoDB is that we can let “the data do the talking.” Through index optimization we can quickly navigate to a small set of documents and then, for each document retrieved, have the host code ask “what is your type?” and react dynamically to the response. In the case of Maps, we can recursively ‘walk’ the map until we get to name-scalar value pairs that can be easily formatted to appear on a GUI. Obviously, we can also build specific, visually optimized GUIs that ask for very specific parts of the content.

This is a very powerful feature and it’s one of the reasons why the ER diagram versus the simplified MongoDB representation isn’t so far off from reality; capabilities like this allow us to rethink how we want to structure our data. We don’t need 22 extra tables to hang on to different shapes of things. We can place everything inside one collection and let the data do the talking, so to speak.

We don’t really have to spend much time on the code snippet above because the current convention to deal with polymorphic data in SQL is to just BLOB it, either via serialized Java, XML, JSON as a single string, third party representations (like Avro), or something home-grown. With all of these approaches we will lose query capability on our data, and with most we will also lose cross-language type fidelity information on it, meaning if we store a floating point number, we have to make sure our post-API representation is consistent across all versions of Java, C#, JavaScript, Python, Perl, etc. These are very important considerations if we want to focus on working with our data and not waste time trying to build adapters and conformance layers to ensure that the data is properly serialized and de-serialized.

At a fundamental level, what has changed?

We have to look back and once again understand why we started this series with a little bit of a history lesson. In the old days, when RDBMS systems were conceived, CPU was slow. Disk was slow. Memory in particular was very expensive. There was no malloc() in the old days; we couldn’t code “new MyObject()”, never mind assess the performance optimization around making the call in the first place. Everything was compile-time bound and locked into a small set of datatype primitives that could be efficiently handled by the CPU and the operating system. In the year 2014, we have a lot more flexibility. We have a lot more power at our fingertips. We can afford the few extra cycles to let the data tell the code what it is, in exchange for a much more versatile and adaptable data access layer.

More broadly, this power allows us to construct software that operates on generalized sets of problems, independent of the specific business domain (finance, manufacturing, retail, etc.). Consider the examples below:

Everybody has suffered through the pain of satisfying a requirement like “How do I do reconciliation or find the version delta?” Whether it’s a trade or a product or a recipe or a catalog entry; it doesn’t matter. Unless you’re in single table world - and that’s essentially never - you have a problem. You can do one of two things: You can build a piece of software that will hydrate your RDBMS world into an object then hydrate another object and do an object-to-object compare. If you’re lucky the people doing all that will have implemented Comparable (in the case of Java), and maybe you can iterate through the results that way. But it’s still quite a bit of work. If you do it at the RDBMS level by dumping tables and getting rows of things and processing CSVs, you are setting yourself up for a world of pain of brittle feeds, vaguely typed formatted data, and lots of commas. We all live this pain every single day. We’ve just grown accustomed to it, but it’s not the way you’d want to do it.

The example at the top of the image above shows how you would do it in MongoDB combined with generic software. We can generically ask for an entire set of data, walk the structure, and accurately capture value and datatype differences. There is an investment here in the MapDiff.difference() function but once completed, it can be reused across any information architecture.

Another use case that comes up very often is ‘how do you pull together sets of data and continually add “layers” of overrides’? You’ve got your baseline, let’s say preferences for a community, and then when a new user is added there are defaults at a company level, then defaults at a group level, and finally the user’s preferences. You want to be able to overlay these things. Traditionally overlaying anything in the RDBMS world is really tough, and largely what we’ve done in the past is hydrate things into bespoke objects and do the layering logic there. Unfortunately, this also introduces a whole set of compile-time dependences.

With MongoDB it’s very easy just to iteratively extract shapes expressed as a map, “stack” the maps, and at the very end produce a “top-down look” of the stack. It is easy to add a feature that allows us to ask “Which Map in the stack produces the value we see in the top-down look?” You get all that kind of flexibility because you’re in the rich map ecosystem. You’re not just dealing in the flat ResultSet world of ints, doubles, dates, and strings in a code framework that’s heavily geared towards the database. With MongoDB, you’re now geared toward the structures and objects that are natively and fluidly manipulated inside the host language.

Lastly, let’s spend just a couple of minutes on the command-line interface (CLI). What’s important (and exciting!) to know is that in MongoDB, our CLI is actually just the V8 JavaScript engine (the same as in Chrome) that loads the JavaScript driver for MongoDB plus a few extra syntactic add-ons. In other words, with our CLI, you can program JavaScript in it all day long and never even touch MongoDB. This is a very different paradigm from iSQL and PSQL and other SQL CLIs that largely only manipulate SQL, not programming elements like variables, functions, and branch control.

Because the MongoDB CLI is “Javascript first, MongoDB second,” there’s a whole host of powerful native applications or utilities that you can write in JavaScript that fluidly manipulate all kinds of things in addition to manipulating MongoDB. An example appears in the image below. We’re asking for contacts with sequence numbers greater than 10,000. Then we invoke the explain method to get back information from the engine on how long it took, the number of items it scanned, etc.

I can easily put that into a loop, and then push it on to a array “v” to produce an array of explain() output. Because this is just JavaScript in our CLI, I can grab the jStat package off the Internet and I can run standard deviations or anything else I want. Lastly, I can not only capture these things as they are admitted as rich shapes, but also turn them right around and insert them back in. For example, I could adapt this to have a while loop around it and it will continuously run a little script that sees what the timings are for a particular kind of query against my database, take the results, and place them back into the database, creating a performance “log.” And this log itself is rich shapes, available for full fidelity querying and manipulation! Simple. Symmetric. Efficient.

In conclusion, it’s our belief that once you move beyond trivial use cases, it’s actually easier to use MongoDB to interact with your data than RDBMS for some of your bigger problems. MongoDB harmonizes much better with modern programming languages and ecosystems than RDBMS. When we take that and layer in some of the things that we didn’t cover in this series like robust indexing, horizontal scaling, and isomorphic high availability and disaster recovery, MongoDB becomes the modern database you’re better off with for your modern solutions.

For more information on migration, read our migration best practices white paper.
Read the Migration Guide

<< Click back to MongoDB vs SQL: Day 14 (Part 2)

About the Author - Buzz Moschetti

Buzz is a solutions architect at MongoDB. He was formerly the Chief Architecture Officer of Bear Stearns before joining the Investment Bank division of JPMorganChase as Global Head of Architecture. His areas of expertise include enterprise data design, systems integration, and multi-language tiered software leverage with C/C++, Java, Perl, Python, and Ruby. He holds a bachelor of science degree from the Massachusetts Institute of Technology.