To watch the full presentation, click here to view the recording.Transcript
Hello, this is Buzz Moschetti at MongoDB and welcome to today’s webinar entitled Transitioning from SQL to MongoDB. If your travel plans today do not include exploring topics in this space then please exit the aircraft immediately and see an agent at the gate. Otherwise, welcome aboard. Our flight time today is going to be about one hour. Some quick logistics as we start here. This webinar is being recorded and the audio and the video will be available in about 24 hours. You can use the chat window at any time for technical assistance and to ask questions. Now there is a MongoDB team that’s watching that chat window.
They’ll try to answer quick questions as they come up in real time. As we start to see patterns of common questions, those are going to be flagged and I’ll address those kind of in the priority order if you will at the very end of the webinar. We’ll leave about 10 minutes of the presentation to answer those common questions. Very briefly a little bit about myself, I didn’t make up that name for today’s webinar. That actually is my real name. I spent a long time on the other side of the customer-vendor fence and here now at MongoDB I work with customers large and small to help them design and develop Full Stack Solutions.
And at the risk of starting an editor flame war so early in the presentation, yes I still use Emax. So now let’s get into the thick of it. Here’s really the theme for today, and that is what are people doing and what’s the time they’re taking? And why is it sometimes difficult to get things done? So if you are not taking numbers and strings and dates and applying logic to them and presenting that to the business community in a way where they can make a decision and actually add business value then you’re not adding any value.
So the degree to which you actually are reading and writing data from data basis, putting it on message buses, actually working through open source integrations, let’s just call those “necessary evils.” They’re obviously very necessary evils but you actually want to minimize the amount of time you’re spending engaging in those activities. You want those things to be as friction free as possible so that you can get on with the real value work, which is taking the information and computing it and displaying it to the business community.
And along those lines, we often run into this sort of a frustrating dialogue with our users, with our customers. And that’s because the way that they express things like, “Well I just want to save a bunch of trades” or “Can’t you make the product catalogue handle Yens and Pounds and Dollars seamlessly?” But the way the data is expressed, the information is expressed at the business level are often pretty different for a variety of all understandable technical reasons then at the code level. But it’s traditionally been very, very different at the database level.
And why is that? Well the problem isn’t new, and we’ve been at this game with RDBMS for about 40 years. And I’m not going to read through the slide here. But it’s clear that in the past 40 years our business data goals have changed. That’s due to both the increase in the pace of business, globalization, and frankly just better and more exciting technologies that are out there. We do want to release things more quickly than our competitors, sometimes violating causality in the process. But for me, especially as an up Stack Solutions kind of guy it’s this third line about the applications in the code ecosystem that’s really important.
Because in the old days you had fairly simple languages that were kind of well mated to the rectangular RDBS world, and frankly the flat file world. But you fast forward to now and you’ve got all kinds of languages and environments in the whole open source ecosystem. And these are really powerful software environments and we haven’t to date really exploited the full power of those environments, particularly with respect to integrating with persistence.
And then lastly, 40 years ago we had early RDBMS, surprise today, we still have mature RDBMS, but in the past few years we’ve seen an increase in the rise of so called NoSQL, and in particular MongoDB. Now how does Mongo make things different? Well, at the core, very important to understand that we are a rich shaped database; we call it a document database as well. Document is not to be confused with PDFs or Microsoft Word, it’s not about that. It’s about rich structures. It’s about maps of maps of lists of maps that eventually have ints, doubles, dates, and strings at the leaves.
And because a product is designed from its core to be able to fluidly not just store these things but present them in the APIs and have a query language that knows how to understand these types. It becomes extremely powerful. Legacy environments were designed with very different languages and very different kinds of programming environments in mind. And in particular that the whole scripting world that we find ourselves in and it’s the powerful world is not well served by some of the Legacy technologies.
Second, Mongo data is the scheme and we’re going to see a lot more examples of that as we get into the code examples later on. And something that’s important for me is well is this last bullet, there is a symmetry between the way data goes into Mongo and the way it comes out. And we’re going to see in the examples coming up of how we’re just are going from a simple persistence environment to a successively more complicated one. In traditional technologies, the divergence between what it means to put stuff in and take stuff out grows and grows. In MongoDB it remains the same.
Lastly, no Mongo presentation is complete without our requisite ER diagram horror story. Truth in lending, the diagram on the left is more complicated than the one on the right. That’s to be fair. However, I will tell you this, the diagram on the left probably started out with four or five basic entities and all was well. And then somebody needed something that wasn’t a scalar. And then somebody needed something where two or more kinds of things, like two kinds of products for example, had to coexist in the same sort of a shopping card or some sort of a portfolio, a container collection. And then before you know it those five or six entities exploded into this thing that you see then on the left. And frankly it obscures the fact that hidden somewhere in there are the five or six good entities that we want. So let’s actually get into the differences here and how we transition our thinking from SQL to Mongo using actual code. Very quickly some ground rules on this, just because MongoDB is a contemporary and exciting software product, it does not mean that we throw out good rules of engineering. So we’ll have a data access layer in between our applications and MongoDB.
Number two, and this is a little controversial perhaps, but in these examples we’re not going to talk about ORMs. I’m not going to talk about hibernates. There’s actually a very elegant annotation-based product for Java we have called Morphia. But I don’t want to get into of those things because all those frameworks carry with them their own special set of issues and that’s an entirely different webinar to get into, what the pros and cons of those issues are in the SDLC in the dependency management and all that.
So we’re just going to… and by the way footnote, most of those frameworks end up under the covers doing exactly the same sorts of things in code that you’re going to see today. We’re not going to look in errors and exceptions, please don’t cut and paste this stuff and expect it all to work flawlessly. And lastly, in terms of our date counts as we progress through our design, it’s really just a proxy for progress. Don’t say… I’ve had people come back to me and say, “I can code a lot more than that in three days.” It’s really just meant to be granular steps as we evolve the platform.
So here’s how we’re going to begin. We’re going to start with a map. We’re going to use a map as the way for us to move data in and out of our data access layer. Now why do we like maps? A number of reasons; first of all, they’re rich shapes where I can stick a lot of things into them. All the types I’d want to put in including other maps and lists. But perhaps most important is there’s no compile time dependency on a map. So as you persister changes and as your applications on top of the data access layer change, you don’t have to worry about constantly recoding and recompiling and deploying you data access layer.
This is very important stuff. And we’re going to only have in our initial case here just two very simple things. We’re going to have “save” and “fetch”. We’re going to get into rich queries later on. So with that as the setup, brace yourself because here is some code. Now in our initial efforts on both sides, this is kind of what it would look like. And to the degree that yes in MongoDB we are schema free and the data is the schema, I’m not required to create a table. Initially when I put some of these examples together, folks said no one would ever really have a prepared statement that would be built over and over so that should be handled in the initialization phase.
That’s A okay. But largely what’s important here is that if you look at the fetch routines, the fetch is largely the same. And what’s important as a concept here is to understand that in Mongo your basic way of addressing the database is actually similar to an RDBMS. You construct the query, you pass it in, you get back the cursor, you iterate over the cursor. It’s going to be the fidelity that data moving out that’s going to change as we go further on in the presentation. But for the moment let’s just assume that it’s Apples to Apples, we have parity. Now we’re going to go to day two.
Day two in our development we’ve decided that we need to add these two fields; a title and a date. Now you notice the first thing we’re going to do is that we’re not putting a string into the date. There’s not… we don’t want to have strings in our data access layer. Strings are only good frankly on the guey. You want to keep it into the highest fidelity object you can. So we’re going to stick in the date, and we want to move this now in and out of our data access layer to our persister. Now we have to obviously change the way that the data access layer is going to talk to the DB.
And it took me longer frankly to describe what’s going on here than really what it would mean to change things. So bracing yourself again, now this is what our code’s going to look like in the SQL side. The very first thing that’s important is the alter table problem. So before you can touch any code that’s going to go after your new database you have to alter it because the select statement itself that’s going after those new fields will not work. So you are already starting to leg into… and by the way this is not new.
This is just something we’ve become numb to over 40 years of always being a luck step… if you were one step behind the evolution of the schema in order to have the code against it work properly. Now, there’s other things, again, as a software developer I’ve got a little bit of a focus on things like case sensitivity in terms of transfer and such. But you… in general have to make modes in a number of places in order to get this data in and out of the database and put it in. But let’s say that’s not really too hard. But what do we have on MongoDB on day two? Well nothing changed at all.
That title and the date that we put into the map, we simply place into the persister. I didn’t spend any time thinking about it, I just persisted, there’s really no technical debt. And the issue about backfill, this is an important thing. Just because I added the fields doesn’t mean I can’t go back to my old records and add title and date but it becomes your choice to do. If you want to add it, terrific. Write a little script… as we’ll see later on that iterates over the things that do not have the title and date and puts in a default value. But the choice is yours. Let’s move on to day three.
Now we’re getting really into the heart of it. Now we’re going to add some lists. We’re going to add some phone numbers. And again this is going to be a list of structures because each phone number has got both a type and the actual number associate with it. And I want to stick this into my map in my data access layer and I’m going to persist it. Again, pretty easy to add. Now really brace yourself. I’m only going to spend a little bit of time on this slide because it’s just plain bad. But I’ve seen it so many times that it warrants the 45 seconds, and that is off times especially if this way day 90 instead of day three. There is a very strong motivation to not do the right thing and create a new table.
Instead we’ll just assume people are going to have, let’s say three phones. And so you end up with your phone one, phone two. In this case we just we just assume that there’s one number. But this is not what you should be doing on day three, right? We’re still early on in our design. Let’s explore what really should be happening in the SQL world. Now we’ve done the proper approach here and we’ve created a phones table. We’ve updated the way that we talk to it via the joins here.
And let’s… again just assume those are modest changes. What is not a modest change is the code that’s required to both insert and then take the data back out. In particular, on the fetch side you’ll see that the yellow warning triangle no longer as I fetched data out is it just a straight forward build it into the map and pass it back? And we’ll see a lot more of this in just a few slides. But I now have to unwind that result set because I’m not going to return a big rectangle back to my users… back to the application. I actually want to pass back a map that says, “One person, Buzz has a set of phones and each one of these phones has a set of properties.”
Not just a big rectangle that I have to manage through myself. Make no mistake by the time you get into this sort of work it takes time and money. And pretty much if people are thinking while you hibernate for example takes of all these for me, it does until your fourth join triggers some sort of horrible behavior in your performance and you end up coding things like these anyway. So at some point you will end up doing the same SQL unwind. No sooner had we gotten through that little exercise of adding things than the zombies appear.
And currently very popular in both literature and film, but they’ve been with us in information architecture for a long time. And zombies, zero or more between entities, what we forgot is that some people in our contact list do not have phones. And my other query which had the Cartesian product did not return those people without phones. So we go back and we change the query in order to do an outer join. But as a result we now have to go back to the unwind lot and in particular, this is actually where the real work is now in trying now to deal with the unwind because additional material is going to be coming in.
And this took time and money. Hopefully everything we’re talking about here is still in the data access layer, right? Because if your applications were talking directly to your RDBMS you’d have a lot impact analysis ahead of you. MongoDB day three there’s no change. That list of phone numbers… actually a list of structures with number and type flows into MongoDB. It’s stored natively as a list of structures. And this second item here is so important to understand because you no longer have to fear naturally occurring lists in your data structures.
And in addition of course this kind of approach keeps you safe from various other undead distractions as you develop your solution. A couple of weeks later, now it’s getting a little more complicated still. We’ve got things like startup apps which is a string of different startups. You notice how we’re doing it for two different geographies, so depending on whether you’re in the US or EMEA you may have different applications at startup in your solution. At the bottom, that very descriptively named data, map of data created by external source, sometimes you’ll be presented with data that you don’t have control over.
And it’s just… it’s vended to you, you want to capture it. You don’t necessarily want to figure… in fact you don’t have to figure anything out, just whatever’s in that shape you want to save. And now later on as part of pulling all these back out you will hydrate that shape and vend it back to somebody else largely without you having to get in the middle of any of that. Knowing what we’ve known so far in the past six slides about what our journey has looked like what do you think it’s going to look like to add this in?
Well truly I stopped trying to come up with the compile time correct examples and I just gave up because it would be many, many more lines of code and I didn’t want to work with it. But here is the important thing, it’s extremely likely that by the time we got to this people decided that startup apps wasn’t worthy of its own table and they just did a semi-colon delimited string of things. And so there you’ll… what’s really in your ER diagram not only a complex set of tables but you probably got encodings of lists floating around in there simply because people ran out of time, money, patience or all of the above.
And for security, in the absence of a way to actually store the whole thing, a subset of items in there were taken out and stored as COMS A, B, C and D. If you’ve been following along, the theme here in MongoDB, there’s no change. As we incrementally change the kinds of things we wish to persist and truly we’re trying to adopt if you will, agile development practice the database and the code around it can move simultaneously, easily and without having to resort to extraordinary measures just to get the data on and off the database. So, what if we do have to do a join?
So far we’ve just seen our examples where we’re reading and writing out of a single collection. The SQL world did have to join contacts and phones but as a result of the information architecture we were lucky, frankly, that we were able to well model this in Mongo as a single shape inside of a collection. So for greens sake let’s assume that we want to have transactions and although I will say as a foot note, for the highest possible performance if this was a cashing or a web-facing, web-scale kind of app, you might actually want to burry your phone transactions inside the Mongo shape for each phone number. But for the purposes of discussion today, let’s say it’s going to be separate. And the link here is going to be the phone number in my customer table. It’s going to link to one or more transactions where that number is calling a target. So in the SQL world, we’re going to end up with something like this.
And again the point of this exercise is not to figure how I’m going to optimize and look it in to see is another thing. But the point is now in addition to linking together what it was very clearly a one to end relationship between the contacts and the phones, now we have a one to N to M kind of relationship where I’m now pulling in the targets. And the challenge here becomes how do I turn that rectangle that’s coming back in my result set actually into a real list? Now, those of you who are watching this carefully will realize that in general this kind of approach is only going to work for two things: either I ask for a single ID or I ask for all of the IDs.
The reason is if I ask her one ID I will get back all of the G9s and that will fine. If I ask for all of them, I’ll be able to traverse the entire result set and find the full population of G9s. But if I were to try to iterate through this one line at a time, I’ve got data essentially randomly distributed by inside order but randomly distributed throughout my result set. So I would have to get a little fancy. But let’s leave that caveat off to the side. We will go line by line through this example but this is a way in which you might unwind that result set. When we get back to this rectangle we’re going to quickly locate… by the way this isn’t the case where we’re saying, “Just get me all the IDs and their phone numbers not for a particular one.”
We see if the ID has been set up, we create a map, if there is any target numbers then we create that Map. Finally towards the bottom here, that’s the most frequently changing datum. The target and duration columns coming out of the ones that are actually changing over and over again, we do put those into a map and we stick it into a list. By the time we are all done we would end up with a structure, this IDMAP that would be passed back in our data access layer that looks like this; keyed first by my ID, second by the phone number and then a list of the targets that I’m going to.
Okay, not bad. Now, in general people won’t do that for some of the reasons I described before and that is you can’t bail out of the cursor early so we’ll do a little ORDER BY here. Now ORDER BY immediately starts to bring in to the picture of performance considerations. And you had better be sure that you’re well indexed across all of your ordering. Now in our example here probably an index is entirely appropriate on ID but sometimes it is not. And when you’re putting in order by simply to drive the logic in your result set on winding, that’s where you can unnecessarily impose a performance requirement on your database… a performance problem on your database.
The point is, in this kind of a set up you can imagine I can iterate through things and I’ll jump out. When I see that ID changes from G10 to G9, I know that I’m done with all the G10 and I can yield control back to my caller. But still the logic that I’m doing to build a Map of lists to return to the user is largely the same as what we saw on the other side. In short, what does this mean? In MongoDB SQL is about these disassembling things. We start with the big query up front, many columns. We tie together all the tables, business logic, all kinds of material and information is all loaded in to this big string up on top.
Then we throw it at the database engine, we cross our fingers somewhere between two and 2000 or 20,000 or forever milliseconds later our result set comes back and then we disassemble it. And the more joins that you have in your query, the more disassembly that you are going to have to work through. This is really the crux of it. Example is where I say select staff from employees; that’s not the real world. Even two table joins isn’t really the really world. Most of the time you’re talking about three, sometimes four, sometimes more way joins in order to bring this information together.
And for every additional table that you’re joining, you’re incurring more disassembly logic. MongoDB the philosophy is different. It’s the opposite in fact. MongoDB it’s about assembly where you put things together bit by bit. First of all there is no big SQL statement up on top and you also don’t have the problem of splitting the logic. Make no mistake as that SQL statement gets larger more and more logic is going into the operation that you are trying to drive into the database engine. And there is going to be more and more logic in a separate place in disassembling that result set.
By the way that’s further complicated if you start getting into prepared statements, another sort of dynamic structures where that logic is separate from your actual select statement that you’re building which is separate from your result. MongoDB it’s sort of simple. You just find what you want to find, you iterate through and then when you find that you need to go deeper if you will, you simply ask for that information and you get back these rich shapes and you continue to populate as we go along. The way this looks in adjoin for MongoDB, it would look something like this.
The there’re two important things which first of all, this is not much more code than what we saw before for the SQL example in terms of unwinding. Of course it benefits from the fact that there is SQL statement at all here, right? This is the only piece of logic that’s necessary. But for somebody who’s trying to come in and then debug this or do any sorts of map later on, the way the code is constructed is now very logical sequential. It makes sense how I can move from maps and just get list out of the phones and build them. Particularly the second red highlighted item here targeting duration, I don’t have to pull explicitly as columns.
I can simply take the entire structure and stick it into a list and return it back to the parent map. Though in the end I still end up with same IDMAP on the bottom but I get a lot more flexibility and a lot more clarity in terms of what it is I’m trying to put together in this kind of approach. So far we’ve seen broad queries. Get everything or get one thing, all right. But as we know the hallmark of SQL is its rich querying capability. Well Mongo’s got that as well but the big difference is, in Mongo we don’t have a string. It’s not a grammar with whitespaces and comas and things in it.
Again the same tips and tricks are used uniformly across the entire development space. Let’s look at some compare and contact examples. This is any one of your popular command line interpreter to SQL is going to get some contacts and phones. That largely will yield that rectangle we saw before. In our CLI of course in the phone’s example we didn’t have to do the join so this is how we dig through the structure and say, “Look for the phone’s list and any structure inside where the type equals work return that item.” It’s nice and compact.
If I was actually to code this in Java… well in SQL it does pretty much look the same. I will leave to another time. The subtleties of what it actually means to dynamically create SQL and make sure that all of your single quotes are scaled properly as you do a string buffer append and all that because this is actually the happy part. This is the easy case. Dynamically constructing rich SQL sometimes can be fraught with peril. But at any rate as we can see the difference SQL and Java and MongoDB is relatively straight forward. Again the overall architecture is the same, query, cursor iterate over cursor.
Let’s get a little more complicated here. We find the contacts we want from [Indiscernible] [0:29:42] somebody’s who’s hired. This is… I’ll leave it as an exercise to the viewers to see which RDBMS I was going after that supports this syntax because as you can imagine this syntax with the date expression, the date cast doesn’t work on all databases. The equivalent in Mongo is pretty straight forward again where I have an $or operator. We use dollars signs in front of some of our operators for syntactic sugar, and I’m going to evaluate these two expressions.
What’s important to note in our CLI is that when I ask for a date for example, I use a real date object. It’s not looking for the string. And in fact if I now say, what would this look like in Java itself? Now at first this looks a little complicated, right? This looks a little more complicated than SQL, but it really isn’t for the following reasons.
First of all, it’s the same kind of… it’s really the same two or three lines just repeated over and over again as I construct a thing. Second of all, it’s actually very powerful here to be able to dynamically construct filters and queries, because I don’t have to worry about where I am in the predicate path. I don’t have to worry about am I in the first or the second item and do I have to put in a white space and a comma? Do I have to do a two date on anything? Do I have to put parenthesis into an expression?
Although all these things are good for when you’re typing SQL at the command line, those very same syntactic features which are designed for humans work against you in programmatic construction of the query. And in MongoDB, if you could just see beyond this little code example here, you can see how very easy it is to add more expressions into the $or. How I can call out to another function that independently can craft a small filtering fragment that I can add in to my overall query. And obviously with the dates and some of the other structures, I have full fidelity of the types as well.
All right, another and I will repeat what I said before, all of this construction of the query. The same tools that you use to like work the maps and debug them and print them out all that. The same things that you’ve been doing with your data in maps and lists, you can do with the queries as well. That’s really powerful. And lastly, yes, so we can sort the standard skip limit operators around cursors in MongoDB. You don’t have to get the entire data shape back.
If you just wanted first name and last name out to that contact table, post or predicate, you could ask for just those fields. And we do support aggregation, and in fact it’s for another webinar, but the power of the aggregation framework with the release of the 2006 product, is actually pretty profound. It works as a pipeline, you have a lot of flexibility in terms of manipulating data that’s coming in, grouping it based on essentially an expression not just by field names, right. There’s a functional expression that you can use to construct the keys. You can have multiple stages of grouping.
All sorts of really exciting features and it’s all done at the engine level, optimized at the engine level. We’re going to switch gears a little here, we’re about halfway through the hour. We’re going to talk about some RAD, switching the Python. And the reason we’re doing that is because as we start to explore some of these other environments, it’s good to see what the concepts are in MongoDB as opposed to some of the core syntactic detail. I’m a detail oriented guy which is why I dove into the Java first.
Sometimes the scripting environments hide things but in this case as we move into this part of the webinar, I think it actually exposes more. So it would… why would we want to do RAD? Well because, among other things, just assuming that you will only ever address your data with a single language is not a good idea. Coming back to the third slide in today’s presentation, there’s a plethora of languages. There is an even bigger, there’s a miasma of frameworks that can manipulate this stuff.
You have to assume now that the rate of change of technology and drivers and languages, is only going to increase and that you want to, frankly bring the power of all these languages together. But what you want to do is use the power of the language in a way that makes sense for the language. So, in looking at this here, one of the first things we see is that when I construct this piece of data that I’m going to save into MongoDB, because it’s Python, I don’t have to worry about making maps and making lists and inserting and adding, all that goes away. I just use the syntax of Python just to describe it and blast it in.
So you can imagine, coming back to what I was saying before, you added title and hire date, you wanted to backfill it. The script to do is about five lines worth of code, all fields native. Toward the bottom here, and not to be underestimated because it’s at the bottom of the page, we’ll see that there’s an expression. Now that other page we had the somewhat complicated Java construction of that $or expression for a Mongo inside a Java. Here inside of Python it’s a lot of simpler because I don’t, again, have to worry about maps and lists and construction.
Once I pass it in, I like to call that find, these functions like upper and sorted in c.keys, that’s not MongoDB. That’s Python. So Mongo’s driver for Python exposes all of this rich data in a form most convenient to the language that it’s running in. In other words, the idioms that are most important for that language are well expressed inside of our driver. So all of the tools and the tips and the tricks and the third part and all the open source community that has functions and capabilities that operate on all these data are immediately available for you to use here in conjunction with MongoDB.
In other words it just works well with the ecosystem instead of just vending back a rectangle, which, you can have a rectangle in Python too, but that’s not nearly as exciting and or as easy to work with. Probably the most important thing that ties all this together is Polymorphism, which is our fancy term for being able to store more than one kind of a shape inside of one collection. And again in the scripting world, this capability clearly is available in any language but it’s easily visualized inside of a scripting language like Python.
So here, what we’ve done is we’ve got an information architecture where name, ID and personal data are well known. There’s probably going to be an index on name or ID, something like that. And the field name of personal data is well known within our information architecture, but across different documents, the contents, the shape of personal data are different. And so are the topics, preferred airports and travel time, the bottom, Steve is more interested in the last account visited and his favorite number, which is a floating point.
Now what makes this really useful is that you can build applications where you’re storing things inside of Mongo. Millions, billions of things, but you know that for the power of indexing on a few fields, two maybe, three, like ID and date. All of a sudden you can go for index optimize from billions to, let’s say hundreds. You can then bring that data back into your application and dynamically react to the content that you see inside. So you can build GUIs that say, inside of personal data I see preferred airports, which is a list of strings.
Therefore I’m going to construct a little GUI widget that says, ”Preferred airports” and has got maybe for example, a comma separated list of strings. For this last account visited in K9, it will ask, “What is your type?” It’ll say, “You are a map.” So I’ll walk the map, right? I’ll recursively walk the map and build other widgets that do name and values. You’re ultimately producing, probably a popular tree kind of idiom. Obviously you can build a specific GUI that asks for very specific parts of this content, but that’s again your choice. You don’t have to do that. You can let the data drive the direction here.
That’s a very, very kind of powerful thing and it’s one of the reasons why the earlier diagram I showed you with the somewhat inflated ER diagram and then the simplified version of MongoDB. That’s why it not so far off because capabilities like these allow you to rethink how you want to structure your data. You don’t need 22 extra tables to hang on to different shapes of things, where really, eight out of the… there are eight common fields and a small number of varying fields, or frankly, even a large number of varying fields. You can place it all inside of one collection and let the data do the talking.
That when you pull that number back out you get the same floating point number. When you put in a date into your blob you get back a date time accurate to millies in all of your languages. All right, these are very important considerations if you want to focus on working with your data and not wasting time trying to build adapters and conformance layers to ensure that the data is properly serialized and de-serialized in this blob area.
So in summary what changed? You’ve got to really look back, and this is why we started a bit with the history here. In the old days, CPU and disk was not what it is today. Memory in particular was very expensive. And there’s a lot in MongoDB vis-a-vis performance that is enhanced by its ability to aggressively use memory. We’re a distributed database, we’ll touch upon this in just a little bit. And so we may have to worry about that, this will ruin our distributed things. And again I’m an upstart kind of a guy, so the languages and the types and the things that you could do. There was no malloc in the old days, you just didn’t say “new object”. Everything was compile time bound.
In the year 2014, 14 years into this century we have a lot more flexibility. We have a lot more power at our finger tips. I can’t really blame the environments of old for doing what they had to do because that’s kind of how they got by, but in MongoDB when the data is the schema there’s a lot of power that is brought to the table. Now what does that mean a little more broadly?
You can now start to do things like this. So everybody has suffered through the pain of saying “How do I do reconciliation or maybe like version delta on a thing?” Whether it be a trade or a product or a leg or a catalog entry, or a recipe it doesn’t matter, farm implements. Unless you’re in single table world, and that’s essentially never, then you’ve got a problem. You can do one of two things. You can build a piece of software that will hydrate your RDBMS world into an object then hydrate another object and do an object to object compare.
If you’re lucky the people doing all that stuff will have implemented comparable, probable I guess and maybe you can iterate through the results that way. But it’s still… it’s a bit of work. If you do it at the database level; by dumping tables and getting rows of things and admitting CSVs, you are setting yourself up for just a world of pain. We all live this pain every single day. We’ve just grown accustomed to it, but it’s not necessarily the way you’d want to do it. This example on top, this is how you would do it in MongoDB.
If you used substitute collection for some of the things that we saw in Python with Steve and Jane I think it was, where we had preferred airports and last account visited. I can generically ask for an entire set of data, walk that data and then say, “This record has got this field, this one has another field.” I can say “This record has got a date which is of type date, somehow this guy has a date of type integer”, I can flag that. The work here goes into the map [Indiscernible] [0:43:48]. Which by the way is not… there’s also… you can Google, Globe has got one of those as well. We happen to have one here as well, but that’s all you have to do. All of a sudden this problem of understanding differences in your data becomes very easy because now you’re letting the data do the talking.
Another use case that comes up very often is how do you pull together things, sets of date and continually add overrides? So you’ve got your baseline, lets’ say preferences for a community, and then when somebody logs in there’s a company level, and then a group level and then the personal user’s preferences. You want to be able to overlay these things. And traditionally overlaying anything in the RDBMS world is really tough, and largely you just pop things out in the bespoke objects and try and do it that way.
With MongoDB it’s very easy just to grab the shape out now that it’s expressed as a map, continue to stack it at the very end say if I do a top down look of my data what does the full map look like, overrides and all? I could even ask it at any point, what call to what record, document, precipitated a change in a particular value? You get all that kind of flexibility because you’re in the rich map ecosystem. You’re not just dealing in the result set world of ints, doubles, dates, and strings in a code framework that’s heavily geared towards the database. Not geared toward the structures and objects that are natively and fluidly manipulated inside the host language.
In summary, so what does this all add up to? I believe that once you get out of the trivial cases it’s actually easier to use MongoDB to interact with your data than RDBMS for some of your bigger problems. I’m not saying that MongoDb solves every problem but beyond some of just the trivial cases it harmonizes much better with modern programming languages and ecosystems. When you take that and then you add in a lot of the things that we didn’t talk about today because today was more about SQL and the software interface stack and not sort of the infrastructure.
Having a full suite of robust indexing capabilities, plus the horizontal scale story and actually what is a very exciting integrated and really an isomorphic HA and DR kind of strategy. It really adds up to MongoDB being a modern database for some modern solutions. And with that we’re now going to go into the Q&A portion of the webinar and you’ll just go to cluster seven.
Doing one question… okay. Okay so there are about seven or eight, I’ll say common questions, I’m going to try and address them in order. The first is “Don’t you still have to marshal the data from the app layer into the document before you can write it to the DB?” The answer is yes of course, in our examples about a third of the way in when I was constructing in my day two and day three examples. That map of data where I was adding title and hire date that’s what the application would be writing. The application might have bespoke objects, class contact, class organization, these sort of things, with rich getters and setters.
But at the point where they’re going to go persist, some utile in between is going to say “I’m going to go after that class, extract these things and load the data only portion, not compile time bound portion of the data, into the map and pass that into the data access layer for persistence.” And that is similar to… there is a second question which is a follow one into that which but it means… the question is “There’s no need to alter tables but marshaling and unmarshaling is similar.” And the answer is from the application layer into a map it is similar, but from the map into the database… first of all the richness of expression means that you don’t have to worry about creating auxiliary tables for one to end the relationships.
But also you’re not bound to always do alter tables before you make those changes. Into these changes we’ve had. Okay good… great question here, “Assuming you accumulate data in version one, enter this change more fields are mandatory. How do you migrate the date in version, structure one to version structure two?” So what’s great is there’s an approach to this called soft versioning. And I’ll just say, imagine that some of those sheets that we saw going in and out of Mongo had a version, an integer, just V, doesn’t matter. The version number is not about like whether it was canceled, corrected or that sort of thing.
The version number is an explicit piece of information architecture that defines not only the shape but the business intent of the shape. And if you start with an information architecture that day one says, “I’m calling this thing V one,” then your data access layer can interrogate it. And based on all the things that are flowing back and forth they can say, “Hey, you know I used to have hire date like as… or termination date as a single date. But now there’s a list of them, and my applications that we need to respond to a list.” You can clearly identify the origin of that data by saying “The ones that had it as scalar are V1.
The ones that have it as a list are V2.” Then it’s easy, in your data access layer simply to say, “What is the version of the thing?” If you’re V2, I’m going to ask a different question of that map and then project it up to the application in a different way. It’s more… in summary it’s more of an information architecture issue and a coding issue rather than a database issue. The next question we have here is… okay so why is this more… a good question, “Why is this more advantageous than doing a select file by another select rather than by doing a join?”
The answer is well you could do it that way but you’re going to run into a couple of problems. First of all you have to do those sort of nested selects for everything that ends up being non scalar. So in our example where we saw contacts and phone numbers, there’re always going to be a very high percentage of information architectural models where it is truly one to end where the end are bound solely to the parent, right? There is some counter example like accounts and transactions, probably you keep as two separate collections than you’d always want to join them.
But for those things that you don’t have to or don’t want to join, in MongoDB you don’t have to. So again it’s all about options. Let’s see… so there’s another question, “Do I have to do a collection drop and then an insert to alter any document?” No, to alter a document, to update it if you will you can update in place. So I can write a loop that says find things based on an update predicate then for everything that’s found I can either replace and or overlay specific fields inside that document. So you do not have to, you certainly don’t have to drop nor is it an insert or even an upset kind of capability. It is in place update that can take place.
Next question is “I just delivered an application to an agency with no impendent between MongoDB backing closure middle tier and job”… okay that’s good, |any change on the”… this is not a question it is a statement but it’s worth repeating because it’s complementary. So thank you Mr.… actually I don’t have… I only have initials. The statement was any… now it’s been clicked off. Okay so we’ll move on to our next question. “In the example, if the history and transaction was this big it would be a problem since 16 case limits for single doc”… okay. So two things, the limit for any particular document in MongoDB is 16 megabytes.
And perhaps I went through that part of the example a little too quickly. The phone transactions list is… it’s a collection with separate phone transactions. It’s not one document with a lot of transactions in it. There’s millions or billions of those number target and duration tuples sitting in there. So it’s not subject to that limit at all. Next question is “Can you extend Mongo by adding new types?” So currently… that’s a great question especially with the geo stuff. Some people have asked us to expose more of the geo spatial indexing and the geo hash algorithms rhythm so they can do more than two dimensional indexing on numerics.
Currently we’re looking at providing a facility for injecting new kinds of types into MongoDB, looking at it on the road map. The challenge in this is not so much the [Indiscernable 56:03] if you will, just the bytes serial or the bits serialization of these types. It’s the way they work with the predicate and the aggregation engine. At the end of the day you want all your types to have some sort of a reasonable ability to deal with beyond just equals and not equals, to be able to greater than and less than and all sorts of things. You need sometimes ways to promote or demote from one type to the next.
This is a little more obvious for example, like the big decimal to double to integer realm, but the same extends to other types as well. And I guess we’ve got time for one more question. Which is… okay that’s a long question. All right, well I’m going to go for the shorter question because we’re just about coming up on the top of the hour and that is ”In MongoDB can you store byte array data?” And the answer is yes. Byte bracket, bracket is absolutely a supported type used very popularly in a lot of our use cases to store both what I consider to be unstructured data, and ironically PDFs and the Microsoft Word documents or anything.
Images, videos, whatever you like in byte array, and then using as its peers other kinds of rich shapes and structures to define the meta data around it. Which you can then index and then you can bring the byte array into your application. All right with that we’re really coming up on the top of the hour. So with that I’ll close by saying thank you very much for your attendance today and for the good questions. If you have any further questions you can e-mail me at the address you see there, or you can reach out to somebody else at MongoDB.
I hope the material you’ve seen here today has provided a little bit of insight and maybe jogged your thoughts on what it means to move from your traditional world into MongoDB and how it’s really less scary. And I think a lot more capable than you might have otherwise envisioned. And so with that be well, code well and thank you very much.