GIANT Stories at MongoDB

Building your business is hard. Scaling your business data should not be.

Kelly Stirman

Business

Building your business is hard. Scaling your business data should not be.

That's the message Sailthru CTO and co-founder Ian White relayed recently in New York. Over the course of a half-hour, White explained how Sailthru first did application-level sharding of its data out of necessity, but later moved to MongoDB's auto-sharding to massively simplify development.

Success in the Billions

Sailthru makes it easy for ecommerce and media brands to personalize content across a variety of channels, including email, onsite, mobile, social and more. As the company's customer base has swelled to over a billion users, 125 million content documents (e.g., URLs and products relevant to particular users) and 5 billion messages per month, Sailthru has come to store over 40 terabytes of data in MongoDB across 120 nodes on mostly physical infrastructure.

As White suggests, "You can’t store this volume of data on just one node. We had to shard."

Application-level Sharding at Sailthru

When Sailthru first started, it didn't need sharding. But within two years Sailthru's customer count and data volumes were high enough that the company needed to partion its data. The question was: How?

While some applications are either read heavy (online media site) or write heavy (logging and clickstream), Sailthru is both. As White explains, "We have to be able to read data and write personalized recommendations in real-time. MongoDB is a great database for this."

Sailthru adopted MongoDB in the early days -- over four years ago. Prior to MongoDB 1.6, Sailthru partitioned much of its infrastructure using in-app sharding logic, as MongoDB didn't yet support auto-sharding. Sailthru partitioned data by client. Their application would examine each query, and dispatch to the appropriate replica set and collections based on a mapping configuration. This approach worked fine for a time at Sailthru.

However, as Sailthru’s data grew, application-level sharding introduced significant code complexity and administration overhead. Application-level sharding also contributed to uneven load distribution, something Sailthru was able to Band-Aid by scaling up with more expensive servers. But the database team still had to manually rebalance and reallocate resources – every time Sailthru onboarded a sizable client that required a new shard, the database team would have to go in and add another line to the config file and redeploy. It was painful and demanding.

Enter Auto-Sharding

With the introduction of automatic sharding in 2010’s 1.6 release, the database itself manages the effort of distributing and balancing data across shards automatically. Sharding is transparent to applications – for 1 or 100 shards, the application code is the same.

Setting up a sharded cluster involves making a critical decision - choosing a shard key. The shard key is the value the databse uses to determine placement of the document within shards. The Sailthru team considered several options, including sharding on client ID, MongoDB ID, or email. MongoDB supports multiple sharding strategies, and each is appropriate for different use cases. Ultimately, they opted to use hash-based sharding and MongoDB’s ObjectId as the shard key. With this approach, MongoDB does the work of ensuring a uniform distribution of reads and writes by randomizing the placement of documents across shards.

To make the actual migration from application-level sharding to auto-sharding, the team used an open source tool created by MongoDB called MongoConnector. In the process of the migration, Sailthru forked the project, making significant contributions specific to their use case.

With this change, it’s now possible for Sailthru to add shards without making any change to the code base. This meant that during a critical ramp-up time of tight resources and tight cash, Sailthru was able to focus their engineering efforts on improving their service and building new features, ensuring their phenomenal success.

Build the Next Big Thing on MongoDB

Thousands of organizations use MongoDB to build high-performance systems at scale.
If you're interested in reading up on your own, download our Operations Best Practices white paper for additional information on operating and deploying a MongoDB system:
Ops Best Practices

About Kelly Stirman

Kelly Stirman is Director of Products at MongoDB. Kelly works closely with customers, partners and the open-source community to articulate how MongoDB is quickly becoming the world's most popular database. For over 15 years he has worked at the forefront of database technologies. Prior to MongoDB, Kelly served in executive and leadership roles at Hadapt, MarkLogic, Oracle, GE, and PricewaterhouseCoopers.

The Salamander: Using Open Source Solutions to Visualise and Improve a Bank’s Critical Internal Operations

Understanding internal operations is crucial in financial services. Are public interfaces running smoothly? Are the back-end business systems as productive as they could be? Are infrastructure resources being allocated correctly based on business need? These are exactly the types of questions organizations should be able to answer but, surprisingly, struggle with.

However, a clear view into IT operations is often much easier said than done and it cannot be achieved with numerical or log reports alone. For real insight into an organization's operations and systems, we need to cut through the mountain of raw log data and turn it into visualizations and dashboards that make relationships clear to the human eye.

The Salamander

Spanish bank BBVA, needed a better understanding of its technical operations. The bank reached out to one of its innovation management divisions, BEEVA, to create a solution that would manage a wide variety of data and produce clear visual reports that could be quickly acted upon.

Called The Salamander, the tool has provided the bank an unparalleled ability to optimise and simplify business IT processes, which ultimately saves costs and leads to an improved customer experience. Built on open source software the project has also given the bank significant savings by avoiding the expensive licensing costs of traditional software and increasing the pace of development.

BBVA executes thousands of batch jobs daily, structured in job chains with complex dependencies and a strict execution sequence order. To better understand it, BBVA required a workflow tool to visualise these dependencies so it could identify process enhancements and potentially reduce the number of jobs. The task was big and complex and required integrating a number of next-generation technologies.

The Salamander team designed a solution running on a cloud computing architecture with MongoDB as the primary repository. MongoDB was chosen because it can scale large unstructured datasets across commodity servers in the cloud. Apache Pig and Hive execute data processing and transformation of raw data from the mainframe scheduler. MongoDB, which stores the processed data, works with a graph database to identify job relationships for faster network based queries such as minimum path between two given jobs. Finally, the application provides a graphical interface to browse the job catalogues. These generate data visualisations that clearly demonstrate the relationship between operations. The front and back-end of the application communicate via RESFUL APIs and NodeJS-based servers provide elasticity when accessing the stored data.

The tool is already helping the bank’s staff understand the complex network of batch jobs and processes that connect all of its services. The visualisations that Salamander creates is answering those crucial questions and helping to give BBVA an edge in the global banking industry.

Further reading: Find out more about quantifying business advantage in this whitepaper on The Value of Database Selection.

MongoDB vs SQL: Day 30

Buzz Moschetti

Technical

Welcome back to our blog series comparing building an application with MongoDB versus building the same application with SQL. Today, we’re going to switch gears a little bit and talk about rapid application development (RAD), specifically using Python as an example.

First of all, why would we want to do RAD?

It should be obvious by now that assuming that we’ll only ever address our data with a single language is not a good idea. If you’ll recall from the first post in this series, there are now a plethora of languages we can use, and an even larger ecosystem of frameworks that we can utilize to manipulate our data. We have to assume at this point that the rate of change of technology, drivers, and languages is only going to increase. We want to bring together the power of all of these languages but we want to do it in a way that makes sense for each language. Let’s begin by reintroducing the Contacts theme we have been exploring for a few weeks.

When you look at the code above, you’ll notice that when we construct this piece of data to save into MongoDB with Python, much of the overhead of explicitly constructing Maps (name/value pairs) and Lists goes away. Instead, we can just use the syntax of Python to describe it and save it.

If we wanted to go back and add titles and hire dates to existing documents, the backfill logic would be about ten lines worth of code. Of particular interest is the simplicity of the predicate expression at the bottom of the code compared to when we tried to do the same thing using Java; again, with Python we no longer have to be burdened with the explicit construction of Maps and Lists. Also, the functions in the output formatting (upper(), sorted()) are not MongoDB, they’re native Python. MongoDB’s drivers expose all of this rich data in a form most convenient to the host language. This gives us the capability to use all of the tools, tips and tricks from third parties and the open source community to operate on our data.

Probably the most important thing that ties all of this together is polymorphism, which is our fancy term for being able to store more than one kind of a shape inside of one collection. This capability is available in any language but it’s easily visualized inside of a scripting language like Python.

What we have above is an information architecture where name, ID and personalData are well known. There’s probably going to be an index on name and/or ID. The field name of personalData is well known within our information architecture, but across different documents, the contents, and the shape of personalData are different. For Bob, we have preferred airports and travel time, while Steve is more interested in the last account visited and his favorite number, a floating point approximation of pi. What makes RAD really powerful in MongoDB is that we can let “the data do the talking.” Through index optimization we can quickly navigate to a small set of documents and then, for each document retrieved, have the host code ask “what is your type?” and react dynamically to the response. In the case of Maps, we can recursively ‘walk’ the map until we get to name-scalar value pairs that can be easily formatted to appear on a GUI. Obviously, we can also build specific, visually optimized GUIs that ask for very specific parts of the content.

This is a very powerful feature and it’s one of the reasons why the ER diagram versus the simplified MongoDB representation isn’t so far off from reality; capabilities like this allow us to rethink how we want to structure our data. We don’t need 22 extra tables to hang on to different shapes of things. We can place everything inside one collection and let the data do the talking, so to speak.

We don’t really have to spend much time on the code snippet above because the current convention to deal with polymorphic data in SQL is to just BLOB it, either via serialized Java, XML, JSON as a single string, third party representations (like Avro), or something home-grown. With all of these approaches we will lose query capability on our data, and with most we will also lose cross-language type fidelity information on it, meaning if we store a floating point number, we have to make sure our post-API representation is consistent across all versions of Java, C#, JavaScript, Python, Perl, etc. These are very important considerations if we want to focus on working with our data and not waste time trying to build adapters and conformance layers to ensure that the data is properly serialized and de-serialized.

At a fundamental level, what has changed?

We have to look back and once again understand why we started this series with a little bit of a history lesson. In the old days, when RDBMS systems were conceived, CPU was slow. Disk was slow. Memory in particular was very expensive. There was no malloc() in the old days; we couldn’t code “new MyObject()”, never mind assess the performance optimization around making the call in the first place. Everything was compile-time bound and locked into a small set of datatype primitives that could be efficiently handled by the CPU and the operating system. In the year 2014, we have a lot more flexibility. We have a lot more power at our fingertips. We can afford the few extra cycles to let the data tell the code what it is, in exchange for a much more versatile and adaptable data access layer.

More broadly, this power allows us to construct software that operates on generalized sets of problems, independent of the specific business domain (finance, manufacturing, retail, etc.). Consider the examples below:

Everybody has suffered through the pain of satisfying a requirement like “How do I do reconciliation or find the version delta?” Whether it’s a trade or a product or a recipe or a catalog entry; it doesn’t matter. Unless you’re in single table world - and that’s essentially never - you have a problem. You can do one of two things: You can build a piece of software that will hydrate your RDBMS world into an object then hydrate another object and do an object-to-object compare. If you’re lucky the people doing all that will have implemented Comparable (in the case of Java), and maybe you can iterate through the results that way. But it’s still quite a bit of work. If you do it at the RDBMS level by dumping tables and getting rows of things and processing CSVs, you are setting yourself up for a world of pain of brittle feeds, vaguely typed formatted data, and lots of commas. We all live this pain every single day. We’ve just grown accustomed to it, but it’s not the way you’d want to do it.

The example at the top of the image above shows how you would do it in MongoDB combined with generic software. We can generically ask for an entire set of data, walk the structure, and accurately capture value and datatype differences. There is an investment here in the MapDiff.difference() function but once completed, it can be reused across any information architecture.

Another use case that comes up very often is ‘how do you pull together sets of data and continually add “layers” of overrides’? You’ve got your baseline, let’s say preferences for a community, and then when a new user is added there are defaults at a company level, then defaults at a group level, and finally the user’s preferences. You want to be able to overlay these things. Traditionally overlaying anything in the RDBMS world is really tough, and largely what we’ve done in the past is hydrate things into bespoke objects and do the layering logic there. Unfortunately, this also introduces a whole set of compile-time dependences.

With MongoDB it’s very easy just to iteratively extract shapes expressed as a map, “stack” the maps, and at the very end produce a “top-down look” of the stack. It is easy to add a feature that allows us to ask “Which Map in the stack produces the value we see in the top-down look?” You get all that kind of flexibility because you’re in the rich map ecosystem. You’re not just dealing in the flat ResultSet world of ints, doubles, dates, and strings in a code framework that’s heavily geared towards the database. With MongoDB, you’re now geared toward the structures and objects that are natively and fluidly manipulated inside the host language.

Lastly, let’s spend just a couple of minutes on the command-line interface (CLI). What’s important (and exciting!) to know is that in MongoDB, our CLI is actually just the V8 JavaScript engine (the same as in Chrome) that loads the JavaScript driver for MongoDB plus a few extra syntactic add-ons. In other words, with our CLI, you can program JavaScript in it all day long and never even touch MongoDB. This is a very different paradigm from iSQL and PSQL and other SQL CLIs that largely only manipulate SQL, not programming elements like variables, functions, and branch control.

Because the MongoDB CLI is “Javascript first, MongoDB second,” there’s a whole host of powerful native applications or utilities that you can write in JavaScript that fluidly manipulate all kinds of things in addition to manipulating MongoDB. An example appears in the image below. We’re asking for contacts with sequence numbers greater than 10,000. Then we invoke the explain method to get back information from the engine on how long it took, the number of items it scanned, etc.

I can easily put that into a loop, and then push it on to a array “v” to produce an array of explain() output. Because this is just JavaScript in our CLI, I can grab the jStat package off the Internet and I can run standard deviations or anything else I want. Lastly, I can not only capture these things as they are admitted as rich shapes, but also turn them right around and insert them back in. For example, I could adapt this to have a while loop around it and it will continuously run a little script that sees what the timings are for a particular kind of query against my database, take the results, and place them back into the database, creating a performance “log.” And this log itself is rich shapes, available for full fidelity querying and manipulation! Simple. Symmetric. Efficient.

In conclusion, it’s our belief that once you move beyond trivial use cases, it’s actually easier to use MongoDB to interact with your data than RDBMS for some of your bigger problems. MongoDB harmonizes much better with modern programming languages and ecosystems than RDBMS. When we take that and layer in some of the things that we didn’t cover in this series like robust indexing, horizontal scaling, and isomorphic high availability and disaster recovery, MongoDB becomes the modern database you’re better off with for your modern solutions.


For more information on migration, read our migration best practices white paper.
Read the Migration Guide

<< Click back to MongoDB vs SQL: Day 14 (Part 2)


About the Author - Buzz Moschetti

Buzz is a solutions architect at MongoDB. He was formerly the Chief Architecture Officer of Bear Stearns before joining the Investment Bank division of JPMorganChase as Global Head of Architecture. His areas of expertise include enterprise data design, systems integration, and multi-language tiered software leverage with C/C++, Java, Perl, Python, and Ruby. He holds a bachelor of science degree from the Massachusetts Institute of Technology.

Leaf in the Wild: Scaling China’s Largest Car Service App with MongoDB

Mat Keep

Business

Leaf in the Wild posts highlight real world MongoDB deployments. Read other stories about how companies are using MongoDB for their mission-critical projects.

Kuaidi uses MongoDB at the heart of its taxi hailing service, connecting drivers with passengers up to 6 million times a day, and managing nearly half a billion orders. Kuaidi has scaled MongoDB across 4 geographic regions, serving thousands of reads and writes every second.

Following his presentation at last month’s MongoDB Day in Beijing, I sat down with Ouyang Kang, Chief Architect at Kuaidi, to learn more about how China’s leading taxi booking application is using MongoDB, and his recommendations for those getting started with the database.

Smartphone based taxi-calling and ride-sharing services are growing at an astounding rate – attracting significant investment (and huge company valuations). They are also intensely competitive. The choice of technology will ultimately drive success or failure in the market. In the world’s most populous country – and one suffering the most severe traffic congestion – the importance of using agile and scalable technology for transportation services is magnified.

Please start by telling us a little bit about your company.

Kuaidi was founded in 2012 and has grown to become Greater China’s largest car service application1, attracting investment from Alibaba and Matrix Partners. In just 2 years, we have attracted 100 million users who place up to 6 million ride requests every day via our smartphone app, connecting them to 3 million drivers in more than in 300 cities across China. And we are continuing to grow fast.

The goal of Kuaidi Group is to improve the efficiency of urban transportation and the population’s quality of life. We currently operate 2 branded services – Kuaidi Taxi and Kuaidi ONE – which provide taxi and chauffeured limousine services respectively. Our long term plan is to offer services for every facet of passenger transportation combining location-based mobile technologies, data mining of our huge user base and intelligent routing algorithms.

Tell us how you use MongoDB.

At heart of our taxi booking application is the location based service, and we rely on MongoDB for this. Using MongoDB’s geospatial indexes and queries we can track the location of our drivers in real time, using it to connect users with their closest taxi, and displaying updates directly to the customer’s app. The location data is constantly being updated and queried.

We also use MongoDB as an active archive of our order data. Each time a customer requests a taxi, the journey’s start and end points, the driver identity and fare are stored in a single record. We initially built our archive on top of MySQL, but once our order volume exceeded 100 million records, we hit scaling limits. We knew MongoDB scaled, so we migrated the archive to get the cost and performance benefits of horizontal scale out.

What other databases do you use?

We use Redis for caching and MySQL to store operational customer and order data. We also replicate data from MongoDB and MySQL into Hadoop for data mining and analytics.

Did you consider other databases for your app? What made you select MongoDB?

We considered three options for our location based service:

  • Relational solutions based on MySQL and Postgres
  • SOLR (for the search element of the application)
  • MongoDB

We evaluated each on multiple criteria, including

  • Performance. We measure performance on multiple dimensions: latency, which is critical for good user experience on mobile apps; and speed of real time updates, so we are always working from the freshest data
  • Scalability. We were confident that the service would quickly gain traction, so knowing we could scale our database on demand was paramount
  • Ease-of-Use. We needed to achieve our performance and scalability goals without burdening our developer and operations team with complexity

We evaluated all of the options on this criteria, and found MongoDB to be the best choice for us. It met the performance objectives. We found it easy to develop against. What was really important was that it proved easy to deploy and easy to run at scale.

Please describe your MongoDB deployment

Our MongoDB database is sharded across four geographic regions. A 7-node replica set is deployed in each region (6 data-bearing nodes and an arbiter). This deployment enables us to place data physically closer to local users for low latency access, as well as provide the scalability and resilience our application needs. We cannot tolerate downtime at all. We use Nagios for monitoring the application and database.

Geo-Distributed MongoDB Deployment at Kuaidi

We are running MongoDB 2.6 with the Java driver.

Are there any metrics you can share?

Yes.

  • MongoDB is serving 50,000 operations per second (split 80:20 between reads and writes)
  • Our database has grown to just under half a billion documents and continues to scale

Do you have plans to use MongoDB for other applications?

Our marketing team stores all of its promotions and messaging in MySQL, but is starting to hit scaling limits. As a result, it is not keeping pace with their demands. We are evaluating migrating this to MongoDB as well.

What feature of the forthcoming MongoDB 3.0 release are you most looking forward to?

It has to be document level concurrency control. As our service continues to grow, we need to scale to keep pace – especially writes. This is something we believe MongoDB 3.0 with its new WiredTiger storage engine will allow us to do.

What advice would you give someone who is considering using MongoDB for their next project?

Don’t just follow the crowd. Don’t just choose the same technology you have always chosen. There is so much innovation happening today, and the databases of the last decade are not always the right choice.

Once you have a short-list of potential technologies, test them with your app, your queries and your data. It is the only way to be sure you are choosing the right technology going forward.

Ouyang, thank you for your time, and sharing your experiences with the MongoDB community.


Thinking about migrating from a relational database? Read the MongoDB white paper to get started:
Migrating from RDBMS to MongoDB

1Based on market share and transaction volume

About the Author - Mat Keep

Mat is part of the MongoDB product marketing team, responsible for building the vision, positioning and content for MongoDB’s products and services, including the analysis of market trends and customer requirements. Prior to MongoDB, Mat was director of product management at Oracle Corp. with responsibility for the MySQL database in web, telecoms, cloud and big data workloads. This followed a series of sales, business development and analyst / programmer positions with both technology vendors and end-user companies.

<< Read About Our William Zola Award for Community Excellence

Leaf in the Wild:借助 MongoDB 扩展中国最大的汽车服务应用

Mat Keep

Business

快的将 MongoDB 用于其出租车呼叫服务的核心,每天联络司机与乘客高达 600 万次,管理近 5 亿份订单。快的已将 MongoDB 扩展至 4 个地理区域,每秒可处理数千次读写。

上个月,在快的首席架构师欧阳康于北京 MongoDB Day 发表演讲之后,我和他进行了交流,详细了解了中国最主要的出租车预订应用程序如何使用 MongoDB,以及他为打算开始使用此数据库的用户提供的建议。

基于智能手机的打车和共乘服务正在以惊人的速度发展,因此吸引了大量投资(并产生高昂的公司估值)。它们同时也在激烈竞争。技术的选择最终在市场中决定成败。在世界上人口最多同时也是深受最严重的交通拥挤所害的国家,将敏捷可扩展的技术用于交通服务的重要性愈加凸显。

请首先介绍一下关于贵公司的一些情况。

快的成立于 2012 年,并且已成长为大中华区最大的汽车服务应用程序1,吸引了来自阿里巴巴和经纬创投的投资。在仅仅 2 年时间里,我们已吸引了 1 亿用户,在中国 300 多个城市里通过我们的智能手机应用每天提出多达 600 万次乘车请求,与 300 万司机取得联系。而我们仍在持续快速增长。

快的团队的目标是提高城市交通效率,提升人们生活质量。我们当前运营 2 大品牌服务,即快的打车和快的一号专车,分别提供出租车服务和配备司机的豪华轿车服务。我们的长期计划是结合基于位置的移动技术、针对庞大用户群的数据挖掘和智能路线算法,来为乘客运输的每一方面提供服务。

跟我们说一说你们是如何使用 MongoDB 的。

我们的出租车预订应用程序的核心在于基于位置的服务,而我们依靠 MongoDB 来实现此项服务。我们可以使用 MongoDB 地理空间索引和查询实时跟踪司机的位置,以此来将用户和最近的出租车联系起来,并直接在客户的应用中显示更新。位置数据将持续更新并接受查询。

我们还将 MongoDB 用作存储订单数据的有效存档。每当客户要求打车时,这段旅途的起点和终点、司机身份和费用都将存储在单条记录中。我们最初基于 MySQL 生成存档,但当订单量超过 1 亿条记录时便达到了扩展限制。我们知道 MongoDB 可以扩展,因此迁移了存档来通过水平横向扩展获得成本和性能优势。

你们还使用了哪些其他数据库?

我们将 Redis 用于缓存,并将 MySQL 用于存储运营客户和订单数据。我们还将 MongoDB 和 MySQL 中的数据复制到 Hadoop 以进行数据挖掘和分析。是否考虑过将其他数据库用于你们的应用?是什么原因促成你们选择 MongoDB?针对基于位置的服务,我们考虑过以下三种选择:

  • 基于 MySQL 和 Postgres 的关系型解决方案
  • SOLR(用于应用程序的搜索元素)
  • MongoDB

我们按照多种标准评估过每个选项,其中包括

  • **性能。** 我们从多个方面衡量性能:延迟(对于在移动应用上提供良好的用户体验至关重要)和实时更新速度,这样我们才能始终使用最新的数据
  • **可扩展性。** 我们坚信服务将要快速提升,因此应务必了解到我们应能够根据需要扩展数据库。
  • **易于使用。** 我们需要在不因复杂而为开发人员和运维团队增加负担的情况下实现性能和可扩展性目标。

我们按照此标准评估了所有选择,发现 MongoDB 于我们而言是最佳选择。它符合性能目标。且易于开发。尤为重要的是,实践证明它不但易于部署,而且还能够大规模运行

请介绍一下你们的 MongoDB 部署

我们的 MongoDB 数据库分片到四个地理区域。每个区域部署了 7 节点的复制集(6 个数据承载节点和 1 个投票节点)。通过此部署,我们可以将数据放置在实际更靠近本地用户的位置,从而降低访问延迟,同时提供应用程序所需的可扩展性和复原性。我们绝对不允许出现任何宕机时间。我们将 Nagios 用于监视应用程序和数据库。

快的 MongoDB 部署地理分布我们通过 Java 驱动程序运行 MongoDB 2.6。

您可以告诉我们一些度量数据吗?

可以。

  • MongoDB 每秒可处理 50000 次操作(读取和写入比例为 80:20)
  • 我们的数据库增加到接近 5 亿文档,并且还在继续扩展

你们是否计划将 MongoDB 用于其它应用程序?

我们的营销团队将所有宣传信息和消息都存储在 MySQL 中,但现在已经开始接近扩展限制。因此,它已无法满足需求。我们正在评估是否也将此类数据迁移到 MongoDB。

对于即将推出的 MongoDB 3.0 版本,你们最期待哪些功能?

那一定是文档级别并发控制。随着我们的服务持续增长,我们需要进行扩展才能跟上增长速度,尤其对写入而言。我们相信具有全新 WiredTiger 存储引擎的 MongoDB 3.0 可以支持我们实现此目的。

对于考虑将 MongoDB 用于下一个项目的用户,你有什么建议?

不要只是随波逐流。不要只是选择您已经选过的同一技术。在创新层出不穷的今天,前十年的数据库并非始终是正确之选。

当您在考虑一些可能的技术时,使用您的应用、查询和数据来测试它们。这是确保面向未来选择正确技术的唯一方式。

感谢您能抽出时间,欧阳先生,还要感谢您与 MongoDB 社区分享经验。


是否在考虑迁出关系型数据库?请先阅读 MongoDB 白皮书:

从 RDBMS 迁移到 MongoDB

1根据市场占有率和交易量

关于作者 - Mat Keep

Mat 是 MongoDB 产品营销团队的一员,负责为 MongoDB 产品和服务构建愿景、定位和内容,包括分析市场趋势和客户要求。加入 MongoDB 前,Mat 是 Oracle Corp. 的产品管理主管,负责与 Web、电信、云和 Big Data 工作负荷有关的 MySQL 数据库。下属职位包括技术供应商和最终用户公司的一系列销售、业务发展和分析员/程序员职位。

<< 阅读关于 William Zola 社区杰出人物奖的信息

MongoDB vs SQL: Day 14 - Queries

Buzz Moschetti

Business

Welcome back to our blog series highlighting the differences between developing an app with MongoDB vs. with RDBMS/SQL. Last week, we began to cover Day 14 and added a list of startup apps organized by region. We also added data that was provided to us by an external entitlements service. This week we’re going to continue our discussion of Day 14 by diving into queries.

Before we begin, let's refresh our memories with the framework we’ve been using to stage our discussion.

  • We are using Java
  • Assume we have a data access layer in between our application and MongoDB
  • In terms of the date counts as we go through the examples, treat them as relative progress indicators and not the actual time needed to complete the task.
  • We won’t get into exception or error-handling. We won’t muddy the code with boilerplate or persistor logic that does not change from day to day. We won’t get into the database connection or other setup resources. The primary focus will be the core data-handling code.

MongoDB vs SQL: Day 14 - Queries

At this point, we’ve covered how to address broad queries that enable us to get everything or one thing from a collection. But we know the hallmark of SQL is its rich querying capability. Lucky for us, MongoDB has rich querying capabilities as well.

The big difference between SQL and MongoDB’s query language is that the latter is not a single string “sentence.” It doesn’t require white spaces in between words, or commas, or parentheses, or quoted characters. Instead, MongoDB uses a “cascade” of operator/operand structures, typically in name:value pairs. In other words, the operand in one structure can be another operator/operand structure.

This makes things very exciting for developers because the same techniques we use in our code to manipulate rich shapes of data going into and coming out of MongoDB - whether it’s Java, Python, JavaScript, etc. - can be used to construct, manipulate, and “parse” our query expressions. In fact, no parser is necessary. It is trivially easy to walk the cascade with standard programming techniques and find fields, values, etc. Because the query expression is a structured cascade and not a single string, this also means that it is easy to incrementally add subexpressions to the query without first having to break it apart into component pieces.

MongoDB vs SQL: Query Examples

Now let’s look at some compare and contrast, side-by-side code examples for queries.

First, we see one of the popular command line interpreters for SQL which is going to fetch us some contacts and phones. This will yield a rectangle, as we saw in earlier posts.

Next, in the MongoDB command line interpreter (“CLI”), we set up the equivalent query. Notice that we can use “dotpath” syntax to address subfields within fields in the rich shape. It is also worth noting that the “equals” operator is so common that as a shortcut, MongoDB interprets name:value as “name = value”, without having to explicitly supply the $eq operator.

Third, we see the equivalent query in Java / JDBC. Note that although the “sentence” is similar, we start to bump into irritants like escaping quotes.

Lastly, we see the equivalent in MongoDB via the Java driver.

We can see that the overall semantics of queries in MongoDB and SQL are the same and follow the common pattern of query setup, issuance of query, and iteration over a cursor.

Let’s look at some more complicated queries now.

In this query, we’re looking for contacts who either have at least one work phone OR have been hired after a specific date. Again, we can see that the equivalent in MongoDB is pretty straightforward. Note the use of dollar signs in the operator names (`$or`, `$gt`) as syntactic sugar. Also note that in both examples it’s important to use an actual date in our comparison, not a string.

The equivalent query in Java / JDBC will look largely the same as before, with a few more escaped quotes.

However, in practice it isn’t as complicated as it appears -- and it actually offers more flexibility than SQL:
  1. First of all, it’s really the same two or three lines just repeated over and over again with different field:value pairs. This makes it easy to cut-and-paste these expressions as you build up your query.
  2. Second, it is simple to dynamically construct filters and queries without worrying about where we are in the predicate path. We don’t have to worry about white space, commas, or parentheses. We don’t have to worry about splicing in a `SORT` statement or dynamically adjusting the names of returned fields sandwiched between `SELECT` and `WHERE`. Parameter substitution is very straightforward and easily coded, especially when dynamic logical `AND` and `OR` statements come into the picture.

If we extrapolate beyond this small code example, it’s evident how easy it is to add more expressions into the $or statement, or to call out to another function that independently crafts a small filtering fragment that we can add to our overall query. As dynamic queries become more complex in SQL, however, the syntactic sugar that makes SQL “human readable” in the CLI begins to work against you in the programmatic construction of a query.

We’ve used the very basic Java query APIs to illustrate the operator/operand nature of the language. We also deliberately chose standard Java HashMap objects to further reduce coupling until the last moment - when we constructed the BasicDBObject to pass to the find() method. For greater convenience, a Builder pattern set of APIs exist as well, but in the end it is still building a cascade of operator/operand structures.

More MongoDB Query Capabilities

MongoDB offers capabilities you come to expect in a full-featured query language including:

  1. Arbitrary sorting on one or more fields, ascending or descending.
  2. Projection, i.e. retrieving only specified fields from each document in the cursor, not the entire document.
  3. Cursor `skip()` and `limit()` to easily implement pagination if desired.
  4. `explain()`, which returns a wealth of information including full details on query path analysis, document and index counts, and estimated vs. actual processing time.

Perhaps the most important feature is the Aggregation Framework (“agg”), MongoDB’s answer to SQL’s GROUP BY clause. Exploring the power of agg is an entire blog in its own right; please see Asya Kamsky’s posts to get an idea of the agg’s power and programmability. For now, here’s an example to get you thinking about it. Suppose we want to count all the different kinds of cell phones owned by our contacts hired before June 1, 2013. Let’s make it a bit more interesting - let’s capture the names of the people who have these phones and only emit those types where more than 1 person has it. Expressed in the MongoDB CLI, we’d try this:

x2 = db.contact.aggregate([
   { $match: { "hiredate": {"$lt": new ISODate("20130601") }}},
   { $unwind: "$phones"},
   { $group: { "_id": "$phones.type",
               "n": {$sum: 1},
               "who": {$push: “$name”},
             }},
   { $match: { “n”: {“$gt”: 1}} },
   { $sort: { "n": -1, "_id": 1} }
   ]);
x2.forEach(function(r) { printjson(r); });

This might yield:

{ "_id" : "mobile", "n" : 3, who: [ “buzz”, “sam”, “dan” ] }
{ "_id" : "work", "n" : 2, who: [ “sam”, “kay” ] }

The important concepts to grasp with agg are:

  1. Data is “flowed” through a pipeline of operations (e.g. `$match` and `$unwind`). Output from each stage is passed to the next. In the example above, we use the `$match` operator twice: once to filter the input set, then a second to filter the group set.
  2. The `$unwind` operator turns arrays of things into “virtual” documents, one for each element of the array, to simplify further processing.
  3. The `$group` operator is extremely powerful and can even create brand new fields based on numeric and string operations of other fields. In particular, as you group data and aggregate on a scalar (e.g. the count of types), you can use the `$push` operator to capture other information related to that aggregation. The output cursor contains very clear, usable rich shapes.
  4. The agg pipeline in the Java driver (in fact, most of the drivers) is simply a List of operators, very similar to what we saw earlier with `find()`. Thus, the same power and flexibility in terms of parameter substitution and subclause conditional inclusion applies to pipeline construction.

Of course, all this functionality is performed efficiently at the engine, not in the client, so millions (or billions) of documents do not have to be dragged across the network.

Next week, we’ll dive into RAD (Rapid Agile Development) and switch over to some Python examples.


For more information on migration, read our migration best practices white paper.
Read the Migration Guide

<< Day 14 (Part 1)

Day 30 (RAD) >>


About the Author - Buzz Moschetti

Buzz is a solutions architect at MongoDB. He was formerly the Chief Architecture Officer of Bear Stearns before joining the Investment Bank division of JPMorganChase as Global Head of Architecture. His areas of expertise include enterprise data design, systems integration, and multi-language tiered software leverage with C/C++, Java, Perl, Python, and Ruby. He holds a bachelor of science degree from the Massachusetts Institute of Technology.

MongoDB vs SQL: Day 14

Buzz Moschetti

Business

Welcome back to our blog series highlighting the differences between developing an app with MongoDB vs. with RDBMS/SQL. Last week, we added phone numbers to our application. This week, things are going to get a little more sophisticated.

Once again, let’s go over the framework we’ve been using to stage our discussion:

  • We are using Java
  • Assume we have a data access layer in between our application and MongoDB
  • In terms of the date counts as we go through the examples, treat them as relative progress indicators and not the actual time needed to complete the task.
  • We won’t get into exception or error-handling. We won’t muddy the code with boilerplate or persistor logic that does not change from day to day. We won’t get into the database connection or other setup resources. The primary focus will be the core data-handling code.

SQL vs MongoDB: Day 14

Two weeks in and we have 2 new things that need persistence: a list of startup apps organized by region and, more interestingly, some data vended to us by an external entitlements service. The structure might look something like this:

n4.put(“geo”, “US-EAST”);
n4.put(“startupApps”, new String[] {“app1”, “app2”, “app3”});
list2.add(n4);
n5.put(“geo”, “EMEA”);
n5.put(“startupApps”, new String[] {“app6”});
n5.put(“useLocalNumberFormats”, false):
list2.add(n5);
m.put(“preferences”, list2)
n6.put(“optOut”, true);
n6.put(“assertDate”, someDate);
seclist.add(n6);
m.put(“attestations”, seclist)
m.put(“security”, mapOfDataCreatedByExternalSource);

It’s still pretty easy to add this data to the structure, but there are a couple of things that should come to your attention:

  1. Notice that we’re trying to accommodate 2 different geographies. Depending on where you are, you may have different applications at startup in your solution. We therefore model this easily as a list of structures where the “geo” field can be used as a key.
  2. Sometimes you’ll be presented with data over which you have absolutely no control. You don’t necessarily want to manipulate the data; you simply want to be able to persist it in the shape it is presented. When you need to pull the data back out, you’ll want to extract the very same shape you received and pass it to other code for processing without having to worry about (or process) the contents of the shape.

Now let’s look at the persistence code we’d need to write to get all of this in.

SQL Day 14

At this point, the amount of code to handle the existing Day 1-5 work plus our new startup apps list and the new tables that are required to persist it would span a page or more. Just look back to Day 3 and triple it. So we’re not going to dwell on those issues.

But now that we are some way into the development effort, two unpleasant design concessions may have cropped up as a result of fatigue or time-to-market issues:

  1. The development team chooses to not properly model a new table called startup apps with the necessary foreign keys, key management, and additional joins. Instead, acknowledging that regions are a small set and do not change often, two new columns “APPS-US-EAST” and “APPS-EMEA” were created. Each column contains a semi-colon delimited list of app names, eg. “app1;app2;app3”. This is an oft-repeated practice of “column overloading” where complex structure is formatted/encoding into otherwise simple text columns to avoid the tedium of setting up more tables and potentially changing many joins. This practice suborns the ability to do proper data governance and imposes additional data decoding logic on all consumers in all applications in all languages.
  2. Presented with a map of security data of arbitrary and potentially changing shape, the development team chooses to do one of the following:
    1. Pick out a subset of known fields and set them into new columns. The fidelity and transparency gained is completely offset by losing generic persistence ability.
    2. Converting the Map to a CLOB of data in some ASCII form. This keeps the persistence generic but the data cannot be well-queried in the database and forces all readers and writers in all languages to adhere to the same encoding/decoding logic.
    3. Converting the Map to a BLOB of serialized Java. This is the worst of all: the data cannot be queried, it is available ONLY to Java, and there is compile-time coupling between the serialized form in the database and the representation required upon deserialization.

MongoDB Day 14

If you’ve been following along with our blog series, you should know by now that the theme with MongoDB is that there’s no change. As we incrementally change the kinds of data we wish to persist, the database and the code around it can adjust simultaneously. We do not have to resort to extraordinary measures just to get data in and out of our database.

Let’s move onto something that I’m sure many of you are asking: What if we really do have to do a join?

So far we’ve only looked at examples where we’re reading and writing out of a single collection. In the SQL world, we did have to join contacts and phones but as a result of the information architecture in MongoDB, we were able to model everything we needed within a single document inside of our collection.

Let’s assume now that we want to add phone call transactions to our database. It’s clear that transactions are a zero-or-more relationship where “more” could mean a lot more. And it tends to keep growing over time without bound. In this data design use case, it is appropriate in MongoDB -- just like in RDBMS -- to model transactions in a new, separate collection.

We’ll use the calling number as the non-unique key in the transactions collection, and each document will contain the target number (the number being called) and duration of call in seconds:

We have previously explored and detailed some of the pain produced by zombies and outer joins and result set unwinding in the RDBMS space (Day 5), so for the purposes of making join logic clearer in MongoDB, let’s deliberately simplify the SQL:

In addition to linking together what is very clearly a 1 to N relationship between contacts and phones, we are now also going to be looking at 1 to N to M relationships that include the call targets. The challenge then becomes, “How do we turn the rectangle that’s coming back in my result into a usable list?’

And the answer is that this SQL will only work if we either ask for a single ID or all of the IDs:

  • If we ask for a single ID, for example by appending “and A.id = ‘G9’” to the SQL, then we can amass the calling number and target numbers in the data access layer and sort/assign them.
  • If we ask for all of the IDs, we can traverse the entire result set and build a map based on ID and employ the same logic in the bullet above.

This is a way that we might unwind that result set when we want to get all the IDs and their phone transactions :

We end up with a map of data to pass back into our data access layer that’s keyed first by the ID, second by the number, and finally by the list of call targets.

The problem, of course, is that very often the design calls for either the ability to return a partial result to the data access layer or the developer doesn’t want to deal with the complexity of the unwind code. The traditional solution? The tried-and-true ORDER BY:

The use of ORDER BY here isn’t primarily to drive final ordering of the data for presentation; it’s to set up the result set for easier unwinding. As we iterate the result set, when we detect that the ID changes from G10 to G9, we know that we’re done with all the G10 items and we can yield control back to our caller. The logic that we’re doing to build a map of lists, however, is largely the same as what we saw in the previous example.

And unless indexes are properly set up and/or joined sets are relatively small, liberal use of ORDER BY can have a major performance impact on a system.

When you look at the complete fetch logic including query and post-query logic, SQL is about disassembling things. We start with big queries. We tie together tables, business logic, all of our information and material and load it all into a big string at the top.

Then we throw it at the database engine, we cross our fingers, and later our result set comes back and then we have to disassemble it. The more joins that we have in our query, the more disassembly that is required. In the real world, we’re typically talking about three, sometimes four or more N-way joins in order to bring all of the information we want together. And for every additional table that you’re joining, you’re incurring more disassembly logic. And possible additional performance impact if ORDER BY is used to aid the process.

In MongoDB, the philosophy is different. MongoDB is about assembling things. In this first example, we will use an “N+1 select” approach just to keep things simple for the moment:

First of all there is no big SQL statement up on top and we also don’t have the problem of splitting the logic between the SQL statement itself and the unwind code. Keep in mind that as that SQL statement gets larger, more and more logic is going into the operation that we are trying to drive into the database engine; we’ll also have to manage more and more logic in a separate place for disassembling our result set.

This is even further complicated if we start delving into prepared statements, and other types of dynamic structures where that logic is separate from our actual select statement that we’re building, which is separate from our result.

With MongoDB, it’s simple. We just find what we want to find, we iterate through and when we need to go deeper, we simply ask for that information to get back rich shapes that we can populate as we go along.

This is what a join in MongoDB would look like:

There are a few things to note here. First off, this is not much more code than what we saw in our SQL example. It of course benefits from the fact that there is no SQL statement at all. What we see is the only piece of logic that’s necessary.

Furthermore, for anyone who comes in to debug or modify this later on, the way the code is constructed is now very logical and sequential -- and Java friendly. It makes sense how we can iterate over a list of maps and extract the phone number from them. Especially convenient and flexible is that we don’t have to pull out target and duration explicitly; instead, we can simply take the entire substructure, stick it into a list, and return it back to the parent map. In the end, we still end up with the same map of IDs at the bottom, but what we gain is a lot more flexibility and clarity.

We mentioned above that we used an “N+1 select” approach for simplicity. In many circumstances, the performance may be perfectly acceptable, especially if the relative sizes of each tier in the lookup cascade are 1 to many (like 10 or more), not 1:1. But as we learned on Day 1, MongoDB is about choice. With only a little more work, we can change the logic to fetch an entire tier’s worth of data, extracting the entire set of keys to pass to the lookup in the next tier. It is easy to programmatically construct and pass to MongoDB thousands of items in a so-called “in-list” for lookup. We can even mix and match these two approaches as our needs dictate. In the end, the complete fetch logic remains clear, flexible, and scalable as the complexity of the query expands over time.

Next week, we’ll cover rich querying capabilities in MongoDB and how they measure up vs SQL.


For more information on migration, read our migration best practices white paper.
Read the Migration Guide

<< Day 3-5

Day 14 - Queries >>


About the Author - Buzz Moschetti

Buzz is a solutions architect at MongoDB. He was formerly the Chief Architecture Officer of Bear Stearns before joining the Investment Bank division of JPMorganChase as Global Head of Architecture. His areas of expertise include enterprise data design, systems integration, and multi-language tiered software leverage with C/C++, Java, Perl, Python, and Ruby. He holds a bachelor of science degree from the Massachusetts Institute of Technology.

MongoDB vs SQL: Day 3-5

Buzz Moschetti

Business

When we last left off in our MongoDB vs SQL blog series, we covered Day 1 and Day 2 of building the same application using MongoDB vs using SQL with code comparisons. Before we jump into the next couple of days, let’s go over the ground rules again:

  • We’ll be using Java
  • Assume we have a data access layer in between our application and MongoDB
  • In terms of the date counts as we go through the examples, just treat them as progress indicators and not the actual time needed to complete the specified task.
  • We won’t get into exception or error-handling. We won’t muddy the code with boilerplate or persistor logic that does not change from day to day. We won’t get into the database connection or other setup resources. The primary focus will be the core data-handling code.

Now let’s jump into the differences between SQL and MongoDB for Day 3 through Day 5.

SQL vs MongoDB: Day 3

We have already covered saving and fetching data using a Java Map as the data carrier in the Data Access Layer, and adding a few simple fields. For day 3, we’re going to add some phone numbers to the structure.

The Task: Add A List of Phone Numbers

This is where we were:
m.put(“name”, “buzz”);
m.put(“id”, “K1”);
m.put(“title”, “Mr.”);
m.put(“hireDate”, new Date(2011, 11, 1));

Each phone number has associated with it a type, “home” or “work.” I also know that I may want to associate other data with the phone number in the near future like a “do not call” flag. A list of substructures is a great way to organize this data and gives me plenty of room to grow. It is very easy to add this to my map:

n1.put(“type”, “work”);
n1.put(“number”, “1-800-555-1212”));
n1.put(“doNotCall”, false);  // throw one in now just to test...
list.add(n1);
n2.put(“type”, “home”));
n2.put(“number”, “1-866-444-3131”));
list.add(n2);
m.put(“phones”, list);

The persistence code, however, is a different story.

SQL Day 3 - Option 1: Assume Only One Work and One Home Phone Number

With SQL:

This is just plain bad, but it’s worth noting here because we’ve seen this so many times, often far later than day 3 when there’s strong motivation to avoid creating a new table. With this code, we’re assuming that people only have one home and one work phone number. Let’s take the high road on day 3 and model this properly in relational form.

SQL Day 3: Option 2: Proper Approach with Multiple Phone Numbers

Here we’re doing it the right way. We’ve created a phones table and we’ve updated the way we interact with it using joins.

You can see that the incremental addition of a simple list of data is by no means trivial. We once again encounter the “alter table” problem because the SQL will fail unless it points at a database that has been converted to the new schema. The coding techniques used to save and fetch a contact are starting to diverge; the save side doesn’t “look” like the fetch side. And in particular, you’ll notice that fetching data is no longer as simple as building it into the map and passing it back. With joins, one or more (typically many more) of the columns are repeated over and over. Clearly, we don’t want to return such a redundant rectangular structure in the Data Access Layer and burden the application. We must “unwind” the SQL result set and carefully reconstruct the desired output, which is one name associated with a list of phone numbers and types.

This sort of unwinding work takes time and money. Many rely on ORMs like Hibernate to take care of this, but sooner rather than later, the ORM logic required to unwind a complex SQL query leads to unacceptable performance and or resource issues -- and you end up having to code a solution like what’s shown above anyway.

SQL Day 5: Zombies

With SQL, you’ll have to deal with zombies: (z)ero (o)r (m)ore (b)etween (e)ntities. We can’t forget that some people in our contact list do not have phones. Our earlier query, which is a simple join, produces a Cartesian product and will not return individuals without at least one phone.

To address this, we have to go back and change the query to do an outer join. But much more importantly, it also means changing the unwind logic because we don’t want to add blank phone numbers in our list. This takes even more time and money.

As an aside, even though the SQL based logic is burdening us, at least we’ve confined the impact to just the Data Access Layer. Imagine the impact if we had no Data Access Layer and applications were themselves constructing SQL and unwind logic. Just adding a list of phone numbers would have been a major undertaking.

MongoDB Day 3

Now let’s take a look at doing what we just went over, this time with MongoDB:

With MongoDB, there is no change. The list of phone numbers, which is actually a list of structures with numbers and types, flows into MongoDB and is natively stored as a list of structures. Just like on day 2, it is our choice to go back and backfill phone information for those entries already in the database. Gone are the burdens of having to set up another table, another set of foreign keys, managing those keys, and adding yet another join into what will ultimately become a very complex SQL expression. We also don’t have to immediately commit to a one-or-more vs. zero-or-more design. The time and effort saved with richly shaped MongoDB documents is significant.

Next week, we’ll dive even deeper as we add externally sourced material to our contact structure and expose the compromises development teams make in SQL / RDBMS in later-stage development.


For more information on migration, read our migration best practices white paper.
Read the Migration Guide

<< Day 1-2

Day 14 >>

 

*About the Author - Buzz Moschetti*

Buzz is a solutions architect at MongoDB. He was formerly the Chief Architecture Officer of Bear Stearns before joining the Investment Bank division of JPMorganChase as Global Head of Architecture. His areas of expertise include enterprise data design, systems integration, and multi-language tiered software leverage with C/C++, Java, Perl, Python, and Ruby. He holds a bachelor of science degree from the Massachusetts Institute of Technology.

MongoDB vs SQL: Day 1-2

Buzz Moschetti

Business

This will be the first post in an ongoing series based on our popular webinar about the differences in building an application using SQL versus building the same application using MongoDB.

First off - before we get into anything - what is it that we’re all trying to achieve with our data? Sure, there are necessary evils such as reading and writing data from databases, putting data on message buses, and working through open source integrations, but at the end of the day what we really want to do is take information, compute it, and get it in front of the right people to help them make better decisions.

Business demands often seem pretty simple - “I just want to save my trades” or “Can we make a product catalog that handles ¥, £, and $ seamlessly?” - but the way that the data is expressed in code or in the application is different from the way it’s framed in a business use case. When you get to how the use case is implemented in the database, the differences are even more pronounced.

And why is that? One reason is that innovation in business and in the code/application layer has far outpaced innovation in database technologies. RDBMS came onto the scene in 1974. Since then, business goals have changed, the pace of business has increased (time to market actually matters a lot now), and we’re using technologies we could not possibly have imagined would exist 40 years ago. Back then we had fairly simple languages that were well-mated to the ‘rectangular’ and ‘flat’ RDBMS world. Today, we have extremely powerful languages of all types with never-ending streams of updates coming from the open source ecosystems we’ve built. The only thing that’s constant is change.

In 1974... In 2014...
Business Data Goals Capture my company's transactions daily at 5:30PM EST, add them up on a nightly basis, and print a big stack of paper Capture my company's global transactions in real-time plus everything that is happening in the world (customers, competitors, business/regulatory/weather), produce any number of computed results, and pass this all in real-time to predictive analytics with model feedback. Then delivery results to 10s of thousands of mobile devices, multiple GUIs and b2b/b2c channels.
Release Schedule Semi-Annually Yesterday
Application/Code COBOL, Fortran, Algol, PL/1, assembler, proprietary tools C, C++, VB, C#, Java, Javascript, Groovy, Ruby, Perl, Python, Obj-C, SmallTalk, Clojure, ActionScript, Flex, DSLs, spring, AOP, CORBA, ORM, third party software ecosystem, the entire open source movement ... and COBOL and Fortran
Database I/VSAM, early RDBMS Mature RDBMS, legacy I/VSAM
Column & key/value stores, and ... MongoDB

That’s where NoSQL comes in, in particular MongoDB. What makes MongoDB special is that it stores data in rich structures (maps of maps of lists that eventually drill down to integers, float point numbers, dates, and strings). MongoDB was designed to not only fluidly store these objects, but also to present them in APIs and with a query language that knows how to understand all the types of data you’re storing. This is in stark contrast to the legacy technologies designed and built in the programming environments of the 1970s.

In MongoDB, your data is the schema and there is symmetry between the way data goes into the database and the way it comes out. With traditional technologies, the differences between what it means to put data in and take data out increase as applications get more complex. The examples we cover in this series will demonstrate these concepts.

And finally, no MongoDB primer would be complete without the following diagram:

The image on the left illustrates storing customer data in ‘rectangular’ tables and the schema diagram needed to make sense of it all. The image on the right illustrates how data is stored in MongoDB. Suffice it to say that the diagram on the left is more complicated than the one on the right. Now, to be fair, the diagram on the left does contain more entities than just a customer and his phones, and it in all likelihood, it didn’t start out looking like that. It was probably relatively simple in the beginning. Then someone needed something that wasn’t a scalar, and someone else needed a few other things, and before they knew what happened, what was once manageable exploded into what you currently see.

Now let’s get into the actual differences between SQL and MongoDB and how we transition our thinking using code.

Some ground rules:

  • We’ll be using Java
  • Assume we have a data access layer in between our application and MongoDB
  • In terms of the date counts as we go through the examples, just treat them as progress indicators and not the actual time needed to complete the task.
  • We won’t get into exception or error-handling. We won’t muddy the code with boilerplate or persistor logic that does not change from day to day. We won’t get into the database connection or other setup resources. The primary focus will be the core data-handling code.

SQL vs MongoDB: Day 1

We’re going to start with a map, which will let us move data in and out of our data access layer. Maps are rich shapes so we can stick a lot of things into them, and perhaps more importantly, there is no compile-time dependency.

The Task: Saving and Fetching Contact Data

We’ll start with this simple, flat shape in the data access layer:

Map m = new HashMap();
m.put(“name”, “buzz”);
m.put(“id”, “K1”);

We’ll save it in this way:

save(Map m)

And assume we can fetch it by primary key in this way:

Map m = fetch(String id)

In our initial case, we just have two very simple things, “save” and “fetch”. Rich queries will come later on. This is what this might look like in the code for both SQL and MongoDB:

In MongoDB, “data is the schema” and we are not required to create a table. If you look closer at the fetch functions, you’ll notice that they are largely the same. An important takeaway from this is that in MongoDB, your basic way of addressing the database is actually very similar to how you would do it in an RDBMS. You construct the query, you pass it in, you get back the cursor, and you iterate over the cursor. The fidelity of the data moving out will change as we progress through these examples, but for now lets assume that we have parity.

SQL vs MongoDB: Day 2

Let’s add two fields: a title and a date.

The Task: Adding simple fields

m.put(“name”, “buzz”);
m.put(“id”, “K1”);
m.put(“title”, “Mr.”);
m.put(“hireDate”, new Date(2011, 11, 1));

Notice that we’re putting an actual Date object into the “hireDate” slot, not a string representation like “2011-11-01”. In a well-designed data access layer, we always want to use the highest fidelity objects we can. Dates in particular need to be treated this way to avoid YYYYMMDD vs. YYYYDDMM confusion.

This is what the access layer implementation might look like to use SQL:

The first thing to notice is the alter table problem. Before we can touch any code that’s going to interact with the new database, we have to first change the table definition; otherwise, the select statement that’s going after the new fields is simply not going to work. This is nothing new. It’s something that everyone has become numb to over the past 40 years using RDBMS. There are a few other things that developers might also need to consider like case sensitivity, etc.

Now let’s look at what this looks like in MongoDB:

What did we have to change in MongoDB? Nothing.

We put title and hire date into the map, and we simply inserted the whole map into MongoDB. The previously inserted items remain unchanged, without title and hire date fields. If backfilling older items that do not have those fields is important, we can easily write a 4-line javascript program that iterates over the collection and sets default values for title and hire date.

Next week, we’ll cover adding lists - where the real fun begins.


For more information on migration, explore these resources:
Migration Guide

MongoDB vs SQL: Day 3-5 >>

*About the Author - Buzz Moschetti*

Buzz is a solutions architect at MongoDB. He was formerly the Chief Architecture Officer of Bear Stearns before joining the Investment Bank division of JPMorganChase as Global Head of Architecture. His areas of expertise include enterprise data design, systems integration, and multi-language tiered software leverage with C/C++, Java, Perl, Python, and Ruby. He holds a bachelor of science degree from the Massachusetts Institute of Technology.

Announcing MongoDB 2.8.0-rc0 Release Candidate and Bug Hunt

MongoDB

Releases

Edit: 2.8 is Now 3.0

We’re renaming our upcoming MongoDB release to 3.0. For more information, please see the blog post from MongoDB CTO and Co-Founder Eliot Horowitz.

Bug Hunt Extended!

There’s still time to submit your bugs! Along with the recent announcement about MongoDB’s acquisition of WiredTiger, we’ve extended the Bug Hunt. You can file issues until 2.8 is released!

Announcing MongoDB 2.8.0-rc0 Release Candidate

We’re truly excited to announce the availability of the first MongoDB 2.8 release candidate (rc0), headlined by improved concurrency (including document-level locking), compression, and pluggable storage engines.

We’ve put the release through extensive testing, and will be hard at work in the coming weeks optimizing and tuning some of the new features. Now it’s your turn to help ensure the quality of this important release. Over the next three weeks, we challenge you to test and uncover any lingering issues by participating in our MongoDB 2.8 Bug Hunt. Winners are entitled to some great prizes (details below).

MongoDB 2.8 RC0

In future posts we’ll share more information about all the features that make up the 2.8 release. We will begin today with our three headliners:

Pluggable Storage Engines

The new pluggable storage API allows external parties to build custom storage engines that seamlessly integrate with MongoDB. This opens the door for the MongoDB Community to develop a wide array of storage engines designed for specific workloads, hardware optimizations, or deployment architectures.

Pluggable storage engines are first-class players in the MongoDB ecosystem. MongoDB 2.8 ships with two storage engines, both of which use the pluggable storage API. Our original storage engine, now named “MMAPv1”, remains as the default. We are also introducing a new storage engine, WiredTiger, that fulfills our desire to make MongoDB burn through write-heavy workloads and be more resource efficient.

WiredTiger was created by the lead engineers of Berkeley DB and achieves high concurrency and low latency by taking full advantage of modern, multi-core servers with access to large amounts of RAM. To minimize on-disk overhead and I/O, WiredTiger uses compact file formats, and optionally, compression. WiredTiger is key to delivering the other two features we’re highlighting today.

Improved Concurrency

MongoDB 2.8 includes significant improvements to concurrency, resulting in greater utilization of available hardware resources, and vastly better throughput for write-heavy workloads, including those that mix reading and writing.

Prior to 2.8, MongoDB’s concurrency model supported database level locking. MongoDB 2.8 introduces document-level locking with the new WiredTiger storage engine, and brings collection-level locking to MMAPv1. As a result, concurrency will improve for all workloads with a simple version upgrade. For highly concurrent use cases, where writing makes up a significant portion of operations, migrating to the WiredTiger storage engine will dramatically improve throughput and performance.

The improved concurrency also means that MongoDB will more fully utilize available hardware resources. So whereas CPU usage in MongoDB has been traditionally fairly low, it will now correspond more directly to system throughput.

Compression

The WiredTiger storage engine in MongoDB 2.8 provides on-disk compression, reducing disk I/O and storage footprint by 30-80%. Compression is configured individually for each collection and index, so users can choose the compression algorithm most appropriate for their data. In 2.8, WiredTiger compression defaults to Snappy compression, which provides a good compromise between speed and compression rates. For greater compression, at the cost of additional CPU utilization, you can switch to zlib compression.

For more information, including how to seamlessly upgrade to the WiredTiger storage engine, please see the 2.8 Release Notes.

The Bug Hunt

The Bug Hunt rewards community members who contribute to improving MongoDB releases through testing. We’ve put the release through rigorous correctness, performance and usability testing. Now it’s your turn to test MongoDB against your development environment. We challenge you to test and uncover any remaining issues in MongoDB 2.8.0-rc0. Bug reports will be judged on three criteria: user impact, severity and prevalence.

All issues submitted against 2.8.0-rc0 will be candidates for the Bug Hunt. Winners will be announced on the MongoDB blog and user forum by December 9. There will be one first place winner, one second place winner and at least two honorable mentions. Awards are described below.

During the first Bug Hunt, for MongoDB 2.6, the community’s efforts were instrumental in improving the release and we’re hoping to get even more people involved in the Bug Hunt this time!

Bug Hunt Rewards

First Prize:

  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $1000 Amazon Gift Card
  • MongoDB Contributor T-shirt

Second Prize:

  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $500 Amazon Gift Card
  • MongoDB Contributor T-shirt

Honorable Mentions:

  • 1 ticket to MongoDB World — with a reserved front-row seat for keynote sessions
  • $250 Amazon Gift Card
  • MongoDB Contributor T-shirt

How to get started

  • Download MongoDB 2.8 RC0: You can download this release at MongoDB.org/downloads.
  • Deploy in your test environment: It is best to test software in a real environment with realistic data volumes and load. Help us see how 2.8 works with your code and data so that others can build and run applications on MongoDB 2.8 successfully.
  • Test new features and improvements: There is a lot of new functionality in MongoDB 2.8. See the 2.8 Release Notes for a full list.
  • Log a ticket: If you find an issue, create a report in Jira (Core Server project). See the documentation for a guide to submitting well written bug reports or discuss on the MongoDB Developers mailing list.

Don’t Hunt Alone

If you’re new to database testing, you don’t have to do it alone your first time. Join one of our MongoDB User Groups this month to try hacking on the release candidate with MongoDB Performance and QA engineers. Here are some of the upcoming events:

Want to run a Bug Hunt at your local user group or provide a space for the community to hunt? Get in touch with the MongoDB Community team to get started.

If you are interested in doing this work full time, consider applying to join our engineering teams in New York City, Palo Alto and Austin, Texas.

Happy hunting!

Eliot, Dan, Alvin and the MongoDB Team