GIANT Stories at MongoDB

Improving Driver Documentation: The MongoDB Meta Driver

MongoDB

Releases

This is a guest post, written by Mario Alvarez, a MongoDB intern for Summer 2013

This summer, I worked on developing the Meta Driver project, an effort to re-work the drivers documentation, creating an implementable specification of driver functionality that could be applied to different drivers in different languages.

The Problem

Part of MongoDB’s appeal to developers is its ease of use. One important way in which MongoDB provides a good user experience for developers is its ecosystem of drivers - bindings for interacting with the database in a variety of languages, with one for each (major) language. Whereas many other databases force developers to construct queries and database commands using a specific query language, MongoDB’s drivers allow for smooth, idiomatic interaction with the database. Drivers must balance conflicting goals: on the one hand, they should allow developers to write code naturally in the language of their choice; on the other, drivers should strive to provide a relatively consistent experience across languages, to minimize the difficulty of switching between MongoDB-oriented development in different languages.

Because of the language-dependence of providing a natural developer experience, as well as the drivers’ varying authorship (many originated as, or still are, community-supported projects) MongoDB’s drivers embody a wide variety of design decisions regarding these and other tradeoffs. This, coupled with the lack of a fully clear specification of driver functionality, makes it difficult to create or maintain drivers - in cases where the current spec falls short, it is hard to know which other driver to look to as an example.

The Meta Driver is part of a solution to this issue. It creates an implementable specification of driver functionality, as well as creating a reference driver implementation, following the principles of Behavior-Driven Development, to demonstrate a driver meeting the specification. While this project alone is not a complete answer to the problem of inconsistencies within the driver ecosystem, it provides a useful framework around which to begin the process of standardization, as well as the basis of a single, unified driver documentation. In order to achieve these goals, an emphasis on improving the quality and consistency of driver documentation and on more hands-on management of drivers by 10gen itself will be necessary, among other things.

Behavior-Driven Development (BDD)

Behavior-Driven Development is a development methodology that emphasizes the creation and maintenance of useful documentation. Its principles and techniques help make the Meta Driver possible.

BDD has its roots in the Agile community; particularly, in Test-Driven Development (TDD). Under TDD, a developer first writes tests describing the functionality she wants her code to implement, then writes the minimum amount of useful code that will make the tests pass. As a discipline, TDD can be a useful defense against over-architecting code at the beginning, and also leads to the creation of a comprehensive testing framework that (at least ideally) completely describes the code’s functionality, making the code much easier to maintain.

BDD builds on this methodology, taking it a step further. Rather than beginning with writing tests, the Behavior-Driven developer begins by writing a human-readable specification of the behaviors her code should implement. These specifications do not have arbitrary structure; they are written in a format (generally, a simple specification language called Gherkin) that computers can parse as well. Thus, the specifications form a common ground between human and machine understanding of the specified software. In fact, Gherkin’s syntax is so natural that it can be read by non-technical stakeholders in a project, enabling the creation of a common language shared by everyone involved, minimizing the possibility for misunderstanding or ambiguity. The specification is oriented around user-stories, describing various scenarios of interaction with the software, and the behavior the software should exhibit in each scenario.

Next, BDD translates the written specifications into tests run against the code. This is done by writing a series of step definitions that map each step of each scenario to code implementing the step. Once these are written, the developer proceeds just as with TDD, hacking until the tests pass. The difference is that, rather than just a series of tests, the developer has a human-readable specification for her code, and a guarantee that the specification is met. This helps to solve a difficult problem in software engineering: keeping documentation current as the documented codebase changes. If documentation has behavior specifications at its core, it will always be current. BDD serves to keep code and documentation “honest”: if the two do not match, the tests will not pass, and the developer will know where the problem is.

BDD in action: an example

The specifications for a project in BDD are contained in .feature files, which describe scenarios, each with a series of steps, as described above. Here is a simple example.

Feature: Addition
    As someone who works with numbers
    I want to be able to learn the total of my numbers 
    So I add my numbers together to get a result

    Scenario: Adding two numbers
        Given the number 5
        And the other number 42
        When I add the two numbers
        Then the result should be 47

These files are placed in the features directory (default location) at the root of the project being specified. The step definitions are placed in features/support. These are written in the language of the code the specifications are being run against (in this case, .rb files). Below are the step definitions implementing the steps making up the feature excerpt shown above.

Given /^the number (-?\d+)$/ do |addend|
    @addend1 = addend.to_i
end

Given /^the other number (-?\d+)$/ do |addend|
    @addend2 = addend.to_i
end

When /^I add the two numbers$/ do
    @sum = @addend1 + @addend2
end

Then /^the result should be (-?\d+)$/ do |sum|
    @sum.should == sum.to_i
end
    

Cucumber matches the provided regular expressions against lines in the .feature file, using matching groups (surrounded by parentheses) to extract arguments that are then passed in to the provided code blocks. In this way, steps can be made flexible, able to accept a variety of different arguments and thus able to model a wider range of use-cases.

To run these steps, one must first install Cucumber (this can be done via RubyGems: gem install cucumber). The cucumber executable can take a variety of options; cucumber —help will give a fairly detailed description. Without any options, cucumber looks in the default features directory and attempts to run all the tests described by the features and step definitions for all of them. For each step, Cucumber outputs whether the test succeeded or failed; and, if the step failed, Cucumber prints information about the specific assertion failure that caused the step to fail. For this example, the output should look like this:

Feature: Addition
        As someone who works with numbers
        I want to be able to learn the total of my numbers
        So I add my numbers together to get a result
        
    Scenario: Adding two numbers   # features/adding.feature:6
            Given the number 5           # features/step_definitions/step_definitions.rb:1
        And the other number 42      # features/step_definitions/step_definitions.rb:5
        When I add the two numbers   # features/step_definitions/step_definitions.rb:9
        Then the result should be 47 # features/step_definitions/step_definitions.rb:13

    1 scenario (1 passed)
    4 steps (4 passed)
    0m0.003s
 

BDD and Drivers: The Meta Driver

BDD can play a particularly useful role in helping to define and standardize driver behavior across languages. Since the Gherkin specifications of driver functionality are language-independent, they can serve as a common reference for all drivers, and be directly implemented as tests on each driver. In this way, drivers can be held to a common set of functionality.

Meta Driver Challenges

In practice, this approach can lead to some complications. The specifications must be written with an eye toward being independent of the special quirks and needs of each language. Otherwise, overly-complex step definitions would be required, in order to “hack around” the shorcomings of the specifications; this would obscure the direct relationship between the feature-files and the tests to which feature scenarios correspond, limiting the specification’s usefulness. As the specifications are linked to an ever-widening array of drivers and languages, they will likely have to be revised, to take into account any new language-dependencies we discover in the process. Slight differences between the specifications used in different languages may need to be introduced, if languages have differences that cannot be reconciled at the feature-spec level. We hope to keep these differences to a minimum, if they are needed at all; so far, they have not been.

Current Work

So far, I have specified a new BSON implementation for Ruby (the code for which - but not the Cucumber spec - can be found here). Building on this implementation, I have implemented and specified the MongoDB wire protocol, allowing the reference driver to communicate with a real MongoDB server. Finally, I implemented the beginnings of a CRUD API, enabling a user to issue insert, update, delete, and find commands to the database.

I have also begun work on writing step definitions in Python to attach the specification to the PyMongo driver, as a first test-case for using the specification across programming-language lines, and on a pre-existing, production driver. So far I have implemented steps on PyMongo for BSON and the wire protocol. Though adjusting for differences in the driver APIs and the tools available for Ruby and Python caused some minor hitches. As we’d hoped, only small changes were needed in order to make the specifications work with PyMongo, and the changes can be back-ported to the Ruby driver, in order to keep the two consistent.

Driver Specification

The full specification can be found on Github. Here is an example of some features and step implementations for BSON, in the same vein as the above example. This excerpt contains specifications and step implementations for deserializing objects containing individual BSON values. Here is part of a feature:

@bson
Feature: Deserialize Elements
    As a user of MongoDB
    In order to retrieve data from the database
    The driver needs to deserialize BSON elements

    Scenario Outline: Deserialize singleton BSON objects
        Given an IO stream containing <hex_bytes>
        When I deserialize the stream
        Then the result should be the <type> value <value>

        Examples:
        | hex_bytes                                | type      | value                    |
        | 10000000016b0026e4839ecd2a094000         | double    | 3.1459                   |
        | 11000000026b0005000000746573740000       | string    | test                     |
        | 14000000076b0050d3409d82cb8a4fc700000100 | object_id | 50d3409d82cb8a4fc7000001 |

Here are the corresponding step definitions:

Given /^an IO stream containing ([0-9a-fA-F]+)$/ do |hex_bytes|
    @io = StringIO.new([hex_bytes].pack('H*'))
end

When /^I deserialize the stream$/ do
    @document = Hash.from_bson(@io)
end

Then /^the result should be the ((?:\S+) value (?:\S+))$/ do |value|
    @document['k'].should eq(value)
end

This example tests a few simple cases of the general problem of BSON deserialization: deserializing single objects, wrapped in a BSON document with one field (the key for the field is ‘k’). This excerpt is a good example of how feature files can be constructed, beginning with simple cases of functionality and building up toward larger, more advanced ones (later in the same file, deserialization of larger, more complex, heterogeneous BSON objects are tested). For more examples, here is the feature file the above code is taken from; and here are the step definitions.

Using the Specification with Production Drivers

The Meta Driver specification is ultimately designed to be integrated with a wide array of production drivers, in order to ensure that all conform to a common set of behaviors. In order to integrate the specifications with a new driver, step definitions in that driver’s language must be written that link the specification to calls against the driver code. While this is not an especially difficult task, it is nontrivial (differences between languages and between API designs for different drivers mean that the same steps must sometimes be implemented differently for different drivers). Ideally, the only difference between languages and drivers should be the step definitions; the feature files shared between them should be identical. In practice, some small differences may need to be tolerated, but they should be kept to an absolute minimum, since such differences reduce the usefulness of the Meta Driver by undermining its status as a single universal specification.

Binding to PyMongo

Though the vast majority of work in “porting” the Meta Driver specification to other, production drivers has yet to be done, I have begun work on binding the specification to PyMongo, the production MongoDB Python driver. I chose Python because it is largely similar to Ruby in terms of the paradigms it supports, but is different enough to create challenges and expose whatever subtle dependencies on Ruby or the Ruby implementation might exist in the Meta Driver features. Binding the specification to the production Ruby driver would be less likely to reveal such problems (though support for the production Ruby driver is an eventual goal).

So far, using the Behave implementation of Gherkin in Python, I have been able to get the BSON and wire-protocol features to run successfully on PyMongo, with only minimal changes to the specification (mostly, omitting some scenarios to compensate for functionality present in the reference driver’s API that are not supported by PyMongo; some of these can be worked around, and most of the ones that cannot are not very important).

There are frameworks for many other programming languages available. The Cucumber wiki has a partial listing, although there are many others, including (for some languages) multiple implementations, with distinct strengths and weaknesses. For example, I chose Behave over Lettuce and Freshen (two other Python Gherkin implementations) because it was better maintained and appeared to have a cleaner API for defining steps. Choosing between different Gherkin/Cucumber implementations in the target language is another important part of the process of making that language’s driver work with the Meta Driver specification.

Conclusion

Behavior-Driven Development is a useful way to create documentation for programs, encouraging the creation of docs that are both useful to humans and directly linked to the code they specify. Because of the polyglot nature of MongoDB drivers, and the importance of providing a consistent user experience for developers working with the drivers, BDD was an extremely good fit for the Meta Driver project. However, BDD has many other applications, across different products and different languages. Any project important enough to require accurate and up-to-date docs should consider incorporating BDD into its development methodology. While it is far from a “silver bullet” for documenting and testing code, it can streamline the process of doing both well, allowing developers to extend and maintain their code with confidence.

We believe that there are many other opportinities to deploy BDD, across the 10gen/MongoDB codebase, and hope that this is just the beginning of 10gen embracing this methodology.

Future Steps

A lot still remains to be done in achieving the goal of a general specification and testing across many drivers. The specification must be extended further, to more completely cover CRUD operations, as well as to support functionality for administrative commands, read preference, write concern, replica sets, and other more advanced but crucial features of MongoDB. At the same time, the specification will need to be applied to other drivers, in order to expand its reach and to learn more about the varying needs and characteristics of different drivers. The specifications will likely need to be revised during this process.

It will also be important to document the process of attaching the Meta Driver specification to a new driver. Since I have only done this once (and there is still more work to do on PyMongo), I won’t be able to document this process very completely by the time I leave; it will likely have to be put together over time. I will also try to compile some of the best-practices for writing Cucumber specifications I discovered while working on the Meta Driver itself, to help out non-driver projects that want to make use of BDD.

Resources

For information about the Cucumber platform, cukes.info, the Cucumber website, is an excellent resource, as is the Cucumber wiki. The wiki contains a listing of Cucumber/Gherkin implementations in various languages, though it is not a complete catalogue. The wiki also has a useful page on the Gherkin specification language.

This project is completely open-source, and can be found in MongoDB’s github, here. The readme file in that repository contains some useful practical information for getting started using the Meta Driver, not covered here.

Acknowledgements

I’d like to thank Mike Friedman, my mentor, as well as Gary Murakami, Tyler Brock, and the rest of the drivers team at 10gen for their help and support on this project. I’d also like to thank 10gen as a whole, for a wonderful summer internship!

The MEAN Stack: Mistakes You're Probably Making With MongooseJS, And How To Fix Them

MongoDB

Releases

This is a guest post from Valeri Karpov, a MongoDB Hacker and co-founder of the Ascot Project.

If you’re interesting in learning about how to use MongoDB with Node.js, sign up for a free, introductory 7-week course on MongoDB and Node.js

For more MEAN Stack wisdom, check out his blog at TheCodeBarbarian.com. Valeri originally coined the term MEAN Stack while writing for the MongoDB blog, and you can find that post here.

If you’re familiar with Ruby on Rails and are using MongoDB to build a NodeJS app, you might miss some slick ActiveRecord features, such as declarative validation. Diving into most of the basic tutorials out there, you’ll find that many basic web development tasks are more work than you like. For example, if we borrow the style of http://howtonode.org/express-mongodb, a route that pulls a document by its ID will look something like this:


app.get('/document/:id', function(req, res) { 
  db.collection('documents', function(error, collection) {
    collection.findOne({ _id : collection.db.bson_serializer.ObjectID.createFromHexString(req.params.id) },
        function(error, document) {
          if (error || !document) {
            res.render('error', {});
          } else {

            res.render('document', { document : document });
          }
        });
  });
});

In my last guest post MongoDB I touched on MongooseJS, a schema and usability wrapper for MongoDB in NodeJS. MongooseJS was developed by LearnBoost, an education startup based in San Francisco, and maintained by 10gen. MongooseJS lets us take advantage of MongoDB’s flexibility and performance benefits while using development paradigms similar to Ruby on Rails and ActiveRecord. In this post, I’ll go into more detail about how The Ascot Project uses Mongoose for our data, some best practices we’ve learned, and some pitfalls we’ve found that aren’t clearly documented.

Before we dive into the details of working with Mongoose, let’s take a second to define the primary objects that we will be using. Loosely speaking, Mongoose’s schema setup is defined by 3 types: Schema, Connection, and Model.

  • A Schema is an object that defines the structure of any documents that will be stored in your MongoDB collection; it enables you to define types and validators for all of your data items.

  • A Connection is a fairly standard wrapper around a database connection.

  • A Model is an object that gives you easy access to a named collection, allowing you to query the collection and use the Schema to validate any documents you save to that collection. It is created by combining a Schema, a Connection, and a collection name.

  • Finally, a Document is an instantiation of a Model that is tied to a specific document in your collection.

Okay, now we can jump into the dirty details of MongooseJS. Most MongooseJS apps will start something like this:


var Mongoose = require('mongoose');
var myConnection = Mongoose.createConnection('localhost', 'mydatabase');

var MySchema = new Mongoose.schema({
name : {
    type : String,
    default : 'Val',
    enum : ['Val', 'Valeri', 'Valeri Karpov']
  },
created : {
    type : Date,
    default : Date.now
  }
});
var MyModel = myConnection.model('mycollection', MySchema);
var myDocument = new MyModel({});

What makes this code so magical? There are 4 primary advantages that Mongoose has over the default MongoDB wrapper:

1. MongoDB uses named collections of arbitrary objects, and a Mongoose JS Model abstracts away this layer. Because of this, we don’t have to deal with tasks such as asynchronously telling MongoDB to switch to that collection, or work with the annoying createFromHexString function. For example, in the above code, loading and displaying a document would look more like:


app.get('/document/:id', function(req, res) {

  Document.findOne({ _id : req.params.id }, function(error, document) {
    if (error || !document) {
      res.render('error', {});
    } else {

      res.render('document', { document : document });
    }
  });
});

2. Mongoose Models handle the grunt work of setting default values and validating data. In the above example myDocument.name = ‘Val’, and if we try to save with a name that’s not in the provided enum, Mongoose will give us back a nice error. If you want to learn a bit more about the cool things you can do with Mongoose validation, you can check out my blog post on how to integrate Mongoose validation with [AngularJS] (http://thecodebarbarian.wordpress.com/2013/05/12/how-to-easily-validate-any-form-ever- using-angularjs/).

3. Mongoose lets us attach functions to our models:

MySchema.methods.greet = function() { return 'Hello, ' + this.name; }; 

4. Mongoose handles limited sub-document population using manual references (i.e. no MongoDB DBRefs), which gives us the ability to mimic a familiar SQL join. For example:


var UserGroupSchema = new Mongoose.schema({

  users : [{ type : Mongoose.Schema.ObjectId, ref : 'mycollection' }]
});


var UserGroup = myConnection.model('usergroups', UserGroupSchema);
var group = new UserGroup({ users : [myDocument._id] });
group.save(function() {
  UserGroup.find().populate('users').exec(function(error, groups) {

    // Groups contains every document in usergroups with users field populated // Prints 'Val'

    console.log(groups[0][0].name)
  });
});

In the last few months, my team and I have learned a great deal about working with Mongoose and using it to open up the true power of MongoDB. Like most powerful tools, it can be used well and it can be used poorly, and unfortunately a lot of the examples you can find online fall into the latter. Through trial and error over the course of Ascot’s development, my team has settled on some key principles for using Mongoose the right way:

1 Schema = 1 file

A schema should never be declared in app.js, and you should never have multiple schemas in a single file (even if you intend to nest one schema in another). While it is often expedient to inline everything into app.js, not keeping schemas in separate files makes things more difficult in the long run. Separate files lowers the barrier to entry for understanding your code base and makes tracking changes much easier.

Mongoose can’t handle multi-level population yet, and populated fields are not Documents. Nesting schemas is helpful but it’s an incomplete solution. Design your schemas accordingly.

Let’s say we have a few interconnected Models:


var ImageSchema = new Mongoose.Schema({
  url : { type : String},

  created : { type : Date, default : Date.now }
});

var Image = db.model('images', ImageSchema);

var UserSchema = new Mongoose.Schema({

  username : { type : String },

  image : { type : Mongoose.Schema.ObjectId, ref : 'images' }
});


UserSchema.methods.greet = function() {
  return 'Hello, ' + this.name;
};

var User = db.model('users', UserSchema);

var Group = new Mongoose.Schema({

  users : [{ type : Mongoose.Schema.ObjectId, ref : 'users' }]
});

Our Group Model contains a list of Users, which in turn each have a reference to an Image. Can MongooseJS resolve these references for us? The answer, it turns out, is yes and no.


Group.
  find({}).
  populate('user').
  populate('user.image').
  exec(function(error, groups) {
    groups[0].users[0].username; // OK

    groups[0].users[0].greet(); // ERROR – greet is undefined
    
    groups[0].users[0].image; // Is still an object id, doesn't get populated
    groups[0].users[0].image.created; // Undefined
  });

In other words, you can call ‘populate’ to easily resolve an ObjectID into the associated object, but you can’t call ‘populate’ to resolve an ObjectID that’s contained in that object. Furthermore, since the populated object is not technically a Document, you can’t call any functions you attached to the schema. Although this is definitely a severe limitation, it can often be avoided by the use of nested schemas. For example, we can define our UserSchema like this:


var UserSchema = new Mongoose.Schema({
  username : { type : String },

  image : [ImageSchema]
});

In this case, we don’t have to call ‘populate’ to resolve the image. Instead, we can do this:


Group.
  find({}).
  populate('user').
  exec(function(error, groups) {
    groups[0].users[0].image.created; // Date associated with image
  });

However, nested schemas don’t solve all of our problems, because we still don’t have a good way to handle many-to-many relationships. Nested schemas are an excellent solution for cases where the nested schema can only exist when it belongs to exactly one of a parent schema. In the above example, we implicitly assume that a single image belongs to exactly one user – no other user can reference the exact same image object.

For instance, we shouldn’t have UserSchema as a nested schema of Group’s schema, because a User can be a part of multiple Groups, and thus we’d have to store separate copies of a single User object in multiple Groups. Furthermore, a User ought to be able to exist in our database without being part of any groups.

Declare your models exactly once and use dependency injection; never declare them in a routes file.

This is best expressed in an example:


// GOOD

exports.listUsers = function(User) {
  return function(req, res) {
    User.find({}, function(error, users) {
      res.render('list_users', { users : users });
    });
  }
};

// BAD

var db = Mongoose.createConnection('localhost', 'database');
var Schema = require('../models/User.js').UserSchema;

var User = db.model('users', Schema);


exports.listUsers = return function(req, res) {
  User.find({}, function(error, users) {
    res.render('list_users', { users : users });
  });
};

The biggest problem with the “bad” version of listUsers shown above is that if you declare your model at the top of this particular file, you have to define it in every file where you use the User model. This leads to a lot of error-prone find-and-replace work for you, the programmer, whenever you want to do something like rename the Schema or change the collection name that underlies the User model.

Early in Ascot’s development we made this mistake with a single file, and ended up with a particularly annoying bug when we changed our MongoDB password several months later. The proper way to do this is to declare your Models exactly once, include them in your app.js, and pass them to your routes as necessary.

In addition, note that the “bad” listUsers is impossible to unit test. The User schema in the “bad” example is inaccessible through calls to require, so we can’t mock it out for testing. In the “good” example, we can write a test easily using Nodeunit:


var UserRoutes = require('./routes/user.js');

exports.testListUsers = function(test) {
  mockUser.collection = [{ name : 'Val' }];

  var fnToTest = UserRoutes.listUsers(mockUser);
  fnToTest( {},
    { render : function(view, params) {
        test.equals(mockUser.collection, params.users); test.done();
      }
    });
};

And speaking of Nodeunit:

Unit tests catch mistakes, encourage you to write modular code, and allow you to easily make sure your logic works. They are your friend.

I’ll be the first to say that writing unit tests can be very annoying. Some tests can seem trivial, they don’t necessarily catch all bugs, and often you write way more test code than actual production code. However, a good suite of tests can save you a lot of worry; you can make changes and then quickly verify that you haven’t broken any of your modules. Ascot Project currently uses Nodeunit for our backend unit tests; Nodeunit is simple, flexible, and works well for us.

And there you have it! Mongoose is an excellent library, and if you’re using MongoDB and NodeJS, you should definitely consider using it. It will save you from writing a lot of extra code, it’ll handle some basic population, and it’ll handle all your validation and object creation grunt work. This adds up to more time spent building awesome stuff, and less time trying to figure out how to get your database interface to work.

Have any questions about the code featured in this post? Want to suggest a better approach? Feel like telling me why the MEAN Stack is the worst thing that ever happened in the history of the world and how horrible I am? Go ahead and leave a comment below, or shoot me an email at valkar207@gmail.com and I’ll do my best to answer any questions you might have. You can also find me on github at https://github.com/vkarpov15. My current venture is called The Ascot Project, and you can find that over at www.AscotProject.com.

More Information



This post was updated in January 2015 to include additional resources and updated links.

10gen Opens New EMEA Headquarters in Dublin

Last Friday, 10gen announced the opening of our new EMEA Headquarters in Dublin City Center along with a new recruiting campaign aimed at bringing over 60 skilled jobs to Ireland in the next 3 years.

“Our Dublin expansion is a true testament to the increasing adoption of MongoDB, and the growing demand for cutting edge technical skills in the region,” said Max Schireson, 10gen CEO. “Our EMEA headquarters is very close to leading educational institutions which makes it an ideal location to recruit young, talented graduates who are serious about building a career in technology.”

Barry O’Dowd, Manager of Emerging Business, IDA Ireland, said “Big data is a key growth sector and 10gen is a global company offering an open source database platform to help companies derive value from their data. It is a welcome addition to the Irish tech community. Ireland’s reputation as a location for fast-growing business continues to go from strength to strength. IDA looks forward to supporting 10gen as it grows and develops its Irish operations.”

Check out the additional coverage on the newest 10gen office opening

Interested in a position at our new EMEA headquarters? See our open positions

New Geo Features in MongoDB 2.4

MongoDB

Releases

Motivation

Geometric processing as a field of study has many applications, and has resulted in lots of research, and powerful tools. Many modern web applications have location based components, and require a data storage engines capable of managing geometric information. Typically this requires the introduction of an additional storage engine into your infrastructure, which can be a time consuming and expensive operation.

MongoDB has a set of geometric storage and search features. The MongoDB 2.4 release brought several improvements to MongoDB’s existing geo capabilities and the introduction of the 2dsphere index.

The primary conceptual difference (though there are also many functional differences) between the 2d and 2dsphere indexes, is the type of coordinate system that they consider. Planar coordinate systems are useful for certain applications, and can serve as a simplifying approximation of spherical coordinates. As you consider larger geometries, or consider geometries near the meridians and poles however, the requirement to use proper spherical coordinates becomes important.

In addition to this major conceptional difference, there are also significant functional differences, which are outlined in some depth in the Geospatial Indexes and Queries section of the MongoDB documentation. This post will discuss the new features that have been added in the 2.4 release.

What’s New

Storing non-point geometries

Unlike the 2d index, which only allowed the storage of points, the 2dsphere index allows the storage and querying of points, lines, and polygons. To support the storage of different geometries, instead of introducing a proprietary format, MongoDB conforms to the GeoJSON standard. GeoJSON is a collaborative community project that produced a specification for encoding entities in JSON. It has garnered significant support, including the OpenLayers project, PostGIS, and has growing language support for python and ruby.

Here are a few simple examples of GeoJSON embedded documents:

A BSON Document with a GeoJSON Point embedded in the geo field:

    {
        geo: {
            type: “Point”,
            coordinates: [100.0, 0.0]
        }
    }

A BSON Document with a GeoJSON LineString embedded in the geo field:

    {
        geo: {
            type: “LineString”,
            coordinates: [ [100.0, 0.0], [101.0, 1.0] ]
        }
    }

A BSON Document with a GeoJSON Polygon embedded in the geo field:

{
    geo: {
        type: “Polygon”,
        coordinates: [
            [ [100.0, 0.0], [101.0, 0.0],
              [101.0, 1.0], [100.0, 1.0],
              [100.0, 0.0] ]
        ]
    }
}

Note: A GeoJSON Polygon’s coordinates are an array of arrays of point specifications. Each array of point specifications should have the same starting and ending point to form a closed loop. The first array of point specifications defines the polygon’s exterior geometry, and each subsequent array of point specifications defines a “hole” in the polygon. Polygons should be non self-intersecting, and holes should be fully contained by the polygon.

Inclusion searches on a sphere

The new $geoWithin operator, which takes a Polygon geometry as a specifier, returns any geometries of any type that are fully contained within the polygon. It will work well without any index, but must look at every document in the collection to do so.

Intersecting geometries on a sphere

The new $geoIntersects operator, which takes any geometry as a specifier, returns any geometries that have a non-empty intersection with the specifier. $geoIntersects also works well without an index, and must also look at each document in the collection.

Better support for compound indexes

The 2d index can only be used in a compound index if 1. it is the first field, 2. there are exactly two fields in the compound index, and 3. if the second field isn’t a 2d index. 2dsphere indexes aren’t limited in this way, which allows us to pre-filter based on a non-geo field - which is often more efficient.

Consider the following queries: Find me Hot Dog Stands in New York state i.e. use a compound index: (business_type, location). Find me geometries in New York state that are Hot Dog stands i.e. use the compound index: (location, business_type)

The first query will be much more efficient than the second, because business_type is a simple text field, and greatly reduces the set of geometries to search.

Additionally, we can have multiple 2dsphere indexes in the same compound index. This allows queries like: “Find routes with a start location within 50 miles from JFK, and an end location within 100 miles of YYC”.

How it Works

Everything starts when you insert a geometry into a 2dsphere index. We use the open source s2 C++ library from google to select a minimal set of cells that fully cover a geometry. This set of grid cells is called a covering, and the size of the cells is dynamic (between 500m and 100km on a side) based upon the size of the polygon being covered.

image

fig 3 - A very low granularity covering of the entire United Kingdom

image

fig 4 - A fairly granular covering of the A4 around Trafalgar Square.

Each cell in these coverings is now added to a standard B-tree index, with a key that is easily calculable by the location on surface of the sphere - more granular(smaller) cells will have the same prefix as a larger cell that occupies the same area of the surface of the sphere.

Intersection & Within searches

Finding geometries that may be intersecting or within a search polygon becomes as easy as generating a covering for the search specifier, and for each cell in that covering, query the B-tree for any geometries that interact with these cells. Once the list of possibly interacting geometries has been retrieved from the index, each geometry in checked in turn to see if it should be included in the result set.

Near searches

The near search provided by the $near operator is implemented by doing $within searches on concentrically growing donuts (circular polygons with with circular holes).

image

We encourage user feedback and testing on these new Geo features and are excited to see what the community builds.

Map images ⓒ OpenStreetMap contributors, licensed under the Creative Commons Attribution-ShareAlike 2.0 license (CC-BY-SA).

Map data ⓒ OpenStreetMap contributors, licensed under the Open Data Commons Open Database License (ODbL).

The MEAN Stack: MongoDB, ExpressJS, AngularJS and Node.js

MongoDB

Releases

January 2017 An updated post is available: Introducing the MEAN and MERN stacks.

By Valeri Karpov, Kernel Tools engineer at MongoDB and and co-founder of the Ascot Project.

A few weeks ago, a friend of mine asked me for help with PostgreSQL. As someone who’s been blissfully SQL-­free for a year, I was quite curious to find out why he wasn’t just using MongoDB instead. It turns out that he thinks MongoDB is too difficult to use for a quick weekend hack, and this couldn’t be farther from the truth. I just finished my second 24 hour hackathon using Mongo and NodeJS (the FinTech Hackathon co­sponsored by 10gen) and can confidently say that there is no reason to use anything else for your next hackathon or REST API hack.

First of all, there are huge advantages to using a uniform language throughout your stack. My team uses a set of tools that we affectionately call the MEAN stack:­ MongoDB, ExpressJS, AngularJS, and Node.js. By coding with Javascript throughout, we are able to realize performance gains in both the software itself and in the productivity of our developers. With MongoDB, we can store our documents in a JSON-­like format, write JSON queries on our ExpressJS and NodeJS based server, and seamlessly pass JSON documents to our AngularJS frontend. Debugging and database administration become a lot easier when the objects stored in your database are essentially identical to the objects your client Javascript sees. Even better, somebody working on the client side can easily understand the server side code and database queries; using the same syntax and objects the whole way through frees you from having to consider multiple sets of language best practices and reduces the barrier to entry for understanding your codebase. This is especially important in a hackathon setting: the team may not have much experience working together, and with such little time to integrate all the pieces of your project, anything that makes the development process easier is gold.

Another big reason to go with MongoDB is that you can use it in the same way you would a MySQL database (at least at a high level). My team likes to describe MongoDB as a “gateway drug” for NoSQL databases 
because it is so easy to make the transition from SQL to MongoDB. I wish someone had told me this when I first starting looking into NoSQL databases, because it would have saved me a lot of headaches. Like many people, I was under the impression that CouchDB would be easier to use, while the performance improvements from MongoDB were something I could take advantage only once I had gotten my feet wet with CouchDB. Instead CouchDB ended up being much more difficult to work with than I anticipated, largely because it uses custom Map­Reduce functions to query data, rather than the more traditional SQL queries I
 was used to. When I finally switched I was surprised to find that with MongoDB I could still write queries and build indices; the only difference is that the queries are written in JSON and query a flexible NoSQL database.

As a NoSQL database, MongoDB also allows us to define our schema entirely on the code side. With an RDBMS you’re faced with the inescapable fact that the objects in your database are stored in a format that is unusable by your front­end and vice versa. This wastes precious time and mental energy when you inevitably run into a data issue or need to do some database administration. For example, if you change your ActiveRecord schema in Ruby on Rails, you have to run the “rake” command to make sure your SQL columns stay in sync with your schemas. This is a clear violation of the age­-old programming principle D.R.Y.­ (Don’t Repeat Yourself). In contrast, MongoDB doesn’t care what format the documents in your collections take (for the most part anyway). This means that you spend a lot less time worrying about schema migrations, because adding or removing data items from your schema doesn’t really require you to do anything on the database side.

At this point I should note that to get the most out of MongoDB in your MEAN stack, you’re going to want to take advantage of MongooseJS. Mongoose is a schema and general usability tool for Node that lets you use MongoDB while being as lazy as you want. For example, with Mongoose we can define a schema in JSON:

 var UserSchema = new Mongoose.Schema({
username : { type : String, validate: /\S+/, index : { unique : true } }, profile : {
name : {
first : { type : String, default : “” } last : { type : String, default : “” }
} }
});

We can then create a model by mapping our schema to our MongoDB collection:

 var User = db.model('users', UserSchema);

For all of the Ruby on Rails + ActiveRecord fans out there, note that this User object we’ve created above now allows us easy access to the basic features you would normally associate with ActiveRecord. For example, we can do thing like:

User.findOne({ username : 'vkarpov' }, function(error, user) { /* user is either undefined or a user with username vkarpov */ });


User.findOne({ _id : req.params.id }, function(error, user) { /* user with ID defined by the hex string in req.params.id */ });
User.find({ 'profile.name.first' : 'Valeri' }, function(error, users) { /* users is a list with users with first name Valeri */ });
var user = new User({ username : 'vkarpov' });
user.save(function(error, user) { /* Saves user with default values for profile.name.first and .last into 'users' collection */ });
var user2 = new User({ username : 'v karpov' });
user2.save(function(error, user) { /* Error – regular expression validation for username failed */ });

Another powerful tool that MongoDB and MongooseJS provide is the ability to nest schemas. You’ll notice that in the User schema above we have “profile” and “name” objects that are part of a nested schema. This is a simple and useful design choice that illustrates how powerful nested schemas can be. Suppose that we want to give our user the ability to edit their first and last name, but not their username. Assuming the website has a /profile route where our user can the change first and last names, the Javascript front­end will get a JSON object as the result of a call to User.findOne on the backend. After the user modifies their profile, the front­end makes a POST request to /profile.json with the user object in JSON as the body. Now on the backend (using ExpressJS syntax) we can simply use:

function(req, res) {
user.profile = req.body.profile; user.save.function(error, user) { res.redirect('/profile'); }); } }

That’s it. Mongoose takes care of validating of the profile information, so we don’t have to change the POST /user.json route if we change the User schema, and there is no way the username field can be modified. We could do the same thing when using Ruby on Rails and ActiveRecord, but this would require having a separate Profile schema in a separate table, meaning that among other things we’d incur a performance penalty because of the extra underlying join operation.

MongoDB is the bedrock of our MEAN stack, and you should strongly consider using it for your next project. You can write your entire stack in one language, have schemas for ease of use on top of a flexible and performant NoSQL database, and nest schemas when you don’t truly need to have separate collections. All
of this adds up to you spending more of your precious hackathon hours building the other cool aspects of your product and less time figuring out how your objects translate between different levels of the stack.

By the way, MongoDB and MEAN are useful well beyond hackathons­ we use this approach for all of our commercial products, most recently The Ascot Project. Want to read more about the MEAN stack or how we use MongoDB? Check out my development blog.

Check out the Ascot Project at the next MongoDB User Group in New York City.

More Information

This post was updated in January 2015 to include additional resources and updated links.

Get MongoDB updates straight to your inbox

Checking Disk Performance with the mongoperf Utility

MongoDB

Releases

Note: while this blog post uses some Linux commands in its examples, mongoperf runs and is useful on just about all operating systems.

mongoperf is a utility for checking disk i/o performance of a server independent of MongoDB. It performs simple timed random disk i/o’s.

mongoperf has a couple of modes: mmf:false and mmf:true

mmf:false mode is a completely generic random physical I/O test – there is effectively no MongoDB code involved.


With mmf:true mode, the test is a benchmark of memory-mapped file based I/O. The code is not the MongoDB code but the actions are analogous. Thus this is a good baseline test of a system including the operating system virtual memory manager’s behavior.

To build the mongoperf tool:

scons mongoperf

(Or, “scons mongoperf.exe” on Windows.)

or grab a prebuilt binary here.


Then try

mmf:false mode

Here’s an example of a test run with 32 threads performing random physical reads. Note that mongoperf gradually adds more threads so that you can see the difference in performance with more concurrency.

$ echo "{nThreads:32,fileSizeMB:1000,r:true}" | mongoperf 
mongoperf
use -h for help
parsed options:
{ nThreads: 32, fileSizeMB: 1000, r: true }
creating test file size:1000MB ...
testing...
options:{ nThreads: 32, fileSizeMB: 1000, r: true }
wthr 32
new thread, total running : 1
read:1 write:0
4759 ops/sec 18 MB/sec
4752 ops/sec 18 MB/sec
4760 ops/sec 18 MB/sec
4758 ops/sec 18 MB/sec
4752 ops/sec 18 MB/sec
4754 ops/sec 18 MB/sec
4758 ops/sec 18 MB/sec
4755 ops/sec 18 MB/sec
new thread, total running : 2
9048 ops/sec 35 MB/sec
9039 ops/sec 35 MB/sec
9056 ops/sec 35 MB/sec
9029 ops/sec 35 MB/sec
9047 ops/sec 35 MB/sec
9072 ops/sec 35 MB/sec
9040 ops/sec 35 MB/sec
9042 ops/sec 35 MB/sec
new thread, total running : 4
15116 ops/sec 59 MB/sec
15346 ops/sec 59 MB/sec
15401 ops/sec 60 MB/sec
15448 ops/sec 60 MB/sec
15450 ops/sec 60 MB/sec
15502 ops/sec 60 MB/sec
15474 ops/sec 60 MB/sec
15480 ops/sec 60 MB/sec
read:1 write:0
read:1 write:0
new thread, total running : 8
read:1 write:0
read:1 write:0
15999 ops/sec 62 MB/sec
21811 ops/sec 85 MB/sec
21888 ops/sec 85 MB/sec
21964 ops/sec 85 MB/sec
21876 ops/sec 85 MB/sec
22058 ops/sec 86 MB/sec
21966 ops/sec 85 MB/sec
21976 ops/sec 85 MB/sec
new thread, total running : 16
24316 ops/sec 94 MB/sec
24949 ops/sec 97 MB/sec
25239 ops/sec 98 MB/sec
25032 ops/sec 97 MB/sec
25020 ops/sec 97 MB/sec
25331 ops/sec 98 MB/sec
25175 ops/sec 98 MB/sec
25081 ops/sec 97 MB/sec
new thread, total running : 32
24314 ops/sec 94 MB/sec
24991 ops/sec 97 MB/sec
24779 ops/sec 96 MB/sec
24743 ops/sec 96 MB/sec
24932 ops/sec 97 MB/sec
24947 ops/sec 97 MB/sec
24831 ops/sec 96 MB/sec
24750 ops/sec 96 MB/sec
24843 ops/sec 97 MB/sec

The above test was ran on an SSD volume on a 64 bit Red Hat Enterprise Linux server. Notice how the ops/second increase as we add more threads (to a point). It’s interesting to look at the output of iostat while this was running:

iostat -xm 2

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s   avgrq-sz avgqu-sz   await  svctm  %util
dm-0              0.00     0.00  1532.00  4104.00     5.98    16.03     8.00  2354.34  517.87   0.17  96.30
dm-0              0.00     0.00  4755.00     0.00    18.57     0.00     8.00     0.93    0.19   0.19  92.55
dm-0              0.00     0.00  4755.50     0.00    18.58     0.00     8.00     0.93    0.20   0.20  93.20
dm-0              0.00     0.00  4753.50     0.00    18.57     0.00     8.00     0.93    0.20   0.20  93.30
dm-0              0.00     0.00  6130.50     1.50    23.95     0.01     8.00     1.23    0.20   0.16  95.15
dm-0              0.00     0.00  9047.50     0.00    35.34     0.00     8.00     1.84    0.20   0.11 100.05
dm-0              0.00     0.00  9033.50     0.00    35.29     0.00     8.00     1.84    0.20   0.11  99.95
dm-0              0.00     0.00  9053.50     9.50    35.37     0.04     8.00     2.00    0.22   0.11 100.00
dm-0              0.00     0.00 10901.00     0.00    42.58     0.00     8.00     2.43    0.22   0.09 100.05
dm-0              0.00     0.00 15404.50     0.00    60.17     0.00     8.00     3.56    0.23   0.06 100.05
dm-0              0.00     0.00 15441.50     0.00    60.32     0.00     8.00     3.58    0.23   0.06 100.20
dm-0              0.00     0.00 15476.50     0.00    60.46     0.00     8.00     3.56    0.23   0.06 100.00
dm-0              0.00     0.00 15433.00     0.00    60.29     0.00     8.00     4.87    0.23   0.06 100.05
dm-0              0.00     0.00 21024.00     0.00    82.12     0.00     8.00     7.06    0.39   0.05 100.40
dm-0              0.00     0.00 21917.00     0.00    85.62     0.00     8.00     6.91    0.31   0.05 100.35
dm-0              0.00     0.00 21964.00     0.00    85.80     0.00     8.00     6.96    0.32   0.05 100.30
dm-0              0.00     0.00 22738.00     0.00    88.82     0.00     8.00     8.07    0.34   0.04 100.25
dm-0              0.00     0.00 24893.00     0.00    97.24     0.00     8.00    10.05    0.41   0.04 100.60
dm-0              0.00     0.00 25060.00     0.00    97.89     0.00     8.00    10.21    0.40   0.04 100.20
dm-0              0.00     0.00 25236.50     0.00    98.58     0.00     8.00    10.34    0.40   0.04 100.70
dm-0              0.00     0.00 24802.00     0.00    96.88     0.00     8.00    11.28    0.40   0.04 100.60
dm-0              0.00     0.00 24859.00     0.00    97.11     0.00     8.00    10.08    0.45   0.04 100.70
dm-0              0.00     0.00 24793.50     0.00    96.85     0.00     8.00     9.89    0.39   0.04 101.10
dm-0              0.00     0.00 24881.00     0.00    97.19     0.00     8.00     9.93    0.39   0.04 100.90
dm-0              0.00     0.00 24823.00     0.00    96.96     0.00     8.00     9.79    0.39   0.04 100.50
dm-0              0.00     0.00 24805.00     0.00    96.89     0.00     8.00     9.92    0.40   0.04 100.40
dm-0              0.00     0.00 24901.00     0.00    97.27     0.00     8.00     9.97    0.39   0.04 100.90

A few things stand out.

  • First, the read per second (“r/s”) numbers match our mongoperf results.
  • Second, it’s clear that the “%util” column is fairly meaningless in this particular case – we were able to increase r/s even after %util hit 100. I assume this is because the %util value is a modeled value and the assumptions involved which don’t hold for this device.
  • Third, note that if you multiply the r/s value by 4KB, you get the rMB/s value – so we are really doing 4KB reads in this case.

We can now try some writes:

$ echo "{nThreads:32,fileSizeMB:1000,w:true}|mongoperf
new thread, total running : 1
549 ops/sec 2 MB/sec
439 ops/sec 1 MB/sec
270 ops/sec 1 MB/sec
295 ops/sec 1 MB/sec
281 ops/sec 1 MB/sec
371 ops/sec 1 MB/sec
235 ops/sec 0 MB/sec
379 ops/sec 1 MB/sec
new thread, total running : 2
243 ops/sec 0 MB/sec
354 ops/sec 1 MB/sec
310 ops/sec 1 MB/sec
2491 ops/sec 9 MB/sec
2293 ops/sec 8 MB/sec
2077 ops/sec 8 MB/sec
2559 ops/sec 9 MB/sec
1099 ops/sec 4 MB/sec
new thread, total running : 4
2676 ops/sec 10 MB/sec
2667 ops/sec 10 MB/sec
2536 ops/sec 9 MB/sec
2600 ops/sec 10 MB/sec
2612 ops/sec 10 MB/sec
2498 ops/sec 9 MB/sec
2506 ops/sec 9 MB/sec
2492 ops/sec 9 MB/sec
new thread, total running : 8
2463 ops/sec 9 MB/sec
2439 ops/sec 9 MB/sec
2445 ops/sec 9 MB/sec
2401 ops/sec 9 MB/sec
2271 ops/sec 8 MB/sec
2202 ops/sec 8 MB/sec
2206 ops/sec 8 MB/sec
2181 ops/sec 8 MB/sec
new thread, total running : 16
2105 ops/sec 8 MB/sec
2263 ops/sec 8 MB/sec
2305 ops/sec 9 MB/sec
2408 ops/sec 9 MB/sec
2324 ops/sec 9 MB/sec
2244 ops/sec 8 MB/sec
2013 ops/sec 7 MB/sec
2004 ops/sec 7 MB/sec
new thread, total running : 32
read:0 write:1
2088 ops/sec 8 MB/sec
2091 ops/sec 8 MB/sec
2365 ops/sec 9 MB/sec
2278 ops/sec 8 MB/sec
2322 ops/sec 9 MB/sec
2241 ops/sec 8 MB/sec
2105 ops/sec 8 MB/sec
2241 ops/sec 8 MB/sec
2040 ops/sec 7 MB/sec
1997 ops/sec 7 MB/sec
2062 ops/sec 8 MB/sec
2111 ops/sec 8 MB/sec
2150 ops/sec 8 MB/sec
2253 ops/sec 8 MB/sec
2246 ops/sec 8 MB/sec
2188 ops/sec 8 MB/sec

This relatively old SSD drive can only do 2K random writes per second. It appears we need more than one thread to saturate too; we could run with nThreads:1 for a long time to verify that is true. Here are some mongoperf statistics from a test run on an Amazon EC2 machine with internal SSD storage:

         iops, thousands<br>threads  read test write test
-------  --------- ----------
1        4       8
2         8       8
4         16       8
8         32       8<br>16         64       8
32         70       8

Here’s a read test on a RAID-10 volume comprised of four spinning disks (SATA):

parsed options:
{ nThreads: 32, fileSizeMB: 1000, r: true }
creating test file size:1000MB ...
new thread, total running : 1
150 ops/sec 0 MB/sec
174 ops/sec 0 MB/sec
169 ops/sec 0 MB/sec
new thread, total running : 2
351 ops/sec 1 MB/sec
333 ops/sec 1 MB/sec
347 ops/sec 1 MB/sec
new thread, total running : 4
652 ops/sec 2 MB/sec
578 ops/sec 2 MB/sec
715 ops/sec 2 MB/sec
new thread, total running : 16
719 ops/sec 2 MB/sec
722 ops/sec 2 MB/sec
493 ops/sec 1 MB/sec
new thread, total running : 32
990 ops/sec 3 MB/sec
955 ops/sec 3 MB/sec
842 ops/sec 3 MB/sec

Note that when testing a volume using spinning disks it is important to make your test file large – much larger than the 1GB test file in the examples above. Otherwise the test will only be hitting a few adjacent cylinders on the disk and report results that are faster than you would achieve if the disk is used in its entirety. Let’s try a larger file:

{ nThreads: 32, fileSizeMB: 20000, r: true }
new thread, total running : 1
86 ops/sec 0 MB/sec
98 ops/sec 0 MB/sec
91 ops/sec 0 MB/sec
new thread, total running : 2
187 ops/sec 0 MB/sec
188 ops/sec 0 MB/sec
192 ops/sec 0 MB/sec
new thread, total running : 4
295 ops/sec 1 MB/sec
296 ops/sec 1 MB/sec
233 ops/sec 0 MB/sec
new thread, total running : 8
307 ops/sec 1 MB/sec
429 ops/sec 1 MB/sec
414 ops/sec 1 MB/sec
new thread, total running : 16
554 ops/sec 2 MB/sec
501 ops/sec 1 MB/sec
455 ops/sec 1 MB/sec
new thread, total running : 32
893 ops/sec 3 MB/sec
603 ops/sec 2 MB/sec
814 ops/sec 3 MB/sec

Let’s now try a write test on the RAID-10 spinning disks:

parsed options:
{ nThreads: 32, fileSizeMB: 1000, w: true }
creating test file size:1000MB ...
new thread, total running : 1
113 ops/sec 0 MB/sec
117 ops/sec 0 MB/sec
113 ops/sec 0 MB/sec
new thread, total running : 2
120 ops/sec 0 MB/sec
113 ops/sec 0 MB/sec
115 ops/sec 0 MB/sec
new thread, total running : 4
115 ops/sec 0 MB/sec
115 ops/sec 0 MB/sec
112 ops/sec 0 MB/sec
new thread, total running : 8
111 ops/sec 0 MB/sec
110 ops/sec 0 MB/sec
111 ops/sec 0 MB/sec
new thread, total running : 16
116 ops/sec 0 MB/sec
110 ops/sec 0 MB/sec
105 ops/sec 0 MB/sec
new thread, total running : 32
115 ops/sec 0 MB/sec
111 ops/sec 0 MB/sec
114 ops/sec 0 MB/sec
</pre>
<p>The write result above seems slower than one would expect &#x2013; this is an example where more investigation and analysis would then be appropriate, and an example of a case where running mongoperf might prove useful.</p>
<h3>mmf:true mode</h3>
<p>mongoperf has another test mode where instead of using direct (physical) i/o, it tests random reads and writes via memory mapped file regions. In this case caching will come into effect &#x2013; you should see very high read speeds if the datafile is small, and speeds that begin to approach physical random I/O speed as the datafile becomes larger than RAM. For example:</p>
<pre>parsed options:
{ recSizeKB: 8, nThreads: 8, fileSizeMB: 1000, r: true, mmf: true }
creating test file size:1000MB ...
new thread, total running : 1
read:1 write:0
65 ops/sec
79 ops/sec
92 ops/sec
107 ops/sec
111 ops/sec
87 ops/sec
125 ops/sec
141 ops/sec
new thread, total running : 2
273 ops/sec
383 ops/sec
422 ops/sec
594 ops/sec
1220 ops/sec
2598 ops/sec
36578 ops/sec
489132 ops/sec
new thread, total running : 4
183926 ops/sec
171128 ops/sec
173286 ops/sec
172908 ops/sec
173187 ops/sec
173322 ops/sec
173961 ops/sec
175195 ops/sec
new thread, total running : 8
389256 ops/sec
396595 ops/sec
398382 ops/sec
402393 ops/sec
400701 ops/sec
404904 ops/sec
400571 ops/sec

The numbers start low as at the beginning of the reading the test file is not in the file system cache (in the Linux version of mongoperf anyway). Data faults into the cache quite quickly as the readahead for the volume is quite large. Once the entire file is in ram the number of accesses per second is quite high.

We can look at the readahead settings for the device with “sudo blockdev –report”. Note that the value reported by this utility in the “RA” field the number of 512 byte sectors.

During the above test, if we look at iostat, we see large reads occuring because of the readahead setting that was used (the avgrq-sz column, which specifies number of sectors requested):

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc             148.33     0.00  116.00    0.00    22.30     0.00   393.63     1.68   14.48   7.19  83.40
sdd             130.67     0.00  113.00    0.00    20.38     0.00   369.35     1.54   13.58   7.19  81.23
sde             154.00     0.00  113.67    0.00    21.85     0.00   393.64     1.84   16.23   7.38  83.87
sdb             140.00     0.00  107.00    0.00    20.27     0.00   387.91     1.88   17.58   7.84  83.87
md0               0.00     0.00 1025.33    0.00    85.34     0.00   170.45     0.00    0.00   0.00   0.00

Thus we are reading ahead about approximately 200KB apparently from each spindle on a physical random read I/O.

Note that if your database is much larger than RAM and you expect there to be cache misses on a regular basis, this readahead setting might be too large – if the object to be fetched from disk is only 8KB, another ~200KB in this case is being read ahead with it. This is good for cache preheating but that readahead could eject other data from the file system cache; thus if the data read ahead were “cold” and unlikely to be used, that would be bad. In that situation, make the readahead setting for your volume smaller. 32KB might be a good setting, perhaps 16KB on a solid state disk. (It is likely never helpful to go below 8KB (four sectors) as MongoDB b-tree buckets are 8KB.)

One of a couple trade-offs with readahead is that cache preheating will take a long time if the readahead setting is tiny. Consider the following run where there was no readahead (just 4KB reads on faults with no readahead occurring):

parsed options:
{ nThreads: 32, fileSizeMB: 1000, r: true, mmf: true }
creating test file size:1000MB ...
testing...
new thread, total running : 1
67 ops/sec
110 ops/sec
184 ops/sec
167 ops/sec
174 ops/sec
159 ops/sec
189 ops/sec
190 ops/sec
new thread, total running : 2
362 ops/sec
393 ops/sec
371 ops/sec
354 ops/sec
374 ops/sec
388 ops/sec
384 ops/sec
394 ops/sec
new thread, total running : 4
486 ops/sec
400 ops/sec
570 ops/sec
589 ops/sec
567 ops/sec
545 ops/sec
576 ops/sec
412 ops/sec
new thread, total running : 8
666 ops/sec
601 ops/sec
499 ops/sec
731 ops/sec
618 ops/sec
448 ops/sec
508 ops/sec
547 ops/sec
new thread, total running : 16
815 ops/sec
802 ops/sec
917 ops/sec
580 ops/sec
955 ops/sec
1006 ops/sec
1048 ops/sec
938 ops/sec
new thread, total running : 32
1993 ops/sec
1186 ops/sec
1331 ops/sec
1317 ops/sec
1298 ops/sec
991 ops/sec
1431 ops/sec
1406 ops/sec
1395 ops/sec
1099 ops/sec
1265 ops/sec
1400 ops/sec
1484 ops/sec
1436 ops/sec
1352 ops/sec
1438 ops/sec
1380 ops/sec
1350 ops/sec
1565 ops/sec
1440 ops/sec
1015 ops/sec
1253 ops/sec
1414 ops/sec
1443 ops/sec
1478 ops/sec
1405 ops/sec
1305 ops/sec
1518 ops/sec
1217 ops/sec
1573 ops/sec
1605 ops/sec
1476 ops/sec
1130 ops/sec
1362 ops/sec
1463 ops/sec
1740 ops/sec
1682 ops/sec
1653 ops/sec
1135 ops/sec
1521 ops/sec
1821 ops/sec
1708 ops/sec
1701 ops/sec
1631 ops/sec
1195 ops/sec
1752 ops/sec
1701 ops/sec

... time passes ...

353038 ops/sec
353508 ops/sec
353159 ops/sec

Near the end of the run, the entire test file is in the file system cache:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28564 dwight    20   0 1292m 1.0g 1.0g S 609.1  3.3   0:47.11 mongoperf
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28564 dwight    20   0 1292m 1.0g 1.0g S 609.1  3.3   0:47.11 mongoperf

Note though if fetching only 4KB at a time, and 400 physical random reads per second, we’ll need up to 1GB / 4KB per page / 400 pages fetched per second = 655 seconds to heat up the cache. (And 1GB is a small cache, imagine a machine with 128GB of RAM and a database that large or larger.) Note there are ways to preheat a cache other than readahead, for more info see: this post on preheating. Suggestion: on Linux, we suggest using a recSizeKB of 8 or larger when using mmf:true – it seems that when only a single 4KB page is touched, certain kernel versions may not perform readahead. (At least the way mongoperf is coded…)

Writes with mmf:true

We can also do load testing and simulations of writes via memory-mapped files (analogous to what MongoDB does in its storage engine) with mongoperf. Use mmf:true and w:true for this.

MongoDB writes are written to the crash recovery log (journal) by mongod almost immediately, however the datafile writes can be deferred up to a minute. mongoperf simulates this behavior by fsync'ing its test datafile once a minute. Since writes are only allowed to be lazy by that amount, even if the data written fits in RAM, it will be written to disk fairly soon (within a minute) – thus you may see a good amount of random write I/O when mongoperf is running even if the test datafile fits in RAM. This is one reason SSDs are often popular in MongoDB deployments.

For example, consider a scenario where we run the following:

$ echo "{recSizeKB:8,nThreads:32,fileSizeMB:1000,w:true,mmf:true}" | mongoperf

If our drive can write 1GB (the test datafile size) sequentially in less than a minute (not unusual), the test will likely report a very high sustained write rate, even after running more than a minute. However if we then make the file far larger than 1GB, you will likely see a significant slowdown in write speed as the background flushing of data >= 1 minute old will become a factor (at least on spinning disks).

Mixed mode

Note that mongoperf has some other options, see the –help option for more info. In particular you can run a test with concurrent reads and writes in the same test, and also you can specify the read and write rates to explicitly simulate a certain scenario you would like to test.

Conclusions and Caveats

Note that mongoperf is not MongoDB. mmf:false mode is testing physical disk i/o with no caching; because of caching MongoDB will usually perform vastly better than that. Additionally, mmf:true is not a perfect simulation of MongoDB. You might get superior performance in MongoDB than mongoperf indicates.

P.S. The mongoperf utility is very simple (a couple hundred lines of code), so you may wish to take a look at its source code

MongoDB Text Search: Experimental Feature in MongoDB 2.4

MongoDB

Releases

Text search (SERVER-380) is one of the most requested features for MongoDB 10gen is working on an experimental text-search feature, to be released in v2.4, and we’re already seeing some talk in the community about the native implementation within the server. We view this as an important step towards fulfilling a community need.

MongoDB text search is still in its infancy and we encourage you to try it out on your datasets. Many applications use both MongoDB and Solr/Lucene, but realize that there is still a feature gap. For some applications, the basic text search that we are introducing may be sufficient. As you get to know text search, you can determine when MongoDB has crossed the threshold for what you need.

Setting up Text Search


You can configure text search in the mongo shell:

db.adminCommand( { setParameter : 1, textSearchEnabled : true } )


Or set a command:

mongod --setParameter textSearchEnabled=true

A Simple Example:


In this example, we will insert 3 documents into a collection, add in text indexes and then query for the word “Australian”.

// inserting documents
t.save( { _id: 1, title: "Physics World", caption: "Physics World is the membership magazine of the Institute of Physics."} );

t.save( { _id: 2, title: "A break away!", caption: "'A break away!' is an 1891 painting by Australian artist Tom Roberts."} );

t.save( { _id: 3, title: "Mahim Bora", caption: "Mahim Bora (b.1926) is an Indian writer and educationist from Assam state."});

// this creates an index
t.ensureIndex( { caption: "text" } );

// this executes the search command. The “score” below indicates how many of the search terms appear in the indexed fields of the document and how frequently they appear

t.runCommand( "text", { search : "Australian" } );
=>
  {
  "queryDebugString" : "australian||||||",
  "language" : "english",
  "results" : [
      {
      "score" : 0.55,
      "obj" : {
        "_id" : 2,
        "title" : "A break away!",
        "text" : "'A break away!' is an 1891 painting by Australian artist Tom Roberts."
        }
      }
    ],
  "stats" : {
    "nscanned" : 1,
    "nscannedObjects" : 0,
    "n" : 1,
    "timeMicros" : 89
    },
  "ok" : 1
  }

We’ll be organizing a series of project nights through the MongoDB User Group network for anyone interested to come and try out the feature and provide feedback. Here are a list of the upcoming testing sessions:


If you’re interested in organizing a project night for text search get in touch with the community team who can help you get set up. All you need is some computers and a few data sets and you’ll be ready to test.

3D Repo Runs MongoDB

MongoDB

Releases

If you’re an architectural or engineering firm, you’ve undoubtedly confronted the difficulty of managing and collaborating on 3D assets like CAD drawings. The act of sharing massive files is hard but feasible, but it is significantly complicated by the inability to determine that you’re using the latest version. For the CAD-inclined, there’s hope. Jozef Dobos, a doctoral student at University College London (UCL), has applied the geospatial indexing capabilities of MongoDB a version control system for 3D assets called 3D Repo. Sponsored by Arup Foresight, the built environment innovation division of Arup Group Limited, a global design and business consulting firm with offices in over 30 countries, 3D Repo leverages the flexibility of MongoDB’s data model, not to mention its geospatial capabilities, to make collaboration on 3D assets easy.

The Problem

Whether an architectural firm or a product design company, collaboration on large (500GB+) 3D assets is a critical but difficult task, one not solved simply by sending links around to these assets. The maintenance of assets in a large 3D visualisation can involve authors that are numerous, geographically dispersed, and diverse in their skillsets and tools.

This presents problems including, but not limited to, maintaining consistency of the models and dealing with concurrent edits in the same part of a 3D scene. For example, in the industry today, each user loads a particular 3D scene into a modeling tool, modifies it, and then re-saves the entire file again, making any version tracking and sharing unnecessarily unclear. By the time a simulation is completed, the design might progress so rapidly that the results are often inapplicable.

An SVN for 3D Assets

Looking for a better way to collaborate on 3D files, Dobos developed a unified and integrated framework, called 3D Repo, that supports collaborative editing and distribution of 3D assets. Think of it like Subversion (SVN), but for 3D assets, not code.

image

The 3D Repo framework tracks multiple revisions of 3D assets so that they can be integrated later on. It thus provides similar functionality to file-based revision control systems like SVN, but is built around MongoDB, thereby avoiding the constraints of a file-based system. The framework also supports distributed editing over the Internet and additional lightweight clients in web-browsers and mobile devices. Such an approach is expected to improve the engineering as well as public engagement and considerably reduce the costs of future industrial development.

Why MongoDB?

While there are other proprietary and open-source databases that support spatial data directly - e.g. Oracle Spatial and PostGIS for PostgreSQL - these alternatives focus on 2D geometry, not 3D models and their properties. They also do not support the type of revision histories that is commonly required in 3D asset management.

MongoDB, on the other hand, avoids rigid data models and is optimized for large read-write operations. It is therefore possible to store 3D models in a database and, due to MongoDB’s schema flexibility, also track other associated data such as semantic relationships and even individual revisions, as is the case in 3D Repo.

As Dobos explains:

Basically, relational databases can in no way cut it. Rigid table structures are not suitable for highly diverse and large 3D data. In our case, a single 3D model or, better said, a “3D scene” is represented as a scene graph, where each node can be, and most of the time is, something different. It can be a node for a mesh, hence a large binary array, or a transformation, where transformation matrix is simply 16 numbers, or animation, bone, material, texture etc. Each of these is represented and stored differently. So key-value pairs are a match made in heaven for this type of data.

In MongoDB, 3D Repo assigns two collections (tables) per 3D scene, one for all the scene graph constituents and one for all the documents that belong to a revision history. Hence, each of these collections stores a directed acyclic graph, making the access implementation reusable. Once the data is in a database, access is implicitly supported in a distributed manner via a dedicated query language.

In order to offer a scalable visualization platform, 3D Repo streams decomposed scene graph components from MongoDB onto client devices, reconstructs their 3D representation and displays them for viewing.
This explains why MongoDB is superior to an RDBMS like Oracle for this kind of application, but it doesn’t address the possibility that Dobos could have turned to another NoSQL database, and particularly a graph database, given that a scene graph is a directed acyclic graph. Surely this should be a fit for a graph database?
Not really, as Dobos articulates:
Mostly simply stated, we don’t need any complicated graph traversal operations. We opted for MongoDB because of its efficiency with binary storage, BSON rather than ASCII (hence a clear winner over CouchDB, too), and due to its flexibility, not to mention its web-ready approach. MongoDB is proven to scale well, has sharding built-in and there is a thriving community to support it. Most of the design inspiration in our system actually comes from webinars by Wordnik and their experience. We are really grateful for those!
In sum, MongoDB delivers the flexibility and 3D geospatial capabilities necessary to making 3D Repo a reality.

Results

The 3D Repo approach offers significant benefits over alternative techniques like geographic information systems (GIS), given the direct MongoDB database connection, the scalability of MongoDB, and its built-in support for geospatial indexing. In addition, MongoDB enables version control for 3D assets in a way impossible to proprietary or open-source RDBMS. Geographic information systems such as Oracle Spatial or PostGIS may be acceptable for 2D assets such as lines, but they cannot match MongoDB for representing 3D data efficiently and carrying metadata on top. Importantly, 3D Repo’s MongoDB-based solution can be extended to support various data types including building information modelling (BIM) or other representations.
For more information on the research Dobos is doing on 3D Repo, please see “Revision Control Framework for 3D Assets” (PDF) or review his related research.
Note: MongoDB does not currently support 3D geospatial indexing.

MongoQP: MongoDB Slow Query Profiler

MongoDB

Releases

Two times a year 10gen’s Drivers and Innovations team gather together for a face to face meeting to work together and setting goals for the upcoming six months. This year the team broke up into teams for an evening hackathon. MongoQP, a query profiler, was one of the hacks presented by Jeremy Mikola, PHP Evangelist at 10gen.

Logging slow queries is essential for any database application, and MongoDB makes doing so relatively painless with its database profiler. Unfortunately, making sense of the system.profile collection and tying its contents back to your application requires a bit more effort. The heart of mongoqp (Mongo Query Profiler) is a bit of map/reduce JS that aggregates those queries by their BSON skeleton (i.e. keys preserved, but values removed). With queries reduced to their bare structure, any of their statistics can be aggregated, such as average query time, index scans, counts, etc.

image

As a fan of Genghis, a single-file MongoDB admin app, I originally intended to contribute a new UI with the profiler results, but one night was not enough time to wrap my head around Backbone.js and develop the query aggregation. Instead, I whipped up a quick frontend using the Silex PHP micro-framework. But with the hack day deadline no longer looming, there should be plenty of time to get this functionality ported over to Genghis. Additionally, the map/reduce JS may also show up in Tyler Brock’s mongo-hacker shell enhancement package.

image

While presenting mongoqp to my co-workers, I also learned about Dan Crosta’s professor, which already provides many of the features I hoped to implement, such as incremental data collection. I think there is still a benefit to developing the JS innards of mongoqp and getting its functionality ported over to other projects, but I would definitely encourage you to check out professor if you’d like a stand-alone query profile viewer.


Contributions welcome through Github.

Introducing MongoClient

MongoDB

Releases

Today we are releasing updated versions of most of the officially supported MongoDB drivers with new error checking and reporting defaults. See below for more information on these changes, and check your driver docs for specifics.

Over the past several years, it’s become evident that MongoDB’s previous default behavior (where write messages did not wait for a return code from the server by default) wasn’t intuitive and has caused confusion for MongoDB users. We want to rectify that with minimal disruption to the MongoDB apps already in production.

History

First, I thought it would be interesting to share the history behind the previous default behavior, and why and how we are changing it.

The old behavior goes back to the very beginning of 10gen, before Dwight and I imagined MongoDB as a stand-alone database. When we first started 10gen in the fall of 2007, we set out to build a full platform as a service stack with MongoDB as the data layer. This was a fully hosted system (still open source), that encompassed a load balancer, auto scaling application server and data tier. The application side was a full server side JavaScript environment.

Every request into 10gen was an http request. So you can imagine a controller for doing some user analytics could act like this:

URL: http://foo.10gen.com/show?sect=sports&page=blog1

CODE:

db.sect_views.update( 
  { _id : db.getParameter( “sect” ) } , 
  { $inc : { count : 1 } } , /*upsert*/true );
db.page_views.update( 
  { _id : db.getParameter( “page” ) } , 
  { $inc : { count : 1 } } , true );

Writes in that system did not individually wait for a response from the database. However, the application server itself always checked the database for any errors that occurred during the entire page load (using getLastError and getPrevError) so that the user/system would be notified of any issues. Developers could of course also call getLastError whenever they wanted. This worked great in the platform, as we were able to control the whole access pattern.

In January of 2009, we decided for a variety of reasons to only focus on the data tier (MongoDB). At that time, a number of people had been using MongoDB in production for almost a year as part of the full stack, and a lot more were very interested in using it standalone.

Over the course of the next few months, we wrote the initial implementations of the Java, Python, Ruby and PHP drivers. All of those drivers used the same network protocol as the original application server, which has non-synchronous write operations. It seemed natural to us at the time given the background, but it is clear that this is not intuitive for new users of MongoDB who had never used the full stack.

New Behavior

Going forward, the default clearly has to be to wait for the database to acknowledge all writes; that is much more intuitive. Just flipping the default however, would be backward breaking for apps in production.

The change we’re going to make is to add a new top level connection class in each driver. For example, in Java, previously you would do:

Mongo mongo = new Mongo( “mongoserver” );

That class, Mongo, will maintain the old default, and become deprecated.

For new code, you will do:

MongoClient mongo = new MongoClient( “mongoserver” );

which will default to a WriteConcern of 1. Everything else will be the same.

The old class will remain for quite a while (but not forever), so that we won’t break old code right now. As another benefit, every single driver will use MongoClient so for the first time at least the top level class will have the same name across the board. All documentation, tutorials, and sample code have been changed accordingly.

More Information

  • For driver downloads, docs, and tutorials, please visit the drivers page on the MongoDB wiki.
  • For bug reports and feature requests please visit jira.mongodb.org.
  • For any other feedback, please leave a comment on this post or in our user forum.

We appreciate your continued support and feedback on MongoDB.

- Eliot and the MongoDB Team