GIANT Stories at MongoDB

Living in the post-transactional database future

Given that we’ve spent decades building applications around relational databases, it’s not surprising that the first response to the introduction of NoSQL databases like MongoDB is sometimes “Why?” Developers aren’t usually the ones asking this question, because they love the approachability and flexibility MongoDB gives them. But DBAs who have built their careers on managing heavy RDBMS infrastructure? They’re harder to please.

10gen president Max Schireson estimates that 60 percent of the world’s databases are operational in nature, which is MongoDB’s market. Of those use cases, most of them are ripe for a non-relational approach.

The database market is rapidly changing, and very much up for grabs. Or as Redmonk analyst James Governor puts it,

“The idea that everything is relational? Those days are gone.”

As useful as relational databases are (and they’re very useful for a certain class of application), they are losing relevance in a world where complex transactions are more the exception, less the rule. In fact, I’d argue that over time, the majority of application software that developers write will be in use cases that are better fits for MongoDB and other NoSQL technology, not legacy RDBMS.

That’s the future. What about now?

Arguably, many of the applications being built today are already post-transaction, ripe for MongoDB and poor fits for RDBMS. Consider:

  • Amazon: its systems that process order transactions (RDBMS) are largely “done” and “stable”. Amazon’s current development is largely focusing on how to provide better search and recommendations or how to adapt prices on the fly (NoSQL).
  • Netflix: the vast majority of it engineering is focusing on recommending better movies to you (NoSQL), not processing your monthly bill (RDBMS).
  • Square: the easy part is processing the credit card (RDBMS). The hard part is making it location aware, so it knows where you are and what you’re buying (NoSQL).

It’s easy, but erroneous, to pigeon-hole these examples as representative of an anomalous minority of enterprises. Yes, these companies represent the cutting edge of both business and technology. But no, they are not alone in building these sorts of applications. For every early-adopter Netflix there’s a sizable, growing population of mainstream companies in media (e.g., The Guardian), finance (e.g., Intuit), or other verticals that are looking to turn technology into a revenue-driving asset, and not simply something that helps keep the lights on and payrolls running.

When what we built were websites, RDBMS worked great. But today, we’re building applications that are mobile, social, involve high volume data feeds, incorporate predictive analytics, etc. These modern applications? They don’t fit RDBMS. Andy Oliver lists 10 things never to do with a relational database, but the list is much longer, and growing.

MongoDB is empowering the next generation of applications: post-transactional applications that rely on bigger data sets that move much faster than an RDBMS can handle.

Yes, there will remain a relatively small sphere of applications unsuitable for MongoDB (including applications with a heavy emphasis on complex transactions), but the big needs going forward like search, log analysis, media repositories, recommendation engines, high-frequency trading, etc.? Those functions that really help a company innovate and grow revenue?

They’re best done with MongoDB.

Of course, given RDBMS’ multi-decade legacy, it’s natural for developers to try to force RDBMS to work for a given business problem. Take log analysis, for example. Oliver writes:

Log analysis: …[T]urn on the log analysis features of Hadoop or RHQ/JBossON for a small cluster of servers. Set the log level and log capture to anything other than ERROR. Do something more complex and life will be very bad. See, this kind of somewhat unstructured data analysis is exactly what MapReduce à la Hadoop and languages like PIG are for. It’s unfortunate that the major monitoring tools are RDBMS-specific — they really don’t need transactions, and low latency is job No. 1.

For forward-looking organizations, they already realize that MongoDB is an excellent fit for log management, which is why we see more and more enterprises turning to MongoDB for this purpose.

I expect this to continue. As MongoDB continues to enrich its functionality, the universe of applications for which it is not merely applicable, but also better, will continue to expand, even as the universe of applications for which RDBMS is optimal will decline. Indeed, we’re already living in a post-transactional world. Some people just don’t know it yet. (Or, as William Gibson would say, “The future is already here – it’s just not very evenly distributed.”)

Posted by Matt Asay, vice president of Corporate Strategy, with significant help from my inestimable colleague, Jared Rosoff.

Tagged with: NoSQL, MongoDB, RDBMS, relational, James Governor, Redmonk, log analysis, Andy Oliver, transactions, Netflix, Amazon, Square, operational database, DBA

Data Science on MongoDB...At Last!

MongoDB

Releases

Today, I’m excited to announce the launch of Precog for MongoDB, a release that bundles all of the really cool Precog technology into a free package that anyone can download and deploy on their existing MongoDB database.

Precog is a data science platform that lets developers and data scientists do advanced analytics and statistics using Quirrel, the “R for big data” language. You can analyze data programmatically with a REST API (or client library) or interactively with Labcoat, an easy-to-use HTML5 application built on the REST API. We provide a cloud-hosted version of Precog, but we’ve known for a long time that we were going to bring a standalone version of our data science Precog to some NoSQL database.

MongoDB makes the perfect choice for many reasons:

  • MongoDB developers share our passion for creating software that developers love to use.
  • Quirrel is designed to analyze JSON, which is natively supported by MongoDB.
  • MongoDB has a basic query and aggregation framework, but to do more advanced analytics, you have to write lots of custom code or export the data into a RDBMS, both of which are very painful.
  • We’re great friends of some of the 10gen developers and have released open source software for MongoDB.
Precog for MongoDB gives you the ability to analyze all the data in your MongoDB database, without forcing you to export data into another tool or write any custom code.

We’re really excited about the release and encourage you to download the release from the official product page and start using it today.

In the remainder of this post, I’m going to quickly walk you through installation and configuration of the Precog for MongoDB release.

Step 1: Unpack the Download

The download is a ZIP file that contains the following files:

config.cfg
precog.sh
precog.bat

The file precog.jar is the Java JAR that bundles all of the Precog dependencies into a single (really big!) file. The file precog.sh and precog.bat are scripts that launch precog.jar.

The file config.cfg contains configuration information.

Step 2: Configure Precog

All the configuration settings for Precog are stored in the file config.cfg, with reasonable defaults

There are two things you need to do at a minimum before you can launch Precog:

  1. Tell Precog where to find the MongoDB server.
  2. Tell Precog what the master account is.

To tell Precog where to find the MongoDB server, simply edit the following settings:

queryExecutor {
  mongo {
    server = "mongodb://localhost:27017"
  }

Change the “localhost:27017” portion to the host and port of your mongo server. For optimal performance, you should launch Precog on the same machine that is running the MongoDB server.

Precog will map the MongoDB databases and collections into the file system by placing the databases at the top level of the file system, and will nest the database collections under the databases (e.g. /mydb/mycollection/).

To tell Precog what the master account is, edit config.cfg and add the following settings:

security {
  masterAccount {
    apiKey = "12345678-1234-1234-1234-123456789abc"
  }
}

The API key for the master account can be anything you like, but you should treat it securely because they whoever has it has full access to all of your MongoDB data.

You may also want to tweak the ports that Precog use for the web server that exposes the Precog REST API and to server labcoat:


 server {
     port = 8888
    }
    ...
    labcoat {
      port = 8000
}


Step 3: Launch Precog

To run precog.jar, you will need to install JRE 6 or later (many systems already have Java installed). If you’re on an OS X or Linux machine, just run the precog script, which automatically launches Java:

precog.sh


If you’re on a Windows machine, you can launch Precog with the precog.bat script.

Once Precog has been launched, it will start a web server that exposes the REST API as well as labcoat.

Step 4: Try the API

Once Precog is running, you have full access to the Precog REST API. You can find a large number of open source client libraries available on Github, and the Precog developers site contains a bunch of documentation and tutorials for interacting with the API.


Step 5: Try Labcoat

Labcoat is an HTML5 application that comes bundled in the download. You don’t have to use Labcoat, of course, since Precog has a REST API, but Labcoat is the best way to interactively explore your data and develop Quirrel queries.

The precog.jar comes with a bundled web server for labcoat, so once it’s running just point your browser at http://localhost:8000/ (or whatever port you’ve configured it for) and you’ll have a new labcoat IDE pointing at your local Precog REST API.

Step 6: Analyze Data!

Once you’ve got Labcoat running, you’re all set! You should see your MongoDB collections in the file system explorer, and you can query data from the collections, develop queries to analyze the data, and export queries as code that run against your Precog server.

Precog is a beta product, and Precog for MongoDB is hot off the press. You may encounter a few rough corners, and if so, we’d love to hear about them (just send an email to support@precog.com).

If you end up doing something cool with Precog for MongoDB, or if you just want to say hello, feel free to reach out to us via our website, or to me personally at john@precog.com.

Have fun analyzing!

John A. De Goes, CEO/Founder of Precog

MongoDB for the PHP Mind, Part 3

MongoDB

Releases

This is part 3 in a series, which will focus on the data modeling aspect of working with document databases. The previous parts are also available for reading: Part 1: Getting Started, and Part 2: Queries and Indexes.

The Usual Suspects

Although there are plenty of existing articles, presentations and webcasts about modeling your data to take advantage of a document database, this post is taking a slightly PHP-centric position as a part of this series. This information is useful to anyone though, regardless of their chosen programming language.

We’re going to use two different scenarios to look at data modeling in the document world, chosen as common examples to illustrate differences in implementation between relational and document databases:

  • Blog. Your garden variety of data, covering posts, comments and tags
  • Private Sale / E-Commerce. Taking a look at needs for orders, users and products

Scenario 1: Getting All Bloggy

I’m kicking off with the tried-but-true blog example as it is a common frame of reference. Part of the biggest challenges of assessing MongoDB is the ability to quickly understand document data modeling, and stepping outside the constraints of relational data modeling. A common scenario helps illustrate that.

Your typical blog will have the following elements to consider:

AuthorsIn most cases this can point at existing user dataPostsWhere you store the actual data about each blog postCommentsAll blog posts need commentsCategoriesHow you choose to organize your posts by category groupingTagsTagging helps people find posts, and helps you link your related posts together

A typical third normal form relational model will produce around six tables. For many-to-many relationships (such as with posts to tags) you need the reference table as well as the link table with the keys. For example, for tags you might have the posts table, the tags table, and a posts_tags table that simply links post_id with tag_id for each reference. This not only complicates your data model, but it shows a disconnect between relational theory and document structure.

Here’s an example of how your posts might hook into tags, using a relational model:

image

This approach can complicate your code, as you have to write more sophisticated queries to join multiple tables together. If your application is also expected to create new posts and tags, then that logic will be more complex as well. If you are using a framework, you might find yourself spending more time figuring out how to deal with the database inserts and updates, as opposed to actually writing code.

For instance, every time you create a new blog post, you’re going to have to do the following:

  • First check if that category exists, if not then create it
  • Check for tags, create them too
  • Create link between post and tag, for each tag used for this article
  • Link to author

Comments of course happen after the post is live, so inserting them is a little less effort:

  • Check to see if this comment is responding to another comment
  • Insert comment

With MongoDB you have a few approaches to solve this problem.

Equi-join and Embedded Lists

Taking our example with posts and tags, you can remove the middle by storing a list of tag_ids in your post document. This is how this might look with a document model:

    > db.posts.findOne();
    {
        "_id" : ObjectId("508d27069cc1ae293b36928d"),
        "title" : "This is the title",
        "body" : "This is the body text.",
        "tags" : [
            ObjectId("508d35349cc1ae293b369299"),
            ObjectId("508d35349cc1ae293b36929a"),
            ObjectId("508d35349cc1ae293b36929b"),
            ObjectId("508d35349cc1ae293b36929c")
        ],
        "created_date" : ISODate("2012-10-28T12:41:39.110Z"),
        "author_id" : ObjectId("508d280e9cc1ae293b36928e"),
        "category_id" : ObjectId("508d29709cc1ae293b369295"),
        "comments" : [
            ObjectId("508d359a9cc1ae293b3692a0"),
            ObjectId("508d359a9cc1ae293b3692a1"),
            ObjectId("508d359a9cc1ae293b3692a2")
        ]
    }

This approach presumes that you are storing your tags and comments in their own collection. Of course your users are in a separate collection as well. With MongoDB, an equi-join works just like in the relational model, however you have to perform a separate query to fetch that data.

You might be asking yourself so why is running separate queries a better idea than one SQL statement with a few joins? Think about the query cache from a relational database. That one query will hit multiple tables, and any of which need to be updated only once and that query gets dropped from the cache.

MongoDB caches queries too, however with separate queries for each collection, an update to one of these collections does not invalidate the cache for the others; and for example, a user updating their password will not touch the cached post, comments, categories, or tags. That same event on a relational database will drop the query from the cache, even though the data being updated had nothing to do in particular with that query.

One last comment about the monster SQL statement approach: Many platforms and frameworks are breaking out the logic that pulls the content for a page, and separating that from the blocks that typically populate the sidebars and footer. For example, the comments are always rendered from a separate module. If you are using a complex, heavy platform that means you have to run separate queries for the post and comments anyway, as the comments module won’t have access to the post content object.

The simplest example is running a single query to fetch the content of your blog post and render it in the main body of your app, and then run a separate query for grabbing all the comments and displaying them in a separate comments module at the bottom of your content area. Although you still can enforce relational integrity, at this point you are getting a minimized benefit from a relational engine as you are displaying related data from separate queries. Some modern platforms will do this with everything, including separating queries for authors, categories, and tags — so you’re running separate queries in the end regardless of database platform.

Embedded Lists, No Join

You could also just embed all of your tags and comments in each post, dropping your count to just the posts collection, and no longer needing a separate collection for comments or tags.

    > db.posts.findOne();
    {
        "_id" : ObjectId("508d27069cc1ae293b36928d"),
        "title" : "This is the title",
        "body" : "This is the body text.",
        "tags" : [
            "chocolate",
            "spleen",
            "piano",
            "spatula"
        ],
        "created_date" : ISODate("2012-10-28T12:41:39.110Z"),
        "author_id" : ObjectId("508d280e9cc1ae293b36928e"),
        "category_id" : ObjectId("508d29709cc1ae293b369295"),
        "comments" : [
            {
                "subject" : "This is coment 1",
                "body" : "This is the body of comment 1.",
                "author_id" : ObjectId("508d345f9cc1ae293b369296"),
                "created_date" : ISODate("2012-10-28T13:34:23.929Z")
            },
            {
                "subject" : "This is coment 2",
                "body" : "This is the body of comment 2.",
                "author_id" : ObjectId("508d34739cc1ae293b369297"),
                "created_date" : ISODate("2012-10-28T13:34:43.192Z")
            },
            {
                "subject" : "This is coment 3",
                "body" : "This is the body of comment 3.",
                "author_id" : ObjectId("508d34839cc1ae293b369298"),
                "created_date" : ISODate("2012-10-28T13:34:59.336Z")
            }
        ]
    }

This greatly speeds the assembly of data for rendering that page, as one query returns the post, tags and comments.

So which approach is better?

With document design you need to consider two things: scale and search. ScaleMongoDB Documents have a limit of 16MB, which although sounds quite small, can accommodate thousands of documents. However if you are expecting 20,000 comments per post on a high traffic website, or your comments are unlimited in size, then embedding might not work well for you.SearchDepending on how you want to find your documents, you should consider their structure. MongoDB makes it dead simple to embed lists and even other documents, but if you find yourself constantly reaching deeper and deeper to find the data you need, then performance can become a problem as your data set grows.

You must weigh both of these to decide what approach makes the most sense for your application’s needs.

Outro, or What’s Coming Next

The next article in this series covers Example Scenario 2: Private Sales. That article examines the issues of managing inventory and making atomic updates.

Password Authentication with Mongoose (Part 2): Account Locking

MongoDB

Releases

This post is Part 2 (of 2) on implementing secure username/password authentication for your Mongoose User models. In Part 1 we implemented one-way password encryption and verification using bcrypt. Here in Part 2 we’ll discuss how to prevent brute-force attacks by enforcing a maximum number of failed login attempts. This was originally posted on the DevSmash Blog

Quick Review

If you haven’t done so already, I recommend you start with reading Part 1. However, if you’re like me and usually gloss over the paragraph text looking for code, here’s what our User model looked like when we left off:

var mongoose = require('mongoose'),
    Schema = mongoose.Schema,
    bcrypt = require('bcrypt'),
    SALT_WORK_FACTOR = 10;

var UserSchema = new Schema({
    username: { type: String, required: true, index: { unique: true } },
    password: { type: String, required: true }
});

UserSchema.pre('save', function(next) {
    var user = this;

    // only hash the password if it has been modified (or is new)
    if (!user.isModified('password')) return next();

    // generate a salt
    bcrypt.genSalt(SALT_WORK_FACTOR, function(err, salt) {
        if (err) return next(err);

        // hash the password using our new salt
        bcrypt.hash(user.password, salt, function(err, hash) {
            if (err) return next(err);

            // override the cleartext password with the hashed one
            user.password = hash;
            next();
        });
    });
});

UserSchema.methods.comparePassword = function(candidatePassword, cb) {
    bcrypt.compare(candidatePassword, this.password, function(err, isMatch) {
        if (err) return cb(err);
        cb(null, isMatch);
    });
};

module.exports = mongoose.model('User', UserSchema);

As can be seen, there’s not much too it - we hash passwords before documents are saved to MongoDB, and we provide a basic convenience method for comparing passwords later on.

Why do we Need Account Locking?

While our code from Part 1 is functional, it can definitely be improved upon. Hashing passwords will save your bacon if a hacker gains access to your database, but it does nothing to prevent brute-force attacks against your site’s login form. This is where account locking comes in: after a specific number of failed login attempts, we simply ignore subsequent attempts, thereby putting the kibosh on the brute-force attack.

Unfortunately, this still isn’t perfect. As stated by OWASP:

Password lockout mechanisms have a logical weakness. An attacker that undertakes a large numbers of authentication attempts on known account names can produce a result that locks out entire blocks of application users accounts.

The prescribed solution, then, is to continue to lock accounts when a likely attack is encountered, but then unlock the account after some time has passed. Given that a sensible password policy puts the password search space into the hundreds of trillions (or better), we don’t need to be too worried about allowing another five guesses every couple of hours or so.

Requirements

In light of the above, let’s define our account locking requirements:

  1. A user’s account should be “locked” after some number of consecutive failed login attempts
  2. A user’s account should become unlocked once a sufficient amount of time has passed
  3. The User model should expose the reason for a failed login attempt to the application (though not necessarily to the end user)

Step 1: Keeping Track of Failed Login Attempts and Account Locks

In order to satisfy our first and second requirements, we’ll need a way to keep track of failed login attempts and, if necessary, how long an account is locked for. An easy solution for this is to add a couple properties to our User model:

var UserSchema = new Schema({
    // existing properties
    username: { type: String, required: true, index: { unique: true } },
    password: { type: String, required: true },
    // new properties
    loginAttempts: { type: Number, required: true, default: 0 },
    lockUntil: { type: Number }
});

loginAttempts will store how many consecutive failures we have seen, and lockUntil will store a timestamp indicating when we may stop ignoring login attempts.

Step 2: Defining Failed Login Reasons

In order to satisfy our third requirement, we’ll need some way to represent why a login attempt has failed. Our User model only has three reasons it needs to keep track of:

  1. The specified user was not found in the database
  2. The provided password was incorrect
  3. The maximum number of login attempts has been exceeded

Any other reason for a failed login will simply be an error scenario. To describe these reasons, we’re going to kick it old school with a faux-enum:

// expose enum on the model
UserSchema.statics.failedLogin = {
    NOT_FOUND: 0,
    PASSWORD_INCORRECT: 1,
    MAX_ATTEMPTS: 2
};

Please note that it is almost always a bad idea to tell the end user why a login has failed. It may be acceptable to communicate that the account has been locked due to reason 3, but you should consider doing this via email if at all possible.

Step 3: Encapsulating the Login Process

Lastly, let’s make life easier on the consuming code base by encapsulating the whole login process. Given that our security requirements have become much more sophisticated, we’ll allow external code to interact through a single User.getAuthenticated() static method. This method will operate as follows:

  1. User.getAuthenticated() accepts a username, a password, and a callback (cb)
  2. If the provided credentials are valid, then the matching user is passed to the callback
  3. If the provided credentials are invalid (or maximum login attempts has been reached), then null is returned instead of the user, along with an appropriate enum value
  4. If an error occurs anywhere in the process, we maintain the standard “errback” convention

We’ll also be adding a new helper method (user.incLoginAttempts()) and a virtual property (user.isLocked) to help us out internally.

Because our User model is starting to get somewhat large, I’m just going to jump straight to the end result with everything included:

var mongoose = require('mongoose'),
    Schema = mongoose.Schema,
    bcrypt = require('bcrypt'),
    SALT_WORK_FACTOR = 10,
    // these values can be whatever you want - we're defaulting to a
    // max of 5 attempts, resulting in a 2 hour lock
    MAX_LOGIN_ATTEMPTS = 5,
    LOCK_TIME = 2 * 60 * 60 * 1000;

var UserSchema = new Schema({
    username: { type: String, required: true, index: { unique: true } },
    password: { type: String, required: true },
    loginAttempts: { type: Number, required: true, default: 0 },
    lockUntil: { type: Number }
});

UserSchema.virtual('isLocked').get(function() {
    // check for a future lockUntil timestamp
    return !!(this.lockUntil && this.lockUntil > Date.now());
});

UserSchema.pre('save', function(next) {
    var user = this;

    // only hash the password if it has been modified (or is new)
    if (!user.isModified('password')) return next();

    // generate a salt
    bcrypt.genSalt(SALT_WORK_FACTOR, function(err, salt) {
        if (err) return next(err);

        // hash the password using our new salt
        bcrypt.hash(user.password, salt, function (err, hash) {
            if (err) return next(err);

            // set the hashed password back on our user document
            user.password = hash;
            next();
        });
    });
});

UserSchema.methods.comparePassword = function(candidatePassword, cb) {
    bcrypt.compare(candidatePassword, this.password, function(err, isMatch) {
        if (err) return cb(err);
        cb(null, isMatch);
    });
};

UserSchema.methods.incLoginAttempts = function(cb) {
    // if we have a previous lock that has expired, restart at 1
    if (this.lockUntil && this.lockUntil < Date.now()) {
        return this.update({
            $set: { loginAttempts: 1 },
            $unset: { lockUntil: 1 }
        }, cb);
    }
    // otherwise we're incrementing
    var updates = { $inc: { loginAttempts: 1 } };
    // lock the account if we've reached max attempts and it's not locked already
    if (this.loginAttempts + 1 >= MAX_LOGIN_ATTEMPTS && !this.isLocked) {
        updates.$set = { lockUntil: Date.now() + LOCK_TIME };
    }
    return this.update(updates, cb);
};

// expose enum on the model, and provide an internal convenience reference 
var reasons = UserSchema.statics.failedLogin = {
    NOT_FOUND: 0,
    PASSWORD_INCORRECT: 1,
    MAX_ATTEMPTS: 2
};

UserSchema.statics.getAuthenticated = function(username, password, cb) {
    this.findOne({ username: username }, function(err, user) {
        if (err) return cb(err);

        // make sure the user exists
        if (!user) {
            return cb(null, null, reasons.NOT_FOUND);
        }

        // check if the account is currently locked
        if (user.isLocked) {
            // just increment login attempts if account is already locked
            return user.incLoginAttempts(function(err) {
                if (err) return cb(err);
                return cb(null, null, reasons.MAX_ATTEMPTS);
            });
        }

        // test for a matching password
        user.comparePassword(password, function(err, isMatch) {
            if (err) return cb(err);

            // check if the password was a match
            if (isMatch) {
                // if there's no lock or failed attempts, just return the user
                if (!user.loginAttempts && !user.lockUntil) return cb(null, user);
                // reset attempts and lock info
                var updates = {
                    $set: { loginAttempts: 0 },
                    $unset: { lockUntil: 1 }
                };
                return user.update(updates, function(err) {
                    if (err) return cb(err);
                    return cb(null, user);
                });
            }

            // password is incorrect, so increment login attempts before responding
            user.incLoginAttempts(function(err) {
                if (err) return cb(err);
                return cb(null, null, reasons.PASSWORD_INCORRECT);
            });
        });
    });
};

module.exports = mongoose.model('User', UserSchema);

Sample Usage

Assuming that you’ve saved the above code as user-model.js, here’s how you would go about using it:

var mongoose = require('mongoose'),
    User = require('./user-model');

var connStr = 'mongodb://localhost:27017/mongoose-bcrypt-test';
mongoose.connect(connStr, function(err) {
    if (err) throw err;
    console.log('Successfully connected to MongoDB');
});

// create a user a new user
var testUser = new User({
    username: 'jmar777',
    password: 'Password123'
});

// save user to database
testUser.save(function(err) {
    if (err) throw err;

    // attempt to authenticate user
    User.getAuthenticated('jmar777', 'Password123', function(err, user, reason) {
        if (err) throw err;

        // login was successful if we have a user
        if (user) {
            // handle login success
            console.log('login success');
            return;
        }

        // otherwise we can determine why we failed
        var reasons = User.failedLogin;
        switch (reason) {
            case reasons.NOT_FOUND:
            case reasons.PASSWORD_INCORRECT:
                // note: these cases are usually treated the same - don't tell
                // the user *why* the login failed, only that it did
                break;
            case reasons.MAX_ATTEMPTS:
                // send email or otherwise notify user that account is
                // temporarily locked
                break;
        }
    });
});

Thanks for reading!

Jeremy Martin is the creator of (recently launched) DevSmash.com, a software developer and Open Source Evangelist at his day job, a Node.js contributor, MongoDB fan boy, and husband to the greatest gal on the planet. Online he goes by @jmar777.

ReactiveMongo for Scala: Unleashing MongoDB Streaming capabilities for Realtime Web

MongoDB

Releases

This is a guest post by Stéphane Godbillon, software architect at Zenexity _ I am very excited to introduce ReactiveMongo, a brand new Scala driver for MongoDB. More than just yet-another-async-driver, it’s a _reactive driver that allows you to design very scalable applications unleashing MongoDB capabilities like streaming infinite live collections and files for modern Realtime Web applications.

What does reactive mean?

I/O operations may be handled in the following ways: * synchronously: each time a request is sent, the running thread is blocked, waiting for the response to arrive. When the response is received, the execution flow resumes. * asynchronously: the code handling the response may be not run simultaneously (often in a closure). Still, a thread may be blocked, but it may not be the same thread that did the request. * non-blocking: sending a request does not block any thread.

Scalability - asynchronous and non-blocking requests

Synchronous database drivers do not perform very well in heavily loaded applications - they spend a lot of time waiting for a response. When you get 2000 or more clients performing each several database non-trivial requests, such a component becomes the main bottleneck of your application. Moreover, what’s the point of using a nifty, powerful, fully asynchronous/non-blocking web framework (like Play!) if all your database accesses are blocking?

The problem is, almost all MongoDB drivers perform synchronous IO. Particularly, there are no fully implemented non-blocking drivers on the JVM. A year ago, Brendan McAdams from 10gen did actually start an initiative to implement the mongo protocol in a non-blocking manner, Hammersmith. This gave us motivation to work on ReactiveMongo.

ReactiveMongo is designed to avoid any kind of blocking request. Every operation returns immediately, freeing the running thread and resuming execution when it is over. Accessing the database is not a bottleneck anymore.

Streaming capability - Iteratee/Enumerator to the rescue

More than enabling non-blocking database requests, ReactiveMongo is streaming capable. Everywhere it makes sense, you can provide streams of documents instead of one, and you can get streams of documents instead of a classic cursor or list. It is very powerful when you want to serve many clients at the same time, without filling up the memory and in an efficient way. This feature is crucial to build modern web applications. Indeed, the future of the web is in streaming data to a very large number of clients simultaneously. Twitter Stream API is a good example of this paradigm shift that is radically altering the way data is consumed all over the web.

ReactiveMongo enables you to build such a web application right now. It allows you to stream data both into and from your MongoDB servers. Using the well known Iteratee/Enumerator pattern, MongoDB can now be seen as a consumer (or producer) of streams of documents.

Here are some examples of what you can do with ReactiveMongo and Iteratees/Enumerators.

Stream a capped collection through a websocket

With ReactiveMongo, you can stream the content of a collection. That’s particularly useful when dealing with websockets. For example, you can get a tailable cursor over a capped collection, asynchronously fetch newly inserted documents and push them into a websocket. There is a good example of doing this here.

Perform bulk inserts from a streaming source

Imagine that your application receives a lot of documents to insert new documents continuously. Instead of making one database call for each, it is better to build a bulk of documents and then send it to the database. ReactiveMongo enables you to do this in perfectly non-blocking way.

Store and read files with MongoDB, the streaming way

MongoDB provides a way to deal with files, GridFS. That’s great when you want to read and write files, without directly using the filesystem and to profit of MongoDB replication/sharding capabilities. But in order to be very scalable, a web application should be able to stream files, without filling up the memory unnecessarily or blocking one thread per upload/download. ReactiveMongo is the perfect tool to do this. It is designed to be natively streaming capable.

ReactiveMongo’s design principles

All database operations must be non-blocking

All the database operations that ReactiveMongo performs are non-blocking. No thread may be blocked waiting for a response.

Implement Streams (Enumerators/Iteratees) everywhere it makes sense

Streaming capability is essential for building modern, reactive web applications. ReactiveMongo implements it with enumerators and iteratees (immutable Producer/Consumer pattern). The API is designed to provide this capability for every feature where it may be interesting: cursors, (bulk) insertion, GridFS…

Stay as close as possible of the Mongo Wire Protocol

Like modern web framework try to be conform to the HTTP protocol in their architecture (notably by being RESTful), ReactiveMongo is designed to stay as close as possible of the Mongo Wire Protocol.

Provide well designed API

ReactiveMongo’s API is designed to be very easy to play with, and to make easier to build libraries upon it.

Developer preview, roadmap and examples

ReactiveMongo is still under heavy development, but the following are already implemented: * ReplicaSet support * Authentication support * GridFS support (streaming capable) * Cursors (providing a stream of documents) * Bulk inserts * Database commands support * Indexing operations

You can get the code from the Github repository. The scaladoc is also available.

Note that it targets Scala 2.10, but can be tested with Scala 2.9.2 right now.

Roadmap

The first release is expected within the coming weeks. What remains to do:

  • Implement the missing database commands
  • Complete the API
  • Extensive testing

Examples

There is a tutorial on writing a web application with ReactiveMongo and Play Framework. There are also some good examples that cover ReactiveMongo’s capabilities:

Community

There is a ReactiveMongo Google Group. Don’t hesitate to post your questions, bug reports and comments. All kinds of contribution are very welcomed.

How MongoDB's Journaling Works

MongoDB

Releases

This was originally posted to Kristina Chodorow’s blog, Snail in a Turtleneck

I was working on a section on the gooey innards of journaling for The Definitive Guide, but then I realized it’s an implementation detail that most people won’t care about. However, I had all of these nice diagrams just laying around.

image

Good idea, Patrick! So, how does journaling work? Your disk has your data files and your journal files, which we’ll represent like this:

image

When you start up mongod, it maps your data files to a shared view. Basically, the operating system says: “Okay, your data file is 2,000 bytes on disk. I’ll map that to memory address 1,000,000-1,002,000. So, if you read the memory at memory address 1,000,042, you’ll be getting the 42nd byte of the file.“ (Also, the data won’t necessary be loaded until you actually access that memory.)

image

This memory is still backed by the file: if you make changes in memory, the operating system will flush these changes to the underlying file. This is basically how mongod works without journaling: it asks the operating system to flush in-memory changes every 60 seconds. However, with journaling, mongod makes a second mapping, this one to a private view. Incidentally, this is why enabling journalling doubles the amount of virtual memory mongod uses.

image

Note that the private view is not connected to the data file, so the operating system cannot flush any changes from the private view to disk. Now, when you do a write, mongod writes this to the private view.

image

mongod will then write this change to the journal file, creating a little description of which bytes in which file changed.

image

The journal appends each change description it gets.

image

At this point, the write is safe. If mongod crashes, the journal can replay the change, even though it hasn’t made it to the data file yet. The journal will then replay this change on the shared view.

image

Finally, at a glacial speed compared to everything else, the shared view will be flushed to disk. By default, mongod requests that the OS do this every 60 seconds.

image

The last step is that mongod remaps the shared view to the private view. This prevents the private view from getting too “dirty” (having too many changes from the shared view it was mapped from).

image

And that’s how journaling works. Thanks to Richard, who gave the best explanation of this I’ve heard (Richard is going to be teaching an online course on MongoDB this fall, if you’re interested in more wisdom from the source).

Password Authentication with Mongoose Part 1

MongoDB

Releases

This post is Part 1 (of 2) on implementing secure username/password authentication for your Mongoose User models, originally posted on Jeremy Martin’s DevSmash Blog. In this first installment, we will discuss how to implement one-way encryption of user passwords with bcrypt, and how to subsequently use the encrypted password for login verification.

Update: Password Authentication with Mongoose (Part 2): Account Locking is now live!

Cast of Characters

Mongoose

From the Mongoose GitHub repo: “Mongoose is a MongoDB object modeling tool designed to work in an asynchronous environment.”

In other words, Mongoose provides a model layer for interacting with your MongoDB collections from Node. This model layer provides a common location for implementing document validation, persistence indirection, and other logic that should be abstracted from the business layer.
Website: http://mongoosejs.com/

node.bcrypt.js

node.bcrypt.js is, well, bcrypt for Node. If you’re not familiar with bcrypt and why it’s a good thing, then I highly recommended Coda Hale’s excellent article on how to safely store a password.
Website: https://github.com/ncb000gt/node.bcrypt.js/

Objectives

Before we get into the code, let’s identify some objectives/requirements in our initial username/password authentication implementation:

  • The User model should fully encapsulate the password encryption and verification logic

  • The User model should ensure that the password is always encrypted before saving

  • The User model should be resistant to program logic errors, like double-encrypting the password on user updates bcrypt interactions should be performed asynchronously to avoid blocking the event loop (bcrypt also exposes a synchronous API)

Step 1: The User Model

Even if you aren’t too familiar with Mongoose schemas and models, the code below should be fairly easy to follow. It wouldn’t be a bad idea to read through some of the basics though. To start things off, let’s create our bare bones representation of a user; for the sake of this article, all we need is a username and a password:

var mongoose = require('mongoose'),
    Schema = mongoose.Schema,
    bcrypt = require(bcrypt),
    SALT_WORK_FACTOR = 10;

var UserSchema = new Schema({
    username: { type: String, required: true, index: { unique: true } },
    password: { type: String, required: true }
});

module.exports = mongoose.model(User&, UserSchema);
Step 2: Password Hashing Middleware*

You might have noticed the unused bcrypt and SALT_WORK_FACTOR references above - we’ll be using them in this step. As a quick refresher, the purpose of the salt is to defeat rainbow table attacks and to resist brute-force attacks in the event that someone has gained access to your database. bcrypt in particular uses a key setup phase that is derived from Blowfish.

For the purposes of this article, all you need to know about that is that the key setup phase is very computationally expensive, which is actually a good thing when trying to thwart brute-force attacks. How expensive depends on how many rounds or iterations the key setup phase uses - this is where our SALT_WORK_FACTOR comes into play. The default used by node.bcrypt.js is 10, so I went ahead and made our explicit value the same.

Alright, let’s get back to the code.

The first thing we want to add to our User model is some Mongoose middleware that will automatically hash the password before it’s saved to the database. Here’s what that looks like:

UserSchema.pre(‘save’, { var user = this;
// only hash the password if it has been modified (or is new)
if (!user.isModified('password')) return next();

// generate a salt
bcrypt.genSalt(SALT_WORK_FACTOR, function(err, salt) {
    if (err) return next(err);

    // hash the password along with our new salt
    bcrypt.hash(user.password, salt, function(err, hash) {
        if (err) return next(err);

        // override the cleartext password with the hashed one
        user.password = hash;
        next();
    });

The above code will accomplish our goal of always hashing the password when a document is saved to the database. There are a couple things to be aware of though: Because passwords are not hashed until the document is saved, be careful if you’re interacting with documents that were not retrieved from the database, as any passwords will still be in cleartext. Mongoose middleware is not invoked on update() operations, so you must use a save() if you want to update user passwords. Step 3: Password Verification Now that we have our User model and we’re hashing passwords, the only thing left is to implement password verification. Adding this to our model turns out to be just a few more lines of code:

UserSchema.methods.comparePassword = function(candidatePassword, cb) {
    bcrypt.compare(candidatePassword, this.password, function(err, isMatch) {
        if (err) return cb(err);
        cb(null, isMatch);
    });

Simple enough.

Altogether Now

Here’s what our User model looks like after adding our middleware and password verification method:

var mongoose = require(&#8216;mongoose&#8217;),
    Schema = mongoose.Schema,
    bcrypt = require(&#8216;bcrypt&#8217;),
    SALT_WORK_FACTOR = 10;

var UserSchema = new Schema({
    username: { type: String, required: true, index: { unique: true } },
    password: { type: String, required: true }
});

UserSchema.pre(save, function(next) {
    var user = this;

// only hash the password if it has been modified (or is new)
if (!user.isModified('password')) return next();

// generate a salt
bcrypt.genSalt(SALT_WORK_FACTOR, function(err, salt) {
    if (err) return next(err);

    // hash the password using our new salt
    bcrypt.hash(user.password, salt, function(err, hash) {
        if (err) return next(err);

        // override the cleartext password with the hashed one
        user.password = hash;
        next();
    });
});


});

UserSchema.methods.comparePassword = function(candidatePassword, cb) {
    bcrypt.compare(candidatePassword, this.password, function(err, isMatch) {
        if (err) return cb(err);
        cb(null, isMatch);
    });
};

module.exports = mongoose.model(User, UserSchema);

Sample Usage

Assuming that you’ve saved the above code as user-model.js, here’s how you would go about testing it:

var mongoose = require(mongoose),
    User = require(./user-model);

var connStr = mongodb://localhost:27017/mongoose-bcrypt-test;
mongoose.connect(connStr, function(err) {
    if (err) throw err;
    console.log(Successfully connected to MongoDB);
});

// create a user a new user
var testUser = new User({
    username: jmar777,
    password: Password;
});

// save user to database
testUser.save(function(err) {
    if (err) throw err;

// fetch user and test password verification
User.findOne({ username: 'jmar777' }, function(err, user) {
    if (err) throw err;

    // test a matching password
    user.comparePassword('Password123', function(err, isMatch) {
        if (err) throw err;
        console.log('Password123:', isMatch); // -> Password123: true
    });

    // test a failing password
    user.comparePassword('123Password', function(err, isMatch) {
        if (err) throw err;
        console.log('123Password:', isMatch); // -> 123Password: false
    });
});

Next Steps

This post was just Part 1 on implementing secure username/password authentication for your Mongoose User models. Stay tuned for Part 2, in which we’ll discuss preventing brute-force attacks by enforcing a maximum number of failed login attempts. Thanks for reading!

Jeremy Martin is the creator of (recently launched) DevSmash.com, a software developer and Open Source Evangelist at his day job, a Node.js contributor, MongoDB fan boy, and husband to the greatest gal on the planet. Online he goes by @jmar777. Update: Password Authentication with Mongoose (Part 2): Account Locking is now live!

Fast datetimes in MongoDB

MongoDB

Releases

This was originally posted to Mike Friedman’s blog. Mike is a Perl Evangelist at 10gen, working on the Perl Driver for MongoDB

One of the most common complaints about the Perl MongoDB driver is that it tries to be a little too clever. In the current production release of MongoDB.pm (version 0.46.2 as of this writing), all datetime values retrieved by a query are automatically instantiated as DateTime objects. DateTime is a remarkable CPAN distribution. In fact, I would say that DateTime and its related distributions on CPAN comprise one of the best date and time manipulation libraries in any programming language. But that power comes with a cost. The DateTime codebase is large, and instantiating DateTime objects is expensive. The constructor performs a great deal of validation, and creates a large amount of metadata which is stored inside the object.

Upcoming changes to the Perl MongoDB driver solve this problem. Read more below. If you need to perform a series of complex arithmetic operations with dates, then the cost of DateTime is justified. But frequently, all you want is a simple read-only value that is sufficient for displaying to a user or saving elsewhere. If you are running queries involving a large number of documents, the automatic instantiation of thousands of complex objects becomes barrier to performance.

DateTime::Tiny is a lightweight alternative. As its name suggests, it is quite tiny indeed. It does no validation, simply shoving whatever you pass to its constructor into an object. It has a couple methods for outputting formatted dates, and a convenience method for promoting an object to a full DateTime object if required. If you know that your date information came from another process that has already done the validation, and you know that you don’t need to do any manipulation of the date data, then DateTime::Tiny is an excellent choice for speed. When I took over maintenance of the Perl MongoDB driver, I began a project to allow alternatives to the default DateTime method of handling datetime data in MongoDB documents. You can now set a per-connection dt_type attribute if you want DateTime::Tiny objects instead of the default.

But how much faster is it? I set out to find a useful data set with a lot of datetime values that I could import into MongoDB and use for testing. A co-worker suggested the GitHub Archive project, which aggregates voluminous data about various events in public GitHub repositories.

I downloaded 24 JSON files covering the day of April 11, 2012. (I chose that date for the sole reason that it was the date used in the GitHub Archive example queries.) The first step was to examine the JSON data to see what kind of date information was available. To my surprise, I found that the .json files returned by the GitHub Archive are not actually JSON, but streams of JSON objects concatenated together with no delimiter. After figuring out this unusual structure, I was able to coerce the data into a proper JSON array and then format it for human readability via the following bash one-liner.

for file in *.json; do 
  perl  -MFile::Slurp -MJSON::XS -E 
    '$new = "[" . join( "},{", split /}{/, read_file shift ) . "]"; 
     print JSON::XS->new->utf8(1)->pretty(1)->encode( decode_json $new )' $file 
  > $file.new; 
done

Upon examining the data, I was pleased to see it consisted of many highly-structured documents with plenty of datetimes. In order to store them in MongoDB, the datetime strings inside the JSON objects would need to be parsed into DateTime objects for serialization. To my further surprise, I found that different types of things in the GitHub data had different kinds of datetime formats. Some looked like 2012/04/05 11:37:28 -0700, whereas others looked like 2012-04-11T11:01:37Z. Fortunately, both formats are easy to parse and turn into DateTime objects.

use strict;
use warnings;
use v5.16;

use JSON::XS;
use File::Slurp;
use DateTime;
use MongoDB;

my $file = shift;
my $json = read_file $file;

my $data = JSON::XS->new->utf8(1)->decode( $json );

my $conn = MongoDB::Connection->new;
my $db   = $conn->get_database( 'github' );
my $coll = $db->get_collection( 'events' );

sub traverse { 
    my $node = shift;

    return if not defined $node;

    if ( ref $node eq ref [ ] ) { 
        foreach my $item( @$node ) { 
            traverse( $item ) if ref $item;
        }
    } elsif ( ref $node eq ref { } ) { 
        foreach my $key( keys %$node ) { 
             my $val = $node->{$key};
             traverse( $val ) if ref $val;
             next if not defined $val;

             if ( $key =~ m{(pushed|created|closed|updated|merged)_at} ) { 
                   my $re = $val =~ m{/} 
                     ? qr{(?<year>\d{4}) / (?<month>\d{2}) / (?<day>\d{2}) \s
                          (?<hour>\d{2}) : (?<minute>\d{2}) : (?<second>\d{2}) \s
                          (?<time_zone>[+-]\d{4}) }x
                     : qr{(?<year>\d{4}) - (?<month>\d{2}) - (?<day>\d{2}) T
                          (?<hour>\d{2}) : (?<minute>\d{2}) : (?<second>\d{2}) Z}x;

                   $val =~ $re;
                   $node->{$key} = DateTime->new( time_zone => 'GMT', %+ );
             }
         }
    } 
}

traverse( $data );


$coll->insert( $_ ) for @$data;

This code traverses the nested structures in every event record, looking for things that look like dates. It then parses them and turns them into DateTimes, so the MongoDB driver will serialize them as such when importing into the database. The resulting documents are then stored in a collection called events. I didn’t bother to create an _id field for each document; I instead relied on the driver to create default ObjectID’s for me. The day’s worth of data results in 133,790 documents in the events collection, each with at least a couple datetimes. How slow is DateTime compared to DateTime::Tiny or raw datetime strings when fetching thousands of documents? Let’s use Perl’s excellent Benchmark module to find out. The results speak for themselves.

Benchmark: timing 10 iterations of datetime, raw, tiny...
  datetime: 428 wallclock secs (426.24 usr +  1.22 sys = 427.46 CPU) @  0.02/s (n=10)
       raw: 44 wallclock secs (42.58 usr +  0.52 sys = 43.10 CPU) @  0.23/s (n=10)
      tiny: 66 wallclock secs (64.68 usr +  0.99 sys = 65.67 CPU) @  0.15/s (n=10)

Unsurprisingly, raw datetime strings perform the best, taking about one tenth the time as instantiating full DateTime objects. But the more useful DateTime::Tiny performs almost as well, taking only slightly longer than the raw option to construct the same number of date objects. In conclusion, DateTime::Tiny is a worthy optimization for situations where you are grabbing large numbers of dates from MongoDB which require no manipulation. The ability to specify DateTime::Tiny and raw dt_types will be available in MongoDB’s Perl driver release 0.47.

Motor: Asynchronous Driver for MongoDB and Python

MongoDB

Releases

Tornado is a popular asynchronous Python web server. Alas, to connect to MongoDB from a Tornado app requires a tradeoff: You can either use PyMongo and give up the advantages of an async web server, or use AsyncMongo, which is non-blocking but lacks key features.

I decided to fill the gap by writing a new async driver called Motor (for “MOngo + TORnado”), and it’s reached the public alpha stage. Please try it out and tell me what you think. I’ll maintain a homepage for it here, including basic documentation.

Status

Motor is alpha. It is certainly buggy. Its implementation and possibly its API will change in the coming months. I hope you’ll help me by reporting bugs, requesting features, and pointing out how it could be better.

Advantages

Two good projects, AsyncMongo and APyMongo, took the straightforward approach to implementing an async MongoDB driver: they forked PyMongo and rewrote it to use callbacks. But this approach creates a maintenance headache: now every improvement to PyMongo must be manually ported over. Motor sidesteps the problem. It uses a Gevent-like technique to wrap PyMongo and run it asynchronously, while presenting a classic callback interface to Tornado applications. This wrapping means Motor reuses all of PyMongo’s code and, aside from GridFS support, Motor is already feature-complete. Motor can easily keep up with PyMongo development in the future.

Installation

Motor depends on greenlet and, of course, Tornado. It is compatible with CPython 2.5, 2.6, 2.7, and 3.2; and PyPy 1.9. You can get the code from my fork of the PyMongo repo, on the motor branch:

pip install tornado greenlet
pip install git+https://github.com/ajdavis/mongo-python-driver.git@motor

To keep up with development, watch my repo and do

pip install -U git+https://github.com/ajdavis/mongo-python-driver.git@motor

when you want to upgrade.

Example

Here’s an example of an application that can create and display short messages:

import tornado.web, tornado.ioloop
import motor

class NewMessageHandler(tornado.web.RequestHandler):
    def get(self):
        """Show a 'compose message' form"""
        self.write('''
        <form method="post">
            <input type="text" name="msg">
            <input type="submit">
        </form>''')

    # Method exits before the HTTP request completes, thus "asynchronous"
    @tornado.web.asynchronous
    def post(self):
        """Insert a message
        """
        msg = self.get_argument('msg')

        # Async insert; callback is executed when insert completes
        self.settings['db'].messages.insert(
            {'msg': msg},
            callback=self._on_response)

    def _on_response(self, result, error):
        if error:
            raise tornado.web.HTTPError(500, error)
        else:
            self.redirect('/')

class MessagesHandler(tornado.web.RequestHandler):
    @tornado.web.asynchronous
    def get(self):
        """Display all messages
        """
        self.write('<a href="/compose">Compose a message</a><br>')
        self.write('<ul>')
        db = self.settings['db']
        db.messages.find().sort([('_id', -1)]).each(self._got_message)

    def _got_message(self, message, error):
        if error:
            raise tornado.web.HTTPError(500, error)
        elif message:
            self.write('<li>%s</li>' % message['msg'])
        else:
            # Iteration complete
            self.write('</ul>')
            self.finish()

db = motor.MotorConnection().open_sync().test

application = tornado.web.Application([
        (r'/compose', NewMessageHandler),
        (r'/', MessagesHandler)
    ], db=db
)

print 'Listening on http://localhost:8888'
application.listen(8888)
tornado.ioloop.IOLoop.instance().start()

Other examples are Chirp, a Twitter-like demo app, and Motor-Blog, which runs this site.

Support For now, email me directly if you have any questions or feedback.

Roadmap In the next week I’ll implement the PyMongo feature I’m missing, GridFS. Once the public alpha and beta stages have shaken out the bugs and revealed missing features, Motor will be included as a module in the official PyMongo distribution.

A. Jesse Jiryu Davis

Hadoop Streaming Support for MongoDB

MongoDB

Releases

MongoDB has some native data processing tools, such as the built-in Javascript-oriented MapReduce framework, and a new Aggregation Framework in MongoDB v2.2. That said, there will always be a need to decouple persistance and computational layers when working with Big Data.

Enter MongoDB+Hadoop: an adapter that allows Apache’s Hadoop platform to integrate with MongoDB.

Using this adapter, it is possible to use MongoDB as a real-time datastore for your application while shifting large aggregation, batch processing, and ETL workloads to a platform better suited for the task.

Well, the engineers at 10gen have taken it one step further with the introduction of the streaming assembly for Mongo-Hadoop.

What does all that mean?

The streaming assembly lets you write MapReduce jobs in languages like Python, Ruby, and JavaScript instead of Java, making it easy for developers that are familiar with MongoDB and popular dynamic programing languages to leverage the power of Hadoop.

It works like this:

Once a developer has Java installed and Hadoop ready to rock they download and build the adapter. With the adapter built, you compile the streaming assembly, load some data into Mongo, and get down to writing some MapReduce jobs.

The assembly streams data from MongoDB into Hadoop and back out again, running it through the mappers and reducers defined in a language you feel at home with. Cool right?

Ruby support was recently added and is particularly easy to get started with. Lets take a look at an example where we analyze twitter data.

Import some data into MongoDB from twitter:

This script curls the twitter status stream and and pipes the json into mongodb using mongoimport. The mongoimport binary has a couple of flags: “-d” which specifies the database “twitter” and -c which specifies the collection “in”.

Next, write a Mapper and save it in a file called mapper.rb:

#!/usr/bin/env ruby
require 'mongo-hadoop'

MongoHadoop.map do |document|
  { :_id => document['user']['time_zone'], :count => 1 }
end
The mapper needs to call the MongoHadoop.map function and passes it a block. This block takes an argument “docuement” and emits a hash containing the user’s timezone and a count of 1.

Now, write a Reducer and save it in a file called reducer.rb:

#!/usr/bin/env ruby
require 'mongo-hadoop'

MongoHadoop.reduce do |key, values|
  count = sum = 0

  values.each do |value|
    count += 1
    sum += value['num']
  end

  { :_id => key, :average => sum / count }
end
The reducer calls the MongoHadoop.reduce function and passes it a block. This block takes two parameters, a key and an array of values for that key, reduces the values into a single aggregate and emits a hash with the same key and the newly reduced value.

To run it all, create a shell script that executes hadoop with the streaming assembly jar and tells it how to find the mapper and reducer files as well as where to retrieve and store the data:

hadoop jar mongo-hadoop-streaming-assembly*.jar -mapper mapper.rb -reducer reducer.rb -inputURI mongodb://127.0.0.1/twitter.in -outputURI mongodb://127.0.0.1/twitter.out

Make them all executable by running chmod +x on the all the scripts and run twit.sh to have hadoop process the job.