GIANT Stories at MongoDB

What We're Reading

MongoDB

Company

Here is another roundup of news from the MongoDB community:

ComputerWorld: Four Ideas to Steal from IT Upstarts

Crain’s: These NY Startups are Ripe for IPOs in 2014

DevOps Angle: MongoDB Co-founder Talks Developer Productivity and Big Data

MongoDB Blog: Two-Factor Authentication for MMS Backup via Google Authenticator

MongoDB Blog: MongoDB Extends Its Lead As Industry’s “Best NoSQL Database”…Thanks to You

ScaleGrid: MongoDB as a Service in Your Amazon AWS account

ScaleGrid: Secure Your Mongo Clusters with SSL

MongoHQ: Openshift Quickstart for MongoDB and Ruby 1.9 Using MongoHQ

MongoLab: Tuning MongoDB Performance with MMS

OpenLife.CC: MongoDB Aggregation Framework Fun: Ordering Fabric for Flags

Silicon Angle: MongoDB Makes AWS Easier

The Code Barbarian: Introduction to the MEAN Stack, Part Two: Building And Testing a To-Do List

VentureBeat: Big Data Startups Pull In Big Money in 2013

Schema Design for Social Inboxes in MongoDB

MongoDB

Releases

Designing a schema is a critical part of any application. Like most databases, there are many options for modeling data in MongoDB, and it is important to incorporate the functional requirements and performance goals for your application when determining the best design. In this post, we’ll explore three approaches for using MongoDB when creating social inboxes or message timelines.

If you’re building a social network, like Twitter for example, you need to design a schema that is efficient for users viewing their inbox, as well as users sending messages to all their followers. The whole point of social media, after all, is that you can connect in real time.

There are several design considerations for this kind of application:

  • The application needs to support a potentially large volume of reads and writes.
  • Reads and writes are not uniformly distributed across users. Some users post much more frequently than others, and some users have many, many more followers than others.
  • The application must provide a user experience that is instantaneous.
  • Edit 11/6: The application will have little to no user deletions of data (a follow up blog post will include information about user deletions and historical data)

Because we are designing an application that needs to support a large volume of reads and writes we will be using a sharded collection for the messages. All three designs include the concept of “fan out,” which refers to distributing the work across the shards in parallel:

  1. Fan out on Read
  2. Fan out on Write
  3. Fan out on Write with Buckets

Each approach presents trade-offs, and you should use the design that is best for your application’s requirements.

The first design you might consider is called Fan Out on Read. When a user sends a message, it is simply saved to the inbox collection. When any user views their own inbox, the application queries for all messages that include the user as a recipient. The messages are returned in descending date order so that users can see the most recent messages.

//Shard on "from"
db.shardCollection( "mongodbdays.inbox", {from: 1})

//Make sure we have an inbox to handle inbox reads
db.inbox.ensureIndex( {to: 1, sent: 1})

msg = {
  from: "Joe"
  to: ["Bob", "Jane"],
  sent: new Date(),
  message: "Hi!",
}

//Send a message
db.inbox.save (msg)

//Read Bob's inbox
db.inbox.find ({ to: "Bob"}).sort({sent:-1})

To implement this design, create a sharded collection called inbox, specifying the from field as the shard key, which represents the address sending the message. You can then add a compound index on the to field and the sent field. Once the document is saved into the inbox, the message is effectively sent to all the recipients. With this approach sending messages is very efficient.

Viewing an inbox, on the other hand, is less efficient. When a user views their inbox the application issues a find command based on the to field, sorted by sent. Because the inbox collection uses from as its shard key, messages are grouped by sender across the shards. In MongoDB queries that are not based on the shard key will be routed to all shards. Therefore, each inbox view will be routed to all shards in the system. As the system scales and many users go to view their inbox, all queries will be routed to all shards. This design does not scale as well as each query being routed to a single shard.

With the “Fan Out on Read” method, sending a message is very efficient, but viewing the inbox is less efficient.

Fan out on Read is very efficient for sending messages, but less efficient for reading messages. If the majority of your application consists of users sending messages, but very few go to read what anyone sends them – let’s call it an anti-social app – then this design might work well. However, for most social apps there are more requests by users to view their inbox than there are to send messages.

The Fan out on Write takes a different approach that is more optimized for viewing inboxes. This time, instead of sharding our inbox collection on the sender, we shard on the message recipient. In this way, when we go to view an inbox the queries can be routed to a single shard, which will scale very well. Our message document is the same, but now save a copy of the message for every recipient.

//Shard on "recipient" and "sent"
db.shardCollection("mongodbdays.inbox", {"recipient": 1, "sent":1})

msg = {
  from: "Joe",
  to: ["Bob", "Jane"]
  sent: new Date()
  message: "Hi!", 
}

//Send a message
for (recipient in msg.to){
  msg.recipient = msg.to[recipient]
  db.inbox.insert(msg);
}

//Read Bob's inbox
db.inbox.find ({recipient: "Bob"}).sort({ sent:-1})

With the “Fan Out on Write” method, viewing the inbox is efficient, but sending messages consumes more resources.

In practice we might implement the saving of messages asynchronously. Imagine two celebrities quickly exchange messages at a high-profile event - the system could quickly be saturated with millions of writes. By saving a first copy of their message, then using a pool of background workers to write copies to all followers, we can ensure the two celebrities can exchange messages quickly, and that followers will soon have their own copies. Furthermore, we could maintain a last-viewed date on the user document to ensure they have accessed the system recently - zombie accounts probably shouldn’t get a copy of the message, and for users that haven’t accessed their account recently we could always resort to our first design - Fan out on Read - to repopulate their inbox. Subsequent requests would then be fast again.

At this point we have improved the design for viewing inboxes by routing each inbox view to a single shard. However, each message in the user’s inbox will produce a random read operation. If each inbox view produces 50 random reads, then it only takes a relatively modest number of concurrent users to potentially saturate the disks. Fortunately we can take advantage of the document data model to further optimize this design to be even more efficient.

Fan out on Write with Buckets refines the Fan Out on Write design by “bucketing” messages together into documents of 50 messages ordered by time. When a user views their inbox the request can be fulfilled by reading just a few documents of 50 messages each instead of performing many random reads. Because read time is dominated by seek time, reducing the number of seeks can provide a major performance improvement to the application. Another advantage to this approach is that there are fewer index entries.

To implement this design we create two collections, an inbox collection and a user collection. The inbox collection uses two fields for the shard key, owner and sequence, which holds the owner’s user id and sequence number (i.e. the id of 50-message “bucket” documents in their inbox). The user collection contains simple user documents for tracking the total number of messages in their inbox. Since we will probably need to show the total number of messages for a user in a variety of places in our application, this is a nice place to maintain the count instead of calculating for each request. Our message document is the same as in the prior examples.

//Shard on "owner/sequence"
db.shardCollection("mongodbdays.inbox",
  {owner: 1, sequence: 1})
db.shardCollection("mongodbdays.users", {user_name: 1})

msg={
  from: "Joe",
  to: ["Bob", "Jane"],
  sent: new Date ()
  message: "Hi!", 
}
//Send a message
for(recipient in msg.to) {
  count = db.users.findAndModify({
    query: {user_name: msg.to[recipient]}, 
  update:{"$inc":{"msg_count":1}},
  upsert: true,
  new: true}).msg_count;

  sequence = Math.floor(count/50);

  db.inbox.update({
    owner: msg.to[recipient], sequence: sequence},
    {$push:{"messages":msg}},
    {upsert: true});
}

//Read Bob's inbox
db.inbox.find ({owner: "Bob"})
  .sort({sequence:-1}).limit(2)

To send a message we iterate through the list of recipients as we did in the Fan out on Write example, but we also take another step to increment the count of total messages in the inbox of the recipient, which is maintained on the user document. Once we know the count of messages, we know the “bucket” in which to add the latest message. As these messages reach the 50 item threshold, the sequence number increments and we begin to add messages to the next “bucket” document. The most recent messages will always be in the “bucket” document with the highest sequence number. Viewing the most recent 50 messages for a user’s inbox is at most two reads; viewing the most recent 100 messages is at most three reads.

Normally a user’s entire inbox will exist on a single shard. However, it is possible that a few user inboxes could be spread across two shards. Because our application will probably page through a user’s inbox, it is still likely that every query for these few users will be routed to a single shard.

Fan out on Write with Buckets is generally the most scalable approach of the these three designs for social inbox applications. Every design presents different trade-offs. In this case viewing a user’s inbox is very efficient, but writes are somewhat more complex, and more disk space is consumed. For many applications these are the right trade-offs to make.

Schema design is one of the most important optimizations you can make for your application. We have a number of additional resources available on schema design if you are interested in learning more:

Fan out on Read
Fan out on Write
Fan out on Write with Buckets
Send Message Performance
Best
Single write
Good
Shard per recipient
Multiple writes
Worst
Shard per recipient
Appends (grows)
Read Inbox Performance
Worst
Broadcast all shards
Random reads
Good
Single shard
Random reads
Best
Single shard
Single read
Data Size
Best
Message stored once
Worst
Copy per recipient
Worst
Copy per recipient


Schema design is one of the most important optimizations you can make for your application. We have a number of additional resources available on schema design if you are interested in learning more:

Schema Design for Time Series Data in MongoDB

MongoDB

Releases
Before you read on, this article is from 2013 and a lot has happened with MongoDB and Time Series data in the intervening years. We have much more up to date and relevant articles available especially our 2018 series: Time Series Data and MongoDB - We recommend you read those (or take in the webinar and current white paper) to be up to date with time series data and MongoDB.

This is a post by Sandeep Parikh, Solutions Architect at MongoDB and Kelly Stirman, Director of Products at MongoDB.

Data as Ticker Tape

New York is famous for a lot of things, including ticker tape parades.

For decades the most popular way to track the price of stocks on Wall Street was through ticker tape, the earliest digital communication medium. Stocks and their values were transmitted via telegraph to a small device called a “ticker” that printed onto a thin roll of paper called “ticker tape.” While out of use for over 50 years, the idea of the ticker lives on in scrolling electronic tickers at brokerage walls and at the bottom of most news networks, sometimes two, three and four levels deep.

Today there are many sources of data that, like ticker tape, represent observations ordered over time. For example:

  • Financial markets generate prices (we still call them “stock ticks”).
  • Sensors measure temperature, barometric pressure, humidity and other environmental variables.
  • Industrial fleets such as ships, aircraft and trucks produce location, velocity, and operational metrics.
  • Status updates on social networks.
  • Calls, SMS messages and other signals from mobile devices.
  • Systems themselves write information to logs.

This data tends to be immutable, large in volume, ordered by time, and is primarily aggregated for access. It represents a history of what happened, and there are a number of use cases that involve analyzing this history to better predict what may happen in the future or to establish operational thresholds for the system.

Time Series Data and MongoDB

Time series data is a great fit for MongoDB. There are many examples of organizations using MongoDB to store and analyze time series data. Here are just a few:

  • Silver Spring Networks, the leading provider of smart grid infrastructure, analyzes utility meter data in MongoDB.
  • EnerNOC analyzes billions of energy data points per month to help utilities and private companies optimize their systems, ensure availability and reduce costs.
  • Square maintains a MongoDB-based open source tool called Cube for collecting timestamped events and deriving metrics.
  • Server Density uses MongoDB to collect server monitoring statistics.
  • Appboy, the leading platform for mobile relationship management, uses MongoDB to track and analyze billions of data points on user behavior.
  • Skyline Innovations, a solar energy company, stores and organizes meteorological data from commercial scale solar projects in MongoDB.
  • One of the world’s largest industrial equipment manufacturers stores sensor data from fleet vehicles to optimize fleet performance and minimize downtime.

In this post, we will take a closer look at how to model time series data in MongoDB by exploring the schema of a tool that has become very popular in the community: MongoDB Management Service (MMS). MMS helps users manage their MongoDB systems by providing monitoring, visualization and alerts on over 100 database metrics. Today the system monitors over 25k MongoDB servers across thousands of deployments. Every minute thousands of local MMS agents collect system metrics and ship the data back to MMS. The system processes over 5B events per day, and over 75,000 writes per second, all on less than 10 physical servers for the MongoDB tier.

Schema Design and Evolution

How do you store time series data in a database? In relational databases the answer is somewhat straightforward; you store each event as a row within a table. Let’s say you were monitoring the amount of system memory used per second. In that example you would have a table and rows that looked like the following:

timestamp memory_used
2013-10-10T23:06:37.000Z 1000000
2013-10-10T23:06:38.000Z 2000000


If we map that storage approach to MongoDB, we would end up with one document per event:

{
  timestamp: ISODate("2013-10-10T23:06:37.000Z"),
  type: ”memory_used”,
  value: 1000000
},
{
  timestamp: ISODate("2013-10-10T23:06:38.000Z"),
  type: ”memory_used”,
  value: 15000000
}

While this approach is valid in MongoDB, it doesn’t take advantage of the expressive nature of the document model. Let’s take a closer look at how we can refine the model to provide better performance for reads and to improve storage efficiency.

The Document-Oriented Design

A better schema approach looks like the following, which is not the same as MMS but it will help to understand the key concepts. Let’s call it the document-oriented design:

{
  timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
  type: “memory_used”,
  values: {
    0: 999999,
    …  
    37: 1000000,
    38: 1500000,
    … 
    59: 2000000
  }
}

We store multiple readings in a single document: one document per minute. To further improve the efficiency of the schema, we can isolate repeating data structures. In the ```timestamp_minute``` field we capture the minute that identifies the document, and for each memory reading we store a new value in the ```values``` sub-document. Because we are storing one value per second, we can simply represent each second as fields 0 - 59.

More Updates than Inserts

In any system there may be tradeoffs regarding the efficiency of different operations, such as inserts and updates. For example, in some systems updates are implemented as copies of the original record written out to a new location, which requires updating of indexes as well. One of MongoDB’s core capabilities is the in-place update mechanism: field-level updates are managed in place as long as the size of the document does not grow significantly. By avoiding rewriting the entire document and index entries unnecessarily, far less disk I/O is performed. Because field-level updates are efficient, we can design for this advantage in our application: with the document-oriented design there are many more updates (one per second) than inserts (one per minute).

For example, if you wanted to maintain a count in your application, MongoDB provides a handy operator that increments or decrements a field. Instead of reading a value into your application, incrementing, then writing the value back to the database, you can simply increase the field using $inc:

```{ $inc: { pageviews: 1 } }```

This approach has a number of advantages: first, the increment operation is atomic - multiple threads can safely increment a field concurrently using $inc. Furthermore, this approach is more efficient for disk operations, requires less data to be sent over the network and requires fewer round trips by omitting the need for any reads. Those are three big wins that result in a more simple, more efficient and more scalable system. The same advantages apply to the use of the $set operator.

The document-oriented design has several benefits for writing and reading. As previously stated, writes can be much faster as field-level updates because instead of writing a full document we’re sending a much smaller delta update that can be modeled like so:

db.metrics.update(
  { 
    timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
    type: ”memory_used”
  }, 
  {$set: {“values.59”: 2000000 } }
)

With the document-oriented design reads are also much faster. If you needed an hour’s worth of measurements using the first approach you would need to read 3600 documents, whereas with this approach you would only need to read 60 documents. Reading fewer documents has the benefit of fewer disk seeks, and with any system fewer disk seeks usually results is significantly better performance.

A natural extension to this approach would be to have documents that span an entire hour, while still keeping the data resolution per second:

{
  timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
  type: “memory_used”,
  values: {
    0: 999999,
    1: 1000000, 
    …,
    3598: 1500000,
    3599: 2000000
  }
}

One benefit to this approach is that we can now access an hour’s worth of data using a single read. However, there is one significant downside: to update the last second of any given hour MongoDB would have to walk the entire length of the “values” object, taking 3600 steps to reach the end. We can further refine the model a bit to make this operation more efficient:

{
  timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
  type: “memory_used”,
  values: {
    0: { 0: 999999, 1: 999999, …, 59: 1000000 },
    1: { 0: 2000000, 1: 2000000, …, 59: 1000000 },
    …,
    58: { 0: 1600000, 1: 1200000, …, 59: 1100000 },
    59: { 0: 1300000, 1: 1400000, …, 59: 1500000 }
  }
}
db.metrics.update(
  { 
    timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
    type: “memory_used”
  }, 
  {$set: {“values.59.59”: 2000000 } }
)

MMS Implementation

In MMS users have flexibility to view monitoring data at varying levels of granularity. These controls appear at the top of the monitoring page:

These controls inform the schema design for MMS, and how the data needs to be displayed. In MMS, different resolutions have corresponding range requirements - for example, if you specify that you want to analyze monitoring data at the granularity of “1 hr” instead of “1 min” then the ranges also become less granular, changing from hours to days, weeks and months:

To satisfy this approach in a scalable manner and keep data retention easy to manage, MMS organizes monitoring data to be very efficient for reads by maintaining copies at varying degrees of granularity. The document model allows for efficient use of space, so the tradeoff is very reasonable, even for a system as large as MMS. As data ages out, collections that are associated with ranges of time are simply dropped, which is a very efficient operation. Collections are created to represent future ranges of time, and these will eventually be dropped as well. This cycle maintains a rolling window of history associated with the functionality provided by MMS.

In addition, to support the “avg/sec” display option the system also tracks the number of samples collected and the sum of all readings for each metric similar to the following example:

{
  timestamp_minute: ISODate(“2013-10-10T23:06:00.000Z”),
  num_samples: 58,
  total_samples: 108000000,
  type: “memory_used”,
  values: {
    0: 999999,
    …  
    37: 1000000,
    38: 1500000,
    … 
    59: 1800000
  }
}

The fields “num_samples” and “total_samples” are updated as new readings are applied to the document:

db.metrics.update(
  { 
    timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
    type: “memory_used”
  }, 
  {
    {$set: {“values.59”: 2000000 }},
    {$inc: {num_samples: 1, total_samples: 2000000 }}
  }
)

Computing the average/sec is straightforward and requires no counting or processing, just a single read to retrieve the data and a simple application-level operation to compute the average. Note that with this model we assume a consistent cadence of measurements - one per second - that we can simply aggregate at the top of the document to report a rolled-up average for the whole minute. Other models are possible that would support inconsistent measurements and flexible averages over different time frames.

Another optimization used in MMS is preallocating all documents for the upcoming time period; MMS never causes an existing document to grow or be moved on disk. A background task within the MMS application performs inserts of empty “shell” documents including the subdocument schema but with all zeroes for the upcoming time periods before they are recorded. With this approach fields are always incremented or set without ever growing the document in size, which eliminates the possibility of moving the document and the associated overhead. This is a major performance win and another example of ensuring in-place updates within the document-oriented design.

Conclusion

MongoDB offers many advantages for storing and analyzing time series data, whether it’s stock ticks, tweets or MongoDB metrics. If you are using MongoDB for time series data analysis, we want to hear about your use case. Please continue the conversation by commenting on this post with your story.

More Information

This post was updated in December 2014 to include additional resources and updated links.

Like what you see? Get MongoDB updates straight to your inbox

What We're Reading

MongoDB

Company

Here are some great articles about MongoDB to read this weekend:

ScaleGrid: Should You Enable MongoDB Journaling?, 10/18

Business Insider: MongoDB co-founder Dwight Merriman and CEO Max Schireson were chosen for the Silicon Alley Top 100 in New York Tech roundup, October

InfoWorld: Use MongoDB to Make Your App Location-Aware, 10/24

comSysto: Getting Started with MongoSoup, 10/25

Digital Misinformation: Collections and Embedded Documents in MongoDB, 10/22

We’re happy to announce a new partnership with Soldfire. You can read more about this on Consumer Electronics and the MongoDB blog, 10/24

Introducing the MongoDB Community Kit

MongoDB

Releases

Open source projects thrive as a reflection of the participation and enthusiasm of their communities. MongoDB is a great example of a global community, and we have seen a number of people, like MongoDB MUG organizers and MongoDB Masters, create lasting impact for MongoDB through their work with the community. To encourage growth in the MongoDB Community, we’ve taken what we’ve learned and turned into a new resource: The MongoDB Community Kit.

Sometimes people want to get involved but aren’t sure how how to get started. This tool can help you as a user and a community member contribute to the project in whatever way you like. Based on our experiences with thousands of developers over the past 4 years, the MongoDB community team have developed a number of techniques that will help you provide valuable impact. For instance, you can:

  • Give a talk on how you scaled your MongoDB infrastructure
  • Write a blog post with real advice on how to use MongoDB in production
  • Create an open source tool for MongoDB that enables other users to code faster and create better deployments.
  • Create a User Group in your local area that educates new users and brings a community together
  • Contribute to the community in whichever way you like, even if it isn’t listed in the Community Kit. The invitation is open.

The Kit is available on Github as an open source project. This makes it easy to access, fork and update. This package is released under the Creative Commons license CC BY-NC-SA 3.0. Just like any other project, this kit gets better when users contribute their knowledge, so we encourage you to submit a pull request and add your feedback.

Some suggestions:

  • Add some posters you created for a MUG
  • Post your “Intro to MongoDB Slides”
  • Fork the kit and translate it into your own language and share with your community

We’re looking forward to seeing all the great activity coming from the community. Keep the pull requests coming!

Managing the web nuggets with MongoDB and MongoKit

MongoDB

Releases

This is a guest post by Nicolas Clairon, maintainer of MongoKit and founder of Elkorado

MongoKit is a python ODM for MongoDB. I created it in 2009 (when the ODM acronym wasn’t even used) for my startup project called Elkorado. Now that the service is live, I realize that I never wrote about MongoKit. I’d like to introduce it to you with this quick tutorial based on real use cases from Elkorado.

Elkorado: a place to store web nuggets

Elkorado is a collaborative, interest-based curation tool. It was born over the frustration that there is no place where to find quality resources about a particular topic of interest. There are so many blogs, forums, videos and websites out there that it is very difficult to find our way over this massive wealth of information.

Elkorado aims at helping people to centralize quality content, so they can find them later easily and discover new ones.

MongoDB to the rescue

Rapid prototyping is one of the most important thing in startup world and it is an area where MongoDB shines.

The web is changing fast, and so are web resources and their metadata. MongoDB’s and schemaless database is a perfect fit to store this kind of data. After losing hair by trying to use polymorphism with SQL databases, I went into MongoDB… and I felt in love with it.

While playing with the data, I needed a validation layer and wanted to add some methods to my documents. Back then, they was no ODM for Python. And so I created MongoKit.

MongoKit: MongoDB ODM for Python

MongoKit is a thin layer on top of Pymongo. It brings field validations, inheritance, polymorphism and a bunch of other features. Let’s see how it is used in Elkorado.

Elkorado is a collection of quality web resources called nuggets. This is how we could fetch a nugget discovered by the user “namlook” with Pymongo:

>>> import pymongo
>>> con = pymongo.Connection()
>>> nugget = con.elkorado.nuggets.find_one({"discoverer": "namlook"})

nuggets here is a regular python dict.

Here’s a simple nugget definition with MongoKit:

import mongokit
connection = mongokit.Connection()

@connection.register
class Nugget(mongokit.Document):
    __database__ = "elkorado"
    __collection__ = "nuggets"
    structure = {
      "url": unicode,
      "discoverer": unicode,
      "topics": list,
      "popularity": int
    }
    default_values = {"popularity": 0}
    def is_popular(self):
        """ this is for the example purpose """
        return self.popularity > 1000

Fetching a nugget with MongoKit is pretty the same:

nugget = connection.Nugget.find_one({"discoverer": "namlook"})

However, this time, nugget is a Nugget object and we can call the is_popular method on it:

>>> nugget.is_popular()
True

One of the main advantages of MongoKit is that all your models are registered and accessible via the connection instance. MongoKit look at the <strong>database</strong> and <strong>collection</strong> fields to know which database and which collection has to be used. This is useful so we have only one place to specify those variables.

Inheritance

MongoKit was first build to natively support inheritance:

from datetime import datetime
    class Core(mongokit.Document):
        __database__ = "elkorado"
        use_dot_notation = True
        structure = {
            "created_at": datetime,
            "updated_at": datetime
        }
        default_values = {
            "created_at": datetime.utcnow,
            "updated_at": datetime.utcnow
        }
        def save(self, *args, **kwargs):
           self.updated_at = datetime.utcnow()    
           super(Core, self).save(*args, **kwargs)

In this Core object, we are defining the database name and some fields that will be shared by other models.

If one wants a Nugget object to have date metadata, one just have to make it inherit from Core:

@connection.register
class Nugget(Core):
    __collection__ = "nuggets"
    stucture = {
        "url": unicode,
        "topics": list,
        "discoverer": unicode,
        "popularity": int
    }
    default_values = {"popularity": 0}

It’s all about Pymongo

With MongoKit, your are still very close to Pymongo. In fact, MongoKit’s connection, database and collection are subclasses of Pymongo’s. If once in an algorithm, you need pure performances, you can directly use Pymongo’s layer which is blazing fast:

>>> nuggets = connection.Nugget.find() # nuggets is a list of Nugget object
>>> nuggets = connection.elkorado.nuggets.collection.find() # nuggets is a list of python dict object.

Here, connection is a MongoKit connection but it can be used like a Pymongo connection. Note that to keep the benefice of DRY, we can call the pymongo’s layer from a MongoKit document:

>>> nuggets = connection.Nugget.collection.find() # fast!

A real life “simplified” example

Let’s see an example of CRUD done with MongoKit.

On Elkorado, each nugget is unique but multiple users can share a nugget which have differents metadata. Each time a user picks up a nugget, a UserNugget is created with specific informations. If this is the first time the nugget is discovered, a Nugget object is created, otherwise, it is updated. Here is a simplified UserNugget structure:

from mongokit import ObjectId, Connection

connection = Connection()

@connection.register
class UserNugget(Core):
    __collection__ = "user_nuggets"
    structure = {
        "url": unicode,
        "topics": [unicode],
        "user_id": unicode
    }
    required_fields = ["url", "topics", "user_id"]

    def save(self, *args, **kwargs):
        super(self, UserNugget).save(*args, **kwargs)
        nugget = self.db.Nugget.find_one({"url": self.url})
        if not nugget:
            nugget = self.db.Nugget(url=url, discoverer=self.user_id)
            nugget.save()
        self.db.Nugget.collection.update({"url": self.url}, {"$addToSet": {"topics": {"$each": self.topics}}, "$inc": 1})

This example well describes what can be done with MongoKit. Here, the save method has been overloaded to check if a nugget exists (remember, each nugget is unique by its URL). It will create it if it is not already created, and update it.

Updating data with MongoKit is similar to Pymongo. Use save on the object or use directly the Pymongo’s layer to make atomic updates. Here, we use atomic updates to push new topics and increase the popularity:

self.db.Nugget.collection.update({"url": self.url}, {
    "$addToSet": {"topics": {"$each": self.topics}},
    "$inc": 1
})

Getting live

Let’s play with our model:

>>> user_nugget = connection.UserNugget()
>>> user_nugget.url = u"http://www.example.org/blog/post123"
>>> user_nugget.user_id = u"namlook"
>>> user_nugget.topics = [u"example", u"fun"]
>>> user_nugget.save()

When calling the save method, the document is validated against the UserNugget’s structure. As expected, the fields created_at and updated_at have been added:

>>> user_nugget
{
    "_id": ObjectId("4f314163a1e5fa16fe000000"),
    "created_at": datetime.datetime(2013, 8, 4, 17, 22, 8, 3000),
    "updated_at": datetime.datetime(2013, 8, 4, 17, 22, 8, 3000),
    "url": u"http://www.example.org/blog/post123",
    "user_id": u"namlook",
    "topics": [u"example", u"fun"]
}

and the related nugget has been created:

>>> nugget = connection.Nugget.find_one({"url": "http://www.example.org/blog/post123"})
{
    "_id": ObjectId("4f314163a1e5fa16fe000001"),
    "created_at": datetime.datetime(2013, 8, 4, 17, 22, 8, 3000),
    "updated_at": datetime.datetime(2013, 8, 4, 17, 22, 8, 3000),
    "url": u"http://www.example.org/blog/post123",
    "discoverer": u"namlook",
    "topics": [u"example", u"fun"],
    "popularity": 1
}

Conclusion

MongoKit is a central piece of Elkorado. It has been written to be small and minimalist but powerful. There is so much more to say about features like inherited queries, i18n and gridFS, so take a look at the wiki to read more about how this tool can help you.

Check the documentation for more information about MongoKit. And if you register on Elkorado, check out the nuggets about MongoDB. Don’t hesitate to share you nuggets as well, the more the merrier.

PageRank on Flights Dataset

MongoDB

Releases

By Sweet Song and Daniel Alabi, MongoDB Summer Interns for 2013

This is the second of three blog posts from this summer internship project showing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.

Having done some basic analysis on the Flights dataset (mostly using the MongoDB aggregation framework), we moved on to do some more advanced analysis on this dataset. We settled on computing the PageRank of all airports in the Flights dataset. The PageRank of nodes in a network is often computed iteratively. This process can easily be parallelized and often is. We can utilize MongoDB to compute the PageRank of nodes in a network in several ways. Here are the two options we considered:

  1. We can use the MongoDB MapReduce framework, which since version 2.4 uses the V8 JavaScript engine. Furthermore, since MongoDB is known for its robust sharding capabilities, we can increase the performance of query operations by setting up a MongoDB sharded cluster for our dataset. This is essential for really large working datasets.

  2. The Hadoop open-source framework is well-known for its robust distributed data processing features. MongoDB interfaces with hadoop via the Mongo-Hadoop connector.

For this particular dataset, we opted for (1) since the Flights dataset has only 319 airports. Regardless, there were 4,601 total weighted edges among USA commercial airports. The weight of an edge between any two airports was calculated using the 6,155,752 recorded trips in the flights collection.

Making a Graph of Airports

The airports dataset is fairly connected, with only one airport receiving flights in the past year without any domestic departures. Most flights out of Puerto Rico are considered international flights; as a result, our dataset didn’t have any recorded domestic flights in that year. This would be a black hole for any PageRank that goes to that airport. A more thorough explanation can be found here. Therefore, we removed that singular airport in Puerto Rico from our airports graph.

From the previous analysis, we had put the Flights dataset in a flights collection in the flying database. An entry looks like this:

{
  "_id" : ObjectId("51bf..."),
  ...
  "origAirportId" : 12478,
  "origStateId" : "NY",
  ...
  "destAirportId" : 12892,
  "destStateId" : "CA",
  ...
}

For each document, we create (or modify) at least one node that keeps track of this “edge”:

{
  "_id" : "12478",
  "value" : {
    "pg" : (1 / NUM_OF_AIRPORTS_IN_DB),
    "prs" : {
      "12892" : (NUM_OF_FLIGHTS_FROM_12478_TO_12892 / NUM_OF_FLIGHTS_FROM_12478),
      ...
   } 
 }
}

where NUM_OF_AIRPORTS_IN_DB is the total number of airports in the Flights dataset which corresponds to the number of nodes in the network. NUM_OF_FLIGHTS_FROM_12478 is the total number of flights leaving from airport with airportId=12478. NUM_OF_FLIGHTS_FROM_12478_TO_12892 is the number of flights that leave the airport with airportId=12478 and arrive at the airport with airportId=12892. pg is the current PageRank of an airport; prs is a Map of <aId, pr> where pr is the probability of a flight going from the airport specified by _id to an airport identified by aId. For example, NUM_OF_FLIGHTS_FROM_12478_TO_12892/NUM_OF_FLIGHTS_FROM_12478 is the probability of transitioning from airport with airportId=12478 to airport with airportId=12892.

We wrote preformat.py to create the graph that contains information about the probability of every node in the graph transitioning to another. The resulting graph was stored in an fpg_0 collection (Flights PageRank 0) with 318 nodes.

MongoDB MapReduce

Next, we wrote some JavaScript code to calculate PageRank on the graph stored in the database. The goal was to create a new collection fpg_i for every ith iteration of PageRank. Every iteration is a call on oneiteration() in iteration.js consists of a map and a reduce function. The PageRank algorithm will stop once the average percentage change of the PageRank values for all nodes drops below 0.1%. The map function looks like this:

var map = function() {
    // For each node that is reachable from this node, give it the 
    // appropriate portion of my pagerank
    for (var toNode in this["value"]["prs"]) {
        emit(toNode, {totalNodes : 0.0
                  , pg : this["value"]["prs"][toNode] * this["value"]["pg"]
                  , prs : {}
                  , diff : 0.0
                  , prevpg : 0.0});
    }

    // Pass the previous pagerank and the probability matrix to myself
    emit(this["_id"], {totalNodes: this["value"]["totalNodes"]
                   , pg: 0.0
                   , prs: this["value"]["prs"]
                   , diff: 0.0
                   , prevpg: this["value"]["pg"]});

    // Reduce won't be called on a key unless there's more than one value for that key
    // So this is just an extra emit to make sure that reduce is called
    emit(this["_id"], {totalNodes: 0.0
                   , pg : 0.0
                   , prs : {} 
                   , diff : 0.0
                   , prevpg : 0.0});
};

The map function considers every document (corresponding to an airport) in the current fpg_i. For each airport (call this x), it emits its airport ID (stored in _id) and passes the prs and prevpg (previous pg) information, for use in the next iteration of PageRank. Then, it passes a portion of x’s PageRank to every airport that x links to.

The reduce function looks like this:

var reduce = function(airportId, values) {
    var pg = 0
    , diff = 0
    , prs = {} 
    , prevpg = 0 
    , beta = 0.9
    , totalNodes = 0;

    for (var i in values) {
  // Retrieve the previous pagerank and the probability matrix
        prevPRS = values[i]["prs"]
        for (var key in prevPRS) {
            prs[key] = prevPRS[key];
        }
  prevpg += values[i]["prevpg"];
        // Summation of the pagerank
    pg += values[i]["pg"];
        totalNodes += values[i]["totalNodes"];
    }

    diff = Math.abs(prevpg - pg) / prevpg;
    return {"totalNodes" : totalNodes
        , "pg" : pg 
        , "prs" : prs
        , "diff" : diff
        , "prevpg" : prevpg};
};

The reduce function has two duties:

  1. Collect the prs and prevpg information for each node;

  2. Accumulate the total PageRank score sent to each node.

Finally, db["fpg<em>"+i].mapReduce(map, reduce, {out: "fpg</em>"+(i+1)}); runs MapReduce on the fpg_i collection using the map and reduce functions defined below and stores the result (in the same format as fpg<em>i</em>) into fpg(i+1).

We keep applying the MapReduce operations until the PageRank of the nodes eventually converges. This happens when the average percentage change of pg for each node is less than a certain threshold (0.1% in our case). The execution of our implementation of the PageRank algorithm took 6.203 seconds, having converged after 20 iterations.

PageRank Result and Interpretation

The 10 airports with the most PageRank are:

1.{ pg: 0.06370586088275128,
    airportCode: "ATL",
    airportState: "Georgia",
    airportStateId: "GA",
    airportCity: "Atlanta, GA" },
2.{ pg: 0.04987817077679942,
    airportCode: "ORD",
    airportState: "Illinois",
    airportStateId: "IL",
    airportCity: "Chicago, IL" },
3.{ pg: 0.04484114423869301,
    airportCode: "DFW',
    airportState: "Texas",
    airportStateId: "TX",
    airportCity: "Dallas/Fort Worth, TX" },
4.{ pg: 0.0375874819401995,
    airportCode: "DEN",
    airportState: "Colorado",
    airportStateId: "CO",
    airportCity: "Denver, CO" },
5.{ pg: 0.035847669686020475,
    airportCode: "LAX",
    airportState: "California",
    airportStateId: "CA",
    airportCity: "Los Angeles, CA" },
6.{ pg: 0.029359141715724606,
    airportCode: "IAH",
    airportState: "Texas",
    airportStateId: "TX",
    airportCity: "Houston, TX" },
7.{ pg: 0.029269624393415964,
    airportCode: "PHX",
    airportState: "Arizona",
    airportStateId: "AZ",
    airportCity: "Phoenix, AZ" },
8.{ pg: 0.027586105077479,
    airportCode: "SFO",
    airportState: "California",
    airportStateId: "CA",
    airportCity: "San Francisco, CA" },
9.{ pg: 0.022826269022159618,
    airportCode: "LAS",
    airportState: "Nevada",
    airportStateId: "NV",
    airportCity: "Las Vegas, NV" },
10.{ pg: 0.022075486537264547,
    airportCode: "CLT",
    airportState: "North Carolina",
    airportStateId: "NC",
    airportCity: "Charlotte, NC" }

The outcome matches our intuition that the airports with the most flights would accumulate most of the PageRank. In general, the nodes in a weighted graph with the most PageRank will be the ones with a greater ratio of incoming weight to outgoing weight.

Below is a map of the USA that illustrates the PageRank of all airports in the Flights dataset. Click on the image below to see the interactive map. The bigger the circle on the airport, the larger its PageRank. Hover around a circle to see the full name of an airport, its airport code, and the percentage of the total PageRank the airport accumulated.

Challenges/Lessons Learned

Insert vs. Update

Initially, we envisioned iterations.js to merely update the pg and prevpg of the PageRank collection instead of outputting to a new collection. However, updates were significantly slower than inserts into a new collection, even though we already had indexes on the pg and prevpg fields. We learned that, in general, updates in really large collections are significantly slower than insertions into a new collection. This preference of inserts over updates would be common in our other attempts.

Flights Dataset has no information on International Flights

Only domestic flights are present in our Flights dataset. Perhaps, if international flights were included, JFK, O'Hare, and San Francisco airports would have the most PageRank. Also, our map does not show the USA states and territories of Alaska, Hawaii, and Guam. If they were included, then the continental USA would have been too small to distinguish between individual airports.

Relatively small number of nodes in graph

Even though our initial Flights dataset contained 6,155,748 documents (corresponding to domestic flights), the resulting airports graph had only 318 documents (corresponding to airports/nodes). This is why the MongoDB MapReduce framework was very fast and converged after a few seconds and after less than 20 iterations. Perhaps, it might take a longer time before it converged if run on a dataset with more nodes (more airports).

The next dataset we’ll use is the Twitter Memes dataset. This dataset will have at least 1 million nodes (after pre-processing) that correspond to web pages on the Internet. Performance analysis based on the PageRank algorithm is more easily done on datasets with more nodes.

The MongoDB Java Driver 3.0: What's Changing

MongoDB

Releases

By Trisha Gee, MongoDB Java Engineer and Evangelist

In the last post, we covered the design goals for the new MongoDB Java Driver. In this one, we’re going to go into a bit more detail on the changes you can expect to see, and how to start playing with an alpha version of the driver. Please note, however, that the driver is still a work in progress, and not ready for production.

New features

Other than the overall changes to design detailed above, the 3.0 driver has the following new features:

  • Pluggable Codecs: This means you can do simple changes to serialisation/deserialisation, like tell the driver to use Joda Time instead of java.util.Date, or you can take almost complete control of how to turn your Java objects into BSON. This should be particularly useful for ODMs or other libraries, as they can write their own codecs to convert Java objects to BSON bytes.
  • Predictable cluster management: We’ve done quite a lot of work around discovering the servers in your cluster and determining which ones to talk to. In particular, the driver doesn’t have to wait for all servers to become available before it can start using the ones that are definitely there - the design is event-based so as soon as a server notifies the driver of its state the driver can take appropriate action - use it if it’s active, or start ignoring it if it’s no longer available.
  • Additional Connection Pool features: We’ve added support for additional connection pool settings, and a number of other improvements around connection management. Here’s the full list.
  • Deprecated methods/classes will be removed: In the next 2.x release a number of methods and classes will be deprecated. These, along with existing deprecated methods, will be removed in the 3.0 driver. This should point you in the right direction to help you migrate from 2.x to 3.x.

Speaking of Migration…

We’ve worked hard to maintain backwards compatibility whilst moving forwards with the architecture of the Java driver for MongoDB. We want to make migration as painless as possible, in many cases it should be a simple drop-in replacement if you want to keep using the existing API. We hope to provide a step-by-step guide to migrating from 2.x to 3.0 in the very near future. For now, it’s worth mentioning that upgrading will be easiest if you update to 2.12 (to be released soon), migrate any code that uses deprecated features, and then move to the compatible mode of the new driver.

Awesome! Can I try it?

Yes you can! You can try out an alpha of the new driver right now, but as you’d expect there are CAVEATS: this is an alpha, it does not support all current features (notably aggregation); although it has been tested it is still in development and we can’t guarantee everything will work as you expect. Features which have been or will be deprecated in the 2.x driver are missing completely from the 3.0 driver. Please don’t use it in production. However, if you do want to play with it in a development environment, or want to run your existing test suite against it, please do send us any feedback you have.

If you want to use the compatible mode, with the old API (minus deprecations) and new architecture:

Maven

<project>
   ...

   <dependencies>
       <dependency>
           <groupId>org.mongodb</groupId>
           <artifactId>mongo-java-driver</artifactId>
           <version>3.0.0-SNAPSHOT</version>
       </dependency>
       ...
   </dependencies>

   <repositories>
       <repository>
           <id>sonatype-snapshot</id>
           <url>https://oss.sonatype.org/content/repositories/snapshots/</url>
       </repository>
       ...
   </repositories>
</project>

Gradle

apply plugin: 'java'

repositories {
    maven {
        url "https://oss.sonatype.org/content/repositories/snapshots/"
    }
}

dependencies {
    compile 'org.mongodb:mongo-java-driver:3.0.0-SNAPSHOT'
}

You should be able to do a drop-in replacement with this dependency - use this instead of your existing MongoDB driver, run it in your test environment and see how ready you are to use the new driver.

If you want to play with the new, ever-changing, not-at-all-final API, then you can use the new driver with the new API. Because we wanted to be able to support both APIs and not have a big-bang switchover, there’s a subtle difference to the location of the driver with the updated API, see if you can spot it:

Maven

<project>
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.mongodb.test</groupId>
    <artifactId>3.0-test</artifactId>
    <version>1</version>

    <dependencies>
        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongodb-driver</artifactId>
            <version>3.0.0-SNAPSHOT</version>
        </dependency>
    </dependencies>

    <repositories>
        <repository>
            <id>sonatype-snapshot</id>
            <url>https://oss.sonatype.org/content/repositories/snapshots/</url>
        </repository>
    </repositories>
</project>

Gradle

apply plugin: 'java'

repositories {
    maven {
        url "https://oss.sonatype.org/content/repositories/snapshots/"
    }
}

dependencies {
    compile 'org.mongodb:mongodb-driver:3.0.0-SNAPSHOT'
}

Note that if you use the new API version, you don’t have access to the old compatible API.

Of course, the code is in GitHub

In Summary

For 3.0, we will deliver the updated, simplified architecture with the same API as the existing driver, as well as working towards a more fluent style of API. This means that although in future you have the option of using the new API, you should also be able to do a simple drop-in replacement of your driver jar file and have the application work as before.

A release date for the 3.0 driver has not been finalized, but keep your eyes open for it.

All Hail the new Java driver!

Faceted Search with MongoDB

MongoDB

Releases

By Jon Rangel, MongoDB Consulting Engineer

Introduction

Faceted search, or faceted navigation, is a way of browsing and searching for items in a set of data by applying filters on various properties (facets) of the items in the collection. It is increasingly seen as an important part of the UI for many search platforms, and indeed nowadays is pretty much expected in places such as e-commerce websites.

Faceted search makes it easy for users to navigate to the specific item or items they are interested in. It complements more free-form keyword search by facilitating exploration and discovery and is therefore useful when a user may not know the specific keywords they wish to search on.

Some core functionality that a faceted search feature should provide might include:

  • finding the items that match a particular value of a certain facet (e.g. colour:blue)
  • finding the items in the intersection of multiple facet values (e.g. colour:blue AND size:medium)
  • finding the items in the union of multiple facet values (e.g. colour:blue OR colour:red OR size:large)
  • for each possible facet filter combination, display to the user the possible facet values on which it is possible to filter further (“drill down”)
  • for each facet value on which it is possible to drill down, display to the user the count of items matching that filter.

In this article, we’ll look at implementing the above faceted search functionality using a pure MongoDB solution. We’ll examine a number of approaches to solving this problem, and discuss their relative performance characteristics and any other pros/cons. We will also introduce some third party tools that, alternatively, can integrate with MongoDB to provide faceted search functionality.

Navigating a Book Store

Suppose we want to build faceted search functionality for a product catalog for a book store. A typical document representing a publication in the catalog might look something like the following:

  {
        _id : 123,
        title : "MongoDB: The Definitive Guide",
        authors : [ "Kristina Chodorow" ],
        publication_date : ISODate("2013-05-23"),
        pages : 432,
        edition : 2,
        isbn_10 : 1449344682,
        isbn_13 : 978-1449344689,
        language : "English",
        publisher : {
            name: "O’Reilly Media",
            ...
        },
        last_updated : ISODate("2013-05-16"),
        ...
    } 

First off, let’s state some reasonable assumptions about the facets for this (or indeed any other) catalog:

  • The total number of facets will be small.
  • The total number of possible facet values for each facet may be large, but will typically be small.
  • Each item in the catalog may have zero or more facet values (“tags”) for each facet (but typically one).
  • The facets are well-known up front, and change rarely if at all. The set of facet values may change frequently i.e. any time the product catalog is updated to add/remove items, or change the tags on existing items.
  • The application has knowledge of the facets being used, but not the set of all possible facet values that exist in the catalog for each of those facets.

For this example, let’s say we have three facets on which we wish to search – Subject, Publisher and Language – and consider how to search efficiently, and how to generate the faceted navigation meta-data to present to the user. We will test on some pre-generated test data based on a real-world product catalog.

Searching

The first part of the problem to solve is how to efficiently search for items in the product catalog. A few schema and indexing approaches are presented below.

Solution #1

One way to define the facet tags for a publication would be to store all facet types and values in subdocuments in an array, as follows:

  {
        _id: 123,
        ...
        facets1 : [
            {
                type : "subject",
                val : "MongoDB"
            },
            {
                type : "subject",
                val : "Databases"
            },
            {
                type : "publisher",
                val : "O'Reilly Media"
            },
            {
                type : "language",
                val : "English"
            }
        ]
    } 

A single ‘generic’ compound index can then be created containing all the facets and facet values:

 > db.books.ensureIndex({"facets1.type" : 1, "facets1.val" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 29891456,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536
        },
        "ok" : 1
    }

See this blog post for a good treatment on these kinds of generic indexes.

Let’s see how this performs for some faceted searches, using explain(). We’ll look at queries on a single facet tag to start with.

Find all books about databases:

  > db.books.find(
    ...     { "facets1" : { $elemMatch : { "type" : "subject", "val" : "Databases" } } }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 7315,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 27,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "subject",
                    "subject"
                ]
            ],
            "facets1.val" : [
                [
                    "Databases",
                    "Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

Find all books by a specific publisher:

  > db.books.find(
    ...     { "facets1" : { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } } }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 39960,
        "nscannedObjects" : 39960,
        "nscanned" : 39960,
        "nscannedObjectsAllPlans" : 39960,
        "nscannedAllPlans" : 39960,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 133,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "publisher",
                    "publisher"
                ]
            ],
            "facets1.val" : [
                [
                    "O'Reilly Media",
                    "O'Reilly Media"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

Both of these queries use the index optimally as the number of documents returned is the same as the number of documents scanned (nscanned is the same as n).

How about queries for documents matching the union or intersection of multiple facet values? To do these “and”/“or” queries we use the $all/$in operators respectively.

Find all books about databases OR published by O'Reilly Media:

  > db.books.find(
    ...     { "facets1" :
    ...         { "$in" : [
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } },
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } }
    ...         ]}
    ...     }
    ... ).explain()
    Fri Aug 16 15:59:04.989 JavaScript execution failed: error: 
    { "$err" : "$elemMatch not allowed within $in", "code" : 15881 } at src/mongo/shell/query.js:L128

Oops! This type of search doesn’t work using $in to construct the query as we cannot use the $elemMatch operator within a $in clause. This query can instead be constructed using the $or operator:

  > db.books.find(
    ...     { "$or" : [
    ...             { "facets1" : { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } } },
    ...             { "facets1" : { $elemMatch : { "type" : "subject", "val" : "Databases" } } }
    ...         ]
    ...     }
    ... ).explain()
    {
        "clauses" : [
            {
                "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
                "isMultiKey" : true,
                "n" : 40019,
                "nscannedObjects" : 40019,
                "nscanned" : 40019,
                "nscannedObjectsAllPlans" : 40019,
                "nscannedAllPlans" : 40019,
                "scanAndOrder" : false,
                "indexOnly" : false,
                "nYields" : 0,
                "nChunkSkips" : 0,
                "millis" : 118,
                "indexBounds" : {
                    "facets1.type" : [
                        [
                            "publisher",
                            "publisher"
                        ]
                    ],
                    "facets1.val" : [
                        [
                            "O'Reilly Media",
                            "O'Reilly Media"
                        ]
                    ]
                }
            },
            {
                "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
                "isMultiKey" : true,
                "n" : 6640,
                "nscannedObjects" : 7374,
                "nscanned" : 7374,
                "nscannedObjectsAllPlans" : 7374,
                "nscannedAllPlans" : 7374,
                "scanAndOrder" : false,
                "indexOnly" : false,
                "nYields" : 1,
                "nChunkSkips" : 0,
                "millis" : 123,
                "indexBounds" : {
                    "facets1.type" : [
                        [
                            "subject",
                            "subject"
                        ]
                    ],
                    "facets1.val" : [
                        [
                            "Databases",
                            "Databases"
                        ]
                    ]
                }
            }
        ],
        "n" : 46659,
        "nscannedObjects" : 47393,
        "nscanned" : 47393,
        "nscannedObjectsAllPlans" : 47393,
        "nscannedAllPlans" : 47393,
        "millis" : 242,
        "server" : "rangel.lan:27017"
    }

This query is pretty optimal: the number of documents scanned is only slightly more than the number returned, and the index is used for both parts of the “or” statement.

Next, find all books about databases AND published by O'Reilly Media:

 > db.books.find(
    ...     { "facets1" :
    ...         { "$all" : [
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } },
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } }
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 39960,
        "nscanned" : 39960,
        "nscannedObjectsAllPlans" : 39960,
        "nscannedAllPlans" : 39960,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 118,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "publisher",
                    "publisher"
                ]
            ],
            "facets1.val" : [
                [
                    "O'Reilly Media",
                    "O'Reilly Media"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This query uses the index, but is not optimal as many more documents are scanned than returned. Note that the number of documents scanned is the same as the number of books by this publisher (as seen from the previous query) – this is because at present $all only uses the index for the first element in the query array.

The performance of these kinds of queries will improve significantly once MongoDB supports index intersection, which is a feature that is coming soon (see SERVER-3071). With single index intersection, queries like the above will not need to scan more documents than those returned. In the meantime, to optimize these kinds of queries put the most selective filter criterion as the first element of the $all array if possible to minimize scanning:

 > db.books.find(
    ...     { "facets1" :
    ...         { "$all" : [
    ...             { $elemMatch : { "type" : "subject", "val" : "Databases" } },
    ...             { $elemMatch : { "type" : "publisher", "val" : "O'Reilly Media" } }
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets1.type_1_facets1.val_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 20,
        "indexBounds" : {
            "facets1.type" : [
                [
                    "subject",
                    "subject"
                ]
            ],
            "facets1.val" : [
                [
                    "Databases",
                    "Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }
Solution #2

Store all facet types and values in in an array, but instead of each element of the array being a subdocument, concatenate the facet type name and value into a single string value:

 {
        _id: 123,
        ...
        facets2 : [
            "subject:MongoDB",
            "subject:Databases",
            "publisher:O'Reilly Media",
            "language:English"
        ]
    }

Create an index on the facets field:

  > db.books.ensureIndex({"facets2" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 55694912,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536,
            "facets2_1" : 25803456
        },
        "ok" : 1
    }

Now let’s try some of the same queries as before. First, a simple query on a single facet value (all books about databases):

   > db.books.find(
    ...     { "facets2" : "subject"+":"+"Databases" }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1",
        "isMultiKey" : true,
        "n" : 7315,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 28,
        "indexBounds" : {
            "facets2" : [
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This works exactly as expected.

Now, lets try an “or” query (all books about databases OR published by O'Reilly Media):

    > db.books.find(
    ...     { "facets2" :
    ...         { "$in" : [
    ...             "publisher"+":"+"O'Reilly Media",
    ...             "subject"+":"+"Databases"
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1 multi",
        "isMultiKey" : true,
        "n" : 46600,
        "nscannedObjects" : 47275,
        "nscanned" : 47276,
        "nscannedObjectsAllPlans" : 47275,
        "nscannedAllPlans" : 47276,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 117,
        "indexBounds" : {
            "facets2" : [
                [
                    "publisher:O'Reilly Media",
                    "publisher:O'Reilly Media"
                ],
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

This query is pretty optimal: the number of documents scanned is only slightly more than the number returned, and the index bounds look sensible, showing that the index is used for both elements of the $in array. Note that $in may be used to construct this type of query since we don’t need to use the $elemMatch operator with this schema.

Finally, an “and” query (all books about databases that are published by O'Reilly Media):

  > db.books.find(
    ...     { "facets2" :
    ...         { "$all" : [
    ...             "subject"+":"+"Databases",
    ...             "publisher"+":"+"O'Reilly Media"
    ...         ]}
    ...     }
    ... ).explain()
    {
        "cursor" : "BtreeCursor facets2_1",
        "isMultiKey" : true,
        "n" : 675,
        "nscannedObjects" : 7315,
        "nscanned" : 7315,
        "nscannedObjectsAllPlans" : 7315,
        "nscannedAllPlans" : 7315,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 20,
        "indexBounds" : {
            "facets2" : [
                [
                    "subject:Databases",
                    "subject:Databases"
                ]
            ]
        },
        "server" : "rangel.lan:27017"
    }

If you’ve been following so far, you won’t be too surprised to see that, unfortunately, this performs exactly the same as in solution #1, for the same reasons described there. Index intersection is coming soon though!

Solution #3

Consider the following schema, where each facet is a field in a subdocument, associated with an array of the tags for that facet:

 {
        _id: 123,
        ...
        facets3 : {
            subject : [ "MongoDB", "Databases" ],
            publisher : [ "O'Reilly Media" ],
            language : [ "English" ]
        }
    }

Add an index on each facet individually:

  > db.books.ensureIndex({"facets3.subject" : 1})
    > db.books.ensureIndex({"facets3.publisher" : 1})
    > db.books.ensureIndex({"facets3.language" : 1})
    > db.books.stats()
    {
        "ns" : "test.books",
        "count" : 105280,
        "size" : 109597152,
        "avgObjSize" : 1041.0063829787234,
        ...
        "totalIndexSize" : 75464480,
        "indexSizes" : {
            "_id_" : 3433920,
            "facets1.type_1_facets1.val_1" : 26457536,
            "facets2_1" : 25803456,
            "facets3.subject_1" : 12084128,
            "facets3.publisher_1" : 2321984,
            "facets3.language_1" : 5363456
        },
        "ok" : 1
    }

This solution has the same performance characteristics as the first two solutions, with the additional benefit that the total size of the indexes required is significantly smaller. This is because we are not storing the facet names in the index for each indexed value.

Once index intersection using multiple indexes is supported (which is also coming under SERVER-3071), this approach will also perform well for “and” queries.

Generating the Faceted Navigation Information

The other part of the faceted search problem is how to most efficiently generate and return the faceted search meta-data. One way to do this would be to use the Aggregation Framework to calculate this information on-the-fly.

For example, to get all the facet values for the collection and the count of documents associated with each one, we could perform the following aggregation query (assuming schema #2 as above):

 > db.books.aggregate([{ "$unwind" : "$facets2" },
                          { "$group" : { "_id" : "$facets2", count : { "$sum" : 1 } } },
                          { "$sort" : { "_id" : 1 } }
                         ])
    {
        "result" : [
            ...
            {
                "_id" : "publisher:O'Reilly Media",
                "count" : 39960
            },
            ...
            {
                "_id" : "subject:Databases",
                "count" : 7315
            },
            ...
        ],
        "ok" : 1
    }

Then, as the user drills down using the facets, we need to add the filter predicates to the aggregation query. For instance, if the user clicks on the “Databases” subject facet, we can obtain the facet values and counts for documents matching this filter as follows:

 > db.books.aggregate([{ "$match" : { "facets2" : "subject"+":"+"Databases" } },
                          { "$unwind" : "$facets2" },
                          { "$group" : { "_id" : "$facets2", "count" : { "$sum" : 1 } } },
                          { "$sort" : { "_id" : 1 } }
                         ])
    {
        "result" : [
            ...
            {
                "_id" : "publisher:O'Reilly Media",
                "count" : 675
            },
            ...
            {
                "_id" : "subject:Databases",
                "count" : 7315
            },
            ...
        ],
        "ok" : 1
    }

The downside to this approach is that it incurs the overhead of an additional aggregation query each time the user queries the product catalog. Furthermore, for certain choices of schema (e.g. solution #3 above) we actually need to do one aggregation query per distinct facet.

It’s reasonable to assume that the product catalog will be updated much less frequently than it is queried, therefore it may well make sense to pre-compute the faceted navigation meta-data and store it in a separate collection. Consider the following schema for a collection of faceted navigation documents:

 {
        _id : "'facet_filter_string",
        value : {
            count : 12,
            facets : {
                facet1_name : {
                    facet1_val1 : 8,
                    facet1_val2 : 12,
                    ...
                },
                facet2_name : {
                    facet2_val1 : 5,
                    ...
                },
                ...
            }
        }
    }

where <facet_filter_string> is either the empty string (for the document representing the root of the faceted navigation) or one or more of “|<facet_name>:<facet_filter_val>|” concatenated together.

Then, to find the faceted navigation information pertaining to all books about databases, the following simple query on _id will do the job:

 > db.facetnav.find({_id:"|subject:Databases|"}).pretty()
    {
        "_id" : "|subject:Databases|",
        "value" : {
            "count" : 7315,
            "facets" : {
                "publisher" : {
                    "O'Reilly Media" : 675,
                    "Pub2" : 3605,
                    "Pub3" : 185,
                    "Pub4" : 305,
                    "Pub5" : 2505,
                    "Pub6" : 15,
                    "Pub7" : 25
                },
                "language" : {
                    "English" : 7250,
                    "French" : 1095,
                    "German" : 1290
                }
            }
        }
    }

Note that it’s not necessary to generate a document like the above for every single permutation of facet filters, only for each unique combination of filters according to some predetermined canonical ordering of facets (e.g. Subject, Publisher, Language). We can then ensure that the application always builds the _id string with which to query using this canonical ordering.

The faceted navigation meta-data collection can be generated quite easily using a Map-Reduce job. For some example code that does this, take a look at my GitHub repo. With the map and reduce functions defined there, the facetnav info for the entire product catalog can be generated as follows:

 > db.books.mapReduce(mapFn, reduceFn, { "out" : "facetnav" })
    {
        "result" : "facetnav",
        "timeMillis" : 117529,
        "counts" : {
            "input" : 105280,
            "emit" : 2423080,
            "reduce" : 63850,
            "output" : 1599
        },
        "ok" : 1,
    }

Subsequently, whenever the product catalog is updated, the facetnav collection can be quickly updated by specifying that the map-reduce job operate only on the recently updated items and fold those changes in to the existing facetnav collection. For example:

  > db.books.ensureIndex({"last_updated : 1"})
    > db.books.mapReduce(mapFn, reduceFn,
    ...                  { "query" : { "last_updated" : { "$gt" : new Date(2013,7,1) } },
    ...                    "out" : { "reduce" : "facetnav" } })
    {
        "result" : "facetnav",
        "timeMillis" : 724,
        "counts" : {
            "input" : 1000,
            "emit" : 13484,
            "reduce" : 198,
            "output" : 1599
        },
        "ok" : 1,
    }

Third-Party Tools

There are a number of search engine software packages that provide faceted search capabilities. These typically provide the core functionality we have described above, plus more advanced features such as more convenient searching on ranges of facet values (e.g. finding documents that fall within a certain date or price range) or auto-completion (i.e. displaying relevant suggestions, grouped by facet, as a user types in a search query).

The trade-offs with using an additional search engine are:

  • Extra complexity due to adding another 'moving part’ to your deployment
  • Your application must deal with the fact that the system as a whole is now eventually consistent, with respect to the data stored in MongoDB versus the data stored in the external search engine. This may be undesirable, particularly for a product catalog that changes very frequently, for example.

Two of the most popular search engines are Solr and ElasticSearch which, like MongoDB, are also free and open-source products.

Solr and ElasticSearch can be easily integrated with MongoDB using Mongo Connector, which comes bundled with plugins for interfacing with each of them. Using the appropriate plugin, Mongo Connector can integrate data from MongoDB into the desired target system and keep the two systems in sync.

Conclusion

Faceted search functionality can be implemented in MongoDB, without requiring the use of external search engines. When index intersection arrives, all the types of queries we have examined here will perform optimally. Integrating with an external search engine to provide faceted search is also a good option, and something to consider depending on the specific requirements of your application.