GIANT Stories at MongoDB

Best Of Both Worlds: Genentech Accelerates Drug Research With MongoDB & Oracle

“Every day we can reduce the time it takes to introduce a new drug can have a big difference on our patients,” said Doug Garrett, Software Engineer at Genentech.

Genentech Research and Early Development (gRED) develops drugs for significant unmet medical needs. A critical component of this effort is the ability to provide investigators with new genetic strains of animals so as to understand the cause of diseases and to test new drugs.

As genetic testing has both increased and become more complex, Genentech has focused on redeveloping the Genetic Analysis Lab system to reduce the time needed to introduce new lab instruments.

MongoDB is at the heart of this initiative, which captures the variety of data generated by genetic tests and integrates it with Genentech's existing Oracle RDBMS environment. MongoDB’s flexible schema and ability to easily integrate with existing Oracle RDBMS has helped Genentech to reduce development from months to weeks or even days, significantly accelerating drug research. “Every day we can reduce the time it takes to introduce a new drug can have a big difference on our patients,” said Doug Garrett, Software Engineer at Genentech.

Previously, the Genentech team needed to change the schema every time they introduced a new lab instrument, which held up research by three to six months, and sometimes even longer. At the same time, the database was becoming more difficult to support and maintain.

The MongoDB redesign delivered immediate results. In just one example, adding a new genetic test instrument (a new loader) had zero impact on the database schema and allowed Genentech to continue with research after just three weeks, instead of the standard three to six-month delay.

MongoDB also makes it possible for Genentech to load more data than in the past, which fits in well with the “collect now, analyze later” model, something he noted MongoDB co-founder Dwight Merriman has often suggested.

Said Garrett: “Even if we don’t know if we need the data, the cost is practically zero and we can do it without any programming changes, so why not collect as much as we can?”


To see all MongoDB World presentations, visit the [MongoDB World Presentations](https://www.mongodb.com/mongodb-world/presentations) page.

What Do I Use This For? 270,000 Apps

Graham Neray

Company

“Would you rather run 10 different databases, or run one database in 10 different ways?” - Charity Majors, Parse/Facebook.

When introduced to a new technology, would-be users often want to know, ‘So, what should I use this for?’ At MongoDB, we like to say, ‘MongoDB is a general purpose database good for a wide variety of applications.’ While this may be true, it of course doesn’t help the would-be user with her original question: ‘So, what do I use this for?’ Fair enough.

In the evening keynote at MongoDB World, Charity Majors offered a response better than any MongoDB employee could have conjured up. Charity is the Production Engineer at Parse (now part of Facebook), a mobile backend-as-a-service that “allows you to build fully featured mobile apps without having to sully your pure mind with things like data models and indexes.” And it runs on MongoDB.

Parse runs and scales more apps than even the most sprawling enterprises. 270,000 apps. The number of developers using Parse is growing at 250% per year; the number of API requests is growing at 500% annually. They serve every kind of workload under the sun. “We don’t know what any app’s workload is going to look like, and neither do the developers,” says Charity. And as Parse goes, so goes MongoDB.

The diversity of workloads that Parse runs on MongoDB is testament to the canonical ‘general purpose’ argument. With 270,000 different apps running on MongoDB, it should be clear that you can use it for, at the very least, a lot of different use cases. But Charity’s rousing speech offered an implicit response to the use case question cited above. ‘What do I use this for?’ begs a different but arguably more important question that Charity suggests users ask themselves: ‘What can I not use this for?’ That is, when choosing a technology -- especially a database -- users should be looking for the most reusable solution.

“Would you rather run 10 different databases, or run one database in 10 different ways?” Charity asked the audience. “There is no other database on the planet that can run this number of workloads.”

While most companies don’t need a database that works for 270,000 applications at once, every developer, sysadmin, DBA, startup, and Fortune 500 enterprise faces the same questions as Parse but on its own scale. Given a limited amount of time and money, how many databases do you want to learn how to use? How many databases do you want in production? How many vendor relationships do you want to manage? How many integration points do you want to build? Assuming one database works for most if not all use cases, what is that worth to you?

These are all flavors of what we’ll now call ‘The Parse’ question. To those in search of a solution, we suggest you take a look at MongoDB. And if you’re still unsure, we can try to put you in touch with Charity. But no promises -- she’s a bit of a celebrity these days.

========

Some other delectable quotes from Charity’s keynote because...we just couldn’t resist:

“Holy #%& there are so many people here.”

“I’ve been coming to MongoDB conferences for almost 2 years now, and they just keep getting better.”

“Speaking as a highly seasoned operations professional...I hate software for a living, and I’m pretty good at it.”

“Reliability, Flexibility, Automation.” (The 3 things she loves about MongoDB.)

“When I talk about reliability, I’m talking about reliability through resiliency...you should never have to care about the health of any individual nodes, you should only have to care about the health of the service.”

“Scalability is about more than just handling lots of requests really fast. It’s about building systems that don’t scale linearly in terms of the incremental cost to maintain them.”

“The story of operations is the story of dealing with failures. And this is why MongoDB is great, because it protects you from failures. And when your database lets you sleep through the night, how bad can it be?”

“May your nights be boring; may your pagers never ring.”


To see all MongoDB World presentations, visit the [MongoDB World Presentations](https://www.mongodb.com/mongodb-world/presentations) page.

MongoDB Performance Optimization with MMS

MongoDB

Cloud

This is a guest post by Michael De Lorenzo, CTO at CMP.LY.

CMP.LY is a venture-funded startup offering social media monitoring, measurement, insight and compliance solutions. Our clients include Fortune 100 financial services, automotive, consumer packaged goods companies, as well as leading brands, advertising, social and PR agencies.

Our patented monitoring, measurement and insight (MMI) tool, CommandPost provides real-time, actionable insights into social performance. Its structured methodology, unified cross-platform reporting and direct channel monitoring capabilities ensure marketers can capture and optimize the real value of their engagement, communities and brand advocates. All of CommandPost’s products have built-in compliance solutions including plain language disclosure URLs (such as rul.es, ter.ms, disclosur.es, paid-po.st, sponsored-po.st and many others).

MongoDB at CMP.LY

At CMP.LY, MongoDB provides the data backbone for all of CommandPost’s social media monitoring services. Our monitoring services collect details about social media content across multiple platforms, all engagements with that content and builds profiles around each user that engages. This amounts to thousands of writes hitting our MongoDB replica set every second across multiple collections. While our monitoring services are writing new and updated data to the database in the background, our clients are consuming the same data in real-time via our dashboards from those same collections.

More Insights Mean More Writes

With the launch of CommandPost, we expanded the number of data points our monitoring services collected and enhanced analysis of those we were already collecting. These changes saw our MongoDB deployment come under a heavier load than we had previously seen - especially in terms of the number of writes performed.

Increasing the number of data points collected also meant we had more interesting data for our clients to access. From a database perspective, this meant more reads for our system to handle. However, it appeared we had a problem - our API was slower than ever in returning the data clients requested.

We had been diligent about adding indexes and making sure the most frequent client-facing queries were covered, but reads were still terribly slow. We turned to our MongoDB Management Service dashboard for clues as to why.

MongoDB Management Service

By turning to MMS, we knew we would have a reliable source to provide insight into what our database was doing both before and after our updates to CommandPost. Most (if not all) of the stats and charts we typically pay attention to in MMS looked normal for our application and MongoDB deployment. As we worked our way through each metric, we finally came across one that had changed significantly- Lock Percentage.

Since releasing the latest updates to CommandPost, our deployment’s primary client-serving database saw its lock percentage jump from about 10% to a constant rate of 150-175%. This was a huge jump with a very negative impact on our application - API requests timed out, queries took minutes to complete and our client-facing applications became nearly unusable.

Why is Lock Percentage important?

A quick look at how MongoDB handles concurrency tells us exactly why Lock Percentage became so important for us.

MongoDB uses a readers-writer lock that allows concurrent reads access to a database but gives exclusive access to a single write operation. When a read lock exists, many read operations may use this lock. However, when a write lock exists, a single write operation holds the lock exclusively, and no other read or write operations may share the lock.

Locks are “writer greedy,” which means writes have preference over reads. When both a read and write are waiting for a lock, MongoDB grants the lock to the write. As of version 2.2, MongoDB implements locks at a per-database granularity for most read and write operations.

The “greediness” of our writes was not only keeping our clients from being able to access data (in any of our collections), but causing additional writes to be delayed.

Strategies to Reduce Lock Contention

Once we identified the collections most affected by the locking, we identified three possible remedies to the issue and worked to apply all of them.

Schema Changes

The collection that saw our greatest load (in terms of writes and reads) originally contained a few embedded documents and arrays that tended to make updating documents hard. We took steps to denormalize our schema and, in some cases, customized the _id attribute. Denormalization allowed us to model our data for atomic updates. Customizing the _id attribute, allowed us to simplify our writes without additional queries or indexes by leverage the existing index on the document’s _id attribute. Enabling atomic updates allowed us to simplify our application code and reduce the time spent in application write lock.

Use of Message Queues

To manage the flow of data, we refactored some writes to be managed using a Publish-Subscribe pattern. We chose to use Amazon’s SQS service to do this, but you could just as easily use Redis, Beanstalkd, IronMQ or any other message queue.

By implementing message queuing to control the flow of writes, we were able to spread the frequency of writes over a longer period of time. This became crucially important during times where our monitoring services came under higher-than-normal load.

Multiple Databases

We also chose to take advantage of MongoDB’s per database locking by creating and moving write-heavy collections into separate databases. This allowed us to move non-client-facing collections into databases that didn’t need to be accessed by our API and client queries.

Splitting into multiple databases meant that only the database taking on an update needed to be locked, leaving all other databases to remain available to serve client requests.

How did things change?

The aforementioned changes yielded immediate results. The results were so drastic that many of our users commented to us that the application seemed faster and performed better. It wasn’t their imaginations - as you can see from the “after” Lock Percentage chart below, we reduced the value to about 50% on our primary client-serving database.

What’s Next?

In working with MongoDB Technical Services, we also identified one more strategy we intend to implement to further reduce our Lock Percentage - Sharding. Sharing will allow us to horizontally scale our write workload across multiple servers and easily add additional capacity to meet our performance targets.

We’re excited about the possibility of not just improving the performance of our MongoDB deployment, but offering our users faster access to their data and a better overall experience using CommandPost.

If you want to learn more about how to use MongoDB Management Service to identify potential issues with your MongoDB deployment, keep it healthy and keep your application running smoothly, attend my talk “Performance Tuning on the Fly at CMP.LY” at MongoDB World in New York City, on Tuesday, June 24th at 2:20pm in the New York East Ballroom.

6 Rules of Thumb for MongoDB Schema Design: Part 3

MongoDB

Technical

By William Zola, Lead Technical Support Engineer at MongoDB

This is our final stop in this tour of modeling One-to-N relationships in MongoDB. In the first post, I covered the three basic ways to model a One-to-N relationship. Last time, I covered some extensions to those basics: two-way referencing and denormalization.

Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.

Read part one and part two if you’ve missed them.

Whoa! Look at All These Choices!

So, to recap:

  • You can embed, reference from the “one” side, or reference from the “N” side, or combine a pair of these techniques
  • You can denormalize as many fields as you like into the “one” side or the “N” side

Denormalization, in particular, gives you a lot of choices: if there are 8 candidates for denormalization in a relationship, there are 2 8 (1024) different ways to denormalize (including not denormalizing at all). Multiply that by the three different ways to do referencing, and you have over 3,000 different ways to model the relationship.

Guess what? You now are stuck in the “paradox of choice” – because you have so many potential ways to model a “one-to-N” relationship, your choice on how to model it just got harder. Lots harder.

Rules of Thumb: Your Guide Through the Rainbow

Here are some “rules of thumb” to guide you through these indenumberable (but not infinite) choices

  • One: favor embedding unless there is a compelling reason not to
  • Two: needing to access an object on its own is a compelling reason not to embed it
  • Three: Arrays should not grow without bound. If there are more than a couple of hundred documents on the “many” side, don’t embed them; if there are more than a few thousand documents on the “many” side, don’t use an array of ObjectID references. High-cardinality arrays are a compelling reason not to embed.
  • Four: Don’t be afraid of application-level joins: if you index correctly and use the projection specifier (as shown in part 2) then application-level joins are barely more expensive than server-side joins in a relational database.
  • Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.
  • Six: As always with MongoDB, how you model your data depends – entirely – on your particular application’s data access patterns. You want to structure your data to match the ways that your application queries and updates it.

Your Guide To The Rainbow

When modeling “One-to-N” relationships in MongoDB, you have a variety of choices, so you have to carefully think through the structure of your data. The main criteria you need to consider are:

  • What is the cardinality of the relationship: is it “one-to-few”, “one-to-many”, or “one-to-squillions”?
  • Do you need to access the object on the “N” side separately, or only in the context of the parent object?
  • What is the ratio of updates to reads for a particular field?

Your main choices for structuring the data are:

  • For “one-to-few”, you can use an array of embedded documents
  • For “one-to-many”, or on occasions when the “N” side must stand alone, you should use an array of references. You can also use a “parent-reference” on the “N” side if it optimizes your data access pattern.
  • For “one-to-squillions”, you should use a “parent-reference” in the document storing the “N” side.

Once you’ve decided on the overall structure of the data, then you can, if you choose, denormalize data across multiple documents, by either denormalizing data from the “One” side into the “N” side, or from the “N” side into the “One” side. You’d do this only for fields that are frequently read, get read much more often than they get updated, and where you don’t require strong consistency, since updating a denormalized value is slower, more expensive, and is not atomic.

Productivity and Flexibility

The upshot of all of this is that MongoDB gives you the ability to design your database schema to match the needs of your application. You can structure your data in MongoDB so that it adapts easily to change, and supports the queries and updates that you need to get the most out of your application.

How Buffer uses MongoDB to power its Growth Platform

MongoDB

Releases

By Sunil Sadasivin, CTO at Buffer

Buffer, powered by experiments and metrics

At Buffer, every product decision we make is driven by quantitative metrics. We have always sought to be lean in our decision making, and one of the core tenants of being lean is launching experimental features early and measuring their impact.

Buffer is a social media tool to help you schedule and space out your posts on social media networks like Twitter, Facebook, Google+ and Linkedin. We started in late 2010 and thanks to a keen focus on analytical data, we have now grown to over 1.5 million users and 155k unique active users per month. We’re now responsible for sharing 3 million social media posts a week.

When I started at Buffer in September 2012 we were using a mixture of Google Analytics, Kissmetrics and an internal tool to track our app usage and analytics. We struggled to move fast and effectively measure product and feature usage with these disconnected tools. We didn’t have an easy way to generate powerful reports like cohort analysis charts or measure things like activation segmented by signup sources over time. Third party tracking services were great for us early on, but as we started to dig deeper into our app insights, we realized there was no way around it—we needed to build our own custom metrics and event tracking.

We took the plunge in April 2013 to build our own metrics framework using MongoDB. While we’ve had some bumps and growing pains setting this up, it’s been one of the best decisions we’ve made. We are now in control of all metrics and event tracking and are able to understand what’s going on with our app at a deeper level. Here’s how we use MongoDB to power our metrics framework.

Why we chose MongoDB

At the time we were evaluating datastores, we had no idea what our data would look like. When I started designing our schema, I quickly found that we needed something that would let us change the metrics we track over time and on the fly. Today, I’ll want to measure our signup funnel based on referrals, tomorrow I might want to measure some custom event and associated data that is specific to some future experiment. I needed to plan for the future, and give our developers the power to track any arbitrary data. MongoDB and its dynamic schema made the most sense for us. MongoDB’s super powerful aggregation framework also seemed perfect for creating the right views with our tracking data.

Our Metrics Framework Architecture

In our app, we’ve set up an AWS SQS queue and any data we want to track from the app goes immediately to this queue. We use SQS heavily in our app and have found it to be a great tool to manage messaging at high throughput levels. We have a simple python worker which picks off messages from this queue and writes them to our metrics database. The reason why we’ve done this instead of connecting and writing directly to our metrics MongoDB database is because we wanted our metrics set up to have absolutely zero impact on application performance. Much like Google Analytics offers no overhead to an application, our event tracking had to do the same. The MongoDB database that would store our events would be extremely write heavy since we would be tracking anything we could think of, including every API request, page visited, Buffer user/profile/post/email created etc. If, for whatever reason our metrics db goes down, or starts having write locking issues, our users shouldn’t be impacted. Using SQS as a middleman would allow tracking data to queue up if any of these issues occur. SQS gives us enough time to figure out what the issue is, fix, it and then process that backlog. We’ve had quite a few times in the past year where using Amazon’s robust SQS service has saved us from losing any data during maintenance or downtime that would occur when creating a robust high throughput metrics framework from scratch. We use MongoHQ to host our data. They’ve been super helpful with any challenges in scaling a db like ours. Since our setup is write heavy, we’ve initially set up a 400GB SSD replica set. As of today (May 16) we have 90 collections and are storing over 500 million documents.

We wrote simple client libraries for tracking data for every language that we use (PHP, Python, Java, NodeJS, Javascript, Objective-C). In addition to bufferapp.com, our API, mobile apps and internal tools all plug into this framework.

Tracking events

Our event tracking is super simple. When a developer creates a new event message, our python worker creates a generic event collection (if it doesn’t exist) and stores event data that’s defined by the developer. It will store the user or visitor id, and the date that the event occurred. It’ll also store the user_joined_at date which is useful for cohort analysis.

Here are some examples of event tracking our metrics platform lets us do.

Visitor page views in the app.

Like many other apps, we want to track every visitor that hits our page. There is a bunch of data that we want to store to understand the context around the event. We’d like to know the IP address, the URI they viewed, the user agent they’re using among other data.

Here’s what the tracking would look like in our app written in PHP:

$visit_event = array(
    'visitor_id' => $visitor_id,
    'ip' => $ip_address,
    'uri' => $uri,
    'referrer' => $referrer,
    'user_agent' => $user_agent
);
//track, < metric name >, < metric data > , < operation type >
$visitor->track('visits', $visit_event, 'event')

Here’s the corresponding result in our MongoDB metrics db:

> db.event.visits.findOne({date:{$gt:ISODate("2014-05-05")}})
{
        "_id" : ObjectId("5366d48148222c37e51a9f31"),
        "domain" : "blog.rafflecopter.com",
        "user_id" : null,
        "ip" : "50.27.200.15",
         "user_joined_at" : null,
         "visitor_id" : ObjectId("5366d48151823c7914450517"),
         "uri" : "",
         "agent" : {
                "platform" : "Windows 7",
                "version" : "34.0.1847.131",
                "browser" : "Chrome"
           },
           "referrer" : "blog.rafflecopter.com/",
           "date" : ISODate("2014-05-05T00:00:01.603Z"),
           "page" : "/"
}

Logging User API calls

We track every API call our clients make to the Buffer API. Essentially what we’ve done here is create query-able logging for API requests. This has been way more effective than standard web server logs and has allowed us to dig deeper into API bugs, security issues and understanding the load on our API.

db.event.api.findOne()
{
        "_id" : ObjectId("536c1a7648222c105f807212"),
        "endpoint" : {
                "name" : "updates/create"
        },
        "user_id" : ObjectId("50367b2c6ffb36784c000048"),
        "params" : {
                "get" : {
                        "text" : "Sending a test update for the the blog post!",
                        "profile_ids" : [
                                "52f52d0a86b3e9211f000012"
                        ],
                        "media" : ""
                }
        },
        "client_id" : ObjectId("4e9680b8562f7e6b22000000"),
        "user_joined_at" : ISODate("2012-08-23T18:50:20.405Z"),
        "date" : ISODate("2014-05-08T23:59:50.419Z"),
        "ip_address" : "32.163.4.8",
        "response_time" : 414.95399475098
}

Experiment data

With this type of event tracking, our developers are able to track anything by writing a single line of code. This has been especially useful for measuring events specific to a feature experiment. This frictionless process helps keep us lean: we can measure feature usage as soon as a feature is launched. For example, we recently launched a group sharing feature for business customers so that they can group their Buffer social media accounts together. Our hypothesis was that people with several social media accounts prefer to share specific content to subsets of accounts. We wanted to quantifiably validate whether this is something many would use, or whether it’s a niche or power user feature. After a week of testing this out, we had our answer.

This example shows our tracking of our ‘group sharing’ experiment. We wanted to track each group that was created with this new feature. With this, we’re able to track the user, the groups created, the name of the group, and the date it was created.

> db.event.web.group_sharing.create_group.find().pretty()
{
        "_id" : ObjectId("536c07e148022c1069b4ff3d"),
        "via" : "web",
        "user_id" : ObjectId("536bfbea61bb78af76e2a94d"),
        "user_joined_at" : ISODate("2014-05-08T21:49:30Z"),
        "date" : ISODate("2014-05-08T22:40:33.880Z"),
        "group" : {
                "profile_ids" : [
                        "536c074d613b7d9924e1a90f",
                        "536c07c361bb7d732d198f1"
                ],
                "id" : "536c07e156a66a28563f14ec",
                "name" : "Dental"
        }
}

Making sense of the data

We store a lot of tracking data. While it’s great that we’re tracking all this data, there would be no point if we weren’t able to make sense of it. Our goal for tracking this data was to create our own growth dashboard so we can keep track of key metrics, and understand results of experiments. Making sense of the data was one of the most challenging parts of setting up our growth platform.

MongoDB Aggregation

We rely heavily on MongoDB’s aggregation framework. It has been super handy for things like gauging API client requests by hour, response times separated by API endpoint, number of visitors based on referrers, cohort analysis and so much more.

Here’s a simple example of how we use MongoDB aggregation to obtain our average API response times between April 8th and April 9th:

db.event.api.aggregate({
    $match: {
            date: {
                $gt: ISODate("2014-05-08T20:02:33.133Z"),
                $lt: ISODate("2014-05-09T20:02:33.133Z"),
                     }
            }
    }, {
    $group: {
        _id: {
                    endpoint: '$endpoint.name'
            },
        avgResponseTime: {
                    $avg: '$response_time'
        },
            count: {
            $sum: 1
            }
    }
}, {
    $sort: {
        "count": -1
    }
})
Result:
{
        "result" : [
                {
                        "_id" : {
                                "endpoint" : "profiles/updates_pending"
                        },
                        "avgResponseTime" : 118.69420306241872,
                        "count" : 749800
                },
                {
                        "_id" : {
                                "endpoint" : "updates/create"
                        },
                        "avgResponseTime" : 1597.2882786981013,
                        "count" : 393282
                },
                {
                        "_id" : {
                                "endpoint" : "profiles/updates_sent"
                        },
                        "avgResponseTime" : 281.65717282199824,
                        "count" : 368860
                },
                {
                        "_id" : {
                                "endpoint" : "profiles/index"
                        },
                        "avgResponseTime" : 112.43379622794643,
                        "count" : 323844
                },
                {
                        "_id" : {
                                "endpoint" : "user/friends"
                        },
                        "avgResponseTime" : 559.7830099245549,
                        "count" : 122320
                },
                ...

With the aggregation framework, we have powerful insight into how clients are using our platform, which users are power users and a lot more. We previously created long running scripts to generate our cohort analysis reports. Now we can use MongoDB aggregation for much of this.

Running ETL jobs

We have several ETL jobs that run periodically to power our growth dashboard. This is the way we make sense of our data core. Some of the more complex reports need this level of reporting. For example, the way we measure product activation is whether someone has posted an update within a week of joining. With the way we’ve structured our data, this requires a join query in two different collections. All of this processing is done in our ETL jobs. We’ll upload the results to a different database which is used to power the views in our growth dashboard for faster loading.

Here are some reports on our growth dashboard that are powered by ETL jobs

Scaling Challenges and Pitfalls

We’ve faced a few different challenges and we’ve iterated to get to a point where we can make solid use out of our growth platform. Here are a few pitfalls and examples of challenges that we’ve faced in setting this up and scaling our platform.

Plan for high disk I/O and write throughput.

The DB server size and type has a key role in how quickly we could process and store events. In planning for the future we knew that we’d be tracking quite a lot of data and a fast pace, so a db with high disk write throughput was key for us. We ended up going for a large SSD replica set. This of course really depends on your application and use case. If you use an intermediate datastore like SQS, you can always start small, and upgrade db servers when you need it without any data loss.

We keep an eye on mongostat and SQS queue size almost daily to see how our writes are doing.

One of the good things about an SSD backed DB is that disk reads are much quicker compared to hard disk. This means it’s much more feasible to run ad hoc queries on un-indexed fields. We do this all the time whenever we have a hunch of something to dig into further.

Be mindful of the MongoDB document limit and how data is structured

Our first iteration of schema design was not scalable. True, MongoDB does not perform schema validation but that doesn’t mean it’s not important to think about how data is structured. Originally, we tracked all events in a single user_metrics and visitor_metrics collection. An event was stored as an embedded object in an array in a user document. Our hope was that we wouldn’t need to do any joins and we could effectively segment out tracking data super easily by user.

We had fields as arrays that were unbounded and could grow infinitely causing the document size to grow. For some highly active users (and bots), after a few months of tracking data in this way some documents in this collection would hit the 16MB document limit and fail to write any more. This created various performance issues in processing updates, and in our growth worker and ETL jobs because there were these huge documents transferred over the wire. When this happened we had to move quickly to restructure our data.

Moving to a single collection per event type has been the most scalable solution and a more flexible solution.

Reading from secondaries

Some of our ETL jobs read and process a lot of data. If you end up querying documents that haven’t been read or written to recently, it is very possible this may be stored out of memory and on disk. Querying this data means MongoDB will page out some documents that have been touched recently and bring query results into memory. This will then make writing to that paged out document slower. It’s for this reason that we have set up our ETL and aggregation queries to read only from our secondaries in our replica-set, even though they may not be consistent with the primary.

Our secondaries have a high number of faults because of paging due to reading ‘stale’ data

Visualizing results

As I mentioned before, one of the more challenging parts about maintaining our own growth platform is extracting and visualizing the data in a way that makes a lot of sense. I can’t say that we’ve come to a great solution yet. We’ve put a lot of effort into building out and maintaining our growth dashboard and creating visualizations is the bottleneck for us today. There is really a lot of room to reduce the turnaround time. We have started to experiment a bit with using Stripe’s MoSQL to map results from MongoDB to PostgresSQL and connect with something like Chart.io to make this a bit more seamless. If you’ve come across some solid solutions for visualizing event tracking with MongoDB, I’d love to hear about it!

Event tracking for everyone!

We would love to open source our growth platform. It’s something we’re hoping to do later this year. We’ve learned a lot by setting up our own tracking platform. If you have any questions about any of this or would like to have more control of your own event tracking with MongoDB, just hit me up @sunils34

Want to help build out our growth platform? Buffer is looking to grow its growth team and reliability team!

Like what you see? Sign up for the MongoDB Newsletter and get MongoDB updates straight to your inbox

6 Rules of Thumb for MongoDB Schema Design: Part 2

MongoDB

Technical

By William Zola, Lead Technical Support Engineer at MongoDB

This is the second stop on our tour of modeling One-to-N relationships in MongoDB. Last time I covered the three basic schema designs: embedding, child-referencing, and parent-referencing. I also covered the two factors to consider when picking one of these designs:

  • Will the entities on the “N” side of the One-to-N ever need to stand alone?
  • What is the cardinality of the relationship: is it one-to-few; one-to-many; or one-to-squillions?

With these basic techniques under our belt, I can move on to covering more sophisticated schema designs, involving two-way referencing and denormalization.

Intermediate: Two-Way Referencing

If you want to get a little bit fancier, you can combine two techniques and include both styles of reference in your schema, having both references from the “one” side to the “many” side and references from the “many” side to the “one” side.

For an example, let’s go back to that task-tracking system. There’s a “people” collection holding Person documents, a “tasks” collection holding Task documents, and a One-to-N relationship from Person -> Task. The application will need to track all of the Tasks owned by a Person, so we will need to reference Person -> Task.

With the array of references to Task documents, a single Person document might look like this:

db.person.findOne()
{
    _id: ObjectID("AAF1"),
    name: "Kate Monster",
    tasks [     // array of references to Task documents
        ObjectID("ADF9"), 
        ObjectID("AE02"),
        ObjectID("AE73") 
        // etc
    ]
}

On the other hand, in some other contexts this application will display a list of Tasks (for example, all of the Tasks in a multi-person Project) and it will need to quickly find which Person is responsible for each Task. You can optimize this by putting an additional reference to the Person in the Task document.

db.tasks.findOne()
{
    _id: ObjectID("ADF9"), 
    description: "Write lesson plan",
    due_date:  ISODate("2014-04-01"),
    owner: ObjectID("AAF1")     // Reference to Person document
}

This design has all of the advantages and disadvantages of the “One-to-Many” schema, but with some additions. Putting in the extra ‘owner’ reference into the Task document means that its quick and easy to find the Task’s owner, but it also means that if you need to reassign the task to another person, you need to perform two updates instead of just one. Specifically, you’ll have to update both the reference from the Person to the Task document, and the reference from the Task to the Person. (And to the relational gurus who are reading this – you’re right: using this schema design means that it is no longer possible to reassign a Task to a new Person with a single atomic update. This is OK for our task-tracking system: you need to consider if this works with your particular use case.)

Intermediate: Denormalizing With “One-To-Many” Relationships

Beyond just modeling the various flavors of relationships, you can also add denormalization into your schema. This can eliminate the need to perform the application-level join for certain cases, at the price of some additional complexity when performing updates. An example will help make this clear.

Denormalizing from Many -> One

For the parts example, you could denormalize the name of the part into the ‘parts[]’ array. For reference, here’s the version of the Product document without denormalization.

> db.products.findOne()
{
    name : 'left-handed smoke shifter',
    manufacturer : 'Acme Corp',
    catalog_number: 1234,
    parts : [     // array of references to Part documents
        ObjectID('AAAA'),    // reference to the #4 grommet above
        ObjectID('F17C'),    // reference to a different Part
        ObjectID('D2AA'),
        // etc
    ]
}

Denormalizing would mean that you don’t have to perform the application-level join when displaying all of the part names for the product, but you would have to perform that join if you needed any other information about a part.

> db.products.findOne()
{
    name : 'left-handed smoke shifter',
    manufacturer : 'Acme Corp',
    catalog_number: 1234,
    parts : [
        { id : ObjectID('AAAA'), name : '#4 grommet' },         // Part name is denormalized
        { id: ObjectID('F17C'), name : 'fan blade assembly' },
        { id: ObjectID('D2AA'), name : 'power switch' },
        // etc
    ]
}

While making it easier to get the part names, this would add just a bit of client-side work to the application-level join:

// Fetch the product document
> product = db.products.findOne({catalog_number: 1234});  
  // Create an array of ObjectID()s containing *just* the part numbers
> part_ids = product.parts.map( function(doc) { return doc.id } );
  // Fetch all the Parts that are linked to this Product
> product_parts = db.parts.find({_id: { $in : part_ids } } ).toArray() ;

Denormalizing saves you a lookup of the denormalized data at the cost of a more expensive update: if you’ve denormalized the Part name into the Product document, then when you update the Part name you must also update every place it occurs in the ‘products’ collection.

Denormalizing only makes sense when there’s an high ratio of reads to updates. If you’ll be reading the denormalized data frequently, but updating it only rarely, it often makes sense to pay the price of slower updates – and more complex updates – in order to get more efficient queries. As updates become more frequent relative to queries, the savings from denormalization decrease.

For example: assume the part name changes infrequently, but the quantity on hand changes frequently. This means that while it makes sense to denormalize the part name into the Product document, it does not make sense to denormalize the quantity on hand.

Also note that if you denormalize a field, you lose the ability to perform atomic and isolated updates on that field. Just like with the two-way referencing example above, if you update the part name in the Part document, and then in the Product document, there will be a sub-second interval where the denormalized ‘name’ in the Product document will not reflect the new, updated value in the Part document.

Denormalizing from One -> Many

You can also denormalize fields from the “One” side into the “Many” side:

> db.parts.findOne()
{
    _id : ObjectID('AAAA'),
    partno : '123-aff-456',
    name : '#4 grommet',
    product_name : 'left-handed smoke shifter',   // Denormalized from the ‘Product’ document
    product_catalog_number: 1234,                     // Ditto
    qty: 94,
    cost: 0.94,
    price: 3.99
}

However, if you’ve denormalized the Product name into the Part document, then when you update the Product name you must also update every place it occurs in the ‘parts’ collection. This is likely to be a more expensive update, since you’re updating multiple Parts instead of a single Product. As such, it’s significantly more important to consider the read-to-write ratio when denormalizing in this way.

Intermediate: Denormalizing With “One-To-Squillions” Relationships

You can also denormalize the “one-to-squillions” example. This works in one of two ways: you can either put information about the “one” side (from the 'hosts’ document) into the “squillions” side (the log entries), or you can put summary information from the “squillions” side into the “one” side.

Here’s an example of denormalizing into the “squillions” side. I’m going to add the IP address of the host (from the ‘one’ side) into the individual log message:

> db.logmsg.findOne()
{
    time : ISODate("2014-03-28T09:42:41.382Z"),
    message : 'cpu is on fire!',
    ipaddr : '127.66.66.66',
    host: ObjectID('AAAB')
}

Your query for the most recent messages from a particular IP address just got easier: it’s now just one query instead of two.

> last_5k_msg = db.logmsg.find({ipaddr : '127.66.66.66'}).sort({time : -1}).limit(5000).toArray()

In fact, if there’s only a limited amount of information you want to store at the “one” side, you can denormalize it ALL into the “squillions” side and get rid of the “one” collection altogether:

> db.logmsg.findOne()
{
    time : ISODate("2014-03-28T09:42:41.382Z"),
    message : 'cpu is on fire!',
    ipaddr : '127.66.66.66',
    hostname : 'goofy.example.com',
}

On the other hand, you can also denormalize into the “one” side. Lets say you want to keep the last 1000 messages from a host in the 'hosts’ document. You could use the $each / $slice functionality introduced in MongoDB 2.4 to keep that list sorted, and only retain the last 1000 messages:

The log messages get saved in the 'logmsg’ collection as well as in the denormalized list in the 'hosts’ document: that way the message isn’t lost when it ages out of the 'hosts.logmsgs’ array.


 //  Get log message from monitoring system
logmsg = get_log_msg();
log_message_here = logmsg.msg;
log_ip = logmsg.ipaddr;
  // Get current timestamp
now = new Date()
  // Find the _id for the host I’m updating
host_doc = db.hosts.findOne({ipaddr : log_ip },{_id:1});  // Don’t return the whole document
host_id = host_doc._id;
  // Insert the log message, the parent reference, and the denormalized data into the ‘many’ side
db.logmsg.save({time : now, message : log_message_here, ipaddr : log_ip, host : host_id ) });
  // Push the denormalized log message onto the ‘one’ side
db.hosts.update( {_id: host_id }, 
        {$push : {logmsgs : { $each:  [ { time : now, message : log_message_here } ],
                           $sort:  { time : 1 },  // Only keep the latest ones 
                           $slice: -1000 }        // Only keep the latest 1000
         }} );

Note the use of the projection specification ( {_id:1} ) to prevent MongoDB from having to ship the entire ‘hosts’ document over the network. By telling MongoDB to only return the _id field, I reduce the network overhead down to just the few bytes that it takes to store that field (plus just a little bit more for the wire protocol overhead).

Just as with denormalizing in the “One-to-Many” case, you’ll want to consider the ratio of reads to updates. Denormalizing the log messages into the Host document makes sense only if log messages are infrequent relative to the number of times the application needs to look at all of the messages for a single host. This particular denormalization is a bad idea if you want to look at the data less frequently than you update it.

Recap

In this post, I’ve covered the additional choices that you have past the basics of embed, child-reference, or parent-reference.

  • You can use bi-directional referencing if it optimizes your schema, and if you are willing to pay the price of not having atomic updates
  • If you are referencing, you can denormalize data either from the “One” side into the “N” side, or from the “N” side into the “One” side

When deciding whether or not to denormalize, consider the following factors:

  • You cannot perform an atomic update on denormalized data
  • Denormalization only makes sense when you have a high read to write ratio

Next time, I’ll give you some guidelines to pick and choose among all of these options.

More Information

This post was updated in January 2015 to include additional resources and updated links.

Sign up for the MongoDB Newsletter to get MongoDB updates right to your inbox

Increasing MMS Security via Two-Factor Authentication

MongoDB

Cloud

As of May 28th, the MongoDB Management Service (MMS) requires Two Factor Authentication (2FA) for all MMS users. Two-factor authentication requires you to know your password and have a physical item that proves your identity. In our implementation, that second factor is your phone. So when you log in, after you enter your password correctly, MMS will prompt you for a code that proves you have your phone.

There are multiple ways to receive a 2FA code in real time:

  • Google Authenticator for Android or Apple iOS on your smartphone. Google Authenticator produces time-based codes that do not require a connection to the internet. You seed the Google Authenticator app by scanning a QR code shown to you by MMS during setup. Once seeded, the Google Authenticator app will show you the current code whenever it is running.
  • Text message to a cellphone number. When you set up your MMS account, you can provide a cell phone number to receive your 2FA codes. Whenever you need to login, MMS will send you a code via SMS. SMS works well for most users, however, certain network providers and countries may impose delays on SMS messages. If you’re using text messaging, you’ll also have to have cell service whenever you want to log in to MMS. For example, you may want to log in on an airplane or when traveling internationally. In these cases, Google Authenticator is a good alternative since it does not require a network connection.
  • Voice call to a cell phone. This option is almost exactly like text messaging. When you try to log in, you will get an automated phone call that reads out the 2FA code required to login.

As a backup, you can also generate recovery codes when setting up 2FA within MMS. These are longer codes that can be used in place of a 2FA code when you don’t have access to a phone or your Google Authenticator app. Each recovery code can be used exactly once, and you should save these codes in a secure place. Additionally, you can re-generate your recovery codes in your Two Factor Authentication link under Settings->Profile in MMS. When you generate new recovery codes, you invalidate previously generated ones.

MMS 2FA requires a little extra work but we believe that it provides a significantly improved level of security to MMS users. If you run into any problems setting up your 2FA, please reach out to the MMS Support team.

Efficient Indexing in MongoDB 2.6

MongoDB

Releases

By Osmar Olivo, Product Manager at MongoDB

One of the most powerful features of MongoDB is its rich indexing functionality. Users can specify secondary indexes on any field, compound indexes, geospatial, text, sparse, TTL, and others. Having extensive indexing functionality makes it easier for developers to build apps that provide rich functionality and low latency.

MongoDB 2.6 introduces a new query planner, including the ability to perform index intersection. Prior to 2.6 the query planner could only make use of a single index for most queries. That meant that if you wanted to query on multiple fields together, you needed to create a compound index. It also meant that if there were several different combinations of fields you wanted to query on, you might need several different compound indexes.

Each index adds overhead to your deployment - indexes consume space, on disk and in RAM, and indexes are maintained during updates, which adds disk IO. In other words, indexes improve the efficiency of many operations, but they also come at a cost. For many applications, index intersection will allow users to reduce the number of indexes they need while still providing rich features and low latency.

In the following sections we will take a deep dive into index intersection and how it can be applied to applications.

An Example - The Phone Book

Let’s take the example of a phone book with the following schema.

{
    FirstName
    LastName
    Phone_Number
    Address
}

If I were to search for “Smith, John” how would I index the following query to be as efficient as possible?

db.phonebook.find({ FirstName : “John”, LastName : “Smith” })

I could use an individual index on FirstName and search for all of the “Johns”.

This would look something like ensureIndex( { FirstName : 1 } )

We run this query and we get back 200,000 John Smiths. Looking at the explain() output below however, we see that we scanned 1,000,000 “Johns” in the process of finding 200,000 “John Smiths”.

> db.phonebook.find({ FirstName : "John", LastName : "Smith"}).explain()
{
    "cursor" : "BtreeCursor FirstName_1",
    "isMultiKey" : false,
    "n" : 200000,
    "nscannedObjects" : 1000000,
    "nscanned" : 1000000,
    "nscannedObjectsAllPlans" : 1000101,
    "nscannedAllPlans" : 1000101,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 2,
    "nChunkSkips" : 0,
    "millis" : 2043,
    "indexBounds" : {
        "FirstName" : [
            [
                "John",
                "John"
            ]
        ]
    },
    "server" : "Oz-Olivo-MacBook-Pro.local:27017"
}

How about creating an individual index on LastName?

This would look something like ensureIndex( { LastName : 1 } )

Running this query we get back 200,000 “John Smiths” but our explain output says that we now scanned 400,000 “Smiths”. How can we make this better?

db.phonebook.find({ FirstName : "John", LastName : "Smith"}).explain()
{
    "cursor" : "BtreeCursor LastName_1",
    "isMultiKey" : false,
    "n" : 200000,
    "nscannedObjects" : 400000,
    "nscanned" : 400000,
    "nscannedObjectsAllPlans" : 400101,
    "nscannedAllPlans" : 400101,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 1,
    "nChunkSkips" : 0,
    "millis" : 852,
    "indexBounds" : {
        "LastName" : [
            [
                "Smith",
                "Smith"
            ]
        ]
    },
    "server" : "Oz-Olivo-MacBook-Pro.local:27017"
}

So we know that there are 1,000,000 “John” entries, 400,000 “Smith” entries, and 200,000 “John Smith” entries in our phonebook. Is there a way that we can scan just the 200,000 we need?

In the case of a phone book this is somewhat simple; since we know that we want it to be sorted by Lastname, Firstname we can create a compound index on them, like the below.

ensureIndex( {  LastName : true, FirstName : 1  } ) 

db.phonebook.find({ FirstName : "John", LastName : "Smith"}).explain()
{
    "cursor" : "BtreeCursor LastName_1_FirstName_1",
    "isMultiKey" : false,
    "n" : 200000,
    "nscannedObjects" : 200000,
    "nscanned" : 200000,
    "nscannedObjectsAllPlans" : 200000,
    "nscannedAllPlans" : 200000,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 370,
    "indexBounds" : {
        "LastName" : [
            [
                "Smith",
                "Smith"
            ]
        ],
        "FirstName" : [
            [
                "John",
                "John"
            ]
        ]
    },
    "server" : "Oz-Olivo-MacBook-Pro.local:27017"
}

Looking at the explain on this, we see that the index only scanned the 200,000 documents that matched, so we got a perfect hit.

Beyond Compound Indexes

The compound index is a great solution in the case of a phonebook in which we always know how we are going to be querying our data. Now what if we have an application in which users can arbitrarily query for different fields together? We can’t possibly create a compound index for every possible combination because of the overhead imposed by indexes, as we discussed above, and because MongoDB limits you to 64 indexes per collection. Index intersection can really help.

Imagine the case of a medical application which doctors use to filter through patients. At a high level, omitting several details, a basic schema may look something like the below.

{
      Fname
      LName
      SSN
      Age
      Blood_Type
      Conditions : [] 
      Medications : [ ]
      ...
      ...
}

Some sample searches that a doctor/nurse may run on this system would look something like the below.

Find me a Patient with Blood_Type = O under the age of 50

db.patients.find( {   Blood_Type : “O”,  Age : {   $lt : 50  }     } )

Find me all patients over the age of 60 on Medication X

db.patients.find( { Medications : “X” , Age : { $gt : 60} })

Find me all Diabetic patients on medication Y

db.patients.find( { Conditions : “Diabetes”, Medications : “Y” } )

With all of the unstructured data in modern applications, along with the desire to be able to search for things as needed in an ad-hoc way, it can become very difficult to predict usage patterns. Since we can’t possibly create compound indexes for every combination of fields, because we don’t necessarily know what those will be ahead of time, we can try indexing individual fields to try to salvage some performance. But as shown above in our phone book application, this can lead to performance issues in which we pull documents into memory that are not matches.

To avoid the paging of unnecessary data, the new index intersection feature in 2.6 increases the overall efficiency of these types of ad-hoc queries by processing the indexes involved individually and then intersecting the result set to find the matching documents. This means you only pull the final matching documents into memory and everything else is processed using the indexes. This processing will utilize more CPU, but should greatly reduce the amount of IO done for queries where all of the data is not in memory as well as allow you to utilize your memory more efficiently.

For example, looking at the earlier example:

db.patients.find( {   Blood_Type : “O”,  Age : {   $lt : 50  }     } )

It is inefficient to find all patients with BloodType: O (which could be millions) and then pull into memory each document to find the ones with age < 50 or vice versa.

Instead, the query planner finds all patients with bloodType: O using the index on BloodType, and all patients with age < 50 using the index on age, and then only pulls the intersection of these 2 result sets into memory. The query planner only needs to fit the subsets of the indexes in memory, instead of pulling in all of the documents. This in turn causes less paging, and less thrashing of the contents of memory, which will yield overall better performance.

Index intersection allows for much more efficient use of existing RAM so less total memory will usually be required to fit the working set then previously. Also, if you had several compound indices that were made up of different combinations of fields, then you can reduce the total number of indexes on the system. This means storing less indices in memory as well as achieving better insert/update performance since fewer indices must be updated.

As of version 2.6.0, you cannot intersect with geo or text indices and you can intersect at most 2 separate indices with each other. These limitations are likely to change in a future release.

Optimizing Multi-key Indexes It is also possible to intersect an index with itself in the case of multi-key indexes. Consider the below query:

Find me all patients with Diabetes & High Blood Pressure

db.patients.find( {  Conditions : { $all : [ “Diabetes”, “High Blood Pressure” ]  }    }  )

In this case we will find the result set of all Patients with Diabetes, and the result set of all patients with High blood pressure, and intersect the two to get all patients with both. Again, this requires less memory and disk speed for better overall performance. As of the 2.6.0 release, an index can intersect with itself up to 10 times.

Do We Still Need Compound Indexes?

To be clear, compound indexing will ALWAYS be more performant IF you know what you are going to be querying on and can create one ahead of time. Furthermore, if your working set is entirely in memory, then you will not reap any of the benefits of Index Intersection as it is primarily based on reducing IO. But in a more ad-hoc case where one cannot predict the shape of the queries and the working set is much larger than available memory, index intersection will automatically take over and choose the most performant path.

MongoDB Security Part II: 10 mistakes that can compromise your database

MongoDB

Company

Update: Watch our webinar on Securing Your MongoDB Deployment for further information on this topic.

This is the second in our 2-part series on MongoDB Security by Andreas Nilsson, Lead Security Engineer at MongoDB

This post outlines 10 best practices for hardening your MongoDB deployment. Failure to follow these best practices can lead to the loss of sensitive data, disrupted operations and have the potential to put entire companies out of business. These recommendations are based on my experience working with MongoDB users, and building security systems for databases and financial services organizations. Items are ordered by a combination of severity and frequency.

#1 Enable Access Control and Enforce Authentication

Enable access control and specify the authentication mechanism. You can use the default MongoDB authentication mechanism or an existing external framework. Authentication requires that all clients and servers provide valid credentials before they can connect to the system. In clustered deployments, enable authentication for each MongoDB server. See Authentication and Enable Client Access Control.

#2 Configure Role-Based Access Control

Create a user administrator first, then create additional users. Create a unique MongoDB user for each person and application that accesses the system. Create roles that define the exact access a set of users needs. Follow a principle of least privilege. Then create users and assign them only the roles they need to perform their operations. A user can be a person or a client application. See Role-Based Access Control and Manage User and Roles.

#3 Encrypt Communication

Configure MongoDB to use TLS/SSL for all incoming and outgoing connections. Use TLS/SSL to encrypt communication between mongod and mongos components of a MongoDB client as well as between all applications and MongoDB. See Configure mongod and mongos for TLS/SSL.

#4 Limit Network Exposure

Ensure that MongoDB runs in a trusted network environment and limit the interfaces on which MongoDB instances listen for incoming connections. Allow only trusted clients to access the network interfaces and ports on which MongoDB instances are available. See Security Hardening and the bindIp setting.

#5 Audit System Activity

Track access and changes to database configurations and data. MongoDB Enterprise includes a system auditing facility that can record system events (e.g. user operations, connection events) on a MongoDB instance. These audit records permit forensic analysis and allow administrators to verify proper controls. See Auditing and Configure Auditing.

#6 Encrypt and Protect Data

Encrypt MongoDB data at-rest. Configure the encrypted storage engine, or use application-level or 3rd party storage encryption.

#7 Run MongoDB with a Dedicated User

Run MongoDB processes with a dedicated operating system user account. Ensure that the account has permissions to access data but no unnecessary permissions. See Install MongoDB for more information on running MongoDB.

#8 Run MongoDB with Secure Configuration Options

MongoDB supports the execution of JavaScript code for certain server-side operations: mapReduce,group, and $where. If you do not use these operations, disable server-side scripting by using the –noscripting option on the command line. Use only the MongoDB wire protocol on production deployments. Do not enable the following, all of which enable the web server interface: net.http.enabled, net.http.JSONPEnabled, and net.http.RESTInterfaceEnabled. Leave these disabled, unless required for backwards compatibility. Note that the HTTP interface has been deprecated for MongoDB 3.2 and above Keep input validation enabled. MongoDB enables input validation by default through the wireObjectCheck setting. This ensures that all documents stored by the mongod instance are valid BSON. See Security Hardening for more information on hardening the MongoDB configuration.

#9 Consider Security Standards Compliance

For applications requiring HIPAA or PCI-DSS compliance, please refer to the MongoDB Security Reference Architecture to learn more about how you can use the key security capabilities to build compliant application infrastructure.

#10 Don’t Ignore Security Best Practices

A guaranteed way to create an insecure system is to ignore the topic altogether, or hope someone else thinks about it. Before deploying a MongoDB instance with sensitive data, please consult the MongoDB Security Manual and the MongoDB Security Tutorials and stay conscious about potential threats to your application.

MongoDB Enterprise Advanced provides access to enterprise grade capabilities. It includes all the ease-of-use, broad driver support, and scalability features of MongoDB, while addressing the more demanding security and certification requirements of corporate and government information security environments. To try it out, download an evaluation version of MongoDB Enterprise.

The Leaf in the Wild: Wearable Sensors Connecting “Man’s Best Friend” - Tractive & MongoDB

MongoDB

IoT, Business

Leaf in the Wild posts highlight real world MongoDB deployments. Read other stories about how companies are using MongoDB for their mission-critical projects.

I had the opportunity to sit down with Michael Lettner, CTO of Hardware & Services and Bernhard Wolkerstorfer, Head of Web & Services at Tractive, to discuss how they use MongoDB at their Internet of Things startup.

Tell us a little bit about your company. What are you trying to accomplish? How do you see yourself growing in the next few years?
Tractive is a cool 18-month old startup designed for pet owners. We extend the concept of the “quantified self” to the quantified pet, enabling owners to monitor their beloved companions through wearable sensor technology.

Our first service was the GPS Pet Tracking device that attaches to the pet’s collar and enables the owner to receive real time location-based tracking on their iOS or Android device. Users can also define a safe zone that acts as a virtual fence - whenever the pet leaves the safe zone, a notification is sent to the owner’s device. We have extended our products to include Tractive Motion that tracks a pet’s activity. Owners can compare how much exercise their pet is getting to other owners with the same breed. The Peterest image gallery enables owners to share images and activity with other members of their social network, and Pet Manager can be used to record veterinary appointments, allergies, vaccination schedules and more.

Tractive is currently available in over 70 countries, mainly across Europe and the Middle East, and is now rapidly extending worldwide with our first customers recently added in the USA, Asia, Australia and New Zealand.

Please describe your application using MongoDB.
MongoDB is our primary database - we use it to store all of the data we rely on to deliver our services - from sensor and geospatial data, to activity data, to user data and social sharing. Image data is stored in AWS S3 with its metadata managed by MongoDB.

We also use MongoDB to log all data from our infrastructure, ensuring our service is always available.

Why did you select MongoDB for Tractive? Did you consider other alternatives?

We initially came from a background of using relational databases, but we believed that these were not appropriate tools for managing the diversity of sensor data we would rely on for the Tractive services. In addition, we knew we would be rapidly evolving the functionality of our apps and were concerned the rigidity of the relational data model would constrain our creativity and time to market.

We knew the way forward was a non-relational database, and many would give us the flexible data model our app needed. Beyond a dynamic schema, we had additional criteria that guided our ultimate decision

  • How easily would the database allow us to store and query geospatial data?
  • How well could the database handle time-series and event-based data?
  • What sort of query flexibility did the database offer to support analytics against the data?
  • How easily and quickly could the database scale as our customer base and data volumes grew?
  • Was the database open source?

There are a multitude of key-value, wide column and document databases we could have chosen. There were many that could ingest time-series data quickly, but they lacked the ability to run rich queries against the data in place – instead forcing us to replicate the data to external systems.

Only MongoDB met all of key criteria – easy to develop against, simple to run in operations and without throwing away the type of query functionality we had come to expect from relational databases.

Please describe your MongoDB deployment
We run our MongoDB cluster across three shards with each shard configured as a three-node replica set. This architecture gives us the resilience we need to deliver always-on availability, and enables us to rapidly add shards as our service continues to grow.

The cluster is deployed in a colocation facility with an external service provider.

Our backend is primarily based on Ruby and currently running MongoDB 2.2 in production. We are planning a move to MongoDB 2.6 to take advantage of some specific new capabilities:

Can you share best practices you learned while scaling MongoDB? For best results, shard before you have to. Get a thorough understanding of your data structures and query patterns. This will help you select a shard key that best suits your applications. If you follow these simple rules, sharding in MongoDB is really simple. It’s automatic and transparent to the developer.

Scaling is of course much more than simply throwing hardware at the database cluster. So we got a lot of benefits from MongoDB tooling in optimizing our queries. During development, we used the MongoDB explain operator to ensure good index coverage. We also use the MongoDB Database Profiler to log all slow queries for further analysis and optimization.

For our analytics queries, we initially used MongoDB’s inbuilt MapReduce, but have since moved to the aggregation framework, which is faster and simpler.

Are you using any tools to monitor, manage and backup your MongoDB deployment? We rely heavily on the MongoDB Management Service application for proactive monitoring of our database cluster. Through MMS alerting we identified a potential issue with replication and were able to rectify it before it caused an outage.

For backups, we currently use mongodump, but are evaluating MMS Backup as this has the potential to extend our disaster recovery capabilities.

For overall performance monitoring of our application stack, we use New Relic which is implemented in the drivers we use.

What business advantage is MongoDB delivering?
As a startup, time to market is key. We could not have got to market as quickly with other databases. MongoDB’s flexible document model and dynamic schema have been essential not only in launching the original service, but now as we evolve our products. Requirements change quickly and we are always adding new features. MongoDB enables us to do that.

As we add more products and features, we add new customers. We need the ability to scale our infrastructure fast. Again MongoDB provides that scalability and operational simplicity we need to focus on the business, rather than the database.

What advice would you give someone who is considering using MongoDB for their next project?
We came from a relational database background and were surprised how easy it was for us in development and ops to transfer that knowledge to MongoDB. That helps us get up and running quickly.

MongoDB schema design is new concept and requires a change in thinking - from a normalized model that packs data into rows and columns across multiple tables to a document model that allows embedding of related data into a single object. Developers need to move on from focusing on how data is stored, to how it is queried by the application. You need to identify your queries and build your schema from there.

The good news is that there is a wealth of documentation online. The MongoDB blog is a great resource to learn best practices from the community. An example is the awesome post on MongoDB schema design for time series data - this will help anyone managing this type of data in IoT applications.

The MongoDB University provides free self-paced training for developers (in multiple languages), administrators and operations staff. There are also some really useful tutorials covering every step of MongoDB replication and sharding.

Our recommendation would be to perform due diligence during your research - ensure you understand your requirements, then download the software and get started in your evaluation.

Wrapping Up
Mike and Bernhard - I’d like to thank you for taking the time to share your experiences with us!