GIANT Stories at MongoDB

Enabling Extreme Agility At The Gap With MongoDB

The Gap's creative director insists that "Fashion is...about instinct and gut reaction." In the competitive world of retail, that "instinct" has been set to fast forward as Gap seeks to outpace fast-fashion retailers and other trends that constantly push Gap and other retailers to meet consumer needs, faster.

As boring as it may seem, Gap's purchase order management system really, really matters in ensuring it can quickly evolve to meet consumer tastes. Unable to meet business agility requirements using traditional relational databases, Gap uses MongoDB for a wide range of supply chain systems, including various master data management, inventory and logistics functions, including purchase order management.

Collecting Money From Happy Customers

This is no small feat given Gap's size. The Gap is a global specialty retailer offering clothing, accessories and personal care products for men, women, children and babies. With nearly 134,000 employees and almost 3,200 company-operated stores and an additional 400 franchise stores, fashion-conscious consumers can find The Gap around the world.

And they do, spending over $16 billion annually on Gap's latest track pant, indigo-washed jeans and racerback tanks.

That's both the good news and the bad news, as presented by Gap consultant Ryan Murray at MongoDB World.

Good, because it means Gap, more than anyone else, dresses America and, increasingly, the world.

Bad, because at its scale change can be hard.

Square Pegs, Round Holes And Purchase Orders

Even something simple like a purchase order can have a huge impact on a company like Gap. A purchase order is a rich business object that contains various pieces of information (item type, color, price, vendor information, shipping information, etc.). A purchase order at Gap can be an order to a vendor to produce a certain article of clothing.

The critical thing is that the business thinks about the order as a single entity, while Gap's RDBMS broke up the purchase order into a variety of rows, columns and tables, joined together.

Not very intuitive.

While this may seem like a small thing, as Murray points out, the RDBMS "forced [developers] to shift away from the business concept-- what is a purchase order and what are the business rules and capabilities around it-- and shift gears into 'How do I make this technology work for me and help me solve a business problem?' [mode of thinking]. And that destroys flow." D

evelopers may be more technical than the rest of us, Gap wanted its developers helping to build its business, not merely its technology.

Murray continues: "We don't want the developer having to work with the impedance mismatch between the business concept that they're trying to solve for and the technology they're using to solve it."

Enabling Supply Chain Agility By Improving Developer Productivity

As such, Gap realized it needed to evolve how it manages inventory and its vendors. It turned to MongoDB because it was able to easily make sense of data that comes in different shapes, which it needed to store quickly and transparently in Gap's database. MongoDB, in short, helped Gap become much more agile and, hence, far more competitive. One way Gap managed this was by moving from a monolithic application architecture to a microservices-based approach.

The traditional model for building applications has typically been as large monoliths. In this case, that meant the PO system was one, big code base that handled everything related to a PO, whether that was handling demand from the planning systems and creating those purchase orders or simply handling how the purchase orders actually integrate to other systems and get down to the vendors.

All of those things are actually fairly independent of each other, but the code base to manage it was monstrously big and monolithic.

Instead Murray and team introduced the concept of the microservice, a service dedicated to one business capability. For example, a microservice could handle communicating out to the vendors by EDI or whatever technology that a new purchase order has been registered. It turns out that MongoDB is perfect for such microservices because it's so simple and lightweight, Murray notes.

Gap uses MongoDB to power these single service and to connect them together. Each of these services lines up with a business function. Developers can work on separate microservices without bumping into or waiting on each other, as is common in a monolithic architecture. This enables them to be far more productive; to work much faster.

MongoDB As An "Extreme Enabler Of Agile Development"

In this and other ways, Murray lauds MongoDB as “an extreme enabler of agile development”, or iterative development. Waxing rhapsodic, Murray continues:

MongoDB allow[s our developers] to essentially forget about the storage layer that's underneath and just get work done. As the business evolves, the concept of a purchase order as an aggregate concept will also change as they add fields to it. MongoDB gets out of the way. [Developers] drop a collection, start up new code over that database, and MongoDB accepts whatever they throw at it.

Again, developers don't have to stop, break the context of solving the business problem, and get back to what they're doing. They simply get to focus on the business problem. And so as an agile enabler, as an enabler of developers to work fast and smart, MongoDB do is extremely useful.

As just one example, Gap was able to develop this new MongoDB-based purchase order system in just 75 days, a record for the company. In true agile fashion, MongoDB enables Gap to continue to iterate on the system. Five months in, the business wanted to track in a dashboard style the life of a purchase order. With MongoDB, that business requirement turned out to almost require no development effort. Murray and team were able to add new types of purchase orders and have them easily coexist with old purchase orders in the same collection and keep moving.

Not in months. Or weeks. But rather each day the development team was able to show the business what that feature might look like because of MongoDB's flexibility.

All of which makes Murray and his team at Gap so happy to work with MongoDB. "Software is ultimately about people," he insists, and giving developers software like MongoDB that they love to use makes them happy and productive.

And agile.


**Sign up to receive videos and content from MongoDB World.**

The Leaf in the Wild: MongoDB at MachineShop

Leaf in the Wild posts highlight real world MongoDB deployments. Read other stories about how companies are using MongoDB for their mission-critical projects.

I had the chance to meet with John Cox, Senior Technology Director at MachineShop, who running their Internet of Services platform on MongoDB. MachineShop is one of many startups who are using MongoDB to power the Internet of Things and are changing the way developers and organizations engage data to garner insights and connect to their environments.

Tell us a little bit about your company. What are you trying to accomplish? How do you see yourself growing in the next few years?

MachineShop is an on-demand middleware service that simplifies the way organizations build applications, integrate systems and share data within an enterprise and its ecosystem.

MachineShop is uniquely architected to connect with Internet enabled devices, systems and databases and offers an API-oriented approach to aggregating and managing services that engage, enrich, expose and manage data and their underlying sources.

We offer Developers and Organizations access to rich tools, reports and analytics about their services and applications through the MachineShop Services Exchange – a customizable web-based portal that offers hundreds of discrete APIs and services based on the unique roles and permissions of users.

What problem were you trying to solve?

When aggregating disparate data sources to be processed by central business logic and served up through a standard RESTful API, we needed a database solution that can accommodate multi-structured data and gives us high-throughput. We also need something that’s easy to scale out as we add customers and ramp up data inputs exponentially. MongoDB has it all in spades. The fact that it’s super easy to spit everything out to our API in JSON is a [very nice] bonus.

Was this a new project or did you migrate from a different database? What was it like to learn MongoDB

Earlier iterations of MachineShop used a relational database, but the current product was build from the ground up on MongoDB. There was still a small learning curve for the team jumping into MongoDB. It was tiny, though. The prototype for the current product was built entirely in Ruby (Sinatra/Rails). The fact that we used the Mongoid ODM made the transition really easy to understand as a developer. There were a few things we had to get smart on quickly on system admin, but honestly it was fairly trivial. (Thank you!)

Did you consider other alternatives, like a relational database or non-relational database?

We considered a few alternatives. It became clear very quickly that we wanted to go with a NoSQL solution. Once we crossed that bridge, MongoDB was just an obvious choice. The barrier to entry was low – both in dollars and technical resources. There are a ton of folks working with it that made finding resources online and building relationships in the local community really easy. It’s really fun to work with great, new technology that’s constantly moving forward. It’s also nice to not be on an island trying to figure it out.

Please describe your MongoDB deployment

Right now we’re a pretty small footprint – 3 replica sets and that’s it. It’s fine for the moment. The plan is to move very soon to many shards across a lot of small instances. The idea is that striping the data buys us speed and it’s easy to scale out. We run Ubuntu on AWS for everything. We’re currently using MongoDB 2.4.6 in production.

Are you using any tools to monitor, manage and backup your MongoDB deployment? If so what?

We’re using MMS primarily for monitoring. We also use MongoLab for hosting our production database. They have some pretty good value-add service offerings that we use. We also monitor indirectly through our apps using Scout.

Are you integrating MongoDB with other data analytics, BI or visualization tools like Hadoop? If so can you share any details

We have a proof of concept in place with Hadoop for analytics as well as Storm for real-time processing and aggregation. In production we do fairly basic on-the-fly aggregations and MapReduce jobs with data from devices as well as API request metering. The ultimate goal is to make sure that it’s easy to bolt on common BI tools to allow customers to slice and dice however they like.

How are you measuring the impact of MongoDB on your business?

We’ve never measured anything like cost savings directly. With MongoDB we picked a direction and just started running. Using MongoDB never felt like cost us on any of the metrics you listed. It’s pretty much been smooth as silk. Had we not used MongoDB, I could definitely see where it would cost us in terms of engineering solutions to problems that we never encountered.

What advice would you give someone who is considering using MongoDB for their next project

Fear not! Dive in. When engineering solutions we need to make sure we’re using the right tool for any job that we do. MongoDB happens to be a great tool that can be the right one in a LOT of situations. It lets you move fast and treat your data as just data. It’s freaking fast, too. You don’t have to make so many decisions up front. You can experiment and move pieces around as needed.

A couple of things I would recommend specifically:

  • Make sure you have sufficient memory to store your working set (frequently accessed data and indexes). It’s just better. (Google “mongodb working set”)
  • If you’re using some abstraction of data access, pay close attention to performance on aggregation. We ended up sidestepping some of the abstraction to gain performance in this area.

MongoDB's dynamic schema and object-oriented structure make it a great fit for the Internet of Things. See how companies like Enernoc and Bosch are building a more connected world with MongoDB.

eHarmony : des rencontres 95 % plus rapides avec MongoDB

À l'occasion d'une présentation captivante, le directeur de la technologie d'eHarmony, Thod Nguyen, a expliqué comment le plus grand site de rencontres au monde était parvenu à offrir une expérience optimale à ses clients en accélérant le traitement des affinités de 95 %, et à multiplier par deux le nombre d'abonnements après la migration de sa base de données relationnelle vers MongoDB.

Accéder à l'intégralité enregistrement et diaporama de l'intervention de Thod Nguyen dans le cadre du MongoDB World.

eHarmony opère en Amérique du Nord, en Australie et au Royaume-Uni. L'entreprise affiche une belle réussite : depuis son lancement en 2000, elle peut se targuer d'avoir permis 1,2 million de mariages. Aujourd'hui, eHarmony compte 55 millions d'utilisateurs, et ce nombre promet d'augmenter de manière considérable après le déploiement prochain du service dans vingt pays supplémentaires.

eHarmony s'appuie sur des procédés émanant de la science des données pour identifier les partenaires potentiels sur son site. Au moment de son inscription, chaque utilisateur est invité à remplir un questionnaire détaillé. Des modèles de compatibilité sophistiqués sont ensuite exécutés afin de créer un profil de personnalité sur la base de ses réponses. Ces calculs algorithmiques sont complétés par des recherches additionnelles, basées sur l'apprentissage automatique et des analyses prédictives, pour affiner la mise en correspondance.

Contrairement au mode de recherche par élément ou mot-clé classique habituellement utilisé sur Google, le processus d'appariement qui permet d'identifier des partenaires potentiels est bidirectionnel : il effectue des croisements et calcule des scores pour plusieurs attributs, parmi lesquels l'âge, l'emplacement géographique, le niveau d'études, les préférences, le revenu, etc.

Dans son architecture initiale, eHarmony stockait l'ensemble des données utilisateur et des appariements dans une seule base de données monolithique, dont le niveau de performance baissait à mesure que le service grandissait. eHarmony avait décidé de séparer les données d'appariement et de les stocker dans une base de données Postgres distribuée, pour regagner en capacités. Mais lorsque le système a atteint 3 milliards de correspondances potentielles par jour, générant 25 To de données, un changement est apparu comme impératif. L'exécution d'une analyse complète de la base des utilisateurs nécessitait deux semaines.

Alors que les modèles de données devenaient plus riches et plus complexes, l'ajustement du schéma impliquait le vidage total puis le rechargement de la base de données, engendrant des temps d'arrêt et une complexité accrue, autant de freins à l'activité de l'entreprise, qui s'est mise en quête d'une nouvelle approche.

Elle recherchait une base de données capable de répondre aux trois prérequis suivants :

  • Prise en charge des requêtes multi-attributs complexes à la base du système d'appariement
  • Modèle de données flexible permettant de gérer de nouveaux attributs en toute transparence
  • Possibilité d'exploiter les équipements courants, sans ajouter aux coûts d'exploitation d'une équipe gérant déjà plus de 1 000 serveurs

eHarmony a d'abord étudié la solution Apache Solr, avant de l'écarter : le système d'appariement nécessitait de pouvoir exécuter des recherches bidirectionnelles et non unidirectionnelles. La solution Apache Cassandra a également été envisagée, mais l'association de l'API avec le modèle de données s'avérait trop compliquée et il existait un déséquilibre entre les performances de lecture et d'écriture.

Après une phase d'évaluation extensive, eHarmony a finalement choisi MongoDB. En plus de voir les trois exigences décrites précédemment satisfaites, eHarmony a également pu largement s'appuyer sur la communauté MongoDB et profiter de l'assistance incluse dans MongoDB Enterprise Advanced.

Thod Nguyen a livré quelques-uns des principaux enseignements qu'eHarmony avait tirés de la migration vers MongoDB :

  • Impliquer les ingénieurs MongoDB en amont, afin qu'ils fassent profiter l'entreprise de leurs meilleures pratiques en termes de modélisation des données, de partitionnement et de déploiement
  • Au moment de la phase de tests, utiliser les données de production et les requêtes. Tuer des nœuds de manière aléatoire aide à comprendre le comportement du système dans diverses situations de défaillances
  • Une exécution en mode « shadow » à côté de la base de données relationnelle permet de caractériser les performances après une montée en charge.

Bien entendu, mongoDB n'est qu'un composant de l'infrastructure de gestion des données d'eHarmony. L'équipe d'expertise des données a choisi d'intégrer MongoDB avec Hadoop, ainsi qu'avec Apache Spark et Apache R pour les capacités d'analyse prédictive.

Le ROI de la migration est convaincant.

  • Appariement des partenaires 95 % plus rapide. L'analyse de l'ensemble de la base utilisateurs prend désormais 12 heures, contre 2 semaines auparavant.
  • 30 % de communication en plus entre les partenaires potentiels ;
  • 50 % d'abonnés payants en plus ;
  • 60 % de visites uniques en plus sur le site.

Et ce n'est pas tout. Outre le déploiement prévu dans 20 pays supplémentaires, eHarmony prévoit d'étendre l'expertise acquise dans les sciences des données à un nouveau marché, celui de la recherche d'emploi. Ou comment marier recrues et employeurs potentiels. Dans un premier temps, l'entreprise ajoutera des services de géolocalisation en tirant profit de la prise en charge par MongoDB des index géospatiaux et des requêtes. Le directeur de la technologie d'eHarmony se réjouit également de la disponibilité prochaine de moteurs de stockage enfichables dans la version 3.0 de MongoDB. La possibilité d'associer plusieurs moteurs de stockage dans un cluster MongoDB contribuera à consolider les recherches, les appariements et les données utilisateur. Que vous recherchiez un nouveau conjoint ou un nouvel emploi, il semble bien qu'eharmony dispose de la science et de la base de données nécessaires pour vous aider dans votre quête.


Si vous souhaitez en savoir plus sur la migration d'un système de gestion de bases de données relationnelles (RDBMS) vers MongoDB, nous vous invitons à lire le livre blanc suivant :


MongoDB Takes Center Stage at Ticketmaster

The world leader in selling tickets, Ticketmaster spent more than a decade developing apps extensively on Oracle and MySQL. The ticketing giant recently added MongoDB to the mix to complement existing database technologies with increased flexibility and performance, and decreased costs and time-to-market.

“Database performance and scale are a huge part of what we do, ensuring we can sell tickets 24/7,” said Ed Presz, VP of Database Services at Live Nation/Ticketmaster.

MongoDB currently plays a key role in TM+, Ticketmaster’s newest app covering the secondary, resale market. It will also be used in the future for a new app called Concerts, including venue view, B2B session recovery and client reports. “We’re moving to an agile devops environment and our developers love MongoDB’s ease of deployment and flexibility,” said Presz.

Presz also highly recommends MongoDB’s MMS and has also been pleased with MongoDB’s Enterprise Support. “We were new to MongoDB, about to go into production and we were a bit scared,” he said. “One of the things I was pushing hard for was enterprise support, so we’d have someone we could call. MongoDB’s enterprise support has been fantastic.”

Ticketmaster is a good example of how an organization can benefit both developmentally and operationally from MongoDB.


To see all MongoDB World presentations, visit the [MongoDB World Presentations](https://www.mongodb.com/mongodb-world/presentations) page.

MongoDB: A Single Platform for All Financial Data at AHL

AHL, a part of Man Group plc, is a quantitative investment manager based in London and Hong Kong, with over $11.3 billion in assets under management. The company relies on technology like MongoDB to be more agile and therefore gain an edge in the systematic trading space. With MongoDB, AHL can better support its quantitative researchers – or “quants” – to research, construct and deploy new trading models in order to understand how markets behave.

Importantly, AHL didn't embrace MongoDB piecemeal. Once AHL determined that MongoDB could significantly improve its operations, the financial services firm embraced MongoDB across the firm for an array of applications. AHL replaced a range of traditional technologies like relational databases with a single platform built on MongoDB for every type and frequency of financial market data, and for every level of data SLA, including:

  1. Low Frequency Data – MongoDB was 100x faster in retrieving data and also delivered consistent retrieval times. Not only is this more efficient for cluster computation, but it also leads to a more fluid experience for quants, with data ready for them to easily interact with, run analytics on and plot. MongoDB also delivered cost savings by replacing a proprietary parallel file system with commodity SSDs.

  2. Multi-user, Versioned, Interactive Graph-based Computation – This includes 1 terabyte of data representing 10,000 stocks and 20 years of time-series data, so as to help quants come up with trading signals for stock equities. While not a huge quantity of data, MongoDB reduced time to recompute trading models from hours to minutes, accelerated quants’ ability for interactive research, and enabled read/write performance of 600MB of data in less than 1 second.

  3. Tick Data – Used to capture all market activity, such as price changes for a security, up to 150,000 per second and including 30 terabytes of historic data. MongoDB quickly scaled to 250 million ticks per second, a 25X improvement in tick throughput (with just two commodity machines!) that enabled quants to fit models 25X as fast. AHL also cut disk storage down to a mere 40% of their previous solution, and realized a 40X cost savings.

The result?

According to Gary Collier, AHL’s Technology Manager: “Happy developers. Happy accountants.”

See Gary's presentation at MongoDB World 2014 here.



If you want to learn more about the business benefits a real company realized with a single view built on MongoDB, you can download a white paper to read about MetLife’s single view of the customer.

READ ABOUT METLIFE NOW

Best Of Both Worlds: Genentech Accelerates Drug Research With MongoDB & Oracle

“Every day we can reduce the time it takes to introduce a new drug can have a big difference on our patients,” said Doug Garrett, Software Engineer at Genentech.

Genentech Research and Early Development (gRED) develops drugs for significant unmet medical needs. A critical component of this effort is the ability to provide investigators with new genetic strains of animals so as to understand the cause of diseases and to test new drugs.

As genetic testing has both increased and become more complex, Genentech has focused on redeveloping the Genetic Analysis Lab system to reduce the time needed to introduce new lab instruments.

MongoDB is at the heart of this initiative, which captures the variety of data generated by genetic tests and integrates it with Genentech's existing Oracle RDBMS environment. MongoDB’s flexible schema and ability to easily integrate with existing Oracle RDBMS has helped Genentech to reduce development from months to weeks or even days, significantly accelerating drug research. “Every day we can reduce the time it takes to introduce a new drug can have a big difference on our patients,” said Doug Garrett, Software Engineer at Genentech.

Previously, the Genentech team needed to change the schema every time they introduced a new lab instrument, which held up research by three to six months, and sometimes even longer. At the same time, the database was becoming more difficult to support and maintain.

The MongoDB redesign delivered immediate results. In just one example, adding a new genetic test instrument (a new loader) had zero impact on the database schema and allowed Genentech to continue with research after just three weeks, instead of the standard three to six-month delay.

MongoDB also makes it possible for Genentech to load more data than in the past, which fits in well with the “collect now, analyze later” model, something he noted MongoDB co-founder Dwight Merriman has often suggested.

Said Garrett: “Even if we don’t know if we need the data, the cost is practically zero and we can do it without any programming changes, so why not collect as much as we can?”


To see all MongoDB World presentations, visit the [MongoDB World Presentations](https://www.mongodb.com/mongodb-world/presentations) page.

What Do I Use This For? 270,000 Apps

Graham Neray

Company

“Would you rather run 10 different databases, or run one database in 10 different ways?” - Charity Majors, Parse/Facebook.

When introduced to a new technology, would-be users often want to know, ‘So, what should I use this for?’ At MongoDB, we like to say, ‘MongoDB is a general purpose database good for a wide variety of applications.’ While this may be true, it of course doesn’t help the would-be user with her original question: ‘So, what do I use this for?’ Fair enough.

In the evening keynote at MongoDB World, Charity Majors offered a response better than any MongoDB employee could have conjured up. Charity is the Production Engineer at Parse (now part of Facebook), a mobile backend-as-a-service that “allows you to build fully featured mobile apps without having to sully your pure mind with things like data models and indexes.” And it runs on MongoDB.

Parse runs and scales more apps than even the most sprawling enterprises. 270,000 apps. The number of developers using Parse is growing at 250% per year; the number of API requests is growing at 500% annually. They serve every kind of workload under the sun. “We don’t know what any app’s workload is going to look like, and neither do the developers,” says Charity. And as Parse goes, so goes MongoDB.

The diversity of workloads that Parse runs on MongoDB is testament to the canonical ‘general purpose’ argument. With 270,000 different apps running on MongoDB, it should be clear that you can use it for, at the very least, a lot of different use cases. But Charity’s rousing speech offered an implicit response to the use case question cited above. ‘What do I use this for?’ begs a different but arguably more important question that Charity suggests users ask themselves: ‘What can I not use this for?’ That is, when choosing a technology -- especially a database -- users should be looking for the most reusable solution.

“Would you rather run 10 different databases, or run one database in 10 different ways?” Charity asked the audience. “There is no other database on the planet that can run this number of workloads.”

While most companies don’t need a database that works for 270,000 applications at once, every developer, sysadmin, DBA, startup, and Fortune 500 enterprise faces the same questions as Parse but on its own scale. Given a limited amount of time and money, how many databases do you want to learn how to use? How many databases do you want in production? How many vendor relationships do you want to manage? How many integration points do you want to build? Assuming one database works for most if not all use cases, what is that worth to you?

These are all flavors of what we’ll now call ‘The Parse’ question. To those in search of a solution, we suggest you take a look at MongoDB. And if you’re still unsure, we can try to put you in touch with Charity. But no promises -- she’s a bit of a celebrity these days.

========

Some other delectable quotes from Charity’s keynote because...we just couldn’t resist:

“Holy #%& there are so many people here.”

“I’ve been coming to MongoDB conferences for almost 2 years now, and they just keep getting better.”

“Speaking as a highly seasoned operations professional...I hate software for a living, and I’m pretty good at it.”

“Reliability, Flexibility, Automation.” (The 3 things she loves about MongoDB.)

“When I talk about reliability, I’m talking about reliability through resiliency...you should never have to care about the health of any individual nodes, you should only have to care about the health of the service.”

“Scalability is about more than just handling lots of requests really fast. It’s about building systems that don’t scale linearly in terms of the incremental cost to maintain them.”

“The story of operations is the story of dealing with failures. And this is why MongoDB is great, because it protects you from failures. And when your database lets you sleep through the night, how bad can it be?”

“May your nights be boring; may your pagers never ring.”


To see all MongoDB World presentations, visit the [MongoDB World Presentations](https://www.mongodb.com/mongodb-world/presentations) page.

MongoDB Performance Optimization with MMS

MongoDB

Cloud

This is a guest post by Michael De Lorenzo, CTO at CMP.LY.

CMP.LY is a venture-funded startup offering social media monitoring, measurement, insight and compliance solutions. Our clients include Fortune 100 financial services, automotive, consumer packaged goods companies, as well as leading brands, advertising, social and PR agencies.

Our patented monitoring, measurement and insight (MMI) tool, CommandPost provides real-time, actionable insights into social performance. Its structured methodology, unified cross-platform reporting and direct channel monitoring capabilities ensure marketers can capture and optimize the real value of their engagement, communities and brand advocates. All of CommandPost’s products have built-in compliance solutions including plain language disclosure URLs (such as rul.es, ter.ms, disclosur.es, paid-po.st, sponsored-po.st and many others).

MongoDB at CMP.LY

At CMP.LY, MongoDB provides the data backbone for all of CommandPost’s social media monitoring services. Our monitoring services collect details about social media content across multiple platforms, all engagements with that content and builds profiles around each user that engages. This amounts to thousands of writes hitting our MongoDB replica set every second across multiple collections. While our monitoring services are writing new and updated data to the database in the background, our clients are consuming the same data in real-time via our dashboards from those same collections.

More Insights Mean More Writes

With the launch of CommandPost, we expanded the number of data points our monitoring services collected and enhanced analysis of those we were already collecting. These changes saw our MongoDB deployment come under a heavier load than we had previously seen - especially in terms of the number of writes performed.

Increasing the number of data points collected also meant we had more interesting data for our clients to access. From a database perspective, this meant more reads for our system to handle. However, it appeared we had a problem - our API was slower than ever in returning the data clients requested.

We had been diligent about adding indexes and making sure the most frequent client-facing queries were covered, but reads were still terribly slow. We turned to our MongoDB Management Service dashboard for clues as to why.

MongoDB Management Service

By turning to MMS, we knew we would have a reliable source to provide insight into what our database was doing both before and after our updates to CommandPost. Most (if not all) of the stats and charts we typically pay attention to in MMS looked normal for our application and MongoDB deployment. As we worked our way through each metric, we finally came across one that had changed significantly- Lock Percentage.

Since releasing the latest updates to CommandPost, our deployment’s primary client-serving database saw its lock percentage jump from about 10% to a constant rate of 150-175%. This was a huge jump with a very negative impact on our application - API requests timed out, queries took minutes to complete and our client-facing applications became nearly unusable.

Why is Lock Percentage important?

A quick look at how MongoDB handles concurrency tells us exactly why Lock Percentage became so important for us.

MongoDB uses a readers-writer lock that allows concurrent reads access to a database but gives exclusive access to a single write operation. When a read lock exists, many read operations may use this lock. However, when a write lock exists, a single write operation holds the lock exclusively, and no other read or write operations may share the lock.

Locks are “writer greedy,” which means writes have preference over reads. When both a read and write are waiting for a lock, MongoDB grants the lock to the write. As of version 2.2, MongoDB implements locks at a per-database granularity for most read and write operations.

The “greediness” of our writes was not only keeping our clients from being able to access data (in any of our collections), but causing additional writes to be delayed.

Strategies to Reduce Lock Contention

Once we identified the collections most affected by the locking, we identified three possible remedies to the issue and worked to apply all of them.

Schema Changes

The collection that saw our greatest load (in terms of writes and reads) originally contained a few embedded documents and arrays that tended to make updating documents hard. We took steps to denormalize our schema and, in some cases, customized the _id attribute. Denormalization allowed us to model our data for atomic updates. Customizing the _id attribute, allowed us to simplify our writes without additional queries or indexes by leverage the existing index on the document’s _id attribute. Enabling atomic updates allowed us to simplify our application code and reduce the time spent in application write lock.

Use of Message Queues

To manage the flow of data, we refactored some writes to be managed using a Publish-Subscribe pattern. We chose to use Amazon’s SQS service to do this, but you could just as easily use Redis, Beanstalkd, IronMQ or any other message queue.

By implementing message queuing to control the flow of writes, we were able to spread the frequency of writes over a longer period of time. This became crucially important during times where our monitoring services came under higher-than-normal load.

Multiple Databases

We also chose to take advantage of MongoDB’s per database locking by creating and moving write-heavy collections into separate databases. This allowed us to move non-client-facing collections into databases that didn’t need to be accessed by our API and client queries.

Splitting into multiple databases meant that only the database taking on an update needed to be locked, leaving all other databases to remain available to serve client requests.

How did things change?

The aforementioned changes yielded immediate results. The results were so drastic that many of our users commented to us that the application seemed faster and performed better. It wasn’t their imaginations - as you can see from the “after” Lock Percentage chart below, we reduced the value to about 50% on our primary client-serving database.

What’s Next?

In working with MongoDB Technical Services, we also identified one more strategy we intend to implement to further reduce our Lock Percentage - Sharding. Sharing will allow us to horizontally scale our write workload across multiple servers and easily add additional capacity to meet our performance targets.

We’re excited about the possibility of not just improving the performance of our MongoDB deployment, but offering our users faster access to their data and a better overall experience using CommandPost.

If you want to learn more about how to use MongoDB Management Service to identify potential issues with your MongoDB deployment, keep it healthy and keep your application running smoothly, attend my talk “Performance Tuning on the Fly at CMP.LY” at MongoDB World in New York City, on Tuesday, June 24th at 2:20pm in the New York East Ballroom.

6 Rules of Thumb for MongoDB Schema Design: Part 3

MongoDB

Technical

By William Zola, Lead Technical Support Engineer at MongoDB

This is our final stop in this tour of modeling One-to-N relationships in MongoDB. In the first post, I covered the three basic ways to model a One-to-N relationship. Last time, I covered some extensions to those basics: two-way referencing and denormalization.

Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.

Read part one and part two if you’ve missed them.

Whoa! Look at All These Choices!

So, to recap:

  • You can embed, reference from the “one” side, or reference from the “N” side, or combine a pair of these techniques
  • You can denormalize as many fields as you like into the “one” side or the “N” side

Denormalization, in particular, gives you a lot of choices: if there are 8 candidates for denormalization in a relationship, there are 2 8 (1024) different ways to denormalize (including not denormalizing at all). Multiply that by the three different ways to do referencing, and you have over 3,000 different ways to model the relationship.

Guess what? You now are stuck in the “paradox of choice” – because you have so many potential ways to model a “one-to-N” relationship, your choice on how to model it just got harder. Lots harder.

Rules of Thumb: Your Guide Through the Rainbow

Here are some “rules of thumb” to guide you through these indenumberable (but not infinite) choices

  • One: favor embedding unless there is a compelling reason not to
  • Two: needing to access an object on its own is a compelling reason not to embed it
  • Three: Arrays should not grow without bound. If there are more than a couple of hundred documents on the “many” side, don’t embed them; if there are more than a few thousand documents on the “many” side, don’t use an array of ObjectID references. High-cardinality arrays are a compelling reason not to embed.
  • Four: Don’t be afraid of application-level joins: if you index correctly and use the projection specifier (as shown in part 2) then application-level joins are barely more expensive than server-side joins in a relational database.
  • Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.
  • Six: As always with MongoDB, how you model your data depends – entirely – on your particular application’s data access patterns. You want to structure your data to match the ways that your application queries and updates it.

Your Guide To The Rainbow

When modeling “One-to-N” relationships in MongoDB, you have a variety of choices, so you have to carefully think through the structure of your data. The main criteria you need to consider are:

  • What is the cardinality of the relationship: is it “one-to-few”, “one-to-many”, or “one-to-squillions”?
  • Do you need to access the object on the “N” side separately, or only in the context of the parent object?
  • What is the ratio of updates to reads for a particular field?

Your main choices for structuring the data are:

  • For “one-to-few”, you can use an array of embedded documents
  • For “one-to-many”, or on occasions when the “N” side must stand alone, you should use an array of references. You can also use a “parent-reference” on the “N” side if it optimizes your data access pattern.
  • For “one-to-squillions”, you should use a “parent-reference” in the document storing the “N” side.

Once you’ve decided on the overall structure of the data, then you can, if you choose, denormalize data across multiple documents, by either denormalizing data from the “One” side into the “N” side, or from the “N” side into the “One” side. You’d do this only for fields that are frequently read, get read much more often than they get updated, and where you don’t require strong consistency, since updating a denormalized value is slower, more expensive, and is not atomic.

Productivity and Flexibility

The upshot of all of this is that MongoDB gives you the ability to design your database schema to match the needs of your application. You can structure your data in MongoDB so that it adapts easily to change, and supports the queries and updates that you need to get the most out of your application.

How Buffer uses MongoDB to power its Growth Platform

MongoDB

Releases

By Sunil Sadasivin, CTO at Buffer

Buffer, powered by experiments and metrics

At Buffer, every product decision we make is driven by quantitative metrics. We have always sought to be lean in our decision making, and one of the core tenants of being lean is launching experimental features early and measuring their impact.

Buffer is a social media tool to help you schedule and space out your posts on social media networks like Twitter, Facebook, Google+ and Linkedin. We started in late 2010 and thanks to a keen focus on analytical data, we have now grown to over 1.5 million users and 155k unique active users per month. We’re now responsible for sharing 3 million social media posts a week.

When I started at Buffer in September 2012 we were using a mixture of Google Analytics, Kissmetrics and an internal tool to track our app usage and analytics. We struggled to move fast and effectively measure product and feature usage with these disconnected tools. We didn’t have an easy way to generate powerful reports like cohort analysis charts or measure things like activation segmented by signup sources over time. Third party tracking services were great for us early on, but as we started to dig deeper into our app insights, we realized there was no way around it—we needed to build our own custom metrics and event tracking.

We took the plunge in April 2013 to build our own metrics framework using MongoDB. While we’ve had some bumps and growing pains setting this up, it’s been one of the best decisions we’ve made. We are now in control of all metrics and event tracking and are able to understand what’s going on with our app at a deeper level. Here’s how we use MongoDB to power our metrics framework.

Why we chose MongoDB

At the time we were evaluating datastores, we had no idea what our data would look like. When I started designing our schema, I quickly found that we needed something that would let us change the metrics we track over time and on the fly. Today, I’ll want to measure our signup funnel based on referrals, tomorrow I might want to measure some custom event and associated data that is specific to some future experiment. I needed to plan for the future, and give our developers the power to track any arbitrary data. MongoDB and its dynamic schema made the most sense for us. MongoDB’s super powerful aggregation framework also seemed perfect for creating the right views with our tracking data.

Our Metrics Framework Architecture

In our app, we’ve set up an AWS SQS queue and any data we want to track from the app goes immediately to this queue. We use SQS heavily in our app and have found it to be a great tool to manage messaging at high throughput levels. We have a simple python worker which picks off messages from this queue and writes them to our metrics database. The reason why we’ve done this instead of connecting and writing directly to our metrics MongoDB database is because we wanted our metrics set up to have absolutely zero impact on application performance. Much like Google Analytics offers no overhead to an application, our event tracking had to do the same. The MongoDB database that would store our events would be extremely write heavy since we would be tracking anything we could think of, including every API request, page visited, Buffer user/profile/post/email created etc. If, for whatever reason our metrics db goes down, or starts having write locking issues, our users shouldn’t be impacted. Using SQS as a middleman would allow tracking data to queue up if any of these issues occur. SQS gives us enough time to figure out what the issue is, fix, it and then process that backlog. We’ve had quite a few times in the past year where using Amazon’s robust SQS service has saved us from losing any data during maintenance or downtime that would occur when creating a robust high throughput metrics framework from scratch. We use MongoHQ to host our data. They’ve been super helpful with any challenges in scaling a db like ours. Since our setup is write heavy, we’ve initially set up a 400GB SSD replica set. As of today (May 16) we have 90 collections and are storing over 500 million documents.

We wrote simple client libraries for tracking data for every language that we use (PHP, Python, Java, NodeJS, Javascript, Objective-C). In addition to bufferapp.com, our API, mobile apps and internal tools all plug into this framework.

Tracking events

Our event tracking is super simple. When a developer creates a new event message, our python worker creates a generic event collection (if it doesn’t exist) and stores event data that’s defined by the developer. It will store the user or visitor id, and the date that the event occurred. It’ll also store the user_joined_at date which is useful for cohort analysis.

Here are some examples of event tracking our metrics platform lets us do.

Visitor page views in the app.

Like many other apps, we want to track every visitor that hits our page. There is a bunch of data that we want to store to understand the context around the event. We’d like to know the IP address, the URI they viewed, the user agent they’re using among other data.

Here’s what the tracking would look like in our app written in PHP:

$visit_event = array(
    'visitor_id' => $visitor_id,
    'ip' => $ip_address,
    'uri' => $uri,
    'referrer' => $referrer,
    'user_agent' => $user_agent
);
//track, < metric name >, < metric data > , < operation type >
$visitor->track('visits', $visit_event, 'event')

Here’s the corresponding result in our MongoDB metrics db:

> db.event.visits.findOne({date:{$gt:ISODate("2014-05-05")}})
{
        "_id" : ObjectId("5366d48148222c37e51a9f31"),
        "domain" : "blog.rafflecopter.com",
        "user_id" : null,
        "ip" : "50.27.200.15",
         "user_joined_at" : null,
         "visitor_id" : ObjectId("5366d48151823c7914450517"),
         "uri" : "",
         "agent" : {
                "platform" : "Windows 7",
                "version" : "34.0.1847.131",
                "browser" : "Chrome"
           },
           "referrer" : "blog.rafflecopter.com/",
           "date" : ISODate("2014-05-05T00:00:01.603Z"),
           "page" : "/"
}

Logging User API calls

We track every API call our clients make to the Buffer API. Essentially what we’ve done here is create query-able logging for API requests. This has been way more effective than standard web server logs and has allowed us to dig deeper into API bugs, security issues and understanding the load on our API.

db.event.api.findOne()
{
        "_id" : ObjectId("536c1a7648222c105f807212"),
        "endpoint" : {
                "name" : "updates/create"
        },
        "user_id" : ObjectId("50367b2c6ffb36784c000048"),
        "params" : {
                "get" : {
                        "text" : "Sending a test update for the the blog post!",
                        "profile_ids" : [
                                "52f52d0a86b3e9211f000012"
                        ],
                        "media" : ""
                }
        },
        "client_id" : ObjectId("4e9680b8562f7e6b22000000"),
        "user_joined_at" : ISODate("2012-08-23T18:50:20.405Z"),
        "date" : ISODate("2014-05-08T23:59:50.419Z"),
        "ip_address" : "32.163.4.8",
        "response_time" : 414.95399475098
}

Experiment data

With this type of event tracking, our developers are able to track anything by writing a single line of code. This has been especially useful for measuring events specific to a feature experiment. This frictionless process helps keep us lean: we can measure feature usage as soon as a feature is launched. For example, we recently launched a group sharing feature for business customers so that they can group their Buffer social media accounts together. Our hypothesis was that people with several social media accounts prefer to share specific content to subsets of accounts. We wanted to quantifiably validate whether this is something many would use, or whether it’s a niche or power user feature. After a week of testing this out, we had our answer.

This example shows our tracking of our ‘group sharing’ experiment. We wanted to track each group that was created with this new feature. With this, we’re able to track the user, the groups created, the name of the group, and the date it was created.

> db.event.web.group_sharing.create_group.find().pretty()
{
        "_id" : ObjectId("536c07e148022c1069b4ff3d"),
        "via" : "web",
        "user_id" : ObjectId("536bfbea61bb78af76e2a94d"),
        "user_joined_at" : ISODate("2014-05-08T21:49:30Z"),
        "date" : ISODate("2014-05-08T22:40:33.880Z"),
        "group" : {
                "profile_ids" : [
                        "536c074d613b7d9924e1a90f",
                        "536c07c361bb7d732d198f1"
                ],
                "id" : "536c07e156a66a28563f14ec",
                "name" : "Dental"
        }
}

Making sense of the data

We store a lot of tracking data. While it’s great that we’re tracking all this data, there would be no point if we weren’t able to make sense of it. Our goal for tracking this data was to create our own growth dashboard so we can keep track of key metrics, and understand results of experiments. Making sense of the data was one of the most challenging parts of setting up our growth platform.

MongoDB Aggregation

We rely heavily on MongoDB’s aggregation framework. It has been super handy for things like gauging API client requests by hour, response times separated by API endpoint, number of visitors based on referrers, cohort analysis and so much more.

Here’s a simple example of how we use MongoDB aggregation to obtain our average API response times between April 8th and April 9th:

db.event.api.aggregate({
    $match: {
            date: {
                $gt: ISODate("2014-05-08T20:02:33.133Z"),
                $lt: ISODate("2014-05-09T20:02:33.133Z"),
                     }
            }
    }, {
    $group: {
        _id: {
                    endpoint: '$endpoint.name'
            },
        avgResponseTime: {
                    $avg: '$response_time'
        },
            count: {
            $sum: 1
            }
    }
}, {
    $sort: {
        "count": -1
    }
})
Result:
{
        "result" : [
                {
                        "_id" : {
                                "endpoint" : "profiles/updates_pending"
                        },
                        "avgResponseTime" : 118.69420306241872,
                        "count" : 749800
                },
                {
                        "_id" : {
                                "endpoint" : "updates/create"
                        },
                        "avgResponseTime" : 1597.2882786981013,
                        "count" : 393282
                },
                {
                        "_id" : {
                                "endpoint" : "profiles/updates_sent"
                        },
                        "avgResponseTime" : 281.65717282199824,
                        "count" : 368860
                },
                {
                        "_id" : {
                                "endpoint" : "profiles/index"
                        },
                        "avgResponseTime" : 112.43379622794643,
                        "count" : 323844
                },
                {
                        "_id" : {
                                "endpoint" : "user/friends"
                        },
                        "avgResponseTime" : 559.7830099245549,
                        "count" : 122320
                },
                ...

With the aggregation framework, we have powerful insight into how clients are using our platform, which users are power users and a lot more. We previously created long running scripts to generate our cohort analysis reports. Now we can use MongoDB aggregation for much of this.

Running ETL jobs

We have several ETL jobs that run periodically to power our growth dashboard. This is the way we make sense of our data core. Some of the more complex reports need this level of reporting. For example, the way we measure product activation is whether someone has posted an update within a week of joining. With the way we’ve structured our data, this requires a join query in two different collections. All of this processing is done in our ETL jobs. We’ll upload the results to a different database which is used to power the views in our growth dashboard for faster loading.

Here are some reports on our growth dashboard that are powered by ETL jobs

Scaling Challenges and Pitfalls

We’ve faced a few different challenges and we’ve iterated to get to a point where we can make solid use out of our growth platform. Here are a few pitfalls and examples of challenges that we’ve faced in setting this up and scaling our platform.

Plan for high disk I/O and write throughput.

The DB server size and type has a key role in how quickly we could process and store events. In planning for the future we knew that we’d be tracking quite a lot of data and a fast pace, so a db with high disk write throughput was key for us. We ended up going for a large SSD replica set. This of course really depends on your application and use case. If you use an intermediate datastore like SQS, you can always start small, and upgrade db servers when you need it without any data loss.

We keep an eye on mongostat and SQS queue size almost daily to see how our writes are doing.

One of the good things about an SSD backed DB is that disk reads are much quicker compared to hard disk. This means it’s much more feasible to run ad hoc queries on un-indexed fields. We do this all the time whenever we have a hunch of something to dig into further.

Be mindful of the MongoDB document limit and how data is structured

Our first iteration of schema design was not scalable. True, MongoDB does not perform schema validation but that doesn’t mean it’s not important to think about how data is structured. Originally, we tracked all events in a single user_metrics and visitor_metrics collection. An event was stored as an embedded object in an array in a user document. Our hope was that we wouldn’t need to do any joins and we could effectively segment out tracking data super easily by user.

We had fields as arrays that were unbounded and could grow infinitely causing the document size to grow. For some highly active users (and bots), after a few months of tracking data in this way some documents in this collection would hit the 16MB document limit and fail to write any more. This created various performance issues in processing updates, and in our growth worker and ETL jobs because there were these huge documents transferred over the wire. When this happened we had to move quickly to restructure our data.

Moving to a single collection per event type has been the most scalable solution and a more flexible solution.

Reading from secondaries

Some of our ETL jobs read and process a lot of data. If you end up querying documents that haven’t been read or written to recently, it is very possible this may be stored out of memory and on disk. Querying this data means MongoDB will page out some documents that have been touched recently and bring query results into memory. This will then make writing to that paged out document slower. It’s for this reason that we have set up our ETL and aggregation queries to read only from our secondaries in our replica-set, even though they may not be consistent with the primary.

Our secondaries have a high number of faults because of paging due to reading ‘stale’ data

Visualizing results

As I mentioned before, one of the more challenging parts about maintaining our own growth platform is extracting and visualizing the data in a way that makes a lot of sense. I can’t say that we’ve come to a great solution yet. We’ve put a lot of effort into building out and maintaining our growth dashboard and creating visualizations is the bottleneck for us today. There is really a lot of room to reduce the turnaround time. We have started to experiment a bit with using Stripe’s MoSQL to map results from MongoDB to PostgresSQL and connect with something like Chart.io to make this a bit more seamless. If you’ve come across some solid solutions for visualizing event tracking with MongoDB, I’d love to hear about it!

Event tracking for everyone!

We would love to open source our growth platform. It’s something we’re hoping to do later this year. We’ve learned a lot by setting up our own tracking platform. If you have any questions about any of this or would like to have more control of your own event tracking with MongoDB, just hit me up @sunils34

Want to help build out our growth platform? Buffer is looking to grow its growth team and reliability team!

Like what you see? Sign up for the MongoDB Newsletter and get MongoDB updates straight to your inbox