A Mobile-First, Cloud-First Stack at Pearson
Pearson, the global online education leader, has a simple yet grand mission: to educate the world; to have 1 billion students around the globe touching their content on a regular basis.
They are growing quickly, especially to emerging markets where the primary way to consume content is via mobile phones. But to reach global users, they need to deploy in a multitude of private and public data centers around the globe. This demands a mobile-first, cloud-first platform, with the underlying goal to improve education efficacy.
In 2018, Pearson will be announcing to the public markets what percentage of revenue is associated with the company’s efficacy. There’s no question; that’s a bold move. As a result, apps have to be built in a way to measure how users are interacting with them.
Front and center in Pearson’s strategy is MongoDB.
With MongoDB, as Pearson CTO Aref Matin told the audience at MongoDB World (full video presentation here), Pearson was able to replace silos of double-digit, independent platforms with a consolidated platform that would allow for measuring efficacy.
“A platform should be open, usable by all who want to access functionality and services. But it’s not a platform until you’ve opened up APIs to the external world to introduce new apps on top of it,” declared Matin.
A key part of Pearson’s redesigned technology stack, MongoDB proved to be a good fit for a multitude of reasons, including its agility and scalability, document model and ability to perform fast reads and ad hoc queries. Also important to Matin was the ability to capture the growing treasure trove of unstructured data, such as peer-to-peer and social interactions that are increasingly part of education.
So far, Pearson has leveraged MongoDB for use cases such as:
- Identity and access management for 120 million user accounts, with nearly 50 million per day at peak;
- Adaptive learning and analytics to detect, in near real-time, what content is most effective and identify areas for improvement; and
- The Pearson Activity Framework (akin to a “Google DoubleClick” according to Matin), which collects data on how users interact with apps and feeds the analytics engine.
All of this feeds into Matin’s personal vision of increasing the pace of learning.
“Increasing the pace of learning will be a a disruptive force,” said Matin. “If you can reduce the length of time spent on educating yourself, you can learn a lot more and not spend as much on it. That will help us be able to really educate the world at a more rapid pace.”
**Sign up to receive videos and content from MongoDB World.**
Enabling Extreme Agility At The Gap With MongoDB
The Gap's creative director insists that "Fashion is...about instinct and gut reaction." In the competitive world of retail, that "instinct" has been set to fast forward as Gap seeks to outpace fast-fashion retailers and other trends that constantly push Gap and other retailers to meet consumer needs, faster.
As boring as it may seem, Gap's purchase order management system really, really matters in ensuring it can quickly evolve to meet consumer tastes. Unable to meet business agility requirements using traditional relational databases, Gap uses MongoDB for a wide range of supply chain systems, including various master data management, inventory and logistics functions, including purchase order management.
Collecting Money From Happy CustomersThis is no small feat given Gap's size. The Gap is a global specialty retailer offering clothing, accessories and personal care products for men, women, children and babies. With nearly 134,000 employees and almost 3,200 company-operated stores and an additional 400 franchise stores, fashion-conscious consumers can find The Gap around the world.
And they do, spending over $16 billion annually on Gap's latest track pant, indigo-washed jeans and racerback tanks.
That's both the good news and the bad news, as presented by Gap consultant Ryan Murray at MongoDB World.
Good, because it means Gap, more than anyone else, dresses America and, increasingly, the world.
Bad, because at its scale change can be hard.
Square Pegs, Round Holes And Purchase Orders
Even something simple like a purchase order can have a huge impact on a company like Gap. A purchase order is a rich business object that contains various pieces of information (item type, color, price, vendor information, shipping information, etc.). A purchase order at Gap can be an order to a vendor to produce a certain article of clothing.
The critical thing is that the business thinks about the order as a single entity, while Gap's RDBMS broke up the purchase order into a variety of rows, columns and tables, joined together.
Not very intuitive.
While this may seem like a small thing, as Murray points out, the RDBMS "forced [developers] to shift away from the business concept-- what is a purchase order and what are the business rules and capabilities around it-- and shift gears into 'How do I make this technology work for me and help me solve a business problem?' [mode of thinking]. And that destroys flow." D
evelopers may be more technical than the rest of us, Gap wanted its developers helping to build its business, not merely its technology.
Murray continues: "We don't want the developer having to work with the impedance mismatch between the business concept that they're trying to solve for and the technology they're using to solve it."
Enabling Supply Chain Agility By Improving Developer Productivity
As such, Gap realized it needed to evolve how it manages inventory and its vendors. It turned to MongoDB because it was able to easily make sense of data that comes in different shapes, which it needed to store quickly and transparently in Gap's database. MongoDB, in short, helped Gap become much more agile and, hence, far more competitive. One way Gap managed this was by moving from a monolithic application architecture to a microservices-based approach.
The traditional model for building applications has typically been as large monoliths. In this case, that meant the PO system was one, big code base that handled everything related to a PO, whether that was handling demand from the planning systems and creating those purchase orders or simply handling how the purchase orders actually integrate to other systems and get down to the vendors.
All of those things are actually fairly independent of each other, but the code base to manage it was monstrously big and monolithic.
Instead Murray and team introduced the concept of the microservice, a service dedicated to one business capability. For example, a microservice could handle communicating out to the vendors by EDI or whatever technology that a new purchase order has been registered. It turns out that MongoDB is perfect for such microservices because it's so simple and lightweight, Murray notes.
Gap uses MongoDB to power these single service and to connect them together. Each of these services lines up with a business function. Developers can work on separate microservices without bumping into or waiting on each other, as is common in a monolithic architecture. This enables them to be far more productive; to work much faster.
MongoDB As An "Extreme Enabler Of Agile Development"
In this and other ways, Murray lauds MongoDB as “an extreme enabler of agile development”, or iterative development. Waxing rhapsodic, Murray continues:
MongoDB allow[s our developers] to essentially forget about the storage layer that's underneath and just get work done. As the business evolves, the concept of a purchase order as an aggregate concept will also change as they add fields to it. MongoDB gets out of the way. [Developers] drop a collection, start up new code over that database, and MongoDB accepts whatever they throw at it.
Again, developers don't have to stop, break the context of solving the business problem, and get back to what they're doing. They simply get to focus on the business problem. And so as an agile enabler, as an enabler of developers to work fast and smart, MongoDB do is extremely useful.
As just one example, Gap was able to develop this new MongoDB-based purchase order system in just 75 days, a record for the company. In true agile fashion, MongoDB enables Gap to continue to iterate on the system. Five months in, the business wanted to track in a dashboard style the life of a purchase order. With MongoDB, that business requirement turned out to almost require no development effort. Murray and team were able to add new types of purchase orders and have them easily coexist with old purchase orders in the same collection and keep moving.
Not in months. Or weeks. But rather each day the development team was able to show the business what that feature might look like because of MongoDB's flexibility.
All of which makes Murray and his team at Gap so happy to work with MongoDB. "Software is ultimately about people," he insists, and giving developers software like MongoDB that they love to use makes them happy and productive.
**Sign up to receive videos and content from MongoDB World.**
The Leaf in the Wild: MongoDB at MachineShop
Leaf in the Wild posts highlight real world MongoDB deployments. Read other stories about how companies are using MongoDB for their mission-critical projects.
I had the chance to meet with John Cox, Senior Technology Director at MachineShop, who running their Internet of Services platform on MongoDB. MachineShop is one of many startups who are using MongoDB to power the Internet of Things and are changing the way developers and organizations engage data to garner insights and connect to their environments.
Tell us a little bit about your company. What are you trying to accomplish? How do you see yourself growing in the next few years?
MachineShop is an on-demand middleware service that simplifies the way organizations build applications, integrate systems and share data within an enterprise and its ecosystem.
MachineShop is uniquely architected to connect with Internet enabled devices, systems and databases and offers an API-oriented approach to aggregating and managing services that engage, enrich, expose and manage data and their underlying sources.
We offer Developers and Organizations access to rich tools, reports and analytics about their services and applications through the MachineShop Services Exchange – a customizable web-based portal that offers hundreds of discrete APIs and services based on the unique roles and permissions of users.
What problem were you trying to solve?
When aggregating disparate data sources to be processed by central business logic and served up through a standard RESTful API, we needed a database solution that can accommodate multi-structured data and gives us high-throughput. We also need something that’s easy to scale out as we add customers and ramp up data inputs exponentially. MongoDB has it all in spades. The fact that it’s super easy to spit everything out to our API in JSON is a [very nice] bonus.
Was this a new project or did you migrate from a different database? What was it like to learn MongoDB
Earlier iterations of MachineShop used a relational database, but the current product was build from the ground up on MongoDB. There was still a small learning curve for the team jumping into MongoDB. It was tiny, though. The prototype for the current product was built entirely in Ruby (Sinatra/Rails). The fact that we used the Mongoid ODM made the transition really easy to understand as a developer. There were a few things we had to get smart on quickly on system admin, but honestly it was fairly trivial. (Thank you!)
Did you consider other alternatives, like a relational database or non-relational database?
We considered a few alternatives. It became clear very quickly that we wanted to go with a NoSQL solution. Once we crossed that bridge, MongoDB was just an obvious choice. The barrier to entry was low – both in dollars and technical resources. There are a ton of folks working with it that made finding resources online and building relationships in the local community really easy. It’s really fun to work with great, new technology that’s constantly moving forward. It’s also nice to not be on an island trying to figure it out.
Please describe your MongoDB deployment
Right now we’re a pretty small footprint – 3 replica sets and that’s it. It’s fine for the moment. The plan is to move very soon to many shards across a lot of small instances. The idea is that striping the data buys us speed and it’s easy to scale out. We run Ubuntu on AWS for everything. We’re currently using MongoDB 2.4.6 in production.
Are you using any tools to monitor, manage and backup your MongoDB deployment? If so what?
We’re using MMS primarily for monitoring. We also use MongoLab for hosting our production database. They have some pretty good value-add service offerings that we use. We also monitor indirectly through our apps using Scout.
Are you integrating MongoDB with other data analytics, BI or visualization tools like Hadoop? If so can you share any details
We have a proof of concept in place with Hadoop for analytics as well as Storm for real-time processing and aggregation. In production we do fairly basic on-the-fly aggregations and MapReduce jobs with data from devices as well as API request metering. The ultimate goal is to make sure that it’s easy to bolt on common BI tools to allow customers to slice and dice however they like.
How are you measuring the impact of MongoDB on your business?
We’ve never measured anything like cost savings directly. With MongoDB we picked a direction and just started running. Using MongoDB never felt like cost us on any of the metrics you listed. It’s pretty much been smooth as silk. Had we not used MongoDB, I could definitely see where it would cost us in terms of engineering solutions to problems that we never encountered.
What advice would you give someone who is considering using MongoDB for their next project
Fear not! Dive in. When engineering solutions we need to make sure we’re using the right tool for any job that we do. MongoDB happens to be a great tool that can be the right one in a LOT of situations. It lets you move fast and treat your data as just data. It’s freaking fast, too. You don’t have to make so many decisions up front. You can experiment and move pieces around as needed.
A couple of things I would recommend specifically:
- Make sure you have sufficient memory to store your working set (frequently accessed data and indexes). It’s just better. (Google “mongodb working set”)
- If you’re using some abstraction of data access, pay close attention to performance on aggregation. We ended up sidestepping some of the abstraction to gain performance in this area.
MongoDB's dynamic schema and object-oriented structure make it a great fit for the Internet of Things. See how companies like Enernoc and Bosch are building a more connected world with MongoDB.
eHarmony : des rencontres 95 % plus rapides avec MongoDB
À l'occasion d'une présentation captivante, le directeur de la technologie d'eHarmony, Thod Nguyen, a expliqué comment le plus grand site de rencontres au monde était parvenu à offrir une expérience optimale à ses clients en accélérant le traitement des affinités de 95 %, et à multiplier par deux le nombre d'abonnements après la migration de sa base de données relationnelle vers MongoDB.
Accéder à l'intégralité enregistrement et diaporama de l'intervention de Thod Nguyen dans le cadre du MongoDB World.
eHarmony opère en Amérique du Nord, en Australie et au Royaume-Uni. L'entreprise affiche une belle réussite : depuis son lancement en 2000, elle peut se targuer d'avoir permis 1,2 million de mariages. Aujourd'hui, eHarmony compte 55 millions d'utilisateurs, et ce nombre promet d'augmenter de manière considérable après le déploiement prochain du service dans vingt pays supplémentaires.
eHarmony s'appuie sur des procédés émanant de la science des données pour identifier les partenaires potentiels sur son site. Au moment de son inscription, chaque utilisateur est invité à remplir un questionnaire détaillé. Des modèles de compatibilité sophistiqués sont ensuite exécutés afin de créer un profil de personnalité sur la base de ses réponses. Ces calculs algorithmiques sont complétés par des recherches additionnelles, basées sur l'apprentissage automatique et des analyses prédictives, pour affiner la mise en correspondance.
Contrairement au mode de recherche par élément ou mot-clé classique habituellement utilisé sur Google, le processus d'appariement qui permet d'identifier des partenaires potentiels est bidirectionnel : il effectue des croisements et calcule des scores pour plusieurs attributs, parmi lesquels l'âge, l'emplacement géographique, le niveau d'études, les préférences, le revenu, etc.
Dans son architecture initiale, eHarmony stockait l'ensemble des données utilisateur et des appariements dans une seule base de données monolithique, dont le niveau de performance baissait à mesure que le service grandissait. eHarmony avait décidé de séparer les données d'appariement et de les stocker dans une base de données Postgres distribuée, pour regagner en capacités. Mais lorsque le système a atteint 3 milliards de correspondances potentielles par jour, générant 25 To de données, un changement est apparu comme impératif. L'exécution d'une analyse complète de la base des utilisateurs nécessitait deux semaines.
Alors que les modèles de données devenaient plus riches et plus complexes, l'ajustement du schéma impliquait le vidage total puis le rechargement de la base de données, engendrant des temps d'arrêt et une complexité accrue, autant de freins à l'activité de l'entreprise, qui s'est mise en quête d'une nouvelle approche.
Elle recherchait une base de données capable de répondre aux trois prérequis suivants :
- Prise en charge des requêtes multi-attributs complexes à la base du système d'appariement
- Modèle de données flexible permettant de gérer de nouveaux attributs en toute transparence
- Possibilité d'exploiter les équipements courants, sans ajouter aux coûts d'exploitation d'une équipe gérant déjà plus de 1 000 serveurs
eHarmony a d'abord étudié la solution Apache Solr, avant de l'écarter : le système d'appariement nécessitait de pouvoir exécuter des recherches bidirectionnelles et non unidirectionnelles. La solution Apache Cassandra a également été envisagée, mais l'association de l'API avec le modèle de données s'avérait trop compliquée et il existait un déséquilibre entre les performances de lecture et d'écriture.
Après une phase d'évaluation extensive, eHarmony a finalement choisi MongoDB. En plus de voir les trois exigences décrites précédemment satisfaites, eHarmony a également pu largement s'appuyer sur la communauté MongoDB et profiter de l'assistance incluse dans MongoDB Enterprise Advanced.
Thod Nguyen a livré quelques-uns des principaux enseignements qu'eHarmony avait tirés de la migration vers MongoDB :
- Impliquer les ingénieurs MongoDB en amont, afin qu'ils fassent profiter l'entreprise de leurs meilleures pratiques en termes de modélisation des données, de partitionnement et de déploiement
- Au moment de la phase de tests, utiliser les données de production et les requêtes. Tuer des nœuds de manière aléatoire aide à comprendre le comportement du système dans diverses situations de défaillances
- Une exécution en mode « shadow » à côté de la base de données relationnelle permet de caractériser les performances après une montée en charge.
Bien entendu, mongoDB n'est qu'un composant de l'infrastructure de gestion des données d'eHarmony. L'équipe d'expertise des données a choisi d'intégrer MongoDB avec Hadoop, ainsi qu'avec Apache Spark et Apache R pour les capacités d'analyse prédictive.
Le ROI de la migration est convaincant.
- Appariement des partenaires 95 % plus rapide. L'analyse de l'ensemble de la base utilisateurs prend désormais 12 heures, contre 2 semaines auparavant.
- 30 % de communication en plus entre les partenaires potentiels ;
- 50 % d'abonnés payants en plus ;
- 60 % de visites uniques en plus sur le site.
Et ce n'est pas tout. Outre le déploiement prévu dans 20 pays supplémentaires, eHarmony prévoit d'étendre l'expertise acquise dans les sciences des données à un nouveau marché, celui de la recherche d'emploi. Ou comment marier recrues et employeurs potentiels. Dans un premier temps, l'entreprise ajoutera des services de géolocalisation en tirant profit de la prise en charge par MongoDB des index géospatiaux et des requêtes. Le directeur de la technologie d'eHarmony se réjouit également de la disponibilité prochaine de moteurs de stockage enfichables dans la version 3.0 de MongoDB. La possibilité d'associer plusieurs moteurs de stockage dans un cluster MongoDB contribuera à consolider les recherches, les appariements et les données utilisateur. Que vous recherchiez un nouveau conjoint ou un nouvel emploi, il semble bien qu'eharmony dispose de la science et de la base de données nécessaires pour vous aider dans votre quête.
Si vous souhaitez en savoir plus sur la migration d'un système de gestion de bases de données relationnelles (RDBMS) vers MongoDB, nous vous invitons à lire le livre blanc suivant :
MongoDB Takes Center Stage at Ticketmaster
The world leader in selling tickets, Ticketmaster spent more than a decade developing apps extensively on Oracle and MySQL. The ticketing giant recently added MongoDB to the mix to complement existing database technologies with increased flexibility and performance, and decreased costs and time-to-market.
“Database performance and scale are a huge part of what we do, ensuring we can sell tickets 24/7,” said Ed Presz, VP of Database Services at Live Nation/Ticketmaster.
MongoDB currently plays a key role in TM+, Ticketmaster’s newest app covering the secondary, resale market. It will also be used in the future for a new app called Concerts, including venue view, B2B session recovery and client reports. “We’re moving to an agile devops environment and our developers love MongoDB’s ease of deployment and flexibility,” said Presz.
Presz also highly recommends MongoDB’s MMS and has also been pleased with MongoDB’s Enterprise Support. “We were new to MongoDB, about to go into production and we were a bit scared,” he said. “One of the things I was pushing hard for was enterprise support, so we’d have someone we could call. MongoDB’s enterprise support has been fantastic.”
Ticketmaster is a good example of how an organization can benefit both developmentally and operationally from MongoDB.
To see all MongoDB World presentations, visit the [MongoDB World Presentations](https://www.mongodb.com/mongodb-world/presentations) page.
MongoDB: A Single Platform for All Financial Data at AHL
AHL, a part of Man Group plc, is a quantitative investment manager based in London and Hong Kong, with over $11.3 billion in assets under management. The company relies on technology like MongoDB to be more agile and therefore gain an edge in the systematic trading space. With MongoDB, AHL can better support its quantitative researchers – or “quants” – to research, construct and deploy new trading models in order to understand how markets behave.
Importantly, AHL didn't embrace MongoDB piecemeal. Once AHL determined that MongoDB could significantly improve its operations, the financial services firm embraced MongoDB across the firm for an array of applications. AHL replaced a range of traditional technologies like relational databases with a single platform built on MongoDB for every type and frequency of financial market data, and for every level of data SLA, including:
Low Frequency Data – MongoDB was 100x faster in retrieving data and also delivered consistent retrieval times. Not only is this more efficient for cluster computation, but it also leads to a more fluid experience for quants, with data ready for them to easily interact with, run analytics on and plot. MongoDB also delivered cost savings by replacing a proprietary parallel file system with commodity SSDs.
Multi-user, Versioned, Interactive Graph-based Computation – This includes 1 terabyte of data representing 10,000 stocks and 20 years of time-series data, so as to help quants come up with trading signals for stock equities. While not a huge quantity of data, MongoDB reduced time to recompute trading models from hours to minutes, accelerated quants’ ability for interactive research, and enabled read/write performance of 600MB of data in less than 1 second.
Tick Data – Used to capture all market activity, such as price changes for a security, up to 150,000 per second and including 30 terabytes of historic data. MongoDB quickly scaled to 250 million ticks per second, a 25X improvement in tick throughput (with just two commodity machines!) that enabled quants to fit models 25X as fast. AHL also cut disk storage down to a mere 40% of their previous solution, and realized a 40X cost savings.
According to Gary Collier, AHL’s Technology Manager: “Happy developers. Happy accountants.”
See Gary's presentation at MongoDB World 2014 here.
If you want to learn more about the business benefits a real company realized with a single view built on MongoDB, you can download a white paper to read about MetLife’s single view of the customer.
Best Of Both Worlds: Genentech Accelerates Drug Research With MongoDB & Oracle
“Every day we can reduce the time it takes to introduce a new drug can have a big difference on our patients,” said Doug Garrett, Software Engineer at Genentech.
Genentech Research and Early Development (gRED) develops drugs for significant unmet medical needs. A critical component of this effort is the ability to provide investigators with new genetic strains of animals so as to understand the cause of diseases and to test new drugs.
As genetic testing has both increased and become more complex, Genentech has focused on redeveloping the Genetic Analysis Lab system to reduce the time needed to introduce new lab instruments.
MongoDB is at the heart of this initiative, which captures the variety of data generated by genetic tests and integrates it with Genentech's existing Oracle RDBMS environment. MongoDB’s flexible schema and ability to easily integrate with existing Oracle RDBMS has helped Genentech to reduce development from months to weeks or even days, significantly accelerating drug research. “Every day we can reduce the time it takes to introduce a new drug can have a big difference on our patients,” said Doug Garrett, Software Engineer at Genentech.
Previously, the Genentech team needed to change the schema every time they introduced a new lab instrument, which held up research by three to six months, and sometimes even longer. At the same time, the database was becoming more difficult to support and maintain.
The MongoDB redesign delivered immediate results. In just one example, adding a new genetic test instrument (a new loader) had zero impact on the database schema and allowed Genentech to continue with research after just three weeks, instead of the standard three to six-month delay.
MongoDB also makes it possible for Genentech to load more data than in the past, which fits in well with the “collect now, analyze later” model, something he noted MongoDB co-founder Dwight Merriman has often suggested.
Said Garrett: “Even if we don’t know if we need the data, the cost is practically zero and we can do it without any programming changes, so why not collect as much as we can?”
To see all MongoDB World presentations, visit the [MongoDB World Presentations](https://www.mongodb.com/mongodb-world/presentations) page.
What Do I Use This For? 270,000 Apps
“Would you rather run 10 different databases, or run one database in 10 different ways?” - Charity Majors, Parse/Facebook.
When introduced to a new technology, would-be users often want to know, ‘So, what should I use this for?’ At MongoDB, we like to say, ‘MongoDB is a general purpose database good for a wide variety of applications.’ While this may be true, it of course doesn’t help the would-be user with her original question: ‘So, what do I use this for?’ Fair enough.
In the evening keynote at MongoDB World, Charity Majors offered a response better than any MongoDB employee could have conjured up. Charity is the Production Engineer at Parse (now part of Facebook), a mobile backend-as-a-service that “allows you to build fully featured mobile apps without having to sully your pure mind with things like data models and indexes.” And it runs on MongoDB.
Parse runs and scales more apps than even the most sprawling enterprises. 270,000 apps. The number of developers using Parse is growing at 250% per year; the number of API requests is growing at 500% annually. They serve every kind of workload under the sun. “We don’t know what any app’s workload is going to look like, and neither do the developers,” says Charity. And as Parse goes, so goes MongoDB.
The diversity of workloads that Parse runs on MongoDB is testament to the canonical ‘general purpose’ argument. With 270,000 different apps running on MongoDB, it should be clear that you can use it for, at the very least, a lot of different use cases. But Charity’s rousing speech offered an implicit response to the use case question cited above. ‘What do I use this for?’ begs a different but arguably more important question that Charity suggests users ask themselves: ‘What can I not use this for?’ That is, when choosing a technology -- especially a database -- users should be looking for the most reusable solution.
“Would you rather run 10 different databases, or run one database in 10 different ways?” Charity asked the audience. “There is no other database on the planet that can run this number of workloads.”
While most companies don’t need a database that works for 270,000 applications at once, every developer, sysadmin, DBA, startup, and Fortune 500 enterprise faces the same questions as Parse but on its own scale. Given a limited amount of time and money, how many databases do you want to learn how to use? How many databases do you want in production? How many vendor relationships do you want to manage? How many integration points do you want to build? Assuming one database works for most if not all use cases, what is that worth to you?
These are all flavors of what we’ll now call ‘The Parse’ question. To those in search of a solution, we suggest you take a look at MongoDB. And if you’re still unsure, we can try to put you in touch with Charity. But no promises -- she’s a bit of a celebrity these days.
Some other delectable quotes from Charity’s keynote because...we just couldn’t resist:
“Holy #%& there are so many people here.”
“I’ve been coming to MongoDB conferences for almost 2 years now, and they just keep getting better.”
“Speaking as a highly seasoned operations professional...I hate software for a living, and I’m pretty good at it.”
“Reliability, Flexibility, Automation.” (The 3 things she loves about MongoDB.)
“When I talk about reliability, I’m talking about reliability through resiliency...you should never have to care about the health of any individual nodes, you should only have to care about the health of the service.”
“Scalability is about more than just handling lots of requests really fast. It’s about building systems that don’t scale linearly in terms of the incremental cost to maintain them.”
“The story of operations is the story of dealing with failures. And this is why MongoDB is great, because it protects you from failures. And when your database lets you sleep through the night, how bad can it be?”
“May your nights be boring; may your pagers never ring.”
To see all MongoDB World presentations, visit the [MongoDB World Presentations](https://www.mongodb.com/mongodb-world/presentations) page.
MongoDB Performance Optimization with MMS
This is a guest post by Michael De Lorenzo, CTO at CMP.LY.
CMP.LY is a venture-funded startup offering social media monitoring, measurement, insight and compliance solutions. Our clients include Fortune 100 financial services, automotive, consumer packaged goods companies, as well as leading brands, advertising, social and PR agencies.
Our patented monitoring, measurement and insight (MMI) tool, CommandPost provides real-time, actionable insights into social performance. Its structured methodology, unified cross-platform reporting and direct channel monitoring capabilities ensure marketers can capture and optimize the real value of their engagement, communities and brand advocates. All of CommandPost’s products have built-in compliance solutions including plain language disclosure URLs (such as rul.es, ter.ms, disclosur.es, paid-po.st, sponsored-po.st and many others).
MongoDB at CMP.LY
At CMP.LY, MongoDB provides the data backbone for all of CommandPost’s social media monitoring services. Our monitoring services collect details about social media content across multiple platforms, all engagements with that content and builds profiles around each user that engages. This amounts to thousands of writes hitting our MongoDB replica set every second across multiple collections. While our monitoring services are writing new and updated data to the database in the background, our clients are consuming the same data in real-time via our dashboards from those same collections.
More Insights Mean More Writes
With the launch of CommandPost, we expanded the number of data points our monitoring services collected and enhanced analysis of those we were already collecting. These changes saw our MongoDB deployment come under a heavier load than we had previously seen - especially in terms of the number of writes performed.
Increasing the number of data points collected also meant we had more interesting data for our clients to access. From a database perspective, this meant more reads for our system to handle. However, it appeared we had a problem - our API was slower than ever in returning the data clients requested.
We had been diligent about adding indexes and making sure the most frequent client-facing queries were covered, but reads were still terribly slow. We turned to our MongoDB Management Service dashboard for clues as to why.
MongoDB Management Service
By turning to MMS, we knew we would have a reliable source to provide insight into what our database was doing both before and after our updates to CommandPost. Most (if not all) of the stats and charts we typically pay attention to in MMS looked normal for our application and MongoDB deployment. As we worked our way through each metric, we finally came across one that had changed significantly- Lock Percentage.
Since releasing the latest updates to CommandPost, our deployment’s primary client-serving database saw its lock percentage jump from about 10% to a constant rate of 150-175%. This was a huge jump with a very negative impact on our application - API requests timed out, queries took minutes to complete and our client-facing applications became nearly unusable.
Why is Lock Percentage important?
A quick look at how MongoDB handles concurrency tells us exactly why Lock Percentage became so important for us.
MongoDB uses a readers-writer lock that allows concurrent reads access to a database but gives exclusive access to a single write operation. When a read lock exists, many read operations may use this lock. However, when a write lock exists, a single write operation holds the lock exclusively, and no other read or write operations may share the lock.
Locks are “writer greedy,” which means writes have preference over reads. When both a read and write are waiting for a lock, MongoDB grants the lock to the write. As of version 2.2, MongoDB implements locks at a per-database granularity for most read and write operations.
The “greediness” of our writes was not only keeping our clients from being able to access data (in any of our collections), but causing additional writes to be delayed.
Strategies to Reduce Lock Contention
Once we identified the collections most affected by the locking, we identified three possible remedies to the issue and worked to apply all of them.
The collection that saw our greatest load (in terms of writes and reads) originally contained a few embedded documents and arrays that tended to make updating documents hard. We took steps to denormalize our schema and, in some cases, customized the _id attribute. Denormalization allowed us to model our data for atomic updates. Customizing the _id attribute, allowed us to simplify our writes without additional queries or indexes by leverage the existing index on the document’s _id attribute. Enabling atomic updates allowed us to simplify our application code and reduce the time spent in application write lock.
Use of Message Queues
To manage the flow of data, we refactored some writes to be managed using a Publish-Subscribe pattern. We chose to use Amazon’s SQS service to do this, but you could just as easily use Redis, Beanstalkd, IronMQ or any other message queue.
By implementing message queuing to control the flow of writes, we were able to spread the frequency of writes over a longer period of time. This became crucially important during times where our monitoring services came under higher-than-normal load.
We also chose to take advantage of MongoDB’s per database locking by creating and moving write-heavy collections into separate databases. This allowed us to move non-client-facing collections into databases that didn’t need to be accessed by our API and client queries.
Splitting into multiple databases meant that only the database taking on an update needed to be locked, leaving all other databases to remain available to serve client requests.
How did things change?
The aforementioned changes yielded immediate results. The results were so drastic that many of our users commented to us that the application seemed faster and performed better. It wasn’t their imaginations - as you can see from the “after” Lock Percentage chart below, we reduced the value to about 50% on our primary client-serving database.
In working with MongoDB Technical Services, we also identified one more strategy we intend to implement to further reduce our Lock Percentage - Sharding. Sharing will allow us to horizontally scale our write workload across multiple servers and easily add additional capacity to meet our performance targets.
We’re excited about the possibility of not just improving the performance of our MongoDB deployment, but offering our users faster access to their data and a better overall experience using CommandPost.
If you want to learn more about how to use MongoDB Management Service to identify potential issues with your MongoDB deployment, keep it healthy and keep your application running smoothly, attend my talk “Performance Tuning on the Fly at CMP.LY” at MongoDB World in New York City, on Tuesday, June 24th at 2:20pm in the New York East Ballroom.
6 Rules of Thumb for MongoDB Schema Design: Part 3
By William Zola, Lead Technical Support Engineer at MongoDB
This is our final stop in this tour of modeling One-to-N relationships in MongoDB. In the first post, I covered the three basic ways to model a One-to-N relationship. Last time, I covered some extensions to those basics: two-way referencing and denormalization.
Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.
Whoa! Look at All These Choices!
So, to recap:
- You can embed, reference from the “one” side, or reference from the “N” side, or combine a pair of these techniques
- You can denormalize as many fields as you like into the “one” side or the “N” side
Denormalization, in particular, gives you a lot of choices: if there are 8 candidates for denormalization in a relationship, there are 2 8 (1024) different ways to denormalize (including not denormalizing at all). Multiply that by the three different ways to do referencing, and you have over 3,000 different ways to model the relationship.
Guess what? You now are stuck in the “paradox of choice” – because you have so many potential ways to model a “one-to-N” relationship, your choice on how to model it just got harder. Lots harder.
Rules of Thumb: Your Guide Through the Rainbow
Here are some “rules of thumb” to guide you through these indenumberable (but not infinite) choices
- One: favor embedding unless there is a compelling reason not to
- Two: needing to access an object on its own is a compelling reason not to embed it
- Three: Arrays should not grow without bound. If there are more than a couple of hundred documents on the “many” side, don’t embed them; if there are more than a few thousand documents on the “many” side, don’t use an array of ObjectID references. High-cardinality arrays are a compelling reason not to embed.
- Four: Don’t be afraid of application-level joins: if you index correctly and use the projection specifier (as shown in part 2) then application-level joins are barely more expensive than server-side joins in a relational database.
- Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.
- Six: As always with MongoDB, how you model your data depends – entirely – on your particular application’s data access patterns. You want to structure your data to match the ways that your application queries and updates it.
Your Guide To The Rainbow
When modeling “One-to-N” relationships in MongoDB, you have a variety of choices, so you have to carefully think through the structure of your data. The main criteria you need to consider are:
- What is the cardinality of the relationship: is it “one-to-few”, “one-to-many”, or “one-to-squillions”?
- Do you need to access the object on the “N” side separately, or only in the context of the parent object?
- What is the ratio of updates to reads for a particular field?
Your main choices for structuring the data are:
- For “one-to-few”, you can use an array of embedded documents
- For “one-to-many”, or on occasions when the “N” side must stand alone, you should use an array of references. You can also use a “parent-reference” on the “N” side if it optimizes your data access pattern.
- For “one-to-squillions”, you should use a “parent-reference” in the document storing the “N” side.
Once you’ve decided on the overall structure of the data, then you can, if you choose, denormalize data across multiple documents, by either denormalizing data from the “One” side into the “N” side, or from the “N” side into the “One” side. You’d do this only for fields that are frequently read, get read much more often than they get updated, and where you don’t require strong consistency, since updating a denormalized value is slower, more expensive, and is not atomic.
Productivity and Flexibility
The upshot of all of this is that MongoDB gives you the ability to design your database schema to match the needs of your application. You can structure your data in MongoDB so that it adapts easily to change, and supports the queries and updates that you need to get the most out of your application.