Beyond NoSQL: A Modern Database Manifesto
There is no such thing as NoSQL. Not as we tend to think of it, anyway. While NoSQL was born as a movement away from rigid relational data models so web giants could embrace Big Data with scale-out architectures, the term has come to categorize a set of databases that are more different than they are the same. This broad categorization doesn’t work. It’s not helpful. While we at MongoDB still sometimes refer to NoSQL, we try to do it sparingly, given its propensity to confuse rather than enlighten. Deconstructing NoSQL Today the NoSQL category includes a cacophony of over 100 document, key-value, wide-column and graph databases . Each of these database types comes with its own strengths and limits. Each differs markedly from the others, with disparate models and capabilities relative to data storage, querying, consistency, scalability and high availability. Comparing a document database to a key-value store, for example, is like comparing a smartphone to a beeper. A beeper is exceptionally useful for getting a simple message from Point A to Point B. It’s fast. It’s reliable. But it’s nowhere near as functional as a smartphone, which can quickly and reliably transmit messages, but can also do so much more. Both are useful, but the smartphone fits a far broader range of applications than the more limited beeper. As such, organizations searching for a database to tackle Gartner’s three V’s of Big Data -- volume, velocity and variety -- won’t find an immediate answer in “NoSQL.” Instead, they need to probe deeper for a modern database that can handle all of their Big Data application requirements. Modern Databases For Modern Data One of these requirements is, of course, the ability to handle large volumes of data, the original impetus behind the NoSQL movement. But the ability to handle volume, or scale, is something all databases categorized as “NoSQL” share. MongoDB, for example, counts among its users those who regularly store petabytes of data, perform over 1,000,000 operations per second and clusters that exceed 1,000 nodes. A modern database, however, must do more than scale. Scalability is table stakes. It also must enable agility to accelerate development and time to market. It must allow organizations to iterate as they embrace new business requirements. And a modern database must, above all, enable enterprises to take advantage of rapidly growing data variety. Indeed the “greatest challenge and opportunity” for enterprises, as Forrester notes, is managing a “variety of data sources,” including data types and sources that may not even exist today. In general, all so-called NoSQL databases are much more helpful than relational databases at storing a wide variety of data types and sources, including mobile device, geospatial, social and sensor data. But the hallmark of a modern database its ability to allow organizations to do useful things with their data. Defining The Modern Database To count as a modern database, then, a database must meet three requirements. While relational databases are able to manage some of these requirements, and newer so-called “NoSQL” key-value or wide column data stores meet others, only MongoDB meets all three requirements. The database MUST scale . As data volume and velocity grows, so the database must grow too. It should scale horizontally and elegantly, without doing unnatural things to your application, in the cloud or on commodity hardware. Meeting the base requirements -- like having enough capacity to serve your customers -- should be a given. The database MUST adapt to change . The speed of business accelerates and your database must keep pace, enabling iteration. This means you must be able to process and mine new data sources and data types without the database breaking a sweat (or you breaking your back or budget). Your schema must flow from your application requirements, rather than forcing your application to fit a predefined, rigid schema. The database MUST unleash your data . Just storing data isn’t enough. You must be able to exploit the data, which particularly means you must be able to ask significant questions of your data. In part this means that the database must support rich queries, indexing, aggregation and search across multi-structured, rapidly changing data sets in real time. But it also means that it must support data for modern use cases including mobile, social, Internet of Things and other systems of engagement. Some relational databases can handle a few of these requirements, yet fail in the essential need to deliver scale and adaptability. Some newer databases, including so-called “NoSQL” key-value or wide column data stores, meet still other requirements, yet don’t give organizations the latitude to unleash their data. In fact, they constrain you to look up data by the key with which it was written unless you integrate external search engines and analytics nodes, which can create other problems. MongoDB: A Modern Database For Today's Business Needs But only one database today can deliver on each of these critical components of a modern database. Only one database offers orders of magnitude more productivity for developers and operations teams alike, while still delivering petabyte scale and lightning-fast performance. Only MongoDB, the modern database that tens of thousands of organizations depend upon to build and run today’s most demanding applications. To learn more about how MongoDB has enabled some of the world’s largest and most innovative companies to deliver applications and outcomes that were previously impossible, download our new whitepaper .
Looking beyond labels like relational and NoSQL
According to a new Dice.com salary survey , MongoDB ranks as one of top-10 most highly compensated technology skills. Indeed.com rates MongoDB as the second hottest job trend. And DB-Engines.com, which ranks over 200 databases on their relative popularity, MongoDB is now the fifth-most popular database in the world, this month surpassing IBM's DB2. All great, right? Maybe. Buried in the Dice.com data, as well as the Indeed.com data, is evidence of real confusion. For example, of the top-10 most highly compensated skills in Dice.com's survey is "NoSQL ." NoSQL is not a technology. It's not really something a developer can "know" in any real sense. NoSQL is a movement that describes a different way of modeling data but, as Basho founder Justin Sheehy correctly noted , there are as many differences among so-called NoSQL databases as there are similarities. As such, knowing Basho's Riak won't really help you understand MongoDB. Perhaps at a high, conceptual level, but expertise in one doesn't really translate into familiarity with another. They are different databases with different approaches. Employers looking for generic NoSQL skills need to think more deeply about what their application requirements are. Looking beyond relational databases for modern application requirements is a good start, but looking to generic "NoSQL" is not sufficient. Organizations should be looking for a modern database that dramatically improves developer productivity, encourages application iteration and enables a new wave of transformational applications in areas like Big Data , Internet of Things , mobile and more . That database is MongoDB. Is MongoDB "NoSQL." Sure. But it's much bigger than that ( based on what people search for on Google , many organizations already seem to understand this). MongoDB is the fastest-growing database in the world , not because it fits the NoSQL category, but because it significantly improves the productivity of developers and the organizations for which they work. So if you're looking to hire technology talent, you're far more likely to be successful hiring an experienced MongoDB engineer than a "NoSQL engineer." MongoDB, after all, is an actual database. NoSQL simply describes an important movement.
MongoDB's $150 Million Funding Round: It's about the Customer Experience
Today MongoDB announced that we raised $150 million from a variety of investors both new (Salesforce.com, T.Rowe Price, EMC and others) and old (Sequoia, Red Hat, NEA, Flybridge, etc.). It's a great day for MongoDB, both the company and the project. But mostly it's a great day for our customers and the MongoDB community in which they participate. Hip With The Hackers Over the last few years MongoDB has solidified its position as the industry's leading NoSQL database and the fastest-growing Big Data community . With this funding round, MongoDB is also the best funded Big Data technology. As enterprises invest in Big Data, they turn to the two dominant Big Data technologies, MongoDB and Hadoop , as Wikibon analysis has shown. Importantly, as can be seen in an analysis of LinkedIn profiles by 451 Research, very often enterprises discover that they already have MongoDB expertise within their organizations: Much of this success derives from MongoDB giving developers a better way to create applications . Rather than commoditizing a legacy relational database (RDBMS) market, similar to what other open-source RDBMSs have done, MongoDB significantly increases developer productivity by offering them a flexible data model. MongoDB is a significant part of what Cowen & Co. analyst Peter Goldmacher calls a "fundamental shift in the technology landscape away from legacy systems towards a new breed of better products at a lower cost for Data Management, Apps and in other areas." In other words, MongoDB is empowering the next generation of applications: post-transactional applications that rely on bigger data sets that move much faster than an RDBMS can handle. Developers have responded, voting with their apps, a considerable number of which are backed by MongoDB. A Means, Not An End Given the opportunity ahead of us, MongoDB would be irresponsible to raise less. While most of our funding comes from rapidly growing revenues, the MongoDB board of directors determined that it would be advantageous to the project and, hence, to our customers, to accelerate growth. After all, our relational database competitors have a 30-year headstart. As Max Schireson, MongoDB's CEO, articulated on his blog: We are in a market dominated by technologies with over 30 years of engineering in them. Their designs may not be as well suited to modern applications, but they are very mature, very feature rich, and have huge partner ecosystems and big companies that understand the needs of their enterprise customers behind them. They have way more tooling – and decades of refinement of operational tools. This is why we are raising $150 million. We know that it will take a large and sustained effort to build the maturity that many users expect in this market. Building out our management suite and enhancing the core product will be a ton of work. We have made great progress on security, management, stability, and scalability but we still have so much to do. For next-generation workloads in the cloud, MongoDB is already taking a lead, as Amazon Web Services data from Stackdriver seems to suggest: But MongoDB isn't intended to be a cloud-only database. It's a general purpose database, designed to be a great fit for the vast majority of worklads. We want to make it easy to run on a single node or at massive scale in the cloud or on premise. Whatever the customer needs. This funding will help. Helping Ops Fall In Love With MongoDB Some of that work will be done by MongoDB's exceptional community of developers and business partners. Among other things, the MongoDB community has contributed over 20 drivers, tripling the language compatibility of MongoDB and making it much more approachable for developers, whatever their preferred programming language. But some of it will necessarily be done by MongoDB, Inc. From Linux to JBoss to Drupal, much of the best tooling has had to be developed by a focused, highly incentivized company. MongoDB is no different. We believe we have built the world's best database for developers. Now we need to make sure it is also the world's best database for Operations professionals. So that means an improved and expanded management suite. We recently added Backup , but there are other areas that will help Operations professionals more easily manage MongoDB at the scale that we increasingly see enterprises run the database. Outside of tooling, we also recognize that we need to continue to make improvements to MongoDB's concurrency, further optimize performance and more. We don't by any stretch think we're done. The Path Forward But we're making excellent progress. In the last year since I joined MongoDB I've seen the company double its headcount and dramatically expand sales. This funding not only lets us make significant investments in improving MongoDB for both developers and Operations, but it also helps us to fund expansion geographically. We're already growing 300% or more in Europe year-over-year, and expect much of the same in Asia-Pacific. We need to help support our customers wherever they may be. Given the historic opportunity before MongoDB, it's time to step on the accelerator. Hard. -- If you're interested, please find more coverage of the funding at BusinessWeek , GigaOm , TechCrunch , VentureBeat , and ZDNet .
Pearson National Transcript Center runs MongoDB
High school students only have to worry about one transcript: their own. But for Pearson , a multi-billion dollar learning company that operates in over 70 countries and employs some 36,000 people, its transcript management problem is much bigger. Pearson Education manages the transcripts for over 14 million students from more than 25,000 institutions, and makes and allows NTC member institutions to securely send records and transcripts to any of over 137,000 academic institutions, not to mention employers, licensure agencies, and scholarship organizations. To manage this big data problem, Pearson turned to MongoDB as the underlying database for its National Transcript Center . Pearson’s National Transcript Center isn’t merely a data store for student transcripts. Pearson stores student data and also transforms it from one standard format to another, including PESC High School Transcript XML, PESC College Transcript XML, SPEEDE EDI, SIF Student Record Exchange, and others. Pearson also generates PDF copies of a student’s records, and provides print copies when electronic delivery is not available. The impetus to use MongoDB was a request to archive student data at the end of each year, rather than deleting it. If the student had graduated, why keep her records around? As it turned out, there was plenty of reasons, including the potential need to transfer records between higher educational institutions or on to employers. But how best to store and manage this student data? Pearson had been using an open-source relational database (RDBMS) to store the student records. However, Pearson ran into performance problems with this RDBMS, problems that would compound each year. The idea of taking a year’s worth of student records and sticking it in a separate table, then sharding over and over as the years passed was going to make performance even worse. So Pearson turned to a key-value NoSQL database. Unfortunately, this too, posed problems. Pearson had no idea what a student record would look like in the future and so needed a dynamic schema. The company did not want to keep creating new tables as fields changed. Another problem with this key-value data store was that its filtering mechanism was hard to work with as Pearson employs very complicated queries, where the company searches different fields at the same time. It proved too difficult to get all that query data marshaled with a key-value database. At this point, Pearson decided to give MongoDB a try. Pearson’s development team immediately appreciated the ease of working with MongoDB’s flexible and dynamic data model. But it was perhaps MongoDB’s query mechanism that sold the team on using the NoSQL database. Mongo automatically converted Pearson’s queries from Hibernate into MongoDB. Pearson had Hibernate criteria calls, which allowed the team to avoid building SQL queries by hand. This work mapped directly to MongoDB, saving Pearson time and trouble. Other benefits became apparent over time. With Pearson’s original RDBMS approach, Pearson would have been forced to search gigantic tables when querying the student records. But with MongoDB, if Pearson starts putting too much data in a namespace, it can easily shard the namespace in MongoDB, for example, enabling search by district rather than of an entire state. Hence, instead of storing student data in a blob, as happened with the RDBMS, Pearson is able to use MongoDB’s GridFS, enabling Pearson to keep files and metadata automatically synced and deployed across a number of systems and facilities. For students looking to get into a good college or employer, their transcript is their passport. By using MongoDB, Pearson has been able to boost performance for its end-users, all while improving ease of use and productivity for its developers. Tagged with: Pearson, education, National Transcript Center, GridFS, RDBMS, case study, MongoDB, NoSQL
Guest post: Nokta.com runs Turkey's Internet on MongoDB About SPP42
This is a guest post by Emrah Ozcelebi, CEO of SPP42 , a leading NoSQL consultancy in Turkey. Nokta , one of the largest Internet companies in Turkey, knows what it means to operate at scale. The Internet leader reaches over 84% of all Turkish Internet users, and its video platform, Izlesene.com , delivers more than 2.7 million videos with over 2 billion page views and significant video views. As a Facebook Timeline launch partner, Nokta’s Izlesene.com also enables significant video sharing on Facebook. Finally, Nokta also operates Turkey’s leading photo sharing site, Foto Kritik , as well as a blogging site, Blogcu , that welcomes more than 13 million unique monthly users. At the heart of all this data is MongoDB. But Nokta got off to a rough start with MongoDB, due primarily to poor configuration and an inappropriate use case. Working together, 10gen and SPP42 were able to turn things around. First we got in touch with Nokta’s game department. Its Facebook implemantation of a local board game, OkeyHane was built on PHP, Java and Flash technologies with an open-source RDBMS as the database back-end. We were able to replace this relational database with MongoDB and significantly improve performance. It didn’t take long for Nokta’s software developers to realize that the flexibility of BSON gives extreme agility to the development team. Soon the MongoDB replicaset behind OkeyHane proved itself to be highly stable in production, in addition to being very easy to maintain and administer a MongoDB replicaset compared to other RDBMS solutions. After MongoDB proved itself stable in the midst of a difficult “war zone,” Nokta decided to extend its adoption by also using MongoDB in its flagship product, Izlesene.com. Nokta also elected to employ MongoDB in its homegrown advertisement platform, which feeds all its sites and delivers ads to 15,000 to 40,000 concurrent users. In order to meet the real-time requirements of the advertisement system, we helped to stabilize MongoDB installations. The middleware is built with the Akka concurrent programming framework with Scala language, with Spray being used as Rest API layer. We worked with great guys from Nokta.com like Erdem Agaoglu (@agaoglu) and Hakan Kocakulak who are also highly skilled in Hadoop and HBase. After the proven success of battle-hardened MongoDB installations in the ad-serving application, the Izlesene.com developers became more eager to use MongoDB for storing metadata about users and videos. Nokta is now planning to replace all of its open-source RDBMS implementations with MongoDB. Of course, at that level of traffic, there is no single silver bullet to solve all problems. The skilled development team is aware of that and willing to try new technologies. SPP42 and Nokta are working together to deliver better services to Nokta’s users by combining different NoSQL solutions such as Hadoop and Neo4J. With help from 10Gen, we are able to offer better, integrated solutions to meet Nokta’s demands. There is a great wind filling NoSQL’s sails in Turkey. Although adoption is still at a very early stage, we are finding great success (and plenty of MongoDB interest) as a 10gen partner in Turkey. Companies like Nokta are able to achieve serious scale and improved developer productivity with MongoDB, helped by working with an experienced local partner like SPP42. SPP42 is a Turkey-based consulting and training company specializing in decision support systems and business intelligence. Since its founding, SPP42 has delivered top-level open source consultancy and training services - mainly Java, Pentaho, Jasper and Python solutions over OpenStack, OpenShift and MongoDB and other NoSQL solutions. SPP42’s services include end-to-end integration solutions, from development and architecture to implementation. SPP42 works with Turkey’s leading companies and helps them stay on the bleeding edge of technological innovation. We help them plan the migration from its existing technologies to newer ones so that our customers are always competitive globally. Tagged with: guest post, scalability, Scala, RDBMS, Turkey, SPP42, partner, ease of use, developer productivity
The 'middle class' of Big Data
So much is written about Big Data that we tend to overlook a simple fact: most data isn’t big at all. As Bruno Aziza writes in Forbes , “it isn’t so” that “you have to be Big to be in the Big Data game,” echoing a similar sentiment from ReadWrite ’s Brian Proffitt . Large enterprise adoption of Big Data technologies may steal the headlines, but it’s the “middle class” of enterprise data where the vast majority of data, and money, is. There’s a lot of talk about zettabytes and petabytes of data, but as EMA Research highlights in a new study, “Big Data’s sweet spot starts at 110GB and the most common customer data situation is between 10 to 30TB.” Small? Not exactly But Big? No, not really. Couple this with the fact that most businesses fall into the 20-500-employee range , as Intuit CEO Brad Smith points out , and it’s clear that the biggest market opportunity for Big Data is within the big pool of relatively small enterprises with relatively small data sets. Call it the vast middle class of enterprise Big Data. Call it whatever you want. But it’s where most enterprise data sits. The trick is to first gather that data, and then to put it to work. A new breed of “data-science-as-a-service” companies like Metamarkets and Infochimps has arisen to lower the bar to culling insights from one’s data. While these tools can be used by enterprises of any size, I suspect they’ll be particularly appetizing to small-to-medium sized enterprises, those that don’t have the budget or inclination to hire a data science. (This might be the right way to go, anyway, as Gartner highlights : “Organizations already have people who know their own data better than mystical data scientists.” What they really need is access to the data and tools to process it.) Intriguingly, here at 10gen we’ve seen a wide range of companies, large and small, adopt MongoDB as they build out data-centric applications, but not always with Big Data in mind. In fact, while MongoDB and Hadoop are top-of-mind for data scientists and other IT professionals, as Wikibon has illustrated , many of 10gen’s smaller customers and users aren’t thinking about Big Data at all. Such users are looking for an easy-to-use, highly flexible data store for their applications. The fact that MongoDB also has their scalability needs covered is a bonus, one that many will unlock later into their deployment when they discover they’ve been storing data that could be put to use. In the RDBMS world, scale is a burden, both in terms of cost (bigger scale = bigger hardware = bigger license fees). Today, with NoSQL, scale is a given, allowing NoSQL vendors like 10gen to accentuate scalability with other benefits. It’s a remarkable turn of events for technology that emerged from the needs of the web giants to manage distributed systems at scale. We’re all the beneficiaries. Including SMBs. We don’t normally think about small-to-medium-sized businesses when we think of Big Data, but we should. SMBs are the workhorse of the world’s economies, and they’re quietly, collectively storing massive quantities of data. The race is on to help these companies put their comparatively small quantities of data to big use. It’s a race that NoSQL technologies like MongoDB are very well-positioned to win. Tagged with: MongoDB, big data, SMB, Hadoop, rdbms, Infochimps, Metamarkets, Gartner, Wikibon, data scientist
Considerations before moving from RDBMS to MongoDB
There are a variety of reasons for moving from a relational database (RDBMS) to MongoDB. Perhaps, like FamilySearch , the family history division of The Church of Jesus Christ of Latter-day Saints, a company wants to improve response times from 3 seconds (RDBMS) to under 15 milliseconds (MongoDB). Or perhaps, like Apollo Group (PDF), the private education giant behind University of Phoenix, an enterprise is hoping to store unstructured data and scale to support anticipated growth in the number of users and volume of your content. Whatever the reason for moving off a relational database for MongoDB, it’s important to plan appropriately. In my role I get to work directly with MongoDB users like Telefonica , nearly all of which come from a relational database background. Sometimes when people are fed up with using SQL, or they see MongoDB as a way to scale, they decide to migrate an application designed for a relational database directly to MongoDB…without rethinking the data model and architecture of their application. There are good ways to map SQL executables to MongoDB , but this isn’t one of them. Another error-prone migration “strategy” is driven by the blind usage of Object Document Mappers (ODM) and Object Relational Mappers (ORM) that shield a lot of the complexity of manipulating a database, but can also contribute to poor data model design. So when considering a direct migration from RDBMS to MongoDB, it’s important to be attentive to some issues: Too many collections (10+) - This will lead to poor performance and questions like ...How can I do a join in Mongo?â€œ (Answer: you can’t, but there are ways to accomplish the same thing in MongoDB) Many indexes - indexing all the fields in a document is a bad approach and leads to bad insert performance. I've seen cases where, for a given database, customers end up with more space occupied by indexes then data. This is not a good practice anywhere. The first question to ask, then, when moving from a relational database to MongoDB is, ‘How will this data be accessed?’ Other important questions include: What is the access pattern? What are you hoping to show to your customers/users? How are you going to write this data? These should be the first questions people ask themselves before migrating data from an RDBMS into MongoDB. Indeed, these should be the main questions someone should be asking before using any persistence layer. As stated, there are a lot of great reasons to use MongoDB instead of a relational database, but careful planning is required to pull off a successful migration. Posted by Norberto Leite, Senior Solutions Architect for EMEA, 10gen. Tagged with: rdbms, MongoDB, ORM, ODM, migration
Making sense of increased database choice
Gartner estimates that by 2015, 25 percent of new databases deployed will be of technologies supporting alternative data types and non-traditional data structures. This is great news, as these new database choices, many of them NoSQL, are generally better tuned to modern application requirements. The downside to this end to the “30-year old freeze,” to quote Redmonk analyst James Governor , is that with all these new options comes the risk of complicating a hitherto somewhat simple choice: which database to use? DB-Engines, after all, lists and ranks 92 different database systems , which doesn’t even include all of the NoSQL variants . Good luck to the CIO who tries to deploy all of those within her enterprise. The key, then, is to figure out how to standardize on a core of database technologies. Most companies will want to retain their legacy relational database for applications tuned to an RDBMS, or perhaps require complex transactions. But for most new applications, NoSQL databases like MongoDB will be be the optimal solution. But which one? There are currently at least 150 different NoSQL databases, split into different camps: document, columnar, key-value, graph, and others. One of my favorite guides for differentiating between these different options is Pramod Sadalage and Martin Fowler’s NoSQL Distilled . It does a great job of making NoSQL approachable, and also offers some guidance on which type of database to apply to specific types of problems. This is critical: which database is best largely depends on a particular use case. There is no shortage of guidance as to whether an enterprise should use NoSQL or stick with RDBMS or, if NoSQL, which to use ( here’s just one of many sites offering guidance). Unfortunately, this still doesn’t cut down on the number of choices presented to a developer interested in selecting a database for her application. I’m sure much of the advice is good, but it could end up solving a point problem (which database to use for a particular application) but exacerbate the meta problem (which databases to standardize on throughout the enterprise). This should be top-of-mind for every CIO, as shadow IT is already bringing NoSQL databases into the enterprise. This trend is only going to accelerate, as InfoWorld’ s Bob Lewis notes . The reasons NoSQL technologies are being adopted into the enterprise are somewhat similar to the reasons shadow IT is embracing the public cloud: speed of development, ease of development, and suitability for modern applications, as a recent Forrester survey found : Hence, savvy CIOs will select a few, broadly applicable databases that can tackle the vast majority of enterprise needs, while simultaneously satiating developers’ needs for databases that help them get their work done. But, again, which ones? Most enterprises already have RDBMS preferences, standardizing on two and possibly three SQL databases. Part of the reason that these databases have served so many for so long is that they are general purpose databases. They might not be the absolute perfect solution to a particular application requirement, but they do the job well enough and help the enterprise focus its resources. When choosing a NoSQL database, and every enterprise is going to need to do this, it’s important to opt for NoSQL databases that solve a wide variety of problems, rather than addressing niche requirements with a narrowly-applicable database. Document data stores like MongoDB tend to be the most broadly applicable, able to tackle a wide array of workloads. But there are other NoSQL databases that while not as generally useful, do a few things really well and should be considered. Other things to consider in settling on database standards are political and cultural issues, compatibility with existing applications or applications on the near- and long-term roadmap, and the momentum behind a particular NoSQL database. With 150-plus NoSQL databases to choose from, picking a fashionable but ephemeral database is a recipe for frustration and failure. As I’ve written, MongoDB’s community size and momentum , among other things, suggests it will be around for a long, long time. But there are other NoSQL communities that also demonstrate staying power. No enterprise wants to be managing dozens of databases, or even 10. Ideally, enterprises will settle on a few. Perhaps five, at most. In so doing, they should look to augment their RDBMS standards with NoSQL databases that are general purpose in nature, and broadly adopted. Considered in this light, NoSQL database standardization becomes much more manageable. — Posted by Matt Asay, vice president of Corporate Strategy . Tagged with: MongoDB, NoSQL, RDBMS, choice, database, relational database, Forrester, standardization, InfoWorld, shadow IT, Matt Asay
Case Study: The New York Times Runs MongoDB
Perhaps your business has settled on the exact right operating model, one that will remain static for years, if not decades. But for the 99.999 percent of the rest of the world’s enterprises, your market is in a constant state of flux, demanding constant iterations on how you do business. As the Research & Development group of The New York Times Company (NYT) has found , a key way to confront the constant flux of today’s businesses is to build upon a flexible data infrastructure like MongoDB. The story behind theThe New York Times Company’s use of MongoDB isn’t new. Data scientist and then NYT employee Jake Porway spoke in June 2011 about how the media giant uses MongoDB in Project Cascade, a visualization tool that uses MongoDB to store and manage data about social sharing activity related to NYT content. But what is perhaps new is the more recent realization of just how critical it is to build upon flexible data infrastructure like MongoDB in our ever-changing business climate. Project Cascade visualizes the conversations happening aroundNYT content on Twitter, giving insight into which content is hot and who is fanning the flames. Joab Jackson, writing for PCWorld , has a great write-up, and you can also see an online demo . For the NYT, as Porway explains, [Project Cascade] allows us to [answer] questions that are really big, like what is the best time of day to tweet? What kinds of tweets get people involved? Is it more important for our automated feeds to tweet, or for our journalists? Imagine, however, that the Times editors determine they actually need to be collecting different data. With a relational database, this would involve a fair amount of bother, but for the NYT’s R&D team, it’s simply a matter of tweaking MongoDB’s data model. As Porway notes , “We can't bother futzing with RDBMS schemas when we're constantly changing what we want to look at.” The NYT started Project Cascade with just two weeks of data using just a single MongoDB instance and no replication. Even in this limited snapshot of the roughly 600 pieces of posted content and 25,000 Twitter links each day, Project Cascade was generating 100 GB of MongoDB storage each month. Fast forward to late 2011, and Project Cascade is in serious production, processing 100,000 tweets (and far more clicks) daily, all in real-time. This necessitated moving up to a four-node MongoDB replica set, but it didn’t involve adding the complexity of joins or other characteristics of a relational database. As Deep Kapadia, Technical Program Manager at The New York Times Company, says , “MongoDB allows us to prototype things very quickly.” This is important for any enterprise application, as it allows companies to iterate around their data. Most won’t know exactly what their data model should look like right from the start. The NYT certainly didn’t. As Kapadia explains, the NYT didn’t have to do any schema design upfront to determine which fields to capture from Twitter or Bit.ly, but could simply dump all the data into MongoDB and figure out how to process it later. That flexibility is powerful. Granted, not all businesses will want to change as often as the NYT’s research group, but in a world of accelerating change, it’s increasingly critical that companies don’t hard-code rigid schemas into their data infrastructure. It’s also important that enterprises look to the future. However small a project starts, Big Data looms. Porway explains, “Even if we're not dealing with big data when we start a project, the data demands can rise significantly.” A RDBMS scale-up strategy quickly becomes expensive and constrictive. A NoSQL scale-out architecture is much more forgiving. MongoDB is particularly useful as it runs as well on a single node as it does on hundreds of nodes. Scale almost always starts with one node, as Foursquare and others have found . While the Web companies like Google and Twitter ran into the constraints of RDBMS technology first, mainstream enterprises are hitting them now. The New York Times has been publishing continuously since 1851, yet the nature of its business has changed significantly since the advent of the Internet. The same is true for most businesses. Like NYT, most mainstream enterprises today will find themselves collecting, filtering, and analyzing realtime data feeds from a variety of sources to better understand how customers and prospects interact with their products and services. MongoDB fits perfectly in this kind of ever-changing world. Not surprisingly, the publishing and media world is grappling with the need for flexible data models in a very public way. Like the NYT, UK-based news publisher The Guardian also uses MongoDB to help it adapt to digital and the business models enabled by it. In order to flexibly iterate on different user engagement models, The Guardian had to drop old-school relational database technology and move to MongoDB. Not that MongoDB is perfect. As Kapadia highlighted roughly a year after Porway’s original presentation, there is definitely a science to deploying MongoDB effectively. It’s very easy to get started with MongoDB, but it requires the same level of care that any critical data infrastructure does. If Tim O’Reilly is right and “ Data is the new Intel Inside ,” then it’s important to build applications on a flexible database that not only can scale to collect increasing quantities of data, but also affords the agility to change one’s data model as business needs change. Data offer real competitive advantage to the companies prepared to leverage them. Just ask The New York Times. Tagged with: case study, The New York Times, The Guardian, flexibility, agility, publishing, media, MongoDB, RDBMS, relational database, Jake Porway