9 results

Why Open Source Is Essential To Big Data

Gartner analyst Merv Adrian recently highlighted some of the recent movements in Hadoop Land, with several companies introducing products "intended to improve Hadoop speed." This seems odd, as that wouldn't be my top pick for how to improve Hadoop or, really, most of the Big Data technologies out there. By many accounts, the biggest need in Hadoop is improved ease of use, not improved performance, something Adrian himself confirms : Hadoop already delivers exceptional performance on commodity hardware, compared to its stodgy proprietary competition. Where it's still lacking is in ease of use. Not that Hadoop is alone in this. As Mare Lucas asserts , Today, despite the information deluge, enterprise decision makers are often unable to access the data in a useful way. The tools are designed for those who speak the language of algorithms and statistical analysis. It’s simply too hard for the everyday user to “ask” the data any questions – from the routine to the insightful. The end result? The speed of big data moves at a slower pace … and the power is locked in the hands of the few. Lucas goes on to argue that the solution to the data scientist shortage is to take the science out of data science; that is, consumerize Big Data technology such that non-PhD-wielding business people can query their data and get back meaningful results. The Value Of Open Source To Deciphering Big Data Perhaps. But there's actually an intermediate step before we reach the Promised Land of full consumerization of Big Data. It's called open source. Even with technology like Hadoop that is open source yet still too complex, the benefits of using Hadoop far outweigh the costs (financial and productivity-wise) associated with licensing an expensive data warehousing or analytics platform. As Alex Popescu writes , Hadoop "allows experimenting and trying out new ideas, while continuing to accumulate and storing your data. It removes the pressure from the developers. That’s agility." But these benefits aren't unique to Hadoop. They're inherent in any open-source project. Now imagine we could get open-source software that fits our Big Data needs and is exceptionally easy to use plus is almost certainly already being used within our enterprises...? That is the promise of MongoDB, consistently cited as one of the industry's top-two Big Data technologies . MongoDB makes it easy to get started with a Big Data project. Using MongoDB To Innovate Consider the City of Chicago. The Economist wrote recently about the City of Chicago's predictive analytics platform, WindyGrid. What The Economist didn't mention is that WindyGrid started as a pet project on chief data officer Brett Goldstein's laptop. Goldstein started with a single MongoDB node, and iterated from there, turning it into one of the most exciting data-driven applications in the industry today. Given that we often don't know exactly which data to query, or how to query, or how to put data to work in our applications, this is precisely how a Big Data project should work. Start small, then iterate toward something big. This kind of tinkering simply is difficult to impossible with a relational database, as The Economist's Kenneth Cukier points out in his book, Big Data: A Revolution That Will Transform How We Live, Work, and Think : Conventional, so-called relational, databases are designed for a world in which data is sparse, and thus can be and will be curated carefully. It is a world in which the questions one wants to answer using the data have to be clear at the outset, so that the database is designed to answer them - and only them - efficiently. But with a flexible document database like MongoDB, it suddenly becomes much easier to iterate toward Big Data insights. We don't need to go out and hire data scientists. Rather, we simply need to apply existing, open-source technology like MongoDB to our Big Data problems, which jibes perfectly with Gartner analyst Svetlana Sicular's mantra that it's easier to train existing employees on Big Data technologies than it is to train data scientists on one's business. Except, in the case of MongoDB, odds are that enterprises are already filled with people that understand MongoDB, as 451 Research's LinkedIn analysis suggests: In sum, Big Data needn't be daunting or difficult. It's a download away.

May 2, 2013

AOL's targeted advertising business: Powered by MongoDB

While AOL may evoke thoughts of dial-up broadband for some, the company today drives over $2 billion in annual revenues connecting advertisers to consumers of its premium content, including Huffington Post, Moviefone, Engadget, TechCrunch, Patch, and Stylelist. MongoDB provides the data infrastructure for a significant portion of AOL’s business, both on the content and advertising sides of AOL. In the words of Jonathan Reed , formerly a senior software engineer at AOL, “AOL uses MongoDB a lot throughout our business,” and for very different use cases. As of June 2012 AOL had over 30 MongoDB projects running internally across over 500 servers. One of the important projects for which AOL uses MongoDB is advertising, as detailed in the video above. AOL’s Advertising.com platform helps advertisers reach highly-targeted audiences at scale, and MongoDB plays an essential role in storing Advertising.com’s user profiles. AOL turned to MongoDB for its flexible data model, as user profiles have various sizes and shapes, with different kinds of information stored for different users. One key feature that MongoDB offers is geospatial indexing, which enables AOL to advertise services based on a user’s location (e.g., showing airfare pricing based on the airport nearest to the user, even if all they’ve expressed is interest in flying to Paris). Importantly, all of this must be done in under five milliseconds, which means that AOL simply can’t afford to hit disk and must keep everything in RAM. MongoDB handles this easily, processing 12,000 transactions per second, or several billion each month. MongoDB’s performance was so good, as Reed describes, the company needed a special set-up to manage network traffic, which couldn’t keep up with MongoDB. While this seems like it must require a complex set-up, Reed suggests that MongoDB is “surprisingly simple” to set up and run. In the case of Advertising.com for this project, MongoDB runs in a single cluster spanning three data centers, two in the U.S. and one in Europe. Indeed, ease of use was one of the top-four reasons AOL chose MongoDB to power Advertising.com: Easy to learn and set up Easy to scale Great community Support contract available (“really good value for money”) None of this would matter, however, if MongoDB couldn’t handle AOL’s core requirement: dynamic data schema. AOL’s Advertising.com must constantly tweak the kind of user information it collects and stores, and has to be able to do so with super-high performance at scale. MongoDB ticks each of these boxes, and makes it easy to do so, leading Reed to conclude that hitting AOL’s scale requirements “would have been much harder with other technology.” Tagged with: AOL, Advertising, case study, use cases, flexibility, dynamic schema, high performance, scalability

March 26, 2013

Learn how leading telcos differentiate with MongoDB

MongoDB is the world’s most popular NoSQL database , and nowhere is this more true than in the telecommunications (Telco) market. It would be hard to name a major Telco that isn’t already using MongoDB, many in serious production. From Orange to Telefonica to O2 , MongoDB is the default NoSQL database for the Telco market. The Telco market was among the first to embrace MongoDB, in large part because legacy relational database technology (RDBMS) locked these companies into rigid data schemas that made it hard to adapt to new business requirements. Paramount among those requirements are flexibility and performance. Today, Telcos confront competition not only from companies offering similar technologies (e.g., two wireless operators) but from companies offering the same applications over different technologies (e.g., a landline and wireless operator), wholesale operators with different cost structures and nimble startups offering competitive applications over the Internet. Opportunities for new subscriber growth are limited, with mobile penetration in most rich countries exceeding 100%, fixed-line subscriptions falling and pay-TV subscriptions flat in many countries. Telcos are therefore turning to their existing subscriber bases for revenue growth, considering new revenue streams like targeted advertising and additional value-added services, like over-the-top video and consumer cloud storage, to increase their revenue per user. To make such projects successful, many Telcos are turning to MongoDB to create a flexible data infrastructure, as explained in 10gen’s white paper: “ Agility in the Age of Apps: How the Next Generation of Databases Can Create New Opportunities for Telecoms .” This white paper offers clear guidance on how to turn MongoDB to competitive advantage. Representatives from Telcos and other industries should consider attending our upcoming webinar, “ How Telcos Use MongoDB ,” which will identify how operators are increasingly leveraging MongoDB to develop new applications quickly and secure new revenue streams. The webinar will be April 2, 2013. Please register today. No Telco need resign itself to being reduced to a “dumb pipe,” incapable of product differentiation and consequent profits. A winning MongoDB-based infrastructure can help. Please download 10gen’s white paper and/or attend the April 2 webinar to learn how to put NoSQL to use in driving innovation within your telecommunications company. Tagged with: Telecommunications, Telco, telecom, flexibility, MongoDB, webinar, white paper

March 25, 2013

Guest post: My MongoDB conversion

Michael Calabrese is a senior developer with Lunar Logic , a custom software development firm that specializes in custom websites and web-based technologies for businesses, educational institutions, and non-profit organizations. He is also, as noted below, a convert from the relational database world. I have a confession: I have crossed over. It has finally happenedâ€â€ù I no longer like to use SQL on products. The joy of simplifying needlessly complex queries is gone. I no longer see problems as tables broken down in third normal form. I am now one of them. I have accepted the world of document databases and, in particular, that of MongoDB. Growing Up SQL Like many developers that finished college in the 90s (or before), I grew up with relational databases. I created structures, followed the rules, and removed duplication. I was careful to implement foreign keys and set domains on my data. Data integrity was of the utmost importance. My databases blossomed with tables, adding order types, product inventory types, and inventory locations. Tables grew like flowers, popping up with each new type of data that I had to track. I would feel the rush of building massive SQL queries to mine the data to build reports, adding join after join. Cracks in the SQL Faith But then reality started to set in. It became: ...Just add another table, another join, and it will work out.” The user experience dragged as data grew. We went from thousands of rows to millions. As data requirements changed at management’s whim, domains on the data would not hold. As the data kept growing and becoming more complex, I had data requirements that didn’t necessarily match historical data. I could no longer depend on the fast inner join. I suffered from table creep and now I needed transactions to manage the updates across tables. Tables grew needlessly, with near duplicate products that only changed names. I was increasingly building meta systems, like a CMS where I would not know shape of the data being stored. Choices to model the data became complex and unwieldy. More and more, data processing was moving into the application code or stored procedures. It was now standard procedure to break the rules of SQL where the rules are paramount. The NoSQL Solution About this time I changed employers, and was introduced to NoSQL. In the beginning, I was confused by the simplicity of just storing a document. What is a document anyway? Over time I came to realize documents are very similar to the serialized objects in my settings tables. That is, an arbitrary structure that can be stored, but still accessed simply, as part of a row/document. I could easily create and access multi-part fields, without having to mess with any sort of serialization. This was revolutionary. It could solve many of the problems in which I needed arbitrary data structures. In the new meta systems, I could easily store user-created data structures. On top of all that, I could even create indexes for the sub-fields/sub-documents. From there, document databases and NoSQL just started to make sense. My designs became cleaner. De-normalization became the standard. I was designing schemas for how the data was going to be used, but realized that I was doing that anyway in SQL. With the MongoDB document database, how the data was used became more important than the decomposition of the data . All necessary information could be stored in a single document, rather than across tables and joins. All similar documents could be housed in a collection, so, for example, an order document included all of the relevant order information: customer information, order detail, shipping. Everything is neatly packaged, to be sent to my applications without much processing. Easy. Living in a Post-SQL World I thought I would miss transactions, but I don’t. Generally, transactions are needed because in SQL we were forced to divide data between tables to remove duplication. We had to keep data consistent between those tables. Now, I just keep data that must be kept consistent in the same document. This is better anyway, as in most cases, the data that must be consistent is part of a conceptual whole. My designs are now cleaner and more flexible. This flexibility inspired cleaner, flexible code. All-in-all, I am more productive in design, development and maintenance of my projects. After a year of developing in MongoDB, I no longer look back. Tagged with: NoSQL, document-oriented, LunarLogic, case study, flexibility

March 20, 2013

MongoDB powers Mappy Health's tweet-based disease tracking

Twitter has come a long way from being the place to read what your friends ate for dinner last night (though it still has that). Now it’s also a place where researchers can track the ebb and flow of diseases, and take appropriate action. In early 2012, the U.S. Department of Health and Human Services challenged developers to design applications that use the free Twitter API to track health trends in real time. With $21,000 in prize money at stake, Charles Boicey , Chief Innovation Officer of Social Health Insights, and team got started on the Trending Now Challenge , and ultimately won with its MongoDB-powered solution, Mappy Health . Not bad, especially since the small team had only three weeks to put together a solution. Choosing a Database MongoDB was critical to getting the application done well, and on time, as Boicey tells it, MongoDB is just a wonderful environment in which to work. What used to take weeks with relational database technology is a matter of days or hours with MongoDB. Fortunately, Boicey had a running start. Having used MongoDB previously in a healthcare environment, and seeing how well it had ingested health information exchange data in an XML format, Boicey felt sure MongoDB could manage incoming Twitter data. Plus, Mappy Health needed MongoDB’s geospatial capabilities so as to be able to track diseases by location. Finally, while the team evaluated other NoSQL options, “MongoDB was the easiest to stand up” and is “extremely fast.” To make the development process even more efficient, Mappy Health runs the service on Amazon EC2. Processing the Data While UCI has a Hadoop ecosystem Mappy Health could have used, the team found that for processing real-time algorithms and MapReduce jobs, they run much faster on MongoDB, and so runs MapReduce within MongoDB, yielding insights like this: As Boicey notes, Writing MapReduce jobs in Javascript has been fairly simple and allows us to cache collections/hashes of data frequently displayed on the site easily using a Memcached middleman between the MongoDB server and the Heroku-served front-end web app. This jibes well with Mappy Health’s overall rationale for choosing MongoDB: MongoDB doesn’t require a lot of work upfront (e.g., schema design - “doing the same thing in a relational database would require a lot of advance planning and then ongoing maintenance work like updating tables) and MongoDB works really well and scales beautifully Since winning the Trending Now Challenge, Mappy Health has been working with a number of other organizations. We look forward to even bigger and better things from this team. Imagine what they could do if given a whole four weeks to build an application! Tagged with: Mappy Health, case study, disease tracking, US Department of Health and Human Services, flexibility, ease of use, Amazon, EC2, dynamic schema

March 18, 2013

Pearson / OpenClass Uses MongoDB for Social Learning Platform

We recently spoke with Brian Carpio of Pearson about OpenClass , a new project from Pearson with deep Google integration. What is OpenClass? OpenClass is a dynamic, scalable, fully cloud-based learning environment that goes beyond the LMS. OpenClass stimulates social learning and the exchange of content, coursework, and ideas â€â€ù all from one integrated platform. OpenClass has all the LMS functionality needed to manage courses, but that's just the beginning. Why did you decide to adopt MongoDB for OpenClass? OpenClass leverages MongoDB as one of its primary databases because it offers serious scalability and improved productivity for our developers. With MongoDB, our developers can start working on applications immediately, rather than slogging through the upfront planning and DBA time that relational database systems require. Also, given that a big part of the OpenClass story will be how we integrate with both public and private cloud technologies, MongoDB support for scale-out, commodity hardware is a better fit than traditional scale-up relational database systems that generally must run on big iron hardware. Can you tell us about how you’ve deployed MongoDB? Currently we deploy MongoDB in our world-class datacenters and in Amazon's EC2 cloud environment with future plans to go to a private cloud technologies such as OpenStack. We leverage both Puppet and Fabric for deployment automation and rolling upgrades. We also leverage Zabbix and the mikoomi plugin for monitoring our MongoDB production servers. Currently each OpenClass feature / application leverages its own MongoDB replica set, and we expect to need MongoDB’s sharding features given the expected growth trajectory for OpenClass. What recommendations would you give to other operations teams deploying MongoDB for the first time? Automate everything! Also, work closely with your development teams as they begin to design an application that leverages MongoDB, which is good advice for any new application that will be rolled into production. I would also say to look at Zabbix as it has some amazing features related to monitoring MongoDB in a single replica set or in a sharded configuration that can help you easily identify bottlenecks and identify when it’s time to scale out your MongoDB deployment. Finally, I would suggest subscribing to the #mongodb irc channel, as well as the MongoDB Google Group , and don't be afraid to ask questions. I personally ask a lot of questions in the MongoDB Google Group and receive great answers not only from 10gen CTO Eliot Horowitz , although he does seem to answer a lot of my questions, but from a many other 10gen folks. What is in store for the future with MongoDB at Pearson? Our MongoDB footprint is only going to continue to grow. More and more development teams are playing with MongoDB as the foundation of their new application or OpenClass feature. We are working on migrating functionality out of both Oracle and Microsoft SQL Server to MongoDB where it makes sense to relieve the current stress on those incumbent database technologies. Thanks to Brian for telling us about OpenClass! Brian also blogs at www.briancarpio.com — be sure to check out his posts on MongoDB here and here and here and here and here . Tagged with: case study, Pearson, OpenClass, scalability, flexibility, ease of use

February 28, 2013

Post+Beam's MongoDB-powered innovation factory

When your business is innovation, throttling creativity with rigid, upfront schema design is a recipe for frustration. It’s therefore not surprising that Post+Beam , an innovation and communications “factory,” turned to MongoDB to enable rapid development. Part startup incubator, part branding and communication agency, part development firm, Post+Beam takes ideas and turns them into products. Post+Beam’s first MongoDB-based product is Linea, a cross-platform photo browsing application that extends from web to mobile and enables users to create and share stories through photos, focusing on the photos and the collaboration around them, not photo storage. In talking with lead engineer Jeff Chao, he mentioned MongoDB’s dynamic schema as a primary reason for using the NoSQL database: The most important reason for using MongoDB from the start is rapid development. We wanted to spend just enough development time in spec’ing out a schema so we could get started on writing the application. We were then able to incrementally adjust the schema depending on various technical and non-technical requirements. Another reason for choosing MongoDB is because of its default data representation. We were able to build out an API to allow iOS clients to interact with our web service via JSON. This is particularly interesting given that Post+Beam’s development team has extensive relational database technology. According to Chao, MongoDB’s documentation and community support” made it easy to get up-to-speed. The initial set-up consists of a three-node replica set (for automatic fail-over), all running in one cluster on Amazon EC2. While the team continues to use Postgres for some transactional components of the Linea app, it needed MongoDB’s flexible data model to support its business model, which demands continuous iteration Which, of course, is how innovation happens. Chao noted that Post+Beam plans to expand its use of MongoDB, particularly for those applications that “require a relatively short delivery time combined with requirements that might not be fully matured at the time of the [client] request.” This sounds like most applications, most of the time, in most enterprises. Indeed, this is one of the primary reasons we see for MongoDB’s mass adoption. As our friends at MongoLab say , “It’s a data model thing.” Tagged with: data model, Post+Beam, case study, Linea, innovation, flexibility, replica sets, ease of use

February 19, 2013

Technology Adoption and the Power of Convenience

Just as the ink was drying on my ReadWrite piece on how the convenience of public cloud computing is steamrolling over concerns about security and control, Redmonk ÃÂ_ber-analyst Stephen O’Grady posts an exceptional review of why we should “not underestimate the power of convenience.” As he writes: One of the biggest challenges for vendors built around traditional procurement patterns is their tendency to undervalue convenience. Developers, in general, respond to very different incentives than do their executive purchasing counterparts. Where organizational buyers tend to be less price sensitive and more focused on issues relating to reliability and manageability, as one example, individual developers tend to be more concerned with cost and availability - convenience, in other words. Because you are who you build for, then, enterprise IT products tend to be more secure and compliant and less convenient than developer-oriented alternatives. None of which would be a problem for old-guard IT vendors if developers, not to mention line of business executives, didn’t have increased control over what gets used in the enterprise. From open source to SaaS, legacy procurement processes are fracturing in the face of developers, in particular, building what they want when they want. Because of the cloud. Because of open source. Because of convenience. O’Grady points to a variety of technologies, including MongoDB, Linux, Chef/Puppet, Git, and dynamic programming languages, that have taken off because they’re so easy to use compared to legacy (and often proprietary) incumbents. Most are open source but, as I point out in my ReadWrite article, “open” isn’t always required. Microsoft SharePoint and Salesforce.com, for example, are both proprietary but also easier to adopt than the crufty ECM and on-premise CRM systems they displaced. The key, again, is convenience. It’s one of the things that drew me to 10gen. MongoDB isn’t perfect, but its data model makes life so easy on developers that its adoption has been impressive. That flexibility and ease of use is why MTV and others have embraced MongoDB. With convenience comes adoption, and with adoption comes time to resolve the issues any product will have. Most recently, this has resulted in 10gen removing MongoDB’s global write-lock in MongoDB version 2.2 , as well as changing the default write behavior with MongoClient . All while growing community and revenues at a torrid pace. Back to O’Grady. As he concludes, “with developers increasingly taking an active hand in procurement, convenience is a dangerous feature to ignore.” I couldn’t agree more. - Posted by Matt Asay, vice president of Corporate Strategy. Tagged with: Stephen O'Grady, Redmonk, convenience, ease of use, flexibility, MTV, global write-lock, developers, Linux, ReadWrite

December 20, 2012

Case Study: The New York Times Runs MongoDB

Perhaps your business has settled on the exact right operating model, one that will remain static for years, if not decades. But for the 99.999 percent of the rest of the world’s enterprises, your market is in a constant state of flux, demanding constant iterations on how you do business. As the Research & Development group of The New York Times Company (NYT) has found , a key way to confront the constant flux of today’s businesses is to build upon a flexible data infrastructure like MongoDB. The story behind theThe New York Times Company’s use of MongoDB isn’t new. Data scientist and then NYT employee Jake Porway spoke in June 2011 about how the media giant uses MongoDB in Project Cascade, a visualization tool that uses MongoDB to store and manage data about social sharing activity related to NYT content. But what is perhaps new is the more recent realization of just how critical it is to build upon flexible data infrastructure like MongoDB in our ever-changing business climate. Project Cascade visualizes the conversations happening aroundNYT content on Twitter, giving insight into which content is hot and who is fanning the flames. Joab Jackson, writing for PCWorld , has a great write-up, and you can also see an online demo . For the NYT, as Porway explains, [Project Cascade] allows us to [answer] questions that are really big, like what is the best time of day to tweet? What kinds of tweets get people involved? Is it more important for our automated feeds to tweet, or for our journalists? Imagine, however, that the Times editors determine they actually need to be collecting different data. With a relational database, this would involve a fair amount of bother, but for the NYT’s R&D team, it’s simply a matter of tweaking MongoDB’s data model. As Porway notes , “We can't bother futzing with RDBMS schemas when we're constantly changing what we want to look at.” The NYT started Project Cascade with just two weeks of data using just a single MongoDB instance and no replication. Even in this limited snapshot of the roughly 600 pieces of posted content and 25,000 Twitter links each day, Project Cascade was generating 100 GB of MongoDB storage each month. Fast forward to late 2011, and Project Cascade is in serious production, processing 100,000 tweets (and far more clicks) daily, all in real-time. This necessitated moving up to a four-node MongoDB replica set, but it didn’t involve adding the complexity of joins or other characteristics of a relational database. As Deep Kapadia, Technical Program Manager at The New York Times Company, says , “MongoDB allows us to prototype things very quickly.” This is important for any enterprise application, as it allows companies to iterate around their data. Most won’t know exactly what their data model should look like right from the start. The NYT certainly didn’t. As Kapadia explains, the NYT didn’t have to do any schema design upfront to determine which fields to capture from Twitter or Bit.ly, but could simply dump all the data into MongoDB and figure out how to process it later. That flexibility is powerful. Granted, not all businesses will want to change as often as the NYT’s research group, but in a world of accelerating change, it’s increasingly critical that companies don’t hard-code rigid schemas into their data infrastructure. It’s also important that enterprises look to the future. However small a project starts, Big Data looms. Porway explains, “Even if we're not dealing with big data when we start a project, the data demands can rise significantly.” A RDBMS scale-up strategy quickly becomes expensive and constrictive. A NoSQL scale-out architecture is much more forgiving. MongoDB is particularly useful as it runs as well on a single node as it does on hundreds of nodes. Scale almost always starts with one node, as Foursquare and others have found . While the Web companies like Google and Twitter ran into the constraints of RDBMS technology first, mainstream enterprises are hitting them now. The New York Times has been publishing continuously since 1851, yet the nature of its business has changed significantly since the advent of the Internet. The same is true for most businesses. Like NYT, most mainstream enterprises today will find themselves collecting, filtering, and analyzing realtime data feeds from a variety of sources to better understand how customers and prospects interact with their products and services. MongoDB fits perfectly in this kind of ever-changing world. Not surprisingly, the publishing and media world is grappling with the need for flexible data models in a very public way. Like the NYT, UK-based news publisher The Guardian also uses MongoDB to help it adapt to digital and the business models enabled by it. In order to flexibly iterate on different user engagement models, The Guardian had to drop old-school relational database technology and move to MongoDB. Not that MongoDB is perfect. As Kapadia highlighted roughly a year after Porway’s original presentation, there is definitely a science to deploying MongoDB effectively. It’s very easy to get started with MongoDB, but it requires the same level of care that any critical data infrastructure does. If Tim O’Reilly is right and “ Data is the new Intel Inside ,” then it’s important to build applications on a flexible database that not only can scale to collect increasing quantities of data, but also affords the agility to change one’s data model as business needs change. Data offer real competitive advantage to the companies prepared to leverage them. Just ask The New York Times. Tagged with: case study, The New York Times, The Guardian, flexibility, agility, publishing, media, MongoDB, RDBMS, relational database, Jake Porway

December 17, 2012