Why I Wrote the New MongoDB Aggregations Book
In early May 2021, I published my book, Practical MongoDB Aggregations, which I released electronically and free for anyone to read . I love the MongoDB database and the uniqueness and power of its aggregation framework to analyse and manipulate massive amounts of data intuitively and efficiently. The opportunity to share this passion with others spurred me to write the book, with which I aim to support developers, architects, data analysts, data engineers, and data scientists to better understand how to maximise their productivity and effectiveness when building aggregation pipelines, as well as how to optimise these pipelines. Like many people over the past year during the pandemic, I’ve struggled to keep myself occupied when not busy doing my day job. Hence, my book was born not just from a desire to improve people’s knowledge but as my pandemic project, written over many weekends, to stave off the boredom. I believe aggregation pipelines provide a powerful domain-specific language for data processing in a way I’ve not seen before in other data-oriented tools, languages, or standards. SQL is a good data query language that caters to some analytical use cases via “group-by/having” statements. However, it typically has to be paired with a procedural language (e.g., Oracle’s PL/SQL ) to encompass an ordered set of complex data transformation rules. In the big data world of Hadoop , I find the MapReduce approach is too complex to develop with efficiently. Higher-level tools like Spark help alleviate some of this. However, by the necessity of still having to be general-purpose and versatile, the amount of Spark code required to process data sitting in any type of database is still too high for my liking. Many ETL tools provide proprietary data transformation capabilities, but these have to cater to the lowest common denominator capabilities across all the different types of databases they interact with. For these reasons and from experience, I consider MongoDB Aggregations to be the best tool for processing large data sets because it combines performance with productivity. Nevertheless, I sense the aggregation framework is shrouded in mystery for many people, hence my desire to demystify it with this book. I believe I identified a knowledge gap that many users wanted to be filled. MongoDB Inc. provides excellent reference documentation about aggregations in the MongoDB Manual , and MongoDB University provides a tremendous free online training course on aggregations . What I felt was still to be addressed was an opinionated yet informed perspective on how best to assemble aggregation pipelines from the well-documented parts—something that points the way to achieve optimal productivity and performance, accompanied by fully formed example pipelines to help put these approaches into practice. I hope readers of my book will learn some new things of value and enjoy reading it. A good test of the relevance of my book, in time, will be if people come back to it repeatedly as they continue with their journey of developing aggregations. Read the book for free now!
Intern Spotlight: Russell Kaplan
This year, MongoDB welcomed 33 university students to our intern program in Engineering, Marketing and Education. In this series, we'll introduce you to the talented students who are helping us transform development and operations for how we run applications today. We had the chance to sit down with intern Russell Kaplan, who is working on the C++ Driver team. Where do you go to school, what is your major, and what year are you in? I go to Stanford, where I am a computer science major and a rising sophomore. What is your role at MongoDB? I work on the C++ driver team, building a geospatial API. How did you find out about the internship program at MongoDB? Why did you choose to come to MongoDB? I met MongoDB at PennApps . The App I made there won the prize for best use in the MongoDB category. It was called screenshades, and was a chrome extension that figured out what TV shows you watch and hides spoilers for them from your twitter stream. It worked with machine learning, so we needed a lot of training data, which we scraped from Twitter and Reddit for spoiler hashtags and built a dataset off of. We then used that as a classifier. I chose to come to MongoDB because I already had a lot of experience with front-end development and building web-apps and wanted to learn more about the back-end of development. What’s your hometown? My hometown is NYC. Best city in the world! Did you have previous experience using MongoDB before you arrived? If so, how are things different now that you work at MongoDB? If not, how did you learn MongoDB and how was the education process? I used it at hackathons before. But I only really used its basic features. I learned a lot more about it after getting here. It’s really simple to use for quickly getting started with web applications. Bike or public transportation to work? Subway. What’s a typical day (or week) for you? I get into the office by 10am. Eat some breakfast in the café, catch up on emails for a bit and then get to coding. I code until lunch, have some seamless, play a game of ping pong and then code for the rest of the day. What do you love most about MongoDB? I love the people I get to work with. It’s a lot of really smart high-energy people that I have so much to learn from. What’s the most challenging aspect of your job? Because it’s a database and an open source company, the code really has to be production quality in a way that class work doesn’t. It’s a much more rigorous standard of development. That’s something that’s really cool to learn but challenging at times. What do you hope to accomplish while you’re here? I hope to have my code integrated into the rest of the MongoDB code base. I hope that the people who use the C++ driver appreciate the work I’ve done. What’s your favorite Seamless lunch order? Chop’t steak salad. Name one secret skill you have, unrelated to work. I can beat box. A little bit, I’m an amateur. Whose your favorite tennis player? Djokovic, he’s incredible. He also has a hilarious sense of humor and isn’t afraid to make jokes about himself and other players. Kindle or book? What’s your favorite book? Books. I’m old school. My favorite book is probably 1984. Describe your perfect weekend. Oh man. Sleep in late Saturday morning and then go play some tennis with some friends. Discover some obscure yet delicious restaurant for dinner, and then go see a Death Cab for Cutie concert. All while getting to hang-out with friends and family. Want to help build the next revolution in database technology? MongoDB offers summer internships and new graduate opportunities to foster computer science talent across the country. Learn more about the MongoDB University Relations program .
Beyond NoSQL: A Modern Database Manifesto
There is no such thing as NoSQL. Not as we tend to think of it, anyway. While NoSQL was born as a movement away from rigid relational data models so web giants could embrace Big Data with scale-out architectures, the term has come to categorize a set of databases that are more different than they are the same. This broad categorization doesn’t work. It’s not helpful. While we at MongoDB still sometimes refer to NoSQL, we try to do it sparingly, given its propensity to confuse rather than enlighten. Deconstructing NoSQL Today the NoSQL category includes a cacophony of over 100 document, key-value, wide-column and graph databases . Each of these database types comes with its own strengths and limits. Each differs markedly from the others, with disparate models and capabilities relative to data storage, querying, consistency, scalability and high availability. Comparing a document database to a key-value store, for example, is like comparing a smartphone to a beeper. A beeper is exceptionally useful for getting a simple message from Point A to Point B. It’s fast. It’s reliable. But it’s nowhere near as functional as a smartphone, which can quickly and reliably transmit messages, but can also do so much more. Both are useful, but the smartphone fits a far broader range of applications than the more limited beeper. As such, organizations searching for a database to tackle Gartner’s three V’s of Big Data -- volume, velocity and variety -- won’t find an immediate answer in “NoSQL.” Instead, they need to probe deeper for a modern database that can handle all of their Big Data application requirements. Modern Databases For Modern Data One of these requirements is, of course, the ability to handle large volumes of data, the original impetus behind the NoSQL movement. But the ability to handle volume, or scale, is something all databases categorized as “NoSQL” share. MongoDB, for example, counts among its users those who regularly store petabytes of data, perform over 1,000,000 operations per second and clusters that exceed 1,000 nodes. A modern database, however, must do more than scale. Scalability is table stakes. It also must enable agility to accelerate development and time to market. It must allow organizations to iterate as they embrace new business requirements. And a modern database must, above all, enable enterprises to take advantage of rapidly growing data variety. Indeed the “greatest challenge and opportunity” for enterprises, as Forrester notes, is managing a “variety of data sources,” including data types and sources that may not even exist today. In general, all so-called NoSQL databases are much more helpful than relational databases at storing a wide variety of data types and sources, including mobile device, geospatial, social and sensor data. But the hallmark of a modern database its ability to allow organizations to do useful things with their data. Defining The Modern Database To count as a modern database, then, a database must meet three requirements. While relational databases are able to manage some of these requirements, and newer so-called “NoSQL” key-value or wide column data stores meet others, only MongoDB meets all three requirements. The database MUST scale . As data volume and velocity grows, so the database must grow too. It should scale horizontally and elegantly, without doing unnatural things to your application, in the cloud or on commodity hardware. Meeting the base requirements -- like having enough capacity to serve your customers -- should be a given. The database MUST adapt to change . The speed of business accelerates and your database must keep pace, enabling iteration. This means you must be able to process and mine new data sources and data types without the database breaking a sweat (or you breaking your back or budget). Your schema must flow from your application requirements, rather than forcing your application to fit a predefined, rigid schema. The database MUST unleash your data . Just storing data isn’t enough. You must be able to exploit the data, which particularly means you must be able to ask significant questions of your data. In part this means that the database must support rich queries, indexing, aggregation and search across multi-structured, rapidly changing data sets in real time. But it also means that it must support data for modern use cases including mobile, social, Internet of Things and other systems of engagement. Some relational databases can handle a few of these requirements, yet fail in the essential need to deliver scale and adaptability. Some newer databases, including so-called “NoSQL” key-value or wide column data stores, meet still other requirements, yet don’t give organizations the latitude to unleash their data. In fact, they constrain you to look up data by the key with which it was written unless you integrate external search engines and analytics nodes, which can create other problems. MongoDB: A Modern Database For Today's Business Needs But only one database today can deliver on each of these critical components of a modern database. Only one database offers orders of magnitude more productivity for developers and operations teams alike, while still delivering petabyte scale and lightning-fast performance. Only MongoDB, the modern database that tens of thousands of organizations depend upon to build and run today’s most demanding applications. To learn more about how MongoDB has enabled some of the world’s largest and most innovative companies to deliver applications and outcomes that were previously impossible, download our new whitepaper .
Announcing the Second Annual MongoDB March Madness
March Madness is a college basketball tournament, but it is also a month where we engage our MongoDB User Group network in a global activity. Last year we had the groups compete in a World-wide Hackathon . This year, we are sending MongoDB engineers to 33 MongoDB User Groups (MUGs) around the globe! Our goal is give our incredible MUG Leaders what they always ask for: A MongoDB Engineer to share their expertise with the community! Thanks to MUG Leaders Jason Ford, Bev Corwin, Flavio Percoco, Tobias Trelle, Ivan Hristov, John Puddifoot, Mário Cordeiro, Adrian Wolny, Stefan Rudnitzki, Brad Urani, Scott Shellabarger, Mario Koppen, Bertin Nono, Ben McCann, Sig Navarez and Victoria Malaya for making this event happen. We'll be launching March Madness at the Stockholm MUG and are looking forward to sharing the great stories, slides and photos with the community. Find a March Madness event happening near you or find a MUG near you . Don't see a MUG in your area? Start a MUG for your local tech community. Amsterdam Barcelona Berlin Cambridge Casablanca Cincinnati Copenhagen Dnipropetrovsk Dublin Dusseldorf Edinburgh Geneva Gent Hamburg Krakow Lisbon Little Rock London Madrid Moscow Milan Munich Nashville Orange County Oporto Paris Richmond Rome Sevilla Stockholm Tel Aviv Vienna
Looking beyond labels like relational and NoSQL
According to a new Dice.com salary survey , MongoDB ranks as one of top-10 most highly compensated technology skills. Indeed.com rates MongoDB as the second hottest job trend. And DB-Engines.com, which ranks over 200 databases on their relative popularity, MongoDB is now the fifth-most popular database in the world, this month surpassing IBM's DB2. All great, right? Maybe. Buried in the Dice.com data, as well as the Indeed.com data, is evidence of real confusion. For example, of the top-10 most highly compensated skills in Dice.com's survey is "NoSQL ." NoSQL is not a technology. It's not really something a developer can "know" in any real sense. NoSQL is a movement that describes a different way of modeling data but, as Basho founder Justin Sheehy correctly noted , there are as many differences among so-called NoSQL databases as there are similarities. As such, knowing Basho's Riak won't really help you understand MongoDB. Perhaps at a high, conceptual level, but expertise in one doesn't really translate into familiarity with another. They are different databases with different approaches. Employers looking for generic NoSQL skills need to think more deeply about what their application requirements are. Looking beyond relational databases for modern application requirements is a good start, but looking to generic "NoSQL" is not sufficient. Organizations should be looking for a modern database that dramatically improves developer productivity, encourages application iteration and enables a new wave of transformational applications in areas like Big Data , Internet of Things , mobile and more . That database is MongoDB. Is MongoDB "NoSQL." Sure. But it's much bigger than that ( based on what people search for on Google , many organizations already seem to understand this). MongoDB is the fastest-growing database in the world , not because it fits the NoSQL category, but because it significantly improves the productivity of developers and the organizations for which they work. So if you're looking to hire technology talent, you're far more likely to be successful hiring an experienced MongoDB engineer than a "NoSQL engineer." MongoDB, after all, is an actual database. NoSQL simply describes an important movement.
MongoDB Named InfoWorld 2014 Technology of the Year: It's A Matter Of Innovation
When it rains, it pours. Right on the heels of being named DB-Engines' 2013 Database of the Year and Linux Journal's Best NoSQL Database , InfoWorld has given MongoDB its 2014 Technology of the Year award , alongside Amazon Web Services and GitHub, among others. More than just point solutions to finite business problems, InfoWorld's list includes technologies that "point the way to the data centers, clouds, and applications of tomorrow. They’re the innovations that are changing the way we work and do business," as Doug Dineley, executive editor of InfoWorld’s Test Center, declares . Sometimes innovation is about lower costs. For example, one of the biggest advantages Hadoop brings is enabling data analytics on commodity hardware, as opposed to the expensive, proprietary solutions of yesterday. The real value of Linux, in its early years, was arguably less about product innovation and more a matter of helping enterprises transition away from expensive UNIX servers. MongoDB enables a different type of innovation. Yes, MongoDB is dramatically less expensive than licensing and running a proprietary relational database. But that's not what has made it the fastest-growing, most popular non-relational database (by a wide, wide margin). Instead, MongoDB is popular because it reinvents data management, enabling developers to write a new breed of application that is impossible, or exceptionally difficult, with a relational database. Part of this is a matter of simplifying data schema: And part of it is allowing the developer to focus on her application (pictured as a car in the graphic below), and not the unnecessary overhead of object relational mapping and upkeep on a rigid, relational schema: But the overall value is about enabling and enobling developers, giving them power to get work done for the line of business tasked with new marketing initiatives, optimizing business processes and more. Ultimately, then, MongoDB has won InfoWorld's 2014 Technology of the Year award because it brings innovation back to the data management market, something that has been sorely lacking for a long time.
Hudl: Getting Athletes to the Top with MongoDB
Football is a resource-intensive sport. The strategy and people power that help bring a team into top shape are enormous. Playbooks look like phone books and the hours of game and practice footage are difficult to distribute to teams and coaching staff. Many teams, however, have gotten an edge by using Hudl, a platform that offers secure access to video analysis tools from any computer or mobile device. The MongoDB-based platform makes it easy to upload, sort, analyze and share video to help coaches learn about their teams, scout opponents and win. After facing bottlenecks with SQL, Hudl turned to MongoDB to support its video metadata storage. MongoDB delivers a flexible data model , ensuring coaches are not restricted when defining variable data, such as football formations, camera angles, and custom notes used for post-game analysis. With MongoDB, Hudl can create a single collection with high-speed querying, while easily and cost-effectively sharding to scale linearly. “Rather than partitioning SQL, we decided to invest in horizontal scale for the long term,” said Brian Kaiser, CTO at Hudl. “MongoDB makes it so easy to add shards that we don’t require a large capital expenditure to upgrade, which is great from a predictability point of view. Together with Amazon’s Provisioned IOPS, MongoDB delivers remarkably stable query.” MongoDB has increased developer productivity by facilitating Hudl’s A/B testing and enabling the incremental, easy rollout of new features. In addition, Hudl relies on MongoDB Management Service (MMS) as a crucial asset to monitor MongoDB clusters and proactively address deployment issues. “MongoDB is painless for developers and has proven to be battle-tested for the web-based video analysis that Hudl requires,” said Kaiser. “We appreciate having a strong company that backs the product with great support and a high level of innovation.” Since 2001, over 1.6 Million recruiting packages have been sent through Hudl, and over 162,000 college coaches have watched recruiting films through Hudl. We hope to see more from the Hudl team as they change the way athletes, coaches and recruiters build talent
Mapping the Industry's Tectonic Shift in Data Management
We are clearly in the early stages of a "tectonic shift" in the database market, as eWeek terms it . Not because any particular database vendor decided that the world was ripe for a change, but because the nature of data we're generating and processing has changed. Dramatically. In a recent research note, Cowen & Co. analyst Peter Goldmacher clearly articulates this shift: It is well understood that the current database giants have written superb products to solve primarily one problem (automating standard business processes), but we no longer live in a one problem world. The proliferation of mobile devices is forcing an immense structural change as we increasingly overlay a digital existence on top of our analog existence. If we can measure it, we can manage it; has transcended the world of business process automation and now has meaning in everything we do, as everything we do generates data. Driving, tweeting, gaming, friending, browsing, walking...it all generates data. We can capture, analyze and derive tremendous value from that data, but only if we can use low cost, high-quality data management products. This is the challenge MongoDB is laying down, and it is the challenge all other data management players must rise to meet if Big Data is going to realize its potential. I've called out before that NoSQL and Hadoop are the new normal in data management. This is why. And it's why as much as the RDBMS establishment may wish it otherwise, the industry looks bright for NoSQL technologies like MongoDB.
The Changing Of The Technology Guard: NoSQL + Hadoop
Big Data truly is prompting a changing of the technology guard. In an excellent article today, The Wall Street Journal notes that Hadoop is "challenging tech heavyweights like Oracle and Teradata [whose] core database technology is too expensive and ill-suited for typical big data tasks." This follows my own observations that repeated earnings misses across the legacy technology vendor landscape indicate that real, tectonic shifts in the technology landscape are underway. In other words, NoSQL and Hadoop are the new normal. What the Journal missed, however, was the right emphasis. As fantastic as Hadoop is, it's only one part of the Big Data story. And not necessarily the most significant part. For example, the Journal writes: Traditional databases organize easy-to-categorize information. Customer records or ATM transactions, for example, arrive in a predefined format that is easy to process and analyze. These so-called relational databases are the kind offered by Oracle and Teradata among others, and the market for them runs to an estimated $30 billion a year, according to IDC estimates. The Internet, though, is messy. Companies now also have to make sense of and store the mass of data being generated from tweets, Web-surfing logs and Internet-connected machines. Hadoop is a cheap technology to make that possible, and it was born of Google technologies detailed in academic papers. The article is dead-on in most respects, except for the market that Hadoop truly tackles. Of the $30 billion database market, Hadoop addresses just a quarter of it: the OLAP market. The much larger market is the traditional OLTP market, and this is the home of NoSQL databases like MongoDB. Perhaps unsurprisingly, then, MongoDB has the fastest growing Big Data community , and the second hottest job trend after only HTML5 . Big Data, after all, isn't merely about analytics. It's primarily about operational databases that can help enterprises put their data to work in real time.