Why, you may ask, is MongoDB profiled in a research report dedicated to evaluating key trends and vendors in the data warehousing market? After all, MongoDB is designed to serve operational use-cases, including Internet of Things applications, customer data management, catalog and content management, mobile services and more. In fact, Gartner placed MongoDB as a Leader in its most recent Magic Quadrant for Operational Database Management Systems in recognition of its completeness of vision and ability to execute against requirements in the operational database market.
While MongoDB is not a data warehouse, we believe its inclusion within Gartner’s latest DW/DMSA Magic Quadrant [available at no cost to eligible Gartner clients] reflects the growing demand from business users to accelerate speed-to-insight and turn analytics into real-time action. Whether that is to detect fraud during transaction processing, present relevant recommendations to shoppers as they browse an eCommerce store, or alert operators to the impending failure of a critical piece of manufacturing equipment, creating fast, actionable insight is accomplished by embedding real-time analytics into operational processes. Gartner calls this trend Hybrid Transactional/Analytical Processing (HTAP), while others use the term translytics. It is this specific capability, highlighted by users surveyed in Gartner’s research, that has driven MongoDB’s inclusion into the Magic Quadrant. Not only is this placement a first for MongoDB, it is also a first for Gartner. No other open source, non-relational database has ever been included in the DW/DMSA Magic Quadrant.
Augmenting the Data Warehouse: Unlocking Real-Time Analytics
Using traditional data warehousing platforms, the flow of data – starting with its acquisition from source systems through to transformation, consolidation, analysis, and reporting – follows a well-defined sequential process, as illustrated in Figure 1.
Figure 1: Data Flow in Traditional Analytics Processes
Operational data from multiple source systems is integrated into a centralized Enterprise Data Warehouse (EDW) and local data-marts using Extract Transform Load (ETL) processes. Reports and visualizations of the data are then generated by BI tools. This workflow is predicated on a number of assumptions:
Predictable Frequency. Data is extracted from source systems at regular intervals – typically measured in days, months and quarters.
Static Sources. Data is sourced from controlled, internal systems supporting established and well-defined back-office processes.
Fixed Models. Data structures are known and modeled in advance of analysis. This enables the development of a single schema to accommodate data from all of the source systems, but adds significant time to the upfront design.
Defined Queries. Questions to be asked of the data (i.e., the analytical queries) are pre-defined. If not all of the query requirements are known upfront, or requirements change, then the schema is modified to accommodate changes.
Slow-changing requirements. Rigorous change control is enforced before the introduction of new data sources or reporting requirements.
Limited users. The consumers of BI reports and analytics are typically business managers and senior executives.
Technology Foundations for Real-Time Analytics
This workflow remains incredibly valuable, enabling businesses to run deep, historical analysis to monitor performance and inform business strategy. But it presents a significant “impedance mismatch” to the requirements presented by real time analytics:
Eliminate latency. The frequency of data acquisition, processing and analysis must increase from days to seconds or less. Source data needs to be analyzed as it is generated by operational applications in order to provide the speed-to-insight demanded by the business. Moving data through an ETL pipeline to the data warehouse will not work for real time use-cases.
Uncontrolled sources. Organizations need to harness data that is generated outside of their own firewalls – from location data, to web clicks, to sensors, to social media. The analytics team has no control over these data sources.
Dynamic structures. Much of this data is rapidly changing with polymorphic, semi-structured or unstructured formats that do not map neatly to the fixed schema of traditional relational databases powering most data warehouses.
Changing query patterns. It is impossible to predict the types of questions that will be asked of the data. Search, aggregations, geospatial analytics, and machine learning are just some of the tools now available to analysts as they explore new data sets and discover previously undetected trends.
”Big” volume. Data arrives faster, and in quantities that overwhelm traditional data management technologies. It means scaling out databases and analytics across commodity hardware, rather than the scale-up approach typical of most data warehouses.
Wide consumption. Analytics now extends well beyond the management suite. Permeating through every part of the organization, analytics now need to be accessible to staff on the shopfloor, and consumed by operational applications to control real-time behavior.
MongoDB augments the data warehouse by addressing the challenges above, enabling users to run analytics in real-time directly against their data:
- Rich data structures with complex attributes comprising text, geospatial data, media, arrays, embedded elements, and other complex types can be easily mapped to MongoDB’s JSON-based document data model.
- A dynamic schema means that each document (record) does not need to have the same set of fields. Users can adapt the structure of documents just by adding new fields or deleting existing ones, making it very simple to extend and evolve applications by adding new attributes for analysis and reporting.
- An expressive query language and secondary indexes allow fast and rich access to data, enabling complex analytics and search to be performed in place, without having to move the data to dedicated analytics infrastructure.
- Auto-shading allows MongoDB to partition and distribute large data sets across clusters of commodity servers in the data center or in the cloud.
The latest MongoDB 3.2 release builds on these capabilities with advanced feature sets to enhance analytics:
- The MongoDB Connector for BI allows analysts, data scientists, and business users to seamlessly explore and visualize multi-structured data stored in MongoDB with industry-standard SQL-based BI and analytics platforms such as Tableau, Business Objects, and more.
- MongoDB Compass presents a simple-to-use, sophisticated GUI that allows any user to visualize and explore data with ad-hoc queries in just a few clicks – all with zero knowledge of the MongoDB query language.
- For data governance, document validation allows you to enforce checks on document structure, data types, data ranges, and the presence of mandatory fields.
- Dynamic lookup, new math operators and enhanced search allow richer analytics to be run against live, operational data
Putting Real-Time Analytics to Work
Some of the world’s largest and most innovative organizations are putting real-time analytics to work, creating operational efficiencies and building competitive advantage:
Bosch uses MongoDB at the heart of its IoT Suite. Ingesting real-time telemetry data from millions of vehicles enables auto-manufacturers to deliver predictive maintenance schedules to their customers, and improve product design.
The City of Chicago uses MongoDB to pull together millions of data points across its most crucial departments, providing real-time data analysis to city managers so they can better predict and allocate resources, respond quickly to emergencies, regulate traffic flow and uncover trends that would have otherwise been invisible.
Media company BuzzFeed uses MongoDB to pinpoint when content is viewed, where it’s shared, and how it’s being consumed by its 400 million monthly website visitors. The system enables BuzzFeed’s employees to analyse, track, and display these metrics to writers and editors.
The website of OTTO, Germany’s largest online retailer, generates some 10,000 events per second. Every click and hover of every mouse is stored in MongoDB, and real-time data analytics is used to provide unique and personalised web experiences to individual visitors.
Hadoop and Spark: Building the Complete Data Analytics Platform
Of course, its not just real-time analytics that is driving innovation in the data warehouse world – Apache Hadoop has emerged as a key part of the data management landscape. Some assumed Hadoop would replace the enterprise data warehouse, but that prediction was wrong. In fact, Hadoop is augmenting the data warehouse, in many cases, off-loading data and specific data transformation workloads from existing data warehouses to less-expensive commodity hardware in scale-out environments.
Many organizations are harnessing Hadoop and MongoDB together using the MongoDB Connector for Hadoop, providing the ability to use MongoDB as an input source and an output destination for MapReduce, Spark, HIVE and Pig jobs. With this combination, users can create complete analytics and data management platforms:
- MongoDB powers the online, real time operational application, serving business processes and end-users
- Hadoop consumes data from MongoDB, blending its with data from other operational systems to fuel sophisticated analytics and machine learning. Results are loaded back to MongoDB to serve smarter operational processes.
For example, Ebay handles user data and metadata management for its product catalog in MongoDB, and Hadoop for user analysis to provide personalized search & recommendations. Orbitz uses MongoDB for the management of hotel data and pricing, with Hadoop powering hotel segmentation to support building search facets. Pearson manages student identity and access control along with content management of course materials in MongoDB, and Hadoop for student analytics to create adaptive learning programs.
The Rise of Spark
No analytics discussion is complete without reference to Apache Spark – it has become one of the fastest growing Apache Software Foundation projects. With its memory-oriented architecture, flexible processing systems, and easy-to-use APIs, Apache Spark has emerged as a leading framework for real-time analytics, supporting streaming, machine learning, SQL processing and more.
Unlike Hadoop which has to move all data into HDFS, Spark can directly work against data stored in any database, file system, or message queue. The MongoDB Connector for Hadoop provides a Spark plug-in, allowing Spark jobs to use MongoDB as both a source and a sink. A range of community-developed connectors are also available for MongoDB and Spark integration.
Figure 2: Modernized data architecture: MongoDB, Spark, and Hadoop
Many organizations are already combining MongoDB and Spark to build new analytics-rich applications. A global manufacturing company has built a pilot project to estimate warranty returns by analyzing material samples from production lines. The collected data enables them to build predictive failure models using Spark Machine Learning and MongoDB. A video sharing website is using Spark with MongoDB to place relevant advertisements in front of users as they browse, view and share videos.
A multinational banking group operating in 31 countries with 51 million clients implemented a unified real-time monitoring application, running Apache Spark and MongoDB. The bank wanted to ensure a high quality of service across its online channels, and needed to continuously monitor client activity to check service response times and identify potential issues. All log data is collected in Apache Flume before being persisted to MongoDB where Spark jobs then analyze that data to power real time visualizations and alerts of system health.
MongoDB was selected due to high scalability, dynamic schema that can ingest and manage quickly changing log data, and a rich array of secondary indexes, allowing Spark job to efficiently filter and access only the slices of data that are needed to drive the analytics. This approach results in lower latency and higher analytical throughput.
Putting it all Together
If anyone ever tells you the data warehouse market was slow and boring, dominated by just a few mega-vendors, tell them they are wrong. With the adoption of modern technologies such as MongoDB, Hadoop and Spark, organizations are creating new classes of applications and analytics that offer the promise of unlocking new efficiencies, creating new business models and out-pacing competitors. And with MongoDB serving both operational and analytical use-cases, you can build those applications faster, with lower cost, complexity and risk.
To learn more about real time analytics with MongoDB, Spark and Hadoop, read our white paper.
Gartner Magic Quadrant for Operational Database Management Systems, Donald Feinberg, Merv Adrian, Nick Heudecker, Adam M. Ronthal, Terilyn Palanca, and October 12, 2015.
Gartner Magic Quadrant for Data Warehouse and Data Management Solutions for Analytics, Roxane Edjlali, Mark A. Beyer, and February 25, 2016.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.