The big data problem
It is probably apparent to everyone reading that data is growing at enormous rates. There is extremely valuable insight that can be found in this data if harnessed effectively, and traditional technologies, many initially designed 40 years ago like RDBMSs, are not sufficient for creating the business value promised by the “Big Data” hype. A common example in using Big Data technology is for “Single View of the Customer” – aggregating everything you know about a customer in order to optimize your engagement and revenue with them, e.g. determining exactly what promotions to send them via which channel and when.
The big question though is what architectures and solutions will enable us to actualize this potential. The high-level key functional requirements include at minimum the ability to:
- Aggregate cross-silo, external, and streaming data
- Govern access to data
- Transform data as needed
- Aggregate data
- Provide tools to analyze data
- Report on the data
- Incorporate insight into operational processes
- Do all of this with minimal TCO and response times
Data Lake vision as an answer
Many enterprises are looking at an architecture some call the Data Lake, a flexible data platform for aggregating cross-silo [streaming and persisted] data in a single [logical] location, to be able to mine and derive insight from the data across the enterprise and from 3rd parties. There is considerable momentum towards using Hadoop (including Spark) as the Data Lake for various reasons. It leverages low-TCO commodity hardware to scale horizontally, allows schema-on-read (for accepting a high variety of data), is open source, and includes distributed processing layers with SQL and common languages. Moreover, webscale companies like Yahoo and Google were early references who used it to great success for problems they encountered in indexing the web.
Data persistence options in Hadoop
With that, it seems like a reasonable place to start to assess solutions for the Data Lake vision. When you start to understand what Hadoop is at a deeper level, you find it is really a wide range of projects that cover different aspects of data processing. When we explore storing data in the Data Lake with Hadoop, there are two primary options: HDFS and HBase. With HDFS you decide how to encode your data in append-only files, from JSON, to CSV, to Avro, and others, and it’s up to you because HDFS is just a file system. In contrast, HBase is a database and has a specific way of encoding data that is optimized for writing records quickly and is relatively fast for reading only when querying by primary key.
This is where the attractive vision of a Data Lake with Hadoop-alone meets the reality of an implementation so we can begin to assess the capabilities of Hadoop against the high-level requirements listed above. You can still leverage the distributed processing layers in the Hadoop ecosystem, like Spark and Hive, without using HDFS or HBase so you can choose your persistence layer separately from your distributed processing layer. As an example, you can see my previous blog post on using Spark DataFrames with the data read from and written to MongoDB. Similarly, another previous blog post demonstrates MongoDB as just another Hive table from/to which to read or write.
Indexes still matter
Most technologists familiar with RDBMSs realize there is enormous value from expressive querying capabilities and secondary indexes to make the querying fast (even if the fixed schema, high TCO, and limited horizontal scaling of RDBMSs make it difficult to use as a Data Lake). If we only use HDFS and HBase for our Data Lake persistence, we don’t get the benefit of ad hoc indexing that we have come to expect from databases, and notably run into a few limitations:
- Ad hoc slices - How would we efficiently analyze slices of data identified by more than a primary key without secondary indexes, e.g. running analytics on our best customers, those that have spent more than X amount with us? We would be stuck scanning that huge amount of data to find our best customers.
- Low-latency reporting - How are we going to provide reporting with the sub-second response times that users demand on all this valuable data, if we don’t have flexible indexing? Again we could only quickly report using a customer’s account number or other primary key, and not by using a customer’s name, phone number, zip code, spend, etc. Important to mention, MongoDB just released a BI Connector for any SQL-based reporting tool to work with MongoDB.
- Operationalization - Similarly, how are we going to incorporate the valuable insight into the operational applications, where we impact the company and customers the most, and monetize data without flexible indexing? Imagine a customer service representative (CSR) telling a customer he/she must give only an account number to look up all of their information, because the Data Lake only supports a primary key, or it will take 10 minutes to scan all the data for their information.
Of course there are workarounds to address some of these but they introduce higher TCO, more development or operations effort, and/or higher latency. For example, you could use a search engine or materialized views to query by other than primary key but then you have to go back to make another round trip to the main table in the database with the complete record to get all the data you want. Beyond having double the latency, it requires more management, development effort, and/or infrastructure to use a separate search engine and maintain materialized views, plus there are unnecessary consistency issues writing data to extra places. Wouldn’t it be nice to just have normal flexible indexes we are used to while still maintaining our design principles?
MongoDB is an integral part of an effective Data Lake
We started this discussion exploring whether Hadoop alone would satisfy the requirements for a Data Lake, and found at least 3 gaps. Can we add another persistence layer into our architecture that would fill those gaps and be consistent with our design principles of using low TCO commodity hardware and open source models, schema-on-read, and Hadoop’s distributed processing layers?
I chose to write on this topic because MongoDB is the optimal database to fill the gaps in a Hadoop-only Data Lake. If you use another open source NoSQL database, you will find they almost never have secondary indexes (and if they do, they are not consistent (i.e. in-sync) with the data), nor do they have group-by and aggregations in the database. You could write data to the Data Lake with some of those databases, but if you also want to read concurrently in flexible ways with secondary indexes as the business demands, it would fall short of your requirements. If you use an open source RDBMS in a Data Lake, we already mentioned that their fixed schema and expensive vertical scaling model goes against our design principles for a Data Lake.
Consequently, the below diagram is a recommended architecture for a Data Lake.
MongoDB Integral to a Data Lake
This architecture adds MongoDB as a persistence layer for any dataset for which you need expressive querying, related to the 3 reasons listed above for which you want indexes. I recommend a governance function to decide, based on requirements of the data from consumers, whether to publish the data to HDFS and/or MongoDB. No matter whether you store it on HDFS or MongoDB, you can run your distributed processing jobs, e.g. Hive and Spark. However, if the data is on MongoDB, you can efficiently run your analytics on ad hoc slices of data, because the filter criteria is pushed down to the database and not scanned across files like it would be in HDFS. Likewise, the data in MongoDB is available for real-time, low-latency reporting, and serves as the operational data platform for all those systems of engagement and digital apps you are likely building.
What I find some firms are doing today is copying their data into Hadoop, transforming it, but then copying it elsewhere to do anything valuable with it. Why not bring the most value you can directly from the Data Lake? With MongoDB, you can multiply the value many times.
The Data Lake vision is worthwhile and feasible if you look at the requirements you have in the short and long-term and ensure you fulfill those requirements with the best tools available in the core Hadoop distribution but also those in the ecosystem like MongoDB. I have seen a few enterprises start with a Data Lake by simply spending a year cleansing all their data and writing it to HDFS in the hopes of getting value from it in the future. Then the business is frustrated at seeing no value and in fact yet another batch layer is between them and the customer.
You can ensure the success of your Data Lake by combining Hadoop with MongoDB, for enabling a low TCO, flexible data platform for the optimal response times for all users, including data scientists and analysts, business users, and customers themselves. With that Data Lake, your company and you can deliver on the promise of a Data Lake to gain unique insight, engage customers effectively, monetize your data, and outmaneuver your competition.
Watch Matt's presentation on Enterprise Data Management to learn more about common challenges in data management, what capabilities are necessary, and what the future state of architecture looks like.
About the Author - Matt Kalan
Matt is a Sr. Solution Architect at MongoDB with extensive experience helping more than 500 customers across industries solve business problems with technology. His particular focus these days is on guiding enterprises in maximizing the business value of enterprise data management (EDM) through all the noise in a fast-moving market. Before MongoDB, Matt grew Progress Software’s Apama Algorithmic Trading and Complex Event Processing (CEP) Platform business in North America and later sold broader operational intelligence solutions. He previously worked for Caplin Systems selling solutions to stream real-time market data over the web to FX and FI portals, and for Sapient providing consulting services to Global 2000 clients.