Reducing the Need for ETL with MongoDB Charts
Databases as we know them have been around for over 40 years. When they first came about businesses would often keep data in separate systems and separate formats. There were a variety of reasons for these decisions. One of the side effects of these separate data stores is the need to combine together to be able to perform data analysis. This led to the long-standing practice of ETL, or Extract, Transform, Load.
ETL is a process to extract data from a starting data source, transform the data in some fashion, then load it into another data store. Sounds simple enough, but in fact, there is a lot of work going on under the covers and a lot of steps and decisions to navigate. These additional steps reduce the speed at which we can get meaningful insights from our data. Further, they rely on many assumptions about transforming data into what is assumed to be the correct format for later consumption - without knowing very much about the business questions to be asked of this data down the road.
From Data Warehouses to the Cloud
Traditionally, enterprise applications have relied on performing ETL operations to move data into an enterprise data warehouse (EDW).
Creating a successful data warehouse can be a long, complicated, and expensive process. One of the technologies that have been created to help with the process is Apache Hadoop. Hadoop allows for the processing of massive amounts of data on commodity hardware with open source technologies. However, instead of simplification, the ETL and data warehousing landscape has only become more complex and cumbersome and the proliferation of tools combined with maturity and adoption issues have only increased the cost. Further, according to Gartner analyst Nick Heudecker, 85% of big data projects fail. Mostly due to the complexity of the process itself.
With the transition to the cloud many organizations are undertaking, ETL becomes even more complicated from a meaningful and timely data analytics standpoint. Moving data from one source to another takes time. Now there is hidden data transfer and compute costs and latencies to navigate. While some meaningful analytics can be performed on stale data, most modern analytics need to be as close to real-time as possible.
Issues With ETL
A few of the problems that we are faced with when setting up ETL processes are:
- Latency & Downtime - There is an inherent cost of moving data from point A to point B. Forty years ago, when ETL started, we were working with megabytes of data and not needing “instant” access. Today we’re dealing with terra or petabytes of data and needing real-time insight from that data.
Moving data across the network isn’t free. On a 100 BaseT network, transferring one gigabyte of data takes 100 seconds. A terabyte takes 10,000 seconds or over two and a half hours. All assuming that it’s on a dedicated network that isn’t used by other applications. At ETL demands grow, data could easily be stale by many hours.
We used to be able to schedule these transfers during “downtime” at midnight. However, in today’s global world, users are always online somewhere demanding instant access and insight. Downtime is simply no longer acceptable and latency has become the new downtime. Should suppliers on one side of the world suffer from poor performance just so executives on the other side of the world have up to date dashboards in the morning?
- Storage is cheap, labor is expensive - Data warehouses started at a point in time in which storage was expensive. In 1981, one gigabyte of data storage cost about $290,000. Today that cost is under $0.10. It was, therefore, important to transform and compress as much data as possible when storing to save costs.
As storage costs have decreased, labor costs have gone the opposite direction. Having a good database administrator to design, manage, and maintain your data warehouse and ETL path is expensive. Storing raw data is frequently seen as a more economically viable choice.
- ETL is hard - ETL takes planning. Lots of it. And not just for your current load of data, but for what might happen to the load down the road. Additionally, ETL scripts can get long and complex.
Bringing in data from a variety of sources, looping over them, adding logging, error handling, configurations are just the start. Determining how the data needs to be transformed can be complex, and fragile. What happens if data stored today as a string gets changed down the road? The process breaks and adjustments need to be made.
Do you ever wonder why the first answer out of a DBA’s mouth is an emphatic “No!” when asked if something can be changed? One “simple” change can mean changing dozens or hundreds of lines of code. For these reasons and more, ETL requires planning for current and future data needs, loads, and shape.
- Are developers the right people to build the ETL pipeline? - Developers are great at many things, however, knowing about data storage and ETL pipelines aren’t often one of them. ETL design and implementation are typically best done by data engineers. While a developer may be able to get data through an ETL pipeline and into a data warehouse, generally speaking, it often isn’t done in the most efficient manner. Specialized data engineers should be responsible for these tasks. If you don’t have them on your team, this is another cost of ETL.
- Maintenance headaches - As the size and complexity of data, applications, and analytics requirements grow, so does ETL maintenance. Maintaining changes in data velocity, formats, connections, and features takes time. Many of these challenges may not be thought of at the start of a project, but lead to long-term maintenance needs.
Use MongoDB Charts to Avoid the Headache of ETL
Companies today still have data in a variety of systems. In certain instances, ETL is the only option to be able to perform visualization and analysis of your data. Or, perhaps, you’ve explored ETL but haven’t taken the steps needed to get your data ready for analysis because it’s overwhelming.
If you’ve leveraged MongoDB as your database, the need for ETL procedures has been dramatically reduced with the introduction of MongoDB Charts, now in beta. MongoDB Charts natively understands the MongoDB Document Model allowing for the rapid creation of data visualizations over your data.
With MongoDB Charts you can connect to your MongoDB server, assign user authorization policies to your reports, and easily generate visualization dashboards. With over a dozen different chart variations to choose from, stunning visualizations are just a few clicks away.
MongoDB Charts allows for data to be visualized without performing ETL operations, saving valuable time and resources. You don’t need to write any code or rely on third-party tools. Further, you still get to leverage the richness of the Document Model.
For those situations that you want to quickly access your MongoDB Data, MongoDB Charts is a terrific option. If you’re in a situation that requires multiple data sources to be analyzed, we offer the MongoDB Connector for Business Intelligence. If you are doing advanced analytics with Apache Spark, we have an option for that as well with the MongoDB Connector for Apache Spark.
For many roles in an organization, MongoDB Charts is a great tool for analyzing your data. There’s no need to go through the pain of the ETL process. It is the fastest way to build visualizations over your MongoDB Data, wherever it’s stored. On-premise or in the cloud hosted by MongoDB Atlas. Give it a try today!
Visualizing Your Data With MongoDB Charts
Having data stored in a database is practically a given for today’s businesses. Customer information, order history, product pricing, IoT sensor data, and much more is being recorded for future use. However, just having the data stored isn’t enough to form a competitive market advantage. We must be able to analyze the data as well. There are many options to do so and in a variety of ways. If you have data that needs to be visually analyzed in MongoDB, MongoDB Charts is a terrific option.
Prior to MongoDB Charts, there were really three ways to visualize your MongoDB Data.
- Leverage the MongoDB Business Intelligence (BI) Connector in conjunction with third-party BI tools,
- Perform Extract-Transform-Load (ETL) operations and leverage third-party tools, or
- Write custom code and use charting libraries such as D3.js or Bokeh.
MongoDB Charts, currently in Beta, provides an easy way to visualize your data living in MongoDB. You don’t need to move your data to a different repository, write your own code, or purchase third-party tools. MongoDB Charts knows and understands the richness of the Document Data Model and allows for easy data visualization.
Further, MongoDB Charts allows for a secure way to create and share visualization dashboards with everyone, or just targeted team members. Similarly, the data source being used behind the scenes can be shared securely as well. For example, data for the Sales Department doesn’t have to be made available to Marketing unless needed. Very powerful and follows MongoDB’s design of security being a top priority.
After downloading the MongoDB Charts Docker image and following the installation instructions, we’re able to connect to a data source stored in MongoDB Atlas and start making visualization dashboards. Once connected to the MongoDB Charts server, there are three steps we need to take:
- Add a data source
- Create a dashboard
- Create our charts
Analyzing Airbnb Data with MongoDB Charts
I have set up a database with some Airbnb data from various cities. We’ll be exploring the dataset from Seattle, WA here, but feel free to explore others on your own. We need to get the connection string from the Atlas Cluster that has our data and connect to it in Charts.
Add a Data Source
With our MongoDB Charts server running on
localhost:80, we can log in and head to the Data Sources tab. We use the URI from Atlas (
mongodb+srv://airbnbdemo:email@example.com/test?retryWrites=true) and select Connect. We’re next asked which data source we want to use from that cluster, I’ll select the seattleListingAndReviews from the
airbnb database for this example. For permissions, I just want to keep everything private so I’ll accept the defaults and select Publish Data Source. Once published I can add an alias to the data source. I’ll call it
Create a Dashboard
Next up is to create an actual dashboard to house our visualizations. In the Dashboards section choose New Dashboard and give it a name and description, like Ken’s Airbnb Dashboard. This will take me to where I can add charts to my dashboard.
Create a Chart
After clicking on the Add Chart button we can start building our visualization. We’ll want to choose the
Airbnb Seattle data source from the drop-down. MongoDB Charts automatically determines which fields are available for exploration. For this exercise, I’d like to see which neighborhoods in Seattle have the most Airbnb properties and split them by property type. We’ll use the Stacked Bar chart for the type.
- For the X-Axis then, we’ll want the
idfield, aggregated by count.
- Along the Y-Axis we’ll look at the address and the suburb. Notice that
addressis a subdocument here and that MongoDB Charts natively knows how to handle this type of data. I’d like to sort the
suburbby aggregated value, in descending order, and limit our results to the top 20 suburbs.
- Let’s add the
property_typefield as our series
Now we can name our chart, Properties by Location and save it. We’re then taken back to our dashboard where we can add other visualizations for further exploration.
Have a look at this short video to see some other visualizations being created from this same data source.
MongoDB Charts is an excellent new tool to visually explore your data. It has some great features for specific use cases, such as:
- Ad hoc analysis of your data
- Natively understands the benefits of the Document Data Model
- Collaboration on projects is easy with user-based sharing and permissions
- It’s intuitive enough for non-developers to use allowing for self-service data analysis
MongoDB Charts is the fastest way to build visualizations over your MongoDB data. I’d encourage you to download it and try it out today. Let me know what visualizations you come up with from the Airbnb dataset. I always enjoy seeing how people explore their data.
MongoDB On The Road - Node+JS Interactive
October 10-12, 2018 brought 1,000 developers to Vancouver, BC, Canada for the Node+JS Interactive 2018 conference. Put on by The Linux Foundation, the conference provided talks for two days followed by a day of workshops. MongoDB was a proud Bronze Sponsor of the event. This allowed us to have a booth in the Sponsor Showcase Hall along with having a presence at the Career Fair event.
MongoDB had a great presence at Node+JS Interactive 2018. Aydrian Howard and I from the Developer Advocacy team were on hand to answer questions. Thomas Cirri was there from our Recruiting team. By the way, we’re hiring! Dan Aprahamian from our Node.js Driver team was there along with Gregg Brewster from MongoDB University.
The Sponsor Showcase Hall was filled most of the day with people learning about all aspects of the Node.js ecosystem. The MongoDB booth was busy handing out swag and answering questions about MongoDB Atlas, MongoDB Stitch, MongoDB Charts, along with many other subjects and topics.
Node+JS Interactive 2018 Sessions
The schedule of session talks brought a wide variety of topics and speakers to Vancouver. Irina Shestak from MongoDB gave a great talk on HTTP/2 walking through the connection process one frame at a time and giving special attention to how Node.js implements this protocol.
Jenna Zeigen’s talk From Parentheses to Perception: How Your Code Becomes Someone Else's Reality provided some wonderful information on the path from an idea in a developer’s mind, to pixels on the screen.
There were many other talks from great speakers such as Tierney Cyren from NodeSource, Joe Karlsson from Best Buy, and Adam Baldwin from npm, just to name a few.
Node+JS Interactive 2018 Venue
Node+JS Interactive was hosted by the Vancouver Convention Center - West. Located in the West End area of Vancouver, it overlooks Vancouver Harbor and sits adjacent to the Olympic Cauldron at Jack Poole Plaza.
Vancouver Harbour is not only a busy cargo port bringing in goods for Western Canada, but it is also a heavily trafficked float plane area with seaplanes taking off and landing throughout the day. It was quite a site to be in a conference center and looking out over the harbor’s spectacular scenery and seeing the seaplanes land, taxi, and take off in the crisp and clear fall air.
MongoDB’s BI Connector the Smart Connector for Business Intelligence
September 25, 2018
In today's world, data is being produced and stored all around us. Businesses leverage this data to provide insights into what users and devices are doing. MongoDB is a great way to store your data. From the flexible data model and dynamic schema, it allows for data to be stored in rich, multi-dimensional documents. But, most Business Intelligence tools, such as Tableau, Qlik, and Microsoft Excel, need things in a tabular format. This is where MongoDB's Connector for BI (BI Connector) shines.
MongoDB BI Connector
The BI Connector allows for the use of MongoDB as a data source for SQL based business intelligence and analytics platforms. These tools allow for the creation of dashboards and data visualization reports on your data. Leveraging them allows you to extract hidden insights in your data. This allows for more insights into how your customers are using your products.
The MongoDB Connector for BI is a tool for your data toolbox which acts as a translation layer between the database and the reporting tool. The BI Connector itself stores no data. It serves as a bridge between your MongoDB data and business intelligence tools.
The BI Connector bridges the tooling gap from local, on-premise, or hosted instances of MongoDB. If you are using MongoDB Atlas and are on an M10 or above cluster, there's an integrated built-in option.
Why Use The BI Connector
Without the BI Connector you often need to perform an Extract, Transform, and Load (ETL) process on your data. Moving it from the "source of truth" in your database to a data lake. With MongoDB and the BI Connector, this costly step can be avoided. Performing analysis on your most current data is possible. In real-time.
There are four components of a business intelligence system. The database itself, the BI Connector, an Open Database Connectivity (ODBC) data source name (DSN), and finally, the business intelligence tool itself. Let's take a look at how to connect all these pieces.
I'll be doing this example in Mac OS X, but other systems should be similar. Before I dive in, there are some system requirements you'll need:
- A MongoDB Atlas account
- Administrative access to your system
- ODBC Manager, and
- The MongoDB ODBC Driver for DSN
Instructions for loading the dataset used in the video in your Atlas cluster can be found here.
Feel free to leave a comment below if you have questions.
MongoDB On The Road - Seattle CodeCamp
September 20, 2018
Seattle CodeCamp was held in the Pigott Building on the beautiful Seattle University campus. With the scenic Puget Sound just a few blocks to the west down Madison St and Lake Washington to the east down Cherry St, Seattle CodeCamp was situated in a magnificent venue.
This year, on Saturday, September 15, 2018, 450 developers attended the event. The sponsorship hall had representatives from a few of the conference sponsors including GitHub, Flatiron School, and the College of Science and Engineering from Seattle University. There were plenty of stickers and sponsor information up for grabs along with some great representatives from the companies to talk with.
The conference sessions included over 65 sessions. One of the things I really enjoy about the CodeCamp events I’ve attended is the wide variety of speakers and session topics available. Everything from front-end to back-end topics is open game and available to learn.
And that’s just a small sample of the topics covered at this year’s Seattle Codecamp. I presented a talk on MongoDB & Node.js to a room of about 25 people. I brought with me a supply of MongoDB socks to give session attendees some swag which went over well. A large percentage of people in the room were unfamiliar with MongoDB in general and the MEAN/MERN stack specifically.
As a result, I tailored my talk to discuss the technologies themselves before showing how building an API is done with Node.js, Express.js, and MongoDB. I built an API that served up restaurants indexed by location. After building a functioning API I showed some of the features of MongoDB Compass to explore the data, perform CRUD operations, and leverage the geo-spatial data that was being stored inside MongoDB.
There were several MongoDB specific questions brought up during the session about some of the differences between the way legacy, relational databases store information and how a next generation database, such as MongoDB handles similar schema design and queries. It was a great discussion and provided a great opportunity to educate developers on the flexibility of MongoDB’s document model and the increase in development speed. You can find the project code on GitHub along with the talk slides here.
MongoDB is the easiest and fastest way to work with data. Download MongoDB Compass today and start making smarter decisions about document structure, querying, indexing, and more.
New to MongoDB Atlas — Data Explorer Now Available for All Cluster Sizes
At the recent MongoDB .local Chicago event, MongoDB CTO and Co-Founder, Eliot Horowitz made an exciting announcement about the Data Explorer feature of MongoDB Atlas. It is now available for all Atlas cluster sizes, including the free tier.
The easiest way to explore your data
What is the Data Explorer? This powerful feature allows you to query, explore, and take action on your data residing inside MongoDB Atlas (with full CRUD functionality) right from your web browser. Of course, we've thought about security; Data Explorer access and whether or not a user can modify documents is tied to her role within the Atlas Project. Actions performed via the Data Explorer are also logged in the Atlas alerting window.
Bringing this feature to the "shared" Atlas cluster sizes — the free M0s, M2s, and M5s — allows for even faster development. You can now perform actions on your data while developing your application, which is where these shared cluster sizes really shine.
Check out this short video to see the Data Explorer in action.