Delivering a Near Real-Time Single View into Customers with a Federated Database
Rate this tutorial
With MongoDB Data Federation, you can seamlessly query, transform, and aggregate your data from one or more locations, such as within a MongoDB database, AWS S3 buckets, and even HTTP API endpoints. In other words, with Data Federation, you can use the MongoDB Query API to work with your data even if it doesn't exist within MongoDB.
What's a scenario where this might make sense?
Let's say you're in the automotive or supply chain industries. You have customer data that might exist within MongoDB, but your parts vendors run their own businesses external to yours. However, there's a need to pair the parts data with transactions for any particular customer. In this scenario, you might want to be able to create queries or views that bring each of these pieces together.
In this tutorial, we're going to see how quick and easy it is to work with MongoDB Data Federation to create custom views that might aid your sales and marketing teams.
To be successful with this tutorial, you should have the following or at least an understanding of the following:
- An external data source, accessible within an AWS S3 bucket or an HTTP endpoint.
- Node.js 18+.
While you could have data ready to go for this tutorial, we're going to assume you need a little bit of help. With Node.js, we can get a package that will allow us to generate fake data. This fake data will act as our customer data within MongoDB Atlas. The external data source will contain our vendor data, something we need to access, but ultimately don't own.
To get down to the specifics, we'll be referencing data because it is available as a dataset on AWS. If you want to follow along exactly, load that dataset into your AWS S3 bucket. You can either expose the S3 bucket to the public, or configure access specific for MongoDB. For this example, we'll just be exposing the bucket to the public so we can use HTTP.
Since this example is supposed to get you started, much of the data isn't too important to us, but the theme is. The most important data to us will be the vehicle_id because it should be a unique representation for any particular vehicle. The vehicle_id will be how we connect a customer to a particular vehicle.
With the Carvana data in mind, we can continue towards generating fake customer data.
While we could connect the Carvana data to a MongoDB federated database and perform queries, the example isn't particularly exciting until we add a different data source.
If you don't already have it installed, execute the following from a command prompt:
With the generator installed, we're going to need to draft a template of how the data should look. You can do this directly in the command line, but it might be easier just to create a shell file for it.
Create a generate_data.sh file and include the following:
So what's happening in the above template?
It might be easier to have a look at a completed document based on the above template:
The script will create 50 documents. Many of the fields will be randomly generated with the exception of the
transaction_historyfields. While these fields will be somewhat random, we're sandboxing them to a particular set of options.
Customers need to be linked to actual vehicles found in the Carvana data. The script adds one to three actual id values to each document. To narrow the scope, we'll imagine that the customers are locked to certain regions.
Import the output into MongoDB. You might consider creating a carvana database and a customers collection within MongoDB for this data to live.
It's time for the fun part! We need to create a federated database to combine both customer data that already lives within MongoDB and the Carvana data that lives on AWS S3.
Within MongoDB Atlas, click the Data Federation Tab.
Click “set up manually” in the "create new federated database" dropdown in the top right corner of the UI.
Then, add your data sources. Whether the Carvana data source comes directly from an AWS S3 integration or a public HTTP endpoint, it is up to you. The end result will be the same.
With the data sources available, create a database within your federated instance. Since the theme of this example is Carvana, it might make sense to create a carvana database and give each data source a proper collection name. The data living on AWS S3 might be called sales or transactions and the customer data might have a customers name.
What you name everything is up to you. When connecting to this federated instance, you'll only ever see the federated database name and federated collection names. Looking in, you won't notice any difference from connecting to any other MongoDB instance.
You can connect to your federated instance using the connection string it provides. It will look similar to a standard MongoDB Atlas connection string.
Having all the data sources accessible from one location with Data Federation is great, but we can do better by providing users a single view that might make sense for their reporting needs.
A little imagination will need to be used for this example, but let's say we want a report that shows the amount of car types sold for every city. For this, we're going to need data from both the customers collection as well as the carvana collection.
Let's take a look at the following aggregation pipeline:
There are four stages in the above pipeline.
In the first stage, we want to expand the vehicle id values that are found in customers documents. Reference values are not particularly useful to us standalone so we do a join operation using the
$lookupoperator between collections. This leaves us with all the details for every vehicle alongside the customer information.
The next stage flattens the array of vehicle information using the
$unwindoperation. By the end of this, all results are flat and we're no longer working with arrays.
In the third stage we group the data. In this example, we are grouping the data based on the city and vehicle type and counting how many of those transactions occurred. By the end of this stage, the results might look like the following:
In the final stage, we format the data into something a little more attractive using a
$projectoperation. This leaves us with data that looks like the following:
The data can be manipulated any way we want, but for someone running a report of what city sells the most of a certain type of vehicle, this might be useful.
The aggregation pipeline above can be used in MongoDB Compass and would be nearly identical using several of the MongoDB drivers such as Node.js and Python. To get an idea of what it would look like in another language, here is an example of Java:
When using MongoDB Compass, aggregation pipelines can be output automatically to any supported driver language you want.
The person generating the report probably won't want to deal with aggregation pipelines or application code. Instead, they'll want to look at a view that is always up to date in near real-time.
Within the MongoDB Atlas dashboard, go back to the configuration area for your federated instance. You'll want to create a view, similar to how you created a federated database and federated collection.
Give the view a name and paste the aggregation pipeline into the box when prompted.
Refresh MongoDB Compass or whatever tool you're using and you should see the view. When you load the view, it should show your data as if you ran a pipeline — however, this time without running anything.
In other words, you’d be interacting with the view like you would any other collection — no queries or aggregations to constantly run or keep track of.
The view is automatically kept up to date behind the scenes using the pipeline you used to create it.
With MongoDB Data Federation, you can combine data from numerous data sources and interact with it using standard MongoDB queries and aggregation pipelines. This allows you to create views and run reports in near real-time regardless where your data might live.