How to Use the Union All Aggregation Pipeline Stage in MongoDB 4.4
Rate this tutorial
Here's how you'd use it:
Simplified syntax, with no additional processing on the specified collection
Extended syntax, using optional pipeline field
⚠ If you use the pipeline field to process your collection before combining, keep in mind that stages that write data, like
$merge, can't be used!
Your resulting documents will merge your current collection's (or pipeline's) stream of documents with the documents from the collection/pipeline you specify. Keep in mind that this can include duplicates!
If you've used the
UNION ALLoperation in SQL before, the
$unionWithstage's functionality may sound familiar to you, and you wouldn't be wrong! Both combine the result sets from multiple queries and return the merged rows, some of which may be duplicates. However, that's where the similarities end. Unlike MongoDB's
$unionWithstage, you have to follow
in order to run a valid
UNION ALLoperation in SQL:
- Make sure your two queries have the same number of columns
- Make sure the order of columns are the same
- Make sure the matching columns are compatible data types.
It'd look something like this in SQL:
$unionWithstage in MongoDB, you don't have to worry about these stringent constraints.
The most convenient difference between the
$unionWithstage and other UNION operations is that there's no matching schema restriction. This flexible schema support means you can combine documents that may not have the same type or number of fields. This is common in certain scenarios, where the data we need to use comes from different sources:
- TimeSeries data that's stored by month/quarter/some other unit of time
- IoT device data, per fleet or version
- Archival and Recent data, stored in a Data Lake
- Regional data
$unionWithstage, combining these data sources is possible.
Next, based on your situation, you may already have a few prerequisites setup or need to start from scratch. Either way, choose your scenario to configure the things you need so that you can follow the rest of this tutorial!
Choose your scenario:
I don't have an Atlas cluster set up yet:
- (no credit card needed!). Be sure to select MongoDB 4.4 (may be Beta, which is OK) as your version in Additional Settings!💡 If you don't see the prompt to create a cluster: You may be prompted to create a project first before you see the prompt to create your first cluster. In this case, go ahead and create a project first (leaving all the default settings). Then continue with the instructions to deploy your first free cluster!
- Continue with the steps in Connecting to your cluster.
I have an Atlas cluster set up:
Great! You can skip ahead to Connecting to your cluster.
Connecting to your cluster
To connect to your cluster, we'll use the MongoDB for Visual Studio Code extension (VS Code for short 😊). You can view your data directly, interact with your collections, and much more with this helpful extension! Using this also consolidates our workspace into a single window, removing the need for us to jump back and forth between our code and MongoDB Atlas!
- To connect to your cluster, you'll need a connection string. You can get this connection string from your cluster connection settings. Go to your cluster and select the "Connect" option:
- Skip to the second step and copy the connection string (don't worry about the other settings, you won't need them):
- Switch back to VS Code. Press
P(on Windows) or
P(on Mac) to bring up the command palette. This shows a list of all VS Code commands.
- Start typing "MongoDB" until you see the MongoDB extension's list of available commands. Select the "MongoDB: Connect with Connection String" option.
- Paste in your copied connection string. 💡 Don't forget! You have to replace the placeholder password with your actual password!
- Press enter to connect! You'll know the connection was successful if you see a confirmation message on the bottom right. You'll also see your cluster listed when you expand the MongoDB extension pane.
With the MongoDB extension installed and your cluster connected, you can now use MongoDB Playgrounds to test out the
$unionWithexamples! MongoDB Playgrounds give us a nice sandbox to easily write and test Mongo queries. I love using it when prototying or trying something new because it has query auto-completion and syntax highlighting, something that you don't get in most terminals.
Let's finally dive into some examples!
💡 If you create your own playground, remember to change the database name and delete the default template's code first!
Right at the top, specify the database you'll be using. In this example, I'm using a database also called
💡 I haven't actually created a database called
union-walkthroughin Atlas yet, but that's no problem! When the playground runs, it will see that it does not yet exist and create a database of the specified name!
Next, we need data! Particularly about some planets. And particularly about planets in a certain movie series. 😉
Any planets that appear in at least 2 or more films are considered popular. Otherwise, we'll add them into the
This separation is indicative of how our data may be grouped. Despite the separation, we can use the
$unionWithstage to combine these two collections if we ever needed to analyze them as a single result set!
Let's say that we needed to find out the total population of planets, grouped by climate. Additionally, we'd like to leave out any planets that don't have population data from our calculation. We can do this using an aggregation:
If you've followed along in your own MongoDB playground and have copied the code so far, try running the aggregation!
And if you're using the provided MongoDB playground I created, highlight lines 264 - 290 and then run the selected code.
💡 You'll notice in the code snippet above that I've added another
use('union-walkthrough');method right above the aggregation code. I do this to make the selection of relevant code within the playground easier. It's also required so that the aggregation code can run against the correct database. However, the same thing can be achieved by selecting multiple lines, namely the original
use('union-walkthrough')line at the top and whatever additional example you'd like to run!
You should see the results like so:
Unsurprisingly, planets with "temperate" climates seem to have more inhabitants. Something about that cool 75 F / 23.8 C, I guess 🌞
Let's break down this aggregation:
The next object (and next stage) in our aggregation is our
$unionWithstage. Here, we specifiy what collection we'd like to perform a union with (including any duplicates). We also make use of the pipeline field to similarly filter out any documents in our
popular_planetscollection that have an unknown population:
Since we want to know the total population per climate type, we first specify
_idto be the
$climatefield from our combined result set. Then, we calculate a new field called
totalPopulationby using a
operator to add each matching document's population values together. You'll also notice that based on the data we have, we needed to use a
operator to first convert our
$populationfield into a calculable value!
Now, if you don't need to run some additional processing on the collection you're combining with, you don't have to! The
pipelinefield is optional and is only there if you need it.
So, if you just need to work with the planet data as a unified set, you can do that too:
Copy this aggregation into your own playground and run it! Alternatively, select and run lines 293 - 297 if using the provided MongoDB playground!
Tada! Now you can use this unified dataset for analysis or further processing.
Combining the same schemas is great, but we can do that in regular SQL too! The real convenience of the
$unionWithpipeline stage is that it can also combine collections with different schemas. Let's take a look!
As before, we'll specifiy the database we want to use:
This time, we'll use some acquired information about certain starships and vehicles that are used in this same movie series. Let's add them to their respective collections:
You may be thinking (as I first did), what's the difference between starships and vehicles? You'll be pleased to know that starships are defined as any "single transport craft that has hyperdrive capability". Any other single transport craft that does not have hyperdrive capability is considered a vehicle. The more you know! 😮
If you look at the two collections, you'll see that they have two key differences:
max_atmosphering_speedfield is present in both collections, but is a
starshipscollection and an
starshipscollection has two fields (
MGLT) that are not present in the
vehiclescollection, as it only relates to starships.
But you know what? That's not a problem for the
$unionWithstage! You can combine them just as before:
Try running the aggregation in your playground! Or if you're following along in the MongoDB playground I've provided, select and run lines 185 - 189! You should get the following combined result set as your output:
Can you imagine doing that in SQL? Hint: You can't! That kind of schema restriction is something you don't need to worry about with MongoDB, though!
So we can combine different schemas no problem. What if we need to do a little extra work on our collection before combining it? That's where the
pipelinefield comes in!
Let's say that there's some classified information in our data about the vehicles. Namely, any vehicles manufactured by Kuat Drive Yards (AKA a division of the Imperial Department of Military Research).
By direct orders, you are instructed not to give out this information under any circumstances. In fact, you need to intercept any requests for vehicle information and remove these classified vehicles from the list!
We can do that like so:
In this example, we're combining the
vehiclescollections as before, using the
$unionWithpipeline stage. We also process the
vehicledata a bit more, using the
In our case, we are evaluating whether or not the
manufacturerfield holds a value of "Kuat Drive Yards, Imperial Department of Military Research". If it does (uh oh, that's classified!), we use a system variable called
, which lets us exclude all fields at the current document/embedded document level. If it doesn't, we use another system variable called
, which will return all fields at the current document level, except for any embedded documents.
This works perfectly for our use case. Try running the aggregation (lines 192 - 211, if using the provided MongoDB Playground). You should see a combined result set, minus any Imperial manufactured vehicles:
We did our part to restrict classified information! 🎶 Hums Imperial March 🎶
Now that we know how the
$unionWithstage works, it's important to discuss its limits and restrictions.
We've mentioned it already, but it's important to reiterate: using the
$unionWithstage will give you a combined result set which may include duplicates! This is equivalent to how the
UNION ALLoperator works in
SQLas well. As a workaround, using a
$groupstage at the end of your pipeline to remove duplicates is advised, but only when possible and if the resulting data does not get inaccurately skewed.
There are plans to add similar fuctionality to
UNION(which combines result sets but removes duplicates), but that may be in a future release.
questionable_planets(located within the
$unionWithstage) cannot be sharded. This is enforced to prevent a significant decrease in performance due to the shuffling of data around the cluster as it determines the best execution plan.
Aggregation pipelines can't use the
$unionWithstage inside transactions because a rare, but possible 3-thread deadlock can occur in very niche scenarios. Additionally, in MongoDB 4.4, there is a first-time definition of a view that would restrict its reading from within a transaction.
stages cannot be used in a
$unionWithpipeline. Since both
$mergeare stages that write data to a collection, they need to be the last stage in a pipeline. This conflicts with the usage of the
$unionWithstage as it outputs its combined result set onto the next stage, which can be used at any point in an aggregation pipeline.
However, if your aggregation doesn't include a collation, it will use the collation for the top-level collection/view on which the aggregation is run:
- If the
$unionWithcoll is a collection, its collation is ignored.
- If the
$unionWithcoll is a view, then its collation must match that of the top-level collection/view. Otherwise, the operation errors.
We've discussed what the
$unionWithpipeline stage is and how you can use it in your aggregations to combine data from multiple collections. Though similar to SQL's
UNION ALLoperation, MongoDB's
$unionWithstage distinguishes itself through some convenient and much-needed characteristics. Most notable is the ability to combine collections with different schemas! And as a much needed improvement, using a
$unionWithstage eliminates the need to write additional code, code that was required because we had no other way to combine our data!