Joins and Other Aggregation Enhancements Coming in MongoDB 3.2 (Part 1 of 3) – Introduction

Andrew Morgan
October 30, 2015 | Updated: December 15, 2015
#Technical

Series:

Part 1 – Introduction
Part 2 – Worked Examples
Part 3 - Adding Some Code Glue and Geolocation

This is the first of a three part blog series looking at the aggregation enhancements being introduced in MongoDB 3.2 – most notably $lookup which implements left-outer equi-joins in the MongoDB Aggregation Framework.

This post starts with an introduction to analyzing data with MongoDB. We then explain why joins are sometimes useful for MongoDB – in spite of the strengths of the document model – and how developers have been working without them.

The second post in the series works through examples of building aggregation pipelines – including using the operators added in MongoDB 3.2.

The third and final post shows how geolocation data can be included as well as what to do when you reach the limit of what can be done using a single pipeline – including adding wrapper code. That post also summarizes some of the limitations of the Aggregation Framework and reasons why you might supplement it with a full visualization solution such as Tableau together with MongoDB's Connector for BI (Business Intelligence) – also new in MongoDB 3.2.

Disclaimer

MongoDB's product plans are for informational purposes only. MongoDB's plans may change and you should not rely on them for delivery of a specific feature at a specific time.

Real-Time Analytics and Search

With the emergence of new data sources such as social media, mobile applications and sensor-equipped “Internet of Things” networks, organizations can extend analytics to deliver real-time insight and discovery into such areas as operational performance, customer satisfaction, and competitor behavior.

Time to value is everything. For example, having access to real-time customer sentiment or fleet tracking is of little benefit unless the data can be analyzed and reported in real-time.

MongoDB 3.2 aims to extend the options for performing analytics on the live, operational database – ensuring that answers are delivered quickly, and reflect current data. Work that would previously have needed to be done on the client side can now be performed by the database – freeing the developer to focus on new features.

The Case for Joins

MongoDB’s document data model is flexible and provides developers many options in terms of modeling their data. Most of the time all the data for a record tends to be located in a single document. For the operational application, accessing data is simple, high performance, and easy to scale with this approach.

When it comes to analytics and reporting, however, it is possible that the data you need to access spans multiple collections. This is illustrated in Figure 1, where the _id field of multiple documents from the products collection is included in a document from the orders collection. For a query to analyze orders and details about their associated products, it must fetch the order document from the orders collection and then use the embedded references to read multiple documents from the products collection. Prior to MongoDB 3.2, this work is implemented in application code. However, this adds complexity to the application and requires multiple round trips to the database, which can impact performance.

^{Figure 1: Application-Layer simulation of joins between documents}

MongoDB 3.2 introduces the $lookup operator that can now be included as a stage in an aggregation pipeline. With this approach, the work of combining data from the orders and products collections is implemented within the database, and as part of a broader aggregation pipeline that performs other processing in a single query. As a result, there is less work to code in the application, and fewer round trips to the database. You can think about $lookup as equivalent to a left outer equi-join.

Aside - What is a Left Outer Equi-Join?

A left outer equi-join produces a result set that contains data for all documents from the left table (collection) together with data from the right table (collection) for documents where there is a match with documents from the left table (collection). This is illustrated in Figure 2.

^{Figure 2: Left-Outer join between collections}

MongoDB's Aggregation Framework

The Aggregation Framework is a pipeline for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into aggregated results. The pipeline consists of stages; each stage transforms the documents as they pass through.

In general, each successive stage reduces the volume of data; removing information that isn't needed and combining other data to produce summarized results.

Figure 3 shows a conceptual model for the Aggregation Framework pipeline. This is what's happening at each stage:

On the left-hand side/start of the pipeline is the original collection contents – each record (document) containing a number of shapes (keys), each with a particular color (value)
The $match stage filters out any documents that don't contain a red diamond
The $project stage adds a new “square” attribute with a value computed from the value (color) of the snowflake and triangle attributes
The $lookup stage (new in 3.2 - more details later) performs a left-outer join with another collection, with the star being the comparison key. This creates new documents which contain everything from the previous stage but augmented with data from any document from the second collection containing a matching colored star (i.e., the blue and yellow stars had matching “lookup” values, whereas the red star had none).
Finally, the $group stage groups the data by the color of the square and produces statistics (sum, average and standard deviation) for each group.

^{Figure 3: MongoDB Aggregation Framework pipeline}

This is the full set of aggregation stages:

$match – Filter documents
$geoNear – Sort documents based on geographic proximity
$project – Reshape documents (remove or rename keys or add new data based on calculations on the existing data)
$lookup – Coming in 3.2 – Left-outer joins
$unwind – Expand documents (for example create multiple documents where each contains one element from an array from the original document)
$group – Summarize documents
$sample – Randomly selects a subset of documents
$sort – Order documents
$skip – Jump over a number of documents
$limit – Limit number of documents
$redact – Restrict sensitive content from documents
$out – Coming in 3.2* – store the results in a new collection

The details can be found in the documentation.

New Aggregation Operators in MongoDB 3.2

There are operators used within each stage and this set is being extended in MongoDB 3.2 to include:

Array operations

$slice, $arrayElemAt, $concatArrays, $isArray, $filter, $min, $max, $avg and $sum (some of these were previously available in a $group stage but not in $project

Standard Deviations

$stdDevSamp (based on a sample) and $stdDevPop (based on the complete population)

Square Root

$sqrt

Absolute (make +ve) value

$abs

Rounding numbers

$trunc, $ceil, $floor

Logarithms

$log, $log10, $ln

Raise to power

$pow

Natural Exponent

$exp

Further details on these new operators can be found in the MongoDB 3.2 Release Notes.

$lookup – Left Outer Equi-Joins

Figure 4 illustrates the syntax for performing the join:

leftCollection is the collection that the aggregation is being performed on and is the left collection in the join
from identifies the collection that it will be joined with – the right collection (rightCollection in this case)
localField specifies the key from the original/left collection – leftVal
foreignField specifies the key from the right collection – rightVal
as indicates that the data from the right collection should be embedded within the resulting documents as an array called embeddedData

^{Figure 4: $lookup – Left-Outer Joins for MongoDB}

In the follow-on blogs in this series, you’ll see how the data from a home sales collection (containing details of each home sale, including the property’s postal code) is joined with data from a postal code collection (containing postal codes and their geographical location). This produces documents that contain the original home sale information augmented with the coordinates of the property. In this case, the “homesales” collection is the left-collection and “postcodes” the right-collection; the “postcode” field from each collection is the localField which is matched with the foreignField.

Next Steps

To learn more about what's coming up in MongoDB 3.2, register for the What's new in MongoDB 3.2 webinar and review the MongoDB 3.2 development documentation.

To get the best understanding of the new features then you should experiment with the software, which is available in the MongoDB 3.2 release. It will be available in both the MongoDB Enterprise Advanced and Community Editions (GA coming soon).

The reason MongoDB releases development releases is to give the community a chance to try out the new software – and we hope that you'll give us feedback, whether it be by joining the MongoDB 3.2 bug hunt or commenting on this post.

Learn more about joins and other aggregation enhancements in MongoDB 3.2, watch Andrew's on-demand webinar: Watch the on-demand webinar

About the Author - Andrew Morgan

Andrew is a Principal Product Marketing Manager working for MongoDB. He joined at the start of this summer from Oracle where he’d spent 6+ years in product management, focussed on High Availability. He can be contacted @andrewmorgan or through comments on his blog (clusterdb.com).

← Previous

Document Validation – Part 2: Putting it all Together, a Tutorial

Introduction This is the second and final post in a series looking at document validation in MongoDB 3.2; if you haven’t already read the first blog in this series then you should read it now. The intent of this post is to step you through exactly how document validation can be introduced into an existing production deployment in such a way that there is no impact to your users. It covers: Setting up some test data (not needed for a real deployment) Using MongoDB Compass and the mongo shell to reverse engineer the de facto data model and identify anomalies in the existing documents Defining the appropriate document validation rules Preventing new documents being added which don’t follow the new rules Bring existing documents “up to spec” against the new rules Figure 1: Aligning document validation with application lifecycle Tutorial This section looks at taking an existing, deployed database which currently has no document validations defined. It steps through understanding what the current document structure looks like; deciding on what rules to add and then rolling out those new rules. As a pre-step add some data to the database (obviously, this isn't needed if working with your real deployment). use clusterdb; db.dropDatabase(); use clusterdb(); db.inventory.insert({ "_id" : 1, "sku" : "abc", "description" : "product 1", "instock" : 120 }); db.inventory.insert({ "_id" : 2, "sku" : "def", "description" : "product 2", "instock" : 80 }); db.inventory.insert({ "_id" : 3, "sku" : "ijk", "description" : "product 3", "instock" : 60 }); db.inventory.insert({ "_id" : 4, "sku" : "jkl", "description" : "product 4", "instock" : 70 }); db.inventory.insert({ "_id" : 5, "sku" : null, "description" : "Incomplete" }); db.inventory.insert({ "_id" : 6 }); for (i=1000; i<2000; i++) { db.orders.insert({ _id: i, item: "abc", price: i % 50, quantity: i % 5 }); }; for (i=2000; i<3000; i++) { db.orders.insert({ _id: i, item: "jkl", price: i % 30, quantity: Math.floor(10 * Math.random()) + 1 }); }; for (i=3000; i<3200; i++) { db.orders.insert({ _id: i, price: i % 30, quantity: Math.floor(10 * Math.random()) + 1 }); }; for (i=3200; i<3500; i++) { db.orders.insert({ _id: i, item: null, price: i % 30, quantity: Math.floor(10 * Math.random()) + 1 }); }; for (i=3500; i<4000; i++) { db.orders.insert({ _id: i, item: "abc", price: "free", quantity: Math.floor(10 * Math.random()) + 1 }); }; for (i=4000; i<4250; i++) { db.orders.insert({ _id: i, item: "abc", price: "if you have to ask....", quantity: Math.floor(10 * Math.random()) + 1 }); }; The easiest way to start understanding the de facto schema for your database is to use MongoDB Compass. Simply connect Compass to your mongod (or mongos if you're using sharding) and select the database/collection you'd like to look into. To see MongoDB Compass in action – view this demo video . As shown in Figure 2, there are typically four keys in each document from the clusterdb.orders table: _id is always present and is a number item is normally present and is a string (either "abc" or "jkl") but is occasionally null or missing altogether (undefined) price is always present and is in most cases a number (the histogram shows how the values are distributed between 0 and 49) but in some cases it's a string quantity is always present and is a number Figure 2: Viewing the Document Schema using MongoDB Compass For this tutorial, we'll focus on the price . By clicking on the string label, Compass will show us more information about the string content for price - this is shown in Figure 3. Figure 3: Drilling Down into string Values Compass shows us that: For those instances of price which are strings, the common values are "free" and "if you have to ask....". If you click on one of those values, a query expression is formed and clicking "Apply" runs that query and now Compass will show you information only for that subset of documents. For example, where price == "if you have to ask...." (see Figure 4). By selecting multiple attributes, you can build up fairly complex queries. The query you build visually is printed at the top so you can easily copy/paste into other contexts like the shell. Figure 4: Formulating Search Expressions with MongoDB Compass If applications are to work with the price from these documents then it would be simpler it it was always set to a numerical value, and so this is something that should be fixed. Before cleaning up the existing documents, the application should be updated to ensure numerical values are stored in the price field. We can do this by adding a new validation rule to the collection. We want this rule to: Allow changes to existing invalid documents Prevent inserts of new documents which violate validation rules Set up a very simple document validation rule that checks that price exists and contains a double – see the enumeration of MongoDB BSON types These steps should be run from the mongo shell: db.orders.runCommand("collMod", {validationLevel: "moderate", validationAction: "error"}); db.runCommand({collMod: "orders", validator: { price: {$exists: true}, price: {$type: 1} } }); The validation rules for this collection can now be checked: ```javascript db.getCollectionInfos({name:"orders"}) [ { "name": "orders", "options": { "validator": { "price": { "$type": 1 } }, "validationLevel": "moderate", "validationAction": "error" } } ] </code></pre> Now that this has been set up, it's possible to check that we can't add a new document that breaks the rule: <pre><code>db.orders.insert({ "_id": 6666, "item": "jkl", "price": "rogue", "quantity": 1 }); Document failed validation WriteResult({ "nInserted": 0, "writeError": { "code": 121, "errmsg": "Document failed validation" } }) </code></pre> But it's OK to modify an existing document that does break the rule: <pre><code>db.orders.findOne({price: {$type: 2}}); { "_id": 3500, "item": "abc", "price": "free", "quantity": 5 } > db.orders.update( {_id: 3500}, {$set: {quantity: 12}}); Updated 1 existing record(s) in 5ms WriteResult({ "nMatched": 1, "nUpserted": 0, "nModified": 1 })</code></pre> Now that the application is no longer able to store new documents that break the new rule, it's time to clean up the "legacy" documents. At this point, it's important to point out that Compass works on a random sample of the documents in a collection (this is what allows it to be so quick). To make sure that we're fixing **all** of the documents, we check from the `mongo` shell. As the following commands could consume significant resources, it may make sense to run them on a secondary): <pre><code>secondary> db.orders.aggregate([ {$match: { price: {$type: 2}}}, {$group: { _id: "$price", count: {$sum:1}}} ]) { "_id" : "if you have to ask....", "count" : 250 } { "_id" : "free", "count" : 500 } </code></pre> The number of exceptions isn't too high and so it is safe to go ahead and fix up the data without consuming too many resources: <pre><code>db.orders.update( {price:"free"}, {$set: {price: 0}}, {multi: true}); db.orders.update( {price:"if you have to ask...."}, {$set: {price: 1000000}}, {multi: true});</code></pre> At this point it's now safe to enter the strict mode where any inserts or updates will cause an error if the document being stored doesn't follow the rules: <pre><code>db.orders.runCommand("collMod", {validationLevel: "strict", validationAction: "error"});</code></pre> <h3 id="next-steps">Next Steps</h3> Hopefully this has given you a sense for what the Document Validation functionality offers and started you thinking about how it could be applied to your application and database. I'd encourage you to read up more on the topic and these are some great resources: <ul> <li>If you haven’t already read the <a href="https://www.mongodb.com/blog/post/document-validation-part-1-adding-just-the-right-amount-of-control-over-your-documents “Document Validation - Part 1: Adding Just the Right Amount of Control Over Your Documents"">first blog in this series</a> then you should read it now</li> <li><a href="https://docs.mongodb.org/master/release-notes/3.2/#document-validation" title="MongoDB 3.2 documentation for Document Validation">MongoDB 3.2 documentation for Document Validation</a></li> <li>The best way to really get a feel for the functionality is to try it out for yourself:<a href="https://www.mongodb.org/downloads#development" title="Download MongoDB 3.2">Download MongoDB 3.2</a></li> <li>Feedback is welcomed and we’d encourage you to join the <a href="https://www.mongodb.com/blog/post/announcing-the-mongodb-3-2-bug-hunt">MongoDB 3.2 bug hunt</a></li> <li><a href="http://www.eliothorowitz.com/blog/2015/09/11/document-validation-and-what-dynamic-schema-means/" title="Document Validation and What Dynamic Schema Means">Document Validation and What Dynamic Schema Means</a> – Eliot Horowitz. This blog post adds context to why this functionality is being introduced now.</li> <li><a href="https://www.mongodb.com/presentations/data-management-3-bulletproof-data-management" title="Bulletproof Data Management">Bulletproof Data Management</a> – Buzz Moschetti. Great presentation on how to look after your data - including in earlier versions of MongoDB</li> <li>Register for our upcoming webinar covering <a href="https://www.mongodb.com/webinar/whats-new-in-mongodb-3-2">what's new in MongoDB 3.2</a></li> </ul> <hr> Watch Andrew's webinar covering document validation in 3.2. <center><a class="btn btn-primary" href="https://www.mongodb.com/presentations/webinar-document-validation-in-mongodb-3-2?jmp=blog" target="_BLANK">Document Validation in MongoDB 3.2</a></center> <hr> About the Author - Andrew Morgan Andrew is a Principal Product Marketing Manager working for MongoDB. He joined at the start of this summer from Oracle where he’d spent 6+ years in product management, focussed on High Availability.

October 30, 2015

Next →

Retrieval Augmented Generation for Claim Processing: Combining MongoDB Atlas Vector Search and Large Language Models

Following up on our previous blog, AI, Vectors, and the Future of Claims Processing: Why Insurance Needs to Understand The Power of Vector Databases , we’ll pick up the conversation right where we left it. We discussed extensively how Atlas Vector Search can benefit the claim process in insurance and briefly covered Retrieval Augmented Generation (RAG) and Large Language Models (LLMs). MongoDB.local NYC Join us in person on May 2, 2024 for our keynote address, announcements, and technical sessions to help you build and deploy mission-critical applications at scale. Use Code Web50 for 50% off your ticket! Learn More One of the biggest challenges for claim adjusters is pulling and aggregating information from disparate systems and diverse data formats. PDFs of policy guidelines might be stored in a content-sharing platform, customer information locked in a legacy CRM, and claim-related pictures and voice reports in yet another tool. All of this data is not just fragmented across siloed sources and hard to find but also in formats that have been historically nearly impossible to index with traditional methods. Over the years, insurance companies have accumulated terabytes of unstructured data in their data stores but have failed to capitalize on the possibility of accessing and leveraging it to uncover business insights, deliver better customer experiences, and streamline operations. Some of our customers even admit they’re not fully aware of all the data in their archives. There’s a tremendous opportunity to leverage this unstructured data to benefit the insurer and its customers. Our image search post covered part of the solution to these challenges, opening the door to working more easily with unstructured data. RAG takes it a step further, integrating Atlas Vector Search and LLMs, thus allowing insurers to go beyond the limitations of baseline foundational models, making them context-aware by feeding them proprietary data. Figure 1 shows how the interaction works in practice: through a chat prompt, we can ask questions to the system, and the LLM returns answers to the user and shows what references it used to retrieve the information contained in the response. Great! We’ve got a nice UI, but how can we build an RAG application? Let’s open the hood and see what’s in it! Figure 1: UI of the claim adjuster RAG-powered chatbot Architecture and flow Before we start building our application, we need to ensure that our data is easily accessible and in one secure place. Operational Data Layers (ODLs) are the recommended pattern for wrangling data to create single views. This post walks the reader through the process of modernizing insurance data models with Relational Migrator, helping insurers migrate off legacy systems to create ODLs. Once the data is organized in our MongoDB collections and ready to be consumed, we can start architecting our solution. Building upon the schema developed in the image search post , we augment our documents by adding a few fields that will allow adjusters to ask more complex questions about the data and solve harder business challenges, such as resolving a claim in a fraction of the time with increased accuracy. Figure 2 shows the resulting document with two highlighted fields, “claimDescription” and its vector representation, “claimDescriptionEmbedding” . We can now create a Vector Search index on this array, a key step to facilitate retrieving the information fed to the LLM. Figure 2: document schema of the claim collection, the highlighted fields are used to retrieve the data that will be passed as context to the LLM Having prepared our data, building the RAG interaction is straightforward; refer to this GitHub repository for the implementation details. Here, we’ll just discuss the high-level architecture and the data flow, as shown in Figure 3 below: The user enters the prompt, a question in natural language. The prompt is vectorized and sent to Atlas Vector Search; similar documents are retrieved. The prompt and the retrieved documents are passed to the LLM as context. The LLM produces an answer to the user (in natural language), considering the context and the prompt. Figure 3: RAG architecture and interaction flow It is important to note how the semantics of the question are preserved throughout the different steps. The reference to “adverse weather” related accidents in the prompt is captured and passed to Atlas Vector Search, which surfaces claim documents whose claim description relates to similar concepts (e.g., rain) without needing to mention them explicitly. Finally, the LLM consumes the relevant documents to produce a context-aware question referencing rain, hail, and fire, as we’d expect based on the user's initial question. So what? To sum it all up, what’s the benefit of combining Atlas Vector Search and LLMs in a Claim Processing RAG application? Speed and accuracy: Having the data centrally organized and ready to be consumed by LLMs, adjusters can find all the necessary information in a fraction of the time. Flexibility: LLMs can answer a wide spectrum of questions, meaning applications require less upfront system design. There is no need to build custom APIs for each piece of information you’re trying to retrieve; just ask the LLM to do it for you. Natural interaction: Applications can be interrogated in plain English without programming skills or system training. Data accessibility: Insurers can finally leverage and explore unstructured data that was previously hard to access. Not just claim processing The same data model and architecture can serve additional personas and use cases within the organization: Customer Service: Operators can quickly pull customer data and answer complex questions without navigating different systems. For example, “Summarize this customer's past interactions,” “What coverages does this customer have?” or “What coverages can I recommend to this customer?” Customer self-service: Simplify your members’ experience by enabling them to ask questions themselves. For example, “My apartment is flooded. Am I covered?” or “How long do windshield repairs take on average?” Underwriting: Underwriters can quickly aggregate and summarize information, providing quotes in a fraction of the time. For example, “Summarize this customer claim history.” “I Am renewing a customer policy. What are the customer's current coverages? Pull everything related to the policy entity/customer. I need to get baseline info. Find relevant underwriting guidelines.” If you would like to discover more about Converged AI and Application Data Stores with MongoDB, take a look at the following resources: RAG for claim processing GitHub repository From Relational Databases to AI: An Insurance Data Modernization Journey Modernize your insurance data models with MongoDB and Relational Migrator

April 18, 2024