Slow query for $group, $lookup and $match

Shreyas_Gombi · May 9, 2023, 7:04am

Hi Team,

We are facing a slow query performance issue and we suspect that it might be due to our design approach being wrong in the first place, but we had no other way to validate it. Also we have close to a million records and we wish to first try and fix it through query optimisation instead of updating the records. The situation is quite straightforward and is explained below :

We have an inventory collection with a property location where the value is a customer’s Unique ID / Code. We have created an index on the location and item / product ID columns. This collection holds close to half a million records.

{
    '_id' : ObjectId("635b92a0615739001cfc002d"),
    'location' : 'IND001',
    'item_id' : '10001',
    'quantity' : 20
}

We have our Customer’s collection that holds meta data information of each customer, like their name, address, area, region, zones etc. We have created an index on the Code column. This collection holds data close to 5-10k.

{
    'Code' : 'IND001',
    'Name' : 'Shreyas Gombi',
    'Address' : 'Bengaluru, Karnataka, India',
    'Area' : 'Bengaluru',
    'Region' : 'Karnataka',
    'Zone' : 'South'
}

We also have our product information in our Items collection with product categorisation data. We have created an index on the Code column in Items collection. This collection holds another 60-70k data.

{
    'Code' : '10001',
    'ProductName' : 'PRODUCT-PART-XA123456',
    'Category' : 'CATEGORY_1',
    'SubCategory' : 'SUB_CATEGORY_3',
    'Price' : '100',
    'Currency' : 'INR'
}

We are building a front-end interface that will provide a view of our inventory as well as users can filter data by customer’s region / area / zone or by name. To achieve this we are using $group and $lookup aggregation pipelines (pipeline is shown below). We have 2 APIs, one for the data display of the inventory so we are using pagination - $limit and $skip, it will only pull 60 records, and the other API to pull a count - $count to indicate the number of records expected for that filtered data. But the performance of the count API is very slow, results are returned after 14-20 seconds.

The problem here is that our Area, Region and Zone data is currently sitting in Customer’s collection and Category, SubCategory data of products in Items collection, which will need to be pulled by performing a $lookup. We also use a $group stage to group multiple lines of inventory data by location and item_id.

Here’s the aggregation :

db.collection('_inventory').aggregate( [
                            {
                                $match : {'quantity' : {'$gt' : 0}}
                            },
                            {
                                $group : {
                                    _id : {item_id : '$item_id', location : '$location'},
                                    location : {'$addToSet' : '$location'},
                                    quantity : {'$addToSet' : '$quantity'}
                                }
                            },
                            {
                                $lookup : {
                                    from : '_customers',
                                    localField : '_id.location',
                                    foreignField : 'Code',
                                    as : 'customer_details'
                                }
                            },
                            {
                                $match : {'$or' : [{'customer_details.Area' : 'Karnataka'}, {'customer_details.Area' : 'Maharashtra'}]}
                            },
                            {
                                $lookup : {
                                    from : '_items',
                                    localField : '_id.item_id',
                                    foreignField : 'Code',
                                    as : 'item_details'
                                }
                            },
                            {
                                '$unwind' : {'path' : '$item_details', 'preserveNullAndEmptyArrays' : true}
                            },
                            {
                                '$match' : search_query // ..... Search could be done by customers name or product name ... 
                            },
                            {
                                '$count' : 'count'
                            }], {'allowDiskUse' : true}

search_query : We are building a search query assuming the lookup fields as item_details.ProductName or customer_details.Name, etc. We are also using the filters applied on the front-end like, filter by Area for example, we add a query here as {customer_details.Area : 'Karnataka'}

We are faced with a few challenges here,

Customers expect that user should be able to browse the inventory without having to apply any filters (we might be able to covince them, but thats Plan B).
Customers can search by Product Name / Customer Name so the indexing does not get applied and could slow down the overall query performance.

We have a few possible solutions in mind,

Update all our inventory records and include some of the product’s and customer’s data such as Area, Region, Zone, Category, SubCategory columns in the inventory data itself so that the filtering can be done on the inventory data with added indexes. - This ofcourse will require us to update the entire inventory data and will be redundant.
Introduce some filters by default, users will need to apply these filters before proceeding to view the data, that way our table scan window reduces. - But as said earlier, we are not sure if our clients will agree but open to consider this option, if that’s the only way to go about.

I remember in one of our earlier sessions on MongoDB, that redundancy is what complements MongoDB architecture and we should not treat it like an RDBMS. So we think this might be a design problem and not a query issue. But we wanted to know if there’s any other suggestion or if there’s a better way we can achieve the results without having to update the records.

Aasawari · May 17, 2023, 4:47am

Hi @Shreyas_Gombi and welcome to MongoDB community forums!!

Firstly, thank you for the detailed post with all the relevant information.

Based one the sample schema provided, I tried to create sample documents for all the collections and perform the above aggregation.
Based on my understanding, the first part of the query,

{
                     $match : {'quantity' : {'$gt' : 0}}
                          },
                          {
                      $group : {
                                    _id : {item_id : '$item_id', location : '$location'},
                                    location : {'$addToSet' : '$location'},
                                    quantity : {'$addToSet' : '$quantity'}
                          }},
                       {
                       $lookup : {
                                    from : '_customers',
                                    localField : '_id.location',
                                    foreignField : 'Code',
                                    as : 'customer_details'
                    }
                    }

could be converted into a materialised view and perform the later stages in the aggregation pipeline.

In recommending the materialised views, this would have a caveat that you would need to keep the documents updated with every update in the main collection.

With the above recommendation, below are a few recommendations regarding the aggregation pipeline.

The collections mentioned seems to be very normalised. Is this the design that was used in a previous application, or is this a new application?
In MongoDB denormalising is one possible way to help with performance issues. However it depends. Sometimes it just can’t be done. In such cases, maybe use the Extended Reference pattern.
Since $match operator when used in the beginning of the aggregation pipeline would enhance the query performance. Having said that, $match being used in the middle of the pipeline would not benefit the prior stages being used.
Finally, using $limit and $skip is not the recommended method for pagination. This is because if the query is e.g. skip 1000 and limit 10, it still needs to process 1010 results and throw away 1000 of them. There are other methods such as Paging with the Bucket Pattern - Part 1 | MongoDB Blog.

Please visit the documentation for Building with Patterns: A Summary | MongoDB Blog for further understanding.

Let us know if you have further questions.

Regards
Aasawari

Shreyas_Gombi · June 8, 2023, 6:30am

Hi @Aasawari ,

Thanks for a detailed response. It’s very insightful and helped us open new doors.

1. The collections mentioned seems to be very normalised. Is this the design that was used in a previous application, or is this a new application?

This is how it was used in the previous application, and that’s why we are unable to modify the records currently. We are now looking into denormalising the data so that the filters can be run on the same collection.

We now understand $skip is not recommended, we suspect that this could also be one of the reasons why our exports are not working well too. What would be your recommendation on the export of data using a NodeJS + AngularJS + MongoDB tech stack? We are currently facing some JS Stacktrace issues and memory when exporting the data in batches.

Regards,
Shreyas

Aasawari · June 14, 2023, 9:44am

Hi @Shreyas_Gombi

For the above requirement, you could possibly use mongoimport and mongoexport from the official database tools to perform the operations.

Furthermore, you mentioned encountering JS Stacktrace issues and memory limitations when exporting data in batches. I would appreciate it if you could provide further details about the specific issues you’re facing.

Additionally, please clarify the batch size you are working with for the data import and export processes. Understanding these details will allow me to provide you with more accurate assistance.

Regards
Aasawari