Hi Team,
We are facing a slow query performance issue and we suspect that it might be due to our design approach being wrong in the first place, but we had no other way to validate it. Also we have close to a million records and we wish to first try and fix it through query optimisation instead of updating the records. The situation is quite straightforward and is explained below :
We have an inventory collection with a property location
where the value is a customer’s Unique ID / Code. We have created an index on the location and item / product ID columns. This collection holds close to half a million records.
{
'_id' : ObjectId("635b92a0615739001cfc002d"),
'location' : 'IND001',
'item_id' : '10001',
'quantity' : 20
}
We have our Customer’s collection that holds meta data information of each customer, like their name, address, area, region, zones etc. We have created an index on the Code
column. This collection holds data close to 5-10k.
{
'Code' : 'IND001',
'Name' : 'Shreyas Gombi',
'Address' : 'Bengaluru, Karnataka, India',
'Area' : 'Bengaluru',
'Region' : 'Karnataka',
'Zone' : 'South'
}
We also have our product information in our Items collection with product categorisation data. We have created an index on the Code
column in Items collection. This collection holds another 60-70k data.
{
'Code' : '10001',
'ProductName' : 'PRODUCT-PART-XA123456',
'Category' : 'CATEGORY_1',
'SubCategory' : 'SUB_CATEGORY_3',
'Price' : '100',
'Currency' : 'INR'
}
We are building a front-end interface that will provide a view of our inventory as well as users can filter data by customer’s region / area / zone or by name. To achieve this we are using $group
and $lookup
aggregation pipelines (pipeline is shown below). We have 2 APIs, one for the data display of the inventory so we are using pagination - $limit
and $skip
, it will only pull 60 records, and the other API to pull a count - $count
to indicate the number of records expected for that filtered data. But the performance of the count API is very slow, results are returned after 14-20 seconds.
The problem here is that our Area, Region and Zone data is currently sitting in Customer’s collection and Category, SubCategory data of products in Items collection, which will need to be pulled by performing a $lookup
. We also use a $group
stage to group multiple lines of inventory data by location
and item_id
.
Here’s the aggregation :
db.collection('_inventory').aggregate( [
{
$match : {'quantity' : {'$gt' : 0}}
},
{
$group : {
_id : {item_id : '$item_id', location : '$location'},
location : {'$addToSet' : '$location'},
quantity : {'$addToSet' : '$quantity'}
}
},
{
$lookup : {
from : '_customers',
localField : '_id.location',
foreignField : 'Code',
as : 'customer_details'
}
},
{
$match : {'$or' : [{'customer_details.Area' : 'Karnataka'}, {'customer_details.Area' : 'Maharashtra'}]}
},
{
$lookup : {
from : '_items',
localField : '_id.item_id',
foreignField : 'Code',
as : 'item_details'
}
},
{
'$unwind' : {'path' : '$item_details', 'preserveNullAndEmptyArrays' : true}
},
{
'$match' : search_query // ..... Search could be done by customers name or product name ...
},
{
'$count' : 'count'
}], {'allowDiskUse' : true}
search_query
: We are building a search query assuming the lookup fields as item_details.ProductName
or customer_details.Name
, etc. We are also using the filters applied on the front-end like, filter by Area for example, we add a query here as {customer_details.Area : 'Karnataka'}
We are faced with a few challenges here,
- Customers expect that user should be able to browse the inventory without having to apply any filters (we might be able to covince them, but thats Plan B).
- Customers can search by Product Name / Customer Name so the indexing does not get applied and could slow down the overall query performance.
We have a few possible solutions in mind,
- Update all our inventory records and include some of the product’s and customer’s data such as
Area
,Region
,Zone
,Category
,SubCategory
columns in the inventory data itself so that the filtering can be done on the inventory data with added indexes. - This ofcourse will require us to update the entire inventory data and will be redundant. - Introduce some filters by default, users will need to apply these filters before proceeding to view the data, that way our table scan window reduces. - But as said earlier, we are not sure if our clients will agree but open to consider this option, if that’s the only way to go about.
I remember in one of our earlier sessions on MongoDB, that redundancy is what complements MongoDB architecture and we should not treat it like an RDBMS. So we think this might be a design problem and not a query issue. But we wanted to know if there’s any other suggestion or if there’s a better way we can achieve the results without having to update the records.