Dear MongoDB Developer Community,
I’d like to ask you for an advice on how to get the most (in terms of performance) out of the java sync driver for MongoDB. I’m not talking about having proper indexes for the DB itself - it’s already handled. My use-case is fairly simple: I’ve got a lot of Documents (c.a. 200k), I filter them and return as-is - no object mapping included, only projected JSON. My current approach is:
List<Document> offers = new ArrayList<>();
offersCollection
.find(filters)
.projection(SEARCH_PROJECTION)
.sort(sort)
.iterator()
.forEachRemaining(offers::add);
return offers;
Is iterator + ‘forEachRemaining’ the best approach here? I imagine something like “bulk read”, but couldn’t find a method for that.
The 2nd question (it’s on the verge of data modeling and java driver) that I’d like to ask you is how to arrange my filters to get the most performance out of the DB itself? I know the ESR rule and I’ve created a model for SearchRequest that allows me to optimize the filters (i.e. if there is a Request with 1-element array, it’s modeled as EqualityFilter, InRangeFilter otherwise), which leads to filters like these (this is the BSON that’s passed to ‘find’ method above as ‘filters’):
filters (ranges): And Filter{filters=[Filter{fieldName='gearbox', value=AT}, Filter{fieldName='registered', value=true}, Filter{fieldName='engineCapacityRange', value=1500_2000}, Operator Filter{fieldName='price', operator='$gte', value=40000}, Operator Filter{fieldName='price', operator='$lte', value=60000}, Operator Filter{fieldName='year', operator='$gte', value=2015}, Operator Filter{fieldName='make', operator='$in', value=[Volkswagen, Toyota]}, Operator Filter{fieldName='bodyType', operator='$in', value=[SUV, SEDAN, COMBI]}, Operator Filter{fieldName='fuelType', operator='$in', value=[PETROL, HYBRID]}]}
filters (one element ranges transformed into equality filters): And Filter{filters=[Filter{fieldName='make', value=Toyota}, Filter{fieldName='bodyType', value=SUV}, Filter{fieldName='fuelType', value=HYBRID}, Filter{fieldName='gearbox', value=AT}, Filter{fieldName='registered', value=true}, Filter{fieldName='engineCapacityRange', value=1500_2000}, Operator Filter{fieldName='price', operator='$gte', value=40000}, Operator Filter{fieldName='price', operator='$lte', value=100000}, Operator Filter{fieldName='year', operator='$gte', value=2015}]}
As you can see, ‘in range’ filters are moved to the last positions of ‘filters’ so the DB can use the indexes optimally. I’ve got c.a. 12 indexes, starting from:
db.offers.createIndex({ make: 1, model: 1, generation: 1, bodyType: 1, fuelType: 1, price: -1, publishedDate: 1 })
and ending with:
db.offers.createIndex({ make: 1, model: 1, generation: 1, bodyType: 1, fuelType: 1, gearboxType: 1, registered: 1, damaged: 1, mileageRange: 1, engineCapacityRange: 1, price: -1, publishedDate: 1, year: 1 })
and variations between them.
I’m not sure if the way I arrange filters order in case of ‘in range filters’ is fine (e.g. make is the 1st field in index and works superb for ‘eq’ but is it fine that I move it to the end of the filters when it’s ‘in’? I think so, then it should be fine with ESR rule, but maybe I’m wrong and you’ve got some other ideas, or you see some improvements here?
Currently, I’m getting end-to-end (req to resp in Postman) times under 1s for remote DB (M0 Atlas cloud) filled with c.a. 140k documents for probable query scenarios. What made me wonder though is the small difference between limiting to and returning 50 docs and getting all of the docs that fit the filters (c.a. 750 docs) → it’s c.a. 500 ms vs c.a. 800 ms.
If you can see any improvement points in terms of performance here besides the questions that I stated, please share your thoughts with me!
Best regards,
Przemek, Poland