Best performance for java sync for getting pure Documents

przemek · July 26, 2023, 1:50pm

Dear MongoDB Developer Community,

I’d like to ask you for an advice on how to get the most (in terms of performance) out of the java sync driver for MongoDB. I’m not talking about having proper indexes for the DB itself - it’s already handled. My use-case is fairly simple: I’ve got a lot of Documents (c.a. 200k), I filter them and return as-is - no object mapping included, only projected JSON. My current approach is:

            List<Document> offers = new ArrayList<>();
            offersCollection
                    .find(filters)
                    .projection(SEARCH_PROJECTION)
                    .sort(sort)
                    .iterator()
                    .forEachRemaining(offers::add);
            return offers;

Is iterator + ‘forEachRemaining’ the best approach here? I imagine something like “bulk read”, but couldn’t find a method for that.

The 2nd question (it’s on the verge of data modeling and java driver) that I’d like to ask you is how to arrange my filters to get the most performance out of the DB itself? I know the ESR rule and I’ve created a model for SearchRequest that allows me to optimize the filters (i.e. if there is a Request with 1-element array, it’s modeled as EqualityFilter, InRangeFilter otherwise), which leads to filters like these (this is the BSON that’s passed to ‘find’ method above as ‘filters’):

filters (ranges): And Filter{filters=[Filter{fieldName='gearbox', value=AT}, Filter{fieldName='registered', value=true}, Filter{fieldName='engineCapacityRange', value=1500_2000}, Operator Filter{fieldName='price', operator='$gte', value=40000}, Operator Filter{fieldName='price', operator='$lte', value=60000}, Operator Filter{fieldName='year', operator='$gte', value=2015}, Operator Filter{fieldName='make', operator='$in', value=[Volkswagen, Toyota]}, Operator Filter{fieldName='bodyType', operator='$in', value=[SUV, SEDAN, COMBI]}, Operator Filter{fieldName='fuelType', operator='$in', value=[PETROL, HYBRID]}]}
filters (one element ranges transformed into equality filters): And Filter{filters=[Filter{fieldName='make', value=Toyota}, Filter{fieldName='bodyType', value=SUV}, Filter{fieldName='fuelType', value=HYBRID}, Filter{fieldName='gearbox', value=AT}, Filter{fieldName='registered', value=true}, Filter{fieldName='engineCapacityRange', value=1500_2000}, Operator Filter{fieldName='price', operator='$gte', value=40000}, Operator Filter{fieldName='price', operator='$lte', value=100000}, Operator Filter{fieldName='year', operator='$gte', value=2015}]}

As you can see, ‘in range’ filters are moved to the last positions of ‘filters’ so the DB can use the indexes optimally. I’ve got c.a. 12 indexes, starting from:

db.offers.createIndex({ make: 1, model: 1, generation: 1, bodyType: 1, fuelType: 1, price: -1, publishedDate: 1 })

and ending with:

db.offers.createIndex({ make: 1, model: 1, generation: 1, bodyType: 1, fuelType: 1, gearboxType: 1, registered: 1, damaged: 1, mileageRange: 1, engineCapacityRange: 1, price: -1, publishedDate: 1, year: 1 })

and variations between them.
I’m not sure if the way I arrange filters order in case of ‘in range filters’ is fine (e.g. make is the 1st field in index and works superb for ‘eq’ but is it fine that I move it to the end of the filters when it’s ‘in’? I think so, then it should be fine with ESR rule, but maybe I’m wrong and you’ve got some other ideas, or you see some improvements here?

Currently, I’m getting end-to-end (req to resp in Postman) times under 1s for remote DB (M0 Atlas cloud) filled with c.a. 140k documents for probable query scenarios. What made me wonder though is the small difference between limiting to and returning 50 docs and getting all of the docs that fit the filters (c.a. 750 docs) → it’s c.a. 500 ms vs c.a. 800 ms.

If you can see any improvement points in terms of performance here besides the questions that I stated, please share your thoughts with me!

Best regards,
Przemek, Poland

Jeffrey_Yemin · July 26, 2023, 6:01pm

Is iterator + ‘forEachRemaining’ the best approach here? I imagine something like “bulk read”, but couldn’t find a method for that.

Performance-wise it’s fine, but note that the iterator method returns a MongoCursor, which implements Closeable, and your code doesn’t ensure that the cursor is closed in the face of an exception during iteration (unlikely to happen, but it’s good practice. So you could use try-with-resources to ensure that the cursor is closed. Alternatively, this is a simpler way of doing the same thing:

List<Document> offers =
            offersCollection
                    .find(filters)
                    .projection(SEARCH_PROJECTION)
                    .sort(sort)
                    .into(new ArrayList<>());

Updated: Added https://jira.mongodb.org/browse/JAVA-5085 to address the cursor closing issue with Iterator#forEachRemaining.

przemek · July 27, 2023, 8:28am

Thanks @Jeffrey_Yemin ! I’ve applied ‘into’ to my code.
Is it possible to use JSON directly as an argument to some method to perform MongoDB query? I mean something similar to mongosh → db.coll.find(JSON) → but how this doc should look like to involve sorting and skip/limit?

Are you able to share your knowledge in regards to my 2nd question? To sum up:

let’s assume we have such index:
make_engineCapacityRange_gearbox
Let’s assume we want to filter results like this:
make=[BMW, Toyota], engineCapacityRange=2, gearbox=AT
How should we arrange filters in filter to get the max out of index? My current approach is: engineCapacityRange, gearbox, make, because ‘in’ is range filter and if used, we couldn’t use the index for latter fields. But I’m not sure if I’m right.
why execution time difference is so small while returning 50 and 750 documents? Does MongoDB do all the filtering and later apply limit so the difference is only about sending more data over the wire? Does it always act like this, or only while sorting is applied?

Thanks!

Jeffrey_Yemin · July 27, 2023, 11:51am

There’s a class called org.bson.json.JsonObject that is a bridge between the driver API and a JSON string. You can use it any place you would use an org.bson.Document.

I don’t have quick answers for your other questions (It’s generally better to not ask a bunch of unrelated questions in a single forum question. Makes it easier for different people to help you and to make the answers easier to find for future readers).

Regards,
Jeff

przemek · July 28, 2023, 7:50am

Sure @Jeffrey_Yemin , I’ll create a new thread for the 2nd question.
If I’m only interested in returning the found Docs, shall I return the collection of JsonObject instead of Document? Will it help with performance?
On the query side, I’ll probably stick to what I have (req → model → filters from driver), as these are only one object per req and gives me more flexibility.

Manoj_Singh_rawat · July 28, 2023, 8:12am

All monodb member connect with me in unique app that is facebook with monodb account. Than send me message of any question. In separate message box of my Facebook ID. Thank you.

Jeffrey_Yemin · July 28, 2023, 2:37pm

If I’m only interested in returning the found Docs, shall I return the collection of JsonObject instead of Document? Will it help with performance?

It really depends on your use case, in particular what you need to do with the returned documents.

Have a look at https://www.mongodb.com/docs/drivers/java/sync/current/fundamentals/data-formats/documents/#documents for an overview of the options, and make the best decision for your application.

Regards,
Jeff

przemek · July 28, 2023, 2:54pm

Great, thanks for the link. The docs are really comprahensive. From what I can see, as my use case is simply query for data and return it via controller without any modifications, JsonObject will help performance because conversion to Map won’t be performed.