Applying Projection before Aggregation similar to Find

Paolo_Bartolucci · September 25, 2024, 7:59pm

Hello all,
I want to apply a projection to limit the size of each document before I start my aggregation pipeline.

My document format is roughly:

sibilingField:""
metadata:{
    importantField:"...",
   extremlyLargeField:Binary(...)
}

In increasing the scale of my DB, I noticed that my aggregations became incredibly slow. After investigation, I have learned that the size of each document is the source of this latency. In using the explain() feature I noticed the Query that is executed at the start of each aggregation causes 95% of the latency. This is my current understanding of my problem.

For finding a solution, this is what I understand.
The same “Select *” Query that the aggregation executes is the same as a find({}). Given this, I want to omit the field that is incredible large and not used in my aggregation. With find({}) I can do: find({},{"metadata.extremlyLargeField:0}) which works and slashes my times! So therefore my goal should be completing the same on my aggregation.

Although, that does not exist. The obvious answer is to use the $project aggregation stage or $unset (which are understood as the same when only omitting), but that does not work either. The initial query time is still 95% of my latency. In the documentation it states that unused fields will be automatically omitted, but I noticed that my large field is very clearly not.

There is a clear similarity between find() and what occurs at the start of every aggregation, but there is a very clear difference between the projection on find() and aggregation steps.

Given this preface, here is my question: How do I apply the same pre-fetch projection to the start of my aggregation?

steevej · September 25, 2024, 9:23pm

Can you share the explain plan?

What are the first stages of your aggregation?

What indexes to you have?

steevej · October 2, 2024, 4:53pm

@Paolo_Bartolucci, please followup on your thread.

One thing that comes to mind is that if a lot of use-cases, $project -out extremlyLargeField, then perhaps it is an indication that the field really belongs to an secondary collection.

This way, you won’t be penalized for the use cases that do not need it at the cost of having to do a $lookup when you really need it.

Considering, that documents are completely re-written to a new block when updated, having large binary fields, or large array is also detrimental for your write performances. So if your documents with extremlyLargeField has other fields that are dynamically updated, then moving the large field would be doubly benefical.

Paolo_Bartolucci · October 5, 2024, 8:30pm

Forum notifications lost in email.

I figured that moving the field to its own collection would solve this issue, and noted the solution as such. I was also able to solve my issue by replacing the Aggregation with my own code paired with a find that has the projection set to remove the large field.

Regardless of my own DB configuration, I am still curious about the details of how one would apply the same projection of a find, to the start of an Aggregation. Or, judging from your response, that this issue is the product of a unoptimally constructed Schema. Thanks!

system · October 10, 2024, 8:30pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.