$project in $lookup pipeline does not avoid loading the whole document

Cedric_Louyot · April 24, 2024, 7:49am

Hello,

I’m working with an aggregate similar to this one on a source collection:

[
  {
    $match: {
      isDeleted: false
    },
  },
  {
    $lookup: {
      from: "joined",
      localField: "_id",
      foreignField: "sourceId",
      pipeline: [{
        $project: { veryHeavyField: 0 }
      }]
      as: "joined",
    },
  }
]

As the name suggest, the veryHeavyField is heavy field (around a mega-octet) in the joined collection.

This aggregate is very slow. This seems to be because veryHeavyField seems to be loaded when I run the aggregate even though I use a $project stage to exclude this field. To be clear, veryHeavyField seems to be loaded but, due to the $project stage, it’s not returned so it’s not useful to load it and it leads to a huge performance drop.

This behaviour seems to be different when I use an equivalent projection in a find . In the following query, veryHeavyField is not loaded and it’s very fast:
db.source.find({}, {veryHeavyField: 0})

Questions are:

Is the difference of behaviour described above between $project in a $lookup pipeline and the equivalent projection in a find normal ?
Is there a way to avoid loading veryHeavyField in the above $lookup pipeline ?

Akshat_Gupta3 · April 25, 2024, 5:41am

Hi Cedric,

The syntax to use lookup with pipeline is slightly different as mentioned in the docs - Lookup

I doubt the time consumption is due to the mentioned veryHeavyField but still i think the below modified query may solve your purpose as it in the beginning itself removes the veryHeavyField.

[
  {
    $match: {
      isDeleted: false
    },
  },
  {
    $lookup: {
      from: "joined",
      let: {
        id: "$_id"
      }
      pipeline: [
        {
          $project: { veryHeavyField: 0 }
        },
        {
          $match: {
            $expr: {
              $eq: ["$$id", "$sourceId"]
            }
          }
        }
      ]
      as: "joined",
    },
  }
]

Hope this helps!!

steevej · April 25, 2024, 1:44pm

I do not think that this would prevent a fetch.

To prevent a fetch an index must cover the query. With a $project:{field:0} I do not think it is possible to avoid a fetch because the system does not know which fields exist in the document before the fetch. To avoid a fetch the $project must be inclusive.

You may simply lack an index on sourceId in the joined collection.

If veryHeavyField is projected out in most of your use-cases, you whole system may gain by moving the field in another collection and $lookup for it only when needed. Especially, since update one field means the whole document is written back.

Peter_Hubbard · April 25, 2024, 2:04pm

Documents are read atomically, so there is no way to avoid reading a certain field. The Project stage only affects what data is transmitted back to the client. As such, it should only be the final stage in an aggregation pipeline.

Make sure that you have an index on "sourceId" in the “joined” collection.

steevej · April 25, 2024, 2:22pm

I need some clarification:

My understanding is that documents will not be read/fetch if all projected fields are part of the index use by the query. Right? For documents, like { _id:1, a:2, veryHeavyField:[…]} and an index {a:1,_id:1} and I project {a:1,_id:1} no reading from disk should occur. Reference:

Peter_Hubbard · April 25, 2024, 2:58pm

Yes, this special case is called a Covered Query.

OP mentioned however that the following query was fast:
db.source.find({}, {veryHeavyField: 0})
This query does not even use an index. The performance problem in the aggregation has nothing to do with the presence of the large field in the source Documents. Use the .explain() and then .explain(“executionStats”) functions to see what the query planner is doing.

Cedric_Louyot · April 26, 2024, 12:26am

Thank you @steevej for your answer.

In fact, I have an index on the sourceId field of the joined collection. And the query is very slow only when it fetches documents with a veryHeavyField from the joined collection.

The collection is already existing and in use with a real lot of documents in it, so I cannot move veryHeavyField in another collection anymore.

Cedric_Louyot · April 26, 2024, 12:26am

Thanks a lot for your answer @Akshat_Gupta3.

I think both syntaxes are equivalent. The one from my initial post is described in the section Correlated Subqueries Using Concise Syntax. And I use a particular case where I don’t need any variable to be defined in let since I’m only projecting on a field of the joined collection in the $lookup pipeline.

So the query duration is the same with the query you’ve proposed (I’ve tested it) and my initial query.

Peter_Hubbard · April 26, 2024, 7:09am

You can add a $project as a final stage to the aggregation pipeline, to exclude the veryHeavyField from the response sent back to the client.

Cedric_Louyot · April 26, 2024, 1:43pm

Thank you @Peter_Hubbard for your helpful answers.

I’ve just dug a little more into the subject and my issue comes down to the following comparison.

When I run explain on very basic find queries on the activities collection (which is the joined collection of my OP with the veryHeavyField), the executionTimeMillis depends a lot on the _id range I query (for about the same number of documents returned).

For example, this query:

db.getCollection("activities").explain("executionStats")
  .find({_id: {
    $gte: ObjectId('577000000000000000000000'), 
    $lte: ObjectId('578000000000000000000000')
  }});

is very slow: “executionTimeMillis” : 8751 for “nReturned” : 81951.

detailed explain here

{
    "explainVersion" : "1",
    "queryPlanner" : {
        "namespace" : "activities",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "$and" : [
                {
                    "_id" : {
                        "$lte" : ObjectId("578000000000000000000000")
                    }
                },
                {
                    "_id" : {
                        "$gte" : ObjectId("577000000000000000000000")
                    }
                }
            ]
        },
        "maxIndexedOrSolutionsReached" : false,
        "maxIndexedAndSolutionsReached" : false,
        "maxScansToExplodeReached" : false,
        "winningPlan" : {
            "stage" : "FETCH",
            "inputStage" : {
                "stage" : "IXSCAN",
                "keyPattern" : {
                    "_id" : 1.0
                },
                "indexName" : "_id_",
                "isMultiKey" : false,
                "multiKeyPaths" : {
                    "_id" : [

                    ]
                },
                "isUnique" : true,
                "isSparse" : false,
                "isPartial" : false,
                "indexVersion" : 2.0,
                "direction" : "forward",
                "indexBounds" : {
                    "_id" : [
                        "[ObjectId('577000000000000000000000'), ObjectId('578000000000000000000000')]"
                    ]
                }
            }
        },
        "rejectedPlans" : [

        ]
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 81951.0,
        "executionTimeMillis" : 8751.0,
        "totalKeysExamined" : 81951.0,
        "totalDocsExamined" : 81951.0,
        "executionStages" : {
            "stage" : "FETCH",
            "nReturned" : 81951.0,
            "executionTimeMillisEstimate" : 8603.0,
            "works" : 81952.0,
            "advanced" : 81951.0,
            "needTime" : 0.0,
            "needYield" : 0.0,
            "saveState" : 527.0,
            "restoreState" : 527.0,
            "isEOF" : 1.0,
            "docsExamined" : 81951.0,
            "alreadyHasObj" : 0.0,
            "inputStage" : {
                "stage" : "IXSCAN",
                "nReturned" : 81951.0,
                "executionTimeMillisEstimate" : 132.0,
                "works" : 81952.0,
                "advanced" : 81951.0,
                "needTime" : 0.0,
                "needYield" : 0.0,
                "saveState" : 527.0,
                "restoreState" : 527.0,
                "isEOF" : 1.0,
                "keyPattern" : {
                    "_id" : 1.0
                },
                "indexName" : "_id_",
                "isMultiKey" : false,
                "multiKeyPaths" : {
                    "_id" : [

                    ]
                },
                "isUnique" : true,
                "isSparse" : false,
                "isPartial" : false,
                "indexVersion" : 2.0,
                "direction" : "forward",
                "indexBounds" : {
                    "_id" : [
                        "[ObjectId('577000000000000000000000'), ObjectId('578000000000000000000000')]"
                    ]
                },
                "keysExamined" : 81951.0,
                "seeks" : 1.0,
                "dupsTested" : 0.0,
                "dupsDropped" : 0.0
            }
        }
    },
    "command" : {
        "find" : "activities",
        "filter" : {
            "_id" : {
                "$gte" : ObjectId("577000000000000000000000"),
                "$lte" : ObjectId("578000000000000000000000")
            }
        },
    }
}

Whereas this query (same query with a different range of _id):

db.getCollection("activities").explain("executionStats")
  .find({_id: {
    $gte: ObjectId('579000000000000000000000'), 
    $lte: ObjectId('600000000000000000000000')
  }});

which returns more documents is more than 10 times faster: “executionTimeMillis” : 654 for “nReturned” : 105448.

Detailed explain here

{
    "explainVersion" : "1",
    "queryPlanner" : {
        "namespace" : "activities",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "$and" : [
                {
                    "_id" : {
                        "$lte" : ObjectId("600000000000000000000000")
                    }
                },
                {
                    "_id" : {
                        "$gte" : ObjectId("579000000000000000000000")
                    }
                }
            ]
        },
        "maxIndexedOrSolutionsReached" : false,
        "maxIndexedAndSolutionsReached" : false,
        "maxScansToExplodeReached" : false,
        "winningPlan" : {
            "stage" : "FETCH",
            "inputStage" : {
                "stage" : "IXSCAN",
                "keyPattern" : {
                    "_id" : 1.0
                },
                "indexName" : "_id_",
                "isMultiKey" : false,
                "multiKeyPaths" : {
                    "_id" : [

                    ]
                },
                "isUnique" : true,
                "isSparse" : false,
                "isPartial" : false,
                "indexVersion" : 2.0,
                "direction" : "forward",
                "indexBounds" : {
                    "_id" : [
                        "[ObjectId('579000000000000000000000'), ObjectId('600000000000000000000000')]"
                    ]
                }
            }
        },
        "rejectedPlans" : [

        ]
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 105448.0,
        "executionTimeMillis" : 654.0,
        "totalKeysExamined" : 105448.0,
        "totalDocsExamined" : 105448.0,
        "executionStages" : {
            "stage" : "FETCH",
            "nReturned" : 105448.0,
            "executionTimeMillisEstimate" : 361.0,
            "works" : 105449.0,
            "advanced" : 105448.0,
            "needTime" : 0.0,
            "needYield" : 0.0,
            "saveState" : 108.0,
            "restoreState" : 108.0,
            "isEOF" : 1.0,
            "docsExamined" : 105448.0,
            "alreadyHasObj" : 0.0,
            "inputStage" : {
                "stage" : "IXSCAN",
                "nReturned" : 105448.0,
                "executionTimeMillisEstimate" : 37.0,
                "works" : 105449.0,
                "advanced" : 105448.0,
                "needTime" : 0.0,
                "needYield" : 0.0,
                "saveState" : 108.0,
                "restoreState" : 108.0,
                "isEOF" : 1.0,
                "keyPattern" : {
                    "_id" : 1.0
                },
                "indexName" : "_id_",
                "isMultiKey" : false,
                "multiKeyPaths" : {
                    "_id" : [

                    ]
                },
                "isUnique" : true,
                "isSparse" : false,
                "isPartial" : false,
                "indexVersion" : 2.0,
                "direction" : "forward",
                "indexBounds" : {
                    "_id" : [
                        "[ObjectId('579000000000000000000000'), ObjectId('600000000000000000000000')]"
                    ]
                },
                "keysExamined" : 105448.0,
                "seeks" : 1.0,
                "dupsTested" : 0.0,
                "dupsDropped" : 0.0
            }
        }
    },
    "command" : {
        "find" : "activities",
        "filter" : {
            "_id" : {
                "$gte" : ObjectId("579000000000000000000000"),
                "$lte" : ObjectId("600000000000000000000000")
            }
        },
    },
}

Since the range of _id from the 1st query contains a lot of veryHeavyField whereas this field is mostly empty for the range of _id of the 2nd query, I thought this was the reason why the 1st query spent so much time on the FETCH stage, where the 2nd query was really fast for this FETCH stage.

Is this the reason why FETCH stage durations are so much different ?
Or could there be another reason (like cache since the 2nd query queries more recent documents or anything else) ?

Peter_Hubbard · April 30, 2024, 3:59pm

Assuming all the documents are the same size (approximately) there isn’t enough information in the explain plan to explain why the performance is so different. You might want to look at the Collection Statistics for more information on what is in the cache: https://www.mongodb.com/docs/manual/reference/method/db.collection.stats/

steevej · April 30, 2024, 4:29pm

Sorry to barge in Peter.

They are not as mentioned

Bigger documents have a bigger influence on the cache and on disk I/O.

There are 2 ways to avoid fetching a veryHeavyField.

move the veryHeavyField in another collection and use $lookup when an only when you need it
use a $project that only $project fields that are part of the index, projecting out the veryHeavyField does not prevent the fetch

Note that by moving out veryHeavyField out, write performance are also improved since when a document is written the whole document is written back. With veryHeavyField embedded, modifies a verySmallField implies reading the veryHeavyField and writing back veryHeavyField.

system · May 5, 2024, 4:29pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.