Aggregation $lookup optimization to match _id for top-1 selection

Hi guys,

I am fairly new to MongoDB so there might be obvious mistakes here. I have attemted an aggregation pipeline which takes several seconds for a mere collection size of ~2000 documents. I have isolated the bottleneck: the initial $lookup step. Hints are welcome, you are also welcome to send me your contact details if you do consulting around these kind of issues on a regular basis.

I am trying to do something similar to the SQL statement

SELECT t.* FROM ModelEvaluation t
  WHERE t.id = (SELECT t1.id FROM ModelEvaluation t1
                WHERE t1.customer_id = t.customer_id 
                ORDER BY t1.modified DESC LIMIT 1)

My Mongo aggregation (the part that takes +90% of the time)

db.ModelEvaluation.aggregate([                                                                                                                                                                                       
    {                                                                                                                                                                                                                
        $lookup: {                                                                                                                                                                                                   
            from: "ModelEvaluation",                                                                                                                                                                                 
            let: {f_customer_id: "$customer_id"},                                                                                                                                                                    
            pipeline: [                                                                                                                                                                                              
                {                                                                                                                                                                                                    
                    /* This looks complex but it has to be this way.                                                                                                                                                 
                     *                                                                                                                                                                                               
                     * https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/#lookup-join-let */                                                                                                     
                    $match: {                                                                                                                                                                                        
                        $expr: {                                                                                                                                                                                     
                            $eq: ["$customer_id", "$$f_customer_id"],                                                                                                                                                
                        }                                                                                                                                                                                            
                    }                                                                                                                                                                                                
                },                                                                                                                                                                                                   
                {$sort: {modified: -1}},                                                                                                                                                                                 
                {$limit: 1},                                                                                                                                                                                         
                {$project: {_id: true}}                                                                                                                                                                              
            ],                                                                                                                                                                                                       
            as: "c1",                                                                                                                                                                                                
        }                                                                                                                                                                                                            
    },                                                                                                                                                                                                               
    {$out: "foo"},                                                                                                                                                                                                   
])   

Hi @Henrik_Holst - welcome to the community!

By their nature $lookup operations are slower than queries that don’t need to join data from more than one collection.

If you’ll be needing to join the data together frequently, you may want to consider restructuring your data. The rule of thumb when modeling data in MongoDB is data that is accessed together should be stored together.

If you’re not able to restructure the data, can you post some sample documents from each collection as well as the output you’re trying to generate, so we can see if there is a way to optimize your $lookup?

I agree that the $lookup seems to be designed to allow fetching data from other collections and this is not what I am doing. We have seen various ideas how to optimize this here but in the end of the day I found another way to solve this using $group and $first.

The complete pipeline also became 50% as long:

db.ModelEvaluation.aggregate([
        {
                $sort: {
                        modified: -1
                }
        },
        {
                $group: {
                        _id: "$customer_id",
                        values: {
                                $first: "$values"
                        },
                        dataset_transform: {
                                $first: "$dataset_transform"
                        }
                }
        },
        {$unwind: "$values"},
        {
                $match: {
                        "values.target": "accoding"
                }
        },
        {$unwind: "$dataset_transform"},
        {       
                $match: {
                        "dataset_transform.tail": {$exists: true}
                }
        },
        {
                $project: {
                        customer: "$_id",
                        accuracy: "$values.value",
                        size: "$dataset_transform.tail",
                }
        },
        {$out: "foo"},
])

Hello

If you want only the top1 for each customer based on $modified, $max is simple way to go.

{
  "aggregate": "modelevaluation",
  "pipeline": [
    {
      "$group": {
        "_id": "$customerid",
        "top1Customer": {
          "$max": {
            "modified": "$modified",      //put first on document the sorting creteria,here the modified value
            "customerid": "$customerid",  //or "$values" on your example
            "modelid": "$modelid"         //or "$dataset_transform" on your example
          }
        }
      }
    },
    {
      "$addFields": {
        "customerid": "$_id"
      }
    },
    {
      "$project": {
        "_id": 0
      }
    }
  ],
  "maxTimeMS": 0,
  "cursor": {}
}

You need this i believe,mongoDB allows document comparison

Each time the $max will keep the document with the top value on $modified.

Its fast and don’t need index,that would need memory.