Fill missing values after group

I’m using MongoDB aggregation framework. I have a Mongo collection with documents like this:

{
  'step': 1,
  'name': 'house',
  'score': 2
}
{
  'step': 1,
  'name': 'car',
  'score': 3
}
{
  'step': 2,
  'name': 'house',
  'score': 4
}

I'm grouping the documents with same 'step' and pushing 'name' and 'score' into an array of objects. What I get is:

{
  'step': 1,
  'scores': 
  [
      {'name':'house','score':2},
      {'name':'car','score':3}
  ]
}
{
  'step': 2,
  'scores': 
  [
      {'name':'house','score':4}
  ]
}

For each ‘step’ I need to copy the value of previous ‘step’ in case that a ‘name’ does not exists. I should have something like this:

{
‘step’: 1,
‘scores’:
[
{‘name’:‘house’,‘score’:2},
{‘name’:‘car’,‘score’:3}
]
}
{
‘step’: 2,
‘scores’:
[
{‘name’:‘house’,‘score’:4},
{‘name’: ‘car’, ‘score’:3}
]
}

At the second document the element {‘name’:‘car’,‘score’:3} has been copied from the previous document because at ‘step:2’ there is not documents having ‘score’ for ‘car’.

If step 1 do not have car record then step 2 should not have car record.

I’ll try to explain better the goal:
For each step, the two fields (house and car) should be inspected and in case that no value available for some of them, then missing value should be filled with the last value provided at previous steps. If no previous step has value for the field, then nothing to copy to current step

I’m not able to figure out how to do this operation with MongoDB aggregation. Some help will be very appreciated.

Hello, @Merce_Bruned_Lacoma! Welcome to the community!

Ok, so we need to have build some kind of relationships between two separate documents. That is achievable only if we combine all the documents into temporary list for calculation purposes.

With this approach, please, note that: depending on the number of documents you take in your aggregation and the size of each document, that will go into that temporary list, you may hit the aggregation pipeline stage memory limitations, that may decrease the aggregation performance.

I will extend your dataset example, so we could run the aggregation against the longer documents chain:

db.test1.insertMany([
  {
    step: 1,
    name: 'house',
    score: 18,
  },
  {
    step: 1,
    name: 'car',
    score: 5,
  },
  {
    step: 2,
    name: 'house',
    score: 20,
  },
  {
    step: 2,
    name: 'boat',
    score: 15,
  },
  {
    step: 2,
    name: 'yacht',
    score: 20,
  },
  {
    step: 3,
    name: 'plane',
    score: 50,
  },
]);

This aggregation should provide you with the desired result:


db.test1.aggregate([
  {
    $group: {
      _id: '$step',
      scores: {
        $push: {
          name: '$name',
          score: '$score',
        },
      },
    },
  },
  {
    // we need this, because $group stage
    // does not guarantee consistency
    // in the order in groups
    $sort: {
      _id: 1,
    },
  },
  {
    $group: {
      _id: null,
      list: {
        // collect all docs into one list
        // to be able to compare current and previous doc
        $push: '$$CURRENT',
      },
    },
  },
  {
    $project: {
      listWithChainedScores: {
        $reduce: {
          input: '$list',
          initialValue: null,
          in: {
            $cond: {
              if: {
                $eq: ['$$value', null],
              },
              then: {
                prev: '$$this',
                calculated: ['$$this'],
              },
              else: {
                prev: '$$this',
                calculated: {
                  // concat modified current doc with
                  // the general list of modified docs
                  $concatArrays: ['$$value.calculated', [{
                    // keep the current doc id
                    _id: '$$this._id',
                    scores: {
                      // combine scores of current and previous doc
                      $setUnion: ['$$this.scores', '$$value.prev.scores'],
                    },
                  }]],
                },
              },
            },
          },
        },
      },
    },
  },
  // $unwind + $replaceWith will make a new document
  // per each item in the $listWithChainedScores.calculated array
  {
    $unwind: '$listWithChainedScores.calculated',
  },
  {
    $replaceWith: '$listWithChainedScores.calculated',
  },
]).pretty();

Well almost. Here is the output:

[
  {
    "_id": 1,
    "scores": [
      { "name": "house", "score": 18 },
      { "name": "car", "score": 5 }
    ],
  },
  {
    "_id": 2,
    "scores": [
      { "name": "boat", "score": 15 },
      { "name": "car", "score": 5 },
      { "name": "house", "score": 18 }, // duplicate
      { "name": "house", "score": 20 }, // duplicate
      { "name": "yacht", "score": 20 }
    ],
  },
  {
    "_id": 3,
    "scores": [
      { "name": "boat",  "score": 15 },
      { "name": "house", "score": 20 },
      { "name": "plane", "score": 50 },
      { "name": "yacht", "score": 20 }
    ],
  },
]

Notice, that there are duplicates in ‘scores’ array.
Resolving those duplicates is tricky here. However, nobody said it is not possible :smiley:

Add the following stages to the end of the aggregation:

[
  {
    // $unwind to be able to sort 
    $unwind: '$scores',
  },
  {
    $sort: {
      // order by score from bigger to smaller
      // $sort is needed so in the $group stage we picked
      // the first score object (that will have the bigger score value)
      // change the direction to '1' if you need the opposite
      'scores.score': -1,
    },
  },
  {
    $group: {
      // at this stage we get rid of duplicates
      _id: {
        docId: '$_id',
        scoreName: '$scores.name',
      },
      scores: {
        $first: '$scores',
      },
    },
  },
  {
    // at this stage we restore the original documents structure
    $group: {
      _id: '$_id.docId',
      scores: {
        $push: '$scores',
      },
    },
  },
]

All done.

This is an interesting problem and it actually has a much simpler solution, one that does not rely on having to push all of the documents into a single document which can definitely fail to scale when the collection is large, plus it’s unnecessarily complex.

My solution only needs a single piece of information and that is what the highest (that is last) step number is. If that’s not known at aggregation there are two ways to get it - one is by running a query first to get that number, the other by inserting a $lookup stage to fetch it - luckily expressive $lookup is smart enough to only run a non-correlated subquery only once. I’ll show that solution:

db.steps.aggregate([
{$sort:{name:1, step:1}}, 
{$group:{_id:"$name", steps:{$push:{step:"$step", score:"$score"}}}}, 
{$lookup:{from:"steps", pipeline:[ {$sort:{step:-1}},{$limit:1}], as:"lastStep"}},
{$unwind:"$lastStep"}, 
{$set:{steps: {$reduce:{
    input:{$range:[{$add:[1,{$max:"$steps.step"}]}, {$add:[1,"$lastStep.step"]}]},
    initialValue:"$steps", 
    in: {$concatArrays:[ 
        "$$value", 
        [{$mergeObjects:[
            {$last:"$steps"}, 
            {step:"$$this"}
        ]}]
    ]}
}}}}, 
{$unwind:"$steps"}, 
{$group:{_id:"$steps.step", scores:{$push:{name:"$_id", score:"$steps.score"}}}}, 
{$sort:{_id:1}})

On the original example documents, the result is:

{ "_id" : 1, "scores" : [ { "name" : "house", "score" : 2 }, { "name" : "car", "score" : 3 } ] }
{ "_id" : 2, "scores" : [ { "name" : "house", "score" : 4 }, { "name" : "car", "score" : 3 } ] }

On @slava’s example the result is:

{ "_id" : 1, "scores" : [ { "name" : "car", "score" : 5 }, { "name" : "house", "score" : 18 } ] }
{ "_id" : 2, "scores" : [ { "name" : "car", "score" : 5 }, { "name" : "boat", "score" : 15 }, { "name" : "house", "score" : 20 }, { "name" : "yacht", "score" : 20 } ] }
{ "_id" : 3, "scores" : [ { "name" : "car", "score" : 5 }, { "name" : "boat", "score" : 15 }, { "name" : "house", "score" : 20 }, { "name" : "yacht", "score" : 20 }, { "name" : "plane", "score" : 50 } ] }
3 Likes

I’d be happy to explain anything that’s not clear here, I do want to point out that I’m using a new-in-4.4.0 expression $last which returns the last element of an array. If you want to run this on an earlier version of MongoDB you should replace $last with a much longer and more unwieldy expression

{'$arrayElemAt':[ <array>, {'$subtract':[{'$size':<array>}, 1]} ]}

which is a much more complicated way to grab the last element of an array, wouldn’t you agree?

2 Likes