How to group an array and get the recent value without using unwind?

Consider the below sample document, which I am trying to get the recent values.

{
    "_id" : ObjectId("5ff32c8b6cff64b8582a7c12"),
    "Transaction" : [ 
        {
            "StatusCode" : "1",
            "Amount" : NumberDecimal("300"),
            "CreatedDateTime" : ISODate("2021-01-01T10:27:41.746Z")
        }, 
        {
            "StatusCode" : "2",
            "Amount" : NumberDecimal("-750"),
            "CreatedDateTime" : ISODate("2021-01-02T10:27:41.746Z")
        }, 
        {
            "StatusCode" : "1",
            "Amount" : NumberDecimal("1500"),
            "Date" : ISODate("2021-01-03T10:27:41.746Z")
        }
    ]
}

The Aggregate query that I tried

[{$unwind: 
{
  path: '$Transaction'
}}, {$group: {
  _id: "$Transaction.StatusCode",
   Payments : {
      $push: '$Transaction'
   }
}}, {$group: {
    _id: null,
    Paid: {
    $push: {
      $arrayElemAt: [
        '$Payments.Amount',
        {
          $indexOfArray: [
            '$Payments',
            {
              $max: '$Payments.CreatedDateTime'
            }
          ]
        }
      ]
    }
  } 
}}]

Results which I got

{
    "_id" : null,
    "Paid" : [ 
        NumberDecimal("1500"), 
        NumberDecimal("-750")
    ]
}

Can this query be possible to optimize without using unwind or any other way to bring this result?

$unwind is required but you can optimize the query as below:

  • $unwind deconstruct Transaction array
  • $group by StatusCode and get a maximum amount from Amount
  • $group by null and construct the array of Amount in Paid
db.collection.aggregate([
  { $unwind: { path: "$Transaction" } },
  {
    $group: {
      _id: "$Transaction.StatusCode",
      Payments: { $max: "$Transaction.Amount" }
    }
  },
  {
    $group: {
      _id: null,
      Paid: { $push: "$Payments" }
    }
  }
])

Hi @Sudhesh_Gnanasekaran,

any other way to bring this result?

Based on the example document, Transaction field is an array that looks like going to keep growing in size as more transactions occur. Even before a document reaches the size limitation, you are likely to encounter challenges at querying data from a document.

Depending on your use case, I’d suggest to re-consider your data model. You could try to store a transaction event as a single document instead. See also Building With Patterns: A summary for more data modelling information.

Another observation based on the example document, there are two fields that contains date information. I’d suggest to use a consistent schema for the sub-documents within the Transaction field.

Regards,
Wan.

2 Likes

I addition to what @wan said about the schema, I’d like to add that $unwind may not be necessary here. It depends on what exactly the aggregation is supposed to do. There are multiple ways of processing arrays without having to unwind them, but what I’m not clear on is why you are trying to avoid an unwind.

I can see trying to avoid $group but $unwind is an efficient streaming stage so it’s nothing to fear…

@Sudhesh_Gnanasekaran can you clarify what your aggregation is trying to do? What I see now is grouping all the transactions of the same status code and then taking the amount from the highest date value - is that what you are trying to do? The problem with your aggregation is there is no way to know which final array element belongs to which status code (because the order is not guaranteed to be by group key value) and you have some unfortunate typos or mistakes which are not uncovered by having a very trivial test document.

In general, it’s easier to help if you provide a description of what you want the output of your aggregation to be both in plain English and as a sample document. Otherwise you end up with people giving you advice how to construct an aggregation that does something different than what you want it to!

Asya

2 Likes

Hi @Asya_Kamsky

I want to take the recent amount value by “CreatedDateTime” with groupby “StatusCode.”

If the status code =1, the Amount value 1500 is the recent transaction by date.
If the status code =2, the Amount value 700 is the recent transaction by date.

Below is the expected result

{
    "_id" : null,
    "Paid" : [ 
        NumberDecimal("1500"), 
        NumberDecimal("-750")
    ]
}

So little concern about the $unwind operation will cause any performance issue because we have more than 10 million documents in that collection. Then each transaction array having least 3 to 5 sub document.

In this case my final output will be sum of each transaction of recent paid .

{
  PaidAmount : 750
}

Create several more test documents and see that the result will be an array with as many entries as you have different status codes but they will not be in order so there is no way to know which amount belongs to which statusCode!

Do you want one result document for each input document? Or do you want one result document for your entire collection? How are transactions grouped into documents in your collection? All these things influence how you should write and optimize the aggregation.

If your sample documents had these arrays of transactions, what should the output be?

 { Transactions: [ { status: 1, date: 1, amount: 10}, { status: 2, date: 2, amount: 20} ]}
 { Transactions: [ { status: 1, date: 3, amount: 15}, { status: 2, date: 1, amount: 30} ]}
 { Transactions: [ { status: 3, date: 1, amount: 5} ]}

Is it the single latest date value for each status that you want? Then first you say you want the result to be _id:null with array of paid amounts (in some order?) but then you say you want to sum each transaction of recent paid. So do you not care about the order because you will be summing them? You just want the latest value for each status code?

Simplest way to do that would be:

 [
     {$unwind:"$Transactions"},
     {$sort:{"Transaction.StatusCode":1,"Transaction.CreatedDateTime":-1}},
     {$group:{_id:"$Transaction.StatusCode", latestPayment:{$first:"$Transaction.Amount"}}},
     {$group:{_id:null, PaidAmount:{$sum:"$latestPayment"}}}
 ]

The problem with this is that the $sort cannot use an index (since you are sorting items that are not stored in the collection but unwound version of them), so you will need to use allowDiskUse:true and this aggregation will be slow unless you can limit it to a subset of your original documents with $match first.

So in fact, avoiding $unwind here would be key to being able to use an index or being able to keep the result small enough not to spill to disk. There’s a way to do it I think but it’s a bit ugly. I would really suggest reconsidering your schema if at all possible.

Asya

1 Like