Divide a MongoDB document into multiple documents based on Date Field using aggregation

I have a document on MongoDB that contains flight data records. This data goes back 4 years , and is mixed up in the flight_records array shown in the sample below. I want to split this document into separate documents based on the Year of the record. My Sample Document is as follows -

[
  {
    "flight_number": "DL1637",
    "airlines_code": "DL",
    "flight_records": [
      {
        "status": "Landed 22:09",
        "origin": "DCA",
        "destination": "DTW",
        "date_of_journey": "2021-11-18",
        "scheduled": {
          "dep": 1668735240,
          "arr": 1668741240
        },
        "real": {
          "dep": 1668736446,
          "arr": 1668740944
        }
      },
      {
        "status": "Landed 19:21",
        "origin": "DTW",
        "destination": "DCA",
        "date_of_journey": "2022-11-17",
        "scheduled": {
          "dep": 1668726000,
          "arr": 1668731640
        },
        "real": {
          "dep": 1668727074,
          "arr": 1668730907
        }
      },
      {
        "status": "Landed 21:53",
        "origin": "DCA",
        "destination": "DTW",
        "date_of_journey": "2023-11-17",
        "scheduled": {
          "dep": 1668648840,
          "arr": 1668654840
        },
        "real": {
          "dep": 1668649558,
          "arr": 1668653594
        }
      },
      {
        "status": "Landed 19:20",
        "origin": "DTW",
        "destination": "DCA",
        "date_of_journey": "2021-11-16",
        "scheduled": {
          "dep": 1668639600,
          "arr": 1668645240
        },
        "real": {
          "dep": 1668640813,
          "arr": 1668644421
        }
      },
      {
        "status": "Landed 22:13",
        "origin": "DCA",
        "destination": "DTW",
        "date_of_journey": "2022-11-16",
        "scheduled": {
          "dep": 1668562440,
          "arr": 1668568440
        },
        "real": {
          "dep": 1668564614,
          "arr": 1668568424
        }
      },
      {
        "status": "Landed 19:17",
        "origin": "DTW",
        "destination": "DCA",
        "date_of_journey": "2023-11-15",
        "scheduled": {
          "dep": 1668553200,
          "arr": 1668558840
        },
        "real": {
          "dep": 1668553656,
          "arr": 1668557859
        }
      },
      {
        "status": "Landed 21:59",
        "origin": "DCA",
        "destination": "DTW",
        "date_of_journey": "2022-11-15",
        "scheduled": {
          "dep": 1668476040,
          "arr": 1668482040
        },
        "real": {
          "dep": 1668477413,
          "arr": 1668481169
        }
      },
      {
        "status": "Landed 19:24",
        "origin": "DTW",
        "destination": "DCA",
        "date_of_journey": "2021-11-14",
        "scheduled": {
          "dep": 1668466800,
          "arr": 1668472440
        },
        "real": {
          "dep": 1668468110,
          "arr": 1668471887
        }
      },
      {
        "status": "Landed 22:07",
        "origin": "DCA",
        "destination": "DTW",
        "date_of_journey": "2023-11-14",
        "scheduled": {
          "dep": 1668389640,
          "arr": 1668395640
        },
        "real": {
          "dep": 1668391115,
          "arr": 1668395275
        }
      }
    ]
  }
]

I am using the “$unwind”: “$flight_records” first to unwind the flight records array , and then I am thinking about grouping this data by Year of the record. But I am not able to get a proper result. Can someone guide me as to what proper aggregation tools I should be using here. I am thinking along these lines :

db.collection.aggregate({
  "$unwind": "$flight_records",
  "$group": {
    "_id": "flight_records",
    "date_of_journey": {
      "year": {
        "$year": new Date("$flight_records.date_of_journey")
      }
    }
  }
})

But I am getting this error :

error in query:
  fail to parse content of query: cannot parse date: "$flight_records.date_of_journey" [parsing time "$flight_records.date_of_journey" as "2006-01-02T15:04:05.999Z07:00": cannot parse "$flight_records.date_of_journey" as "2006", parsing time "$flight_records.date_of_journey" as "2006-01-02": cannot parse "$flight_records.date_of_journey" as "2006"]

I am not able to figure out what I am doing wrong , any help is appreciated!

You have several issues.

the _id field in $group needs to be the value you want to group on so change “date_of_journal” to “_id” and remove the “_id” you have.

You aren’t supplying any aggregate/group operation to say what is computed for each group so this will just output the unique dates, if you want to rearrange into one record per date with an array then use something like

recordsforday : { $push : "$flight_records" } - be aware you cannot output these records if they are >16MB though - they can be larger inside the pipeline.

the new Date issue is because you are using Javascript syntax/code in an aggregation, MongoDB Server doesn’t know anything about Javascript ( let’s not argue about that just now ) - so to convert a string to a date you need MongoDB aggregation syntax { $toDate : "$flight_records.date_of_journey"}

"$group" : { 
  _id : { $toDate: "$flight_records.date_of_journey"},
 recsForDate : { $push : "$lightrecords" },
 totalFlightHours : { $sum : { $subtract : [ "$flight_records.real.,arr","$flight_records.real.dep"]}
}

Thank you so much john for the prompt response.

Actually what I am trying to do is group the records of the " flight records " array, by year. For this particular sample , I am expecting the result of the aggregation, to be three different documents , where each document only contains records from a particular year i.e. 2021, 2022 and 2023.

The logic I am trying to implement is , unwind the flight records array , then group the records by year , and then in output generate three docs. each doc will have only records for their year.

Based on your suggestion I have modified my aggregation result a little bit and this is what I have used :

db.collection.aggregate({
  "$unwind": "$flight_records",
  
},
{
  "$group": {
    _id: {
      $year: {
        $toDate: "$flight_records.date_of_journey"
      }
    },
    recordsForYear: {
      $push: "$flight_records"
    }
  }
})

And the result I am getting is this, and I think this should suffice. Is there a more efficient way to do this ? Would love to learn more about this.

[
  {
    "_id": 2021,
    "recordsForYear": [
      {
        "date_of_journey": "2021-11-18",
        "destination": "DTW",
        "origin": "DCA",
        "real": {
          "arr": 1.668740944e+09,
          "dep": 1.668736446e+09
        },
        "scheduled": {
          "arr": 1.66874124e+09,
          "dep": 1.66873524e+09
        },
        "status": "Landed 22:09"
      },
      {
        "date_of_journey": "2021-11-16",
        "destination": "DCA",
        "origin": "DTW",
        "real": {
          "arr": 1.668644421e+09,
          "dep": 1.668640813e+09
        },
        "scheduled": {
          "arr": 1.66864524e+09,
          "dep": 1.6686396e+09
        },
        "status": "Landed 19:20"
      },
      {
        "date_of_journey": "2021-11-14",
        "destination": "DCA",
        "origin": "DTW",
        "real": {
          "arr": 1.668471887e+09,
          "dep": 1.66846811e+09
        },
        "scheduled": {
          "arr": 1.66847244e+09,
          "dep": 1.6684668e+09
        },
        "status": "Landed 19:24"
      }
    ]
  },
  {
    "_id": 2023,
    "recordsForYear": [
      {
        "date_of_journey": "2023-11-17",
        "destination": "DTW",
        "origin": "DCA",
        "real": {
          "arr": 1.668653594e+09,
          "dep": 1.668649558e+09
        },
        "scheduled": {
          "arr": 1.66865484e+09,
          "dep": 1.66864884e+09
        },
        "status": "Landed 21:53"
      },
      {
        "date_of_journey": "2023-11-15",
        "destination": "DCA",
        "origin": "DTW",
        "real": {
          "arr": 1.668557859e+09,
          "dep": 1.668553656e+09
        },
        "scheduled": {
          "arr": 1.66855884e+09,
          "dep": 1.6685532e+09
        },
        "status": "Landed 19:17"
      },
      {
        "date_of_journey": "2023-11-14",
        "destination": "DTW",
        "origin": "DCA",
        "real": {
          "arr": 1.668395275e+09,
          "dep": 1.668391115e+09
        },
        "scheduled": {
          "arr": 1.66839564e+09,
          "dep": 1.66838964e+09
        },
        "status": "Landed 22:07"
      }
    ]
  },
  {
    "_id": 2022,
    "recordsForYear": [
      {
        "date_of_journey": "2022-11-17",
        "destination": "DCA",
        "origin": "DTW",
        "real": {
          "arr": 1.668730907e+09,
          "dep": 1.668727074e+09
        },
        "scheduled": {
          "arr": 1.66873164e+09,
          "dep": 1.668726e+09
        },
        "status": "Landed 19:21"
      },
      {
        "date_of_journey": "2022-11-16",
        "destination": "DTW",
        "origin": "DCA",
        "real": {
          "arr": 1.668568424e+09,
          "dep": 1.668564614e+09
        },
        "scheduled": {
          "arr": 1.66856844e+09,
          "dep": 1.66856244e+09
        },
        "status": "Landed 22:13"
      },
      {
        "date_of_journey": "2022-11-15",
        "destination": "DTW",
        "origin": "DCA",
        "real": {
          "arr": 1.668481169e+09,
          "dep": 1.668477413e+09
        },
        "scheduled": {
          "arr": 1.66848204e+09,
          "dep": 1.66847604e+09
        },
        "status": "Landed 21:59"
      }
    ]
  }
]

Is there a more efficient way to do it , maybe but it depends why you are doing it? If the requirement is just to create this output from an API then there is no need to aggregate on the server, just find, sort by date and aggregate in code (as it’s sorted this is trivial), that will be a lot more efficient in terms of memory and as you are returning the same quantity of data overall faster.

1 Like

Thanks John , will try that out.