Hi,
I am trying to write an aggregation pipeline to aggregate values in an array. Our Datamodel looks something like
{
"_id": {
"$oid": "64e89dd304c2da1a0022b0d6"
},
"size": 2, //number of Documents in data
"version": "abc_Unaggregated",
"data": [
{
"request": {
"a" : "x",
"b" : "y",
"c" : "z"
}
},
{
"request": {
"a" : "x",
"b" : "y",
"c" : "z"
}
}
]
}
The size of the data array is limited to 100 entries per document. We have about 1,5 billion of those entries. From which about 10% are really different. So the idea is to aggregate the different requests into one. In the example above instead of the 2 requests there would be one left.
The genral way I did it is to create an Aggregation-Pipeline with the following steps:
- match documents for the correct version (“abc_Unaggregated”)
- unwind data
- group by {data.request.a , data.request.b, data.request.c, version}
- change version to unaggregated (“abc”)
- projection to get the right data structure
- group with push to new data array
[
{
$match:
/**
* query: The query in MQL.
*/
{
version: "abc_Unaggregated",
},
},
{
$unwind:
/**
* path: Path to the array field.
* includeArrayIndex: Optional name for index.
* preserveNullAndEmptyArrays: Optional
* toggle to unwind null and empty values.
*/
{
path: "$data",
},
},
{
$group:
/**
* _id: The id of the group.
* fieldN: The first field name.
*/
{
_id: {
a: "$data.request.a",
b: "$data.request.b",
c: "$data.request.c"
version: "$version",
},
},
},
{
$set:
/**
* field: The field name
* expression: The expression.
*/
{
"_id.version": {
$replaceOne: {
input: "$_id.version",
find: "_Unaggregated",
replacement: "",
},
},
},
},
{
$project:
/**
* _id: The id of the group.
* fieldN: The first field name.
*/
{
"placeholder.request": {
a: "$_id.a",
b: "$_id.b",
c: "$_id.c"
},
version: "$_id.version",
},
},
{
$group:
/**
* _id: The id of the group.
* fieldN: The first field name.
*/
{
_id: {
/**
* needed to set new package
*/
version: "$version",
},
data: {
$push: "$placeholder",
},
},
}
]
The problem is here the data array would be 1,5 million or more entries large. That is way to much for a single document. Is there a way to distribute the push to different documents so that the end result will be multiple documents each containig a data array with 100 entries?
I tried slice, limit and bucket. All of these don’t seem to be applicable to my problem.
limit only cuts away instead of distributing the documents.
Slice seems to do the same: i can decrease the number of elements but not distribute them.
With buckets I would need to give boundaries, to miraculously get a size of 100 per bucket, which would call for variable boundaries so not really applicable to this problem.
There are some more steps e.g. for merging back into the collection but they are not relevant to my question.
We are using an Atlas Mongo DB version 5.0.19