Grouping and counting in mongodb

Gouri_Sankar · October 11, 2023, 4:47am

Hello,

So to explain i have a collection of feedback submission which contains data in the following structure.

{
  "_id": {
    "$oid": "6524ea72a85a4164cf29f849"
  },
  "form_id": {
    "$oid": "6523d74ecf7337be2640b59b"
  },
  "user_id": 94584,
  "resource_id": 31258,
  "resource_type": 1,
  "responses": [
    {
      "field_id": {
        "$oid": "6523d74ecf7337be2640b596"
      },
      "field_type": 2,
      "question": "What is your feedback of the session?",
      "options": [
        {
          "_id": {
            "$oid": "6523d74ecf7337be2640b597"
          },
          "label": "Very informative"
        },
        {
          "_id": {
            "$oid": "6523d74ecf7337be2640b598"
          },
          "label": "Content needs to improve"
        },
        {
          "_id": {
            "$oid": "6523d74ecf7337be2640b599"
          },
          "label": "Lecture was good"
        }
      ],
      "answer": {
        "$oid": "6523d74ecf7337be2640b597"
      }
    },
    {
      "field_id": {
        "$oid": "6523d74ecf7337be2640b59a"
      },
      "field_type": 3,
      "question": "How would you rate the session",
      "numbers": 5,
      "start_label": "Poor",
      "end_label": "Great",
      "answer": 5
    },
    {
      "field_id": {
        "$oid": "6524d92af40cb9f6de2966c8"
      },
      "field_type": 4,
      "question": "Long answer Type",
      "answer": "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum".
    }
  ],
  "createdAt": {
    "$date": "2023-10-10T06:08:50.333Z"
  },
  "updatedAt": {
    "$date": "2023-10-10T06:08:50.333Z"
  }
}

so this is a example of one submission and there can be possibly more than 3 million submissions for each
form_id .

For stats (like google forms ) lets say only for form fields like : Single choice or multiple choice or ratings and similar i need to bring the count for all the options .

and i have made this aggregate command

{
    $match:
      /**
       * query: The query in MQL.
       */
      {
        form_id: ObjectId(
          "6523d74ecf7337be2640b59b"
        ),
        "responses.field_type": {
          $in: [2, 3],
        },
      },
  },
  {
    $project:
      /**
       * specifications: The fields to
       *   include or exclude.
       */
      {
        "responses.field_id": 1,
        "responses.field_type": 1,
        "responses.answer": 1,
      },
  },
  {
    $unwind:
      /**
       * path: Path to the array field.
       * includeArrayIndex: Optional name for index.
       * preserveNullAndEmptyArrays: Optional
       *   toggle to unwind null and empty values.
       */
      {
        path: "$responses",
      },
  },
  {
    $match:
      /**
       * query: The query in MQL.
       */
      {
        "responses.field_type": {
          $in: [2, 3],
        },
      },
  },
  {
    $group:
      /**
       * _id: The id of the group.
       * fieldN: The first field name.
       */
      {
        _id: {
          field_id: "$responses.field_id",
          answer: "$responses.answer",
        },
        answer: {
          $first: "$responses.answer",
        },
        count: {
          $sum: 1,
        },
      },
  },
  {
    $group:
      /**
       * _id: The id of the group.
       * fieldN: The first field name.
       */
      {
        _id: "$_id.field_id",
        submissions: {
          $push: {
            k: {
              $toString: "$_id.answer",
            },
            v: "$count",
          },
        },
      },
  },
  {
    $project: {
      _id: 1,
      submissions: {
        $arrayToObject: "$submissions",
      },
    },
  },

while checking it is working for 60k records ( submissions) in 500 ms . I doubt this is a slow query and is there any better way to write this .

Indesxes are on
form_id , responses.field_id
The above is giving response like this which is fine

{
  "_id": {
    "$oid": "6523d74ecf7337be2640b596"
  },
  "submissions": {
    "6523d74ecf7337be2640b597": 20311,
    "6523d74ecf7337be2640b599": 19922,
    "6523d74ecf7337be2640b598": 19769
  }
},
{
  "_id": {
    "$oid": "6523d74ecf7337be2640b59a"
  },
  "submissions": {
    "1": 12068,
    "2": 12056,
    "3": 11919,
    "4": 12086,
    "5": 11873
  }
}

.

Also for fields like short answer and long answer lets say i need to collect only latest 20 response grouped by responses.field_id ( yet to make aggregate command) , ( if anyone can help )

steevej · October 11, 2023, 10:26pm

A few things.

1 - Since you do $in on responses.field_type in your first $match, your index should include responses.field_type between form_id and responses.field_id

2 - Since you do $project on responses.answer it could help to add it as the last field of the index.

3 - You could $project _id:0

4 - Rather that $unwind, then $match on field_type, you could $filter in the preceding $project. This way you would $unwind only the appropriate elements, then the $match would be unnecessary.

5 - In the first $group, using $first on responses.answer seems redundant since it is also part of the group _id.

Gouri_Sankar · October 18, 2023, 8:51am

Hey steevej ,

Thank you for helping here , have updated the query based on the above mentioned data ,
technically filter looks good rather than match but still i can see similar number

adding query and stats

"stages": [
    {
      "$cursor": {
        "queryPlanner": {
          "namespace": "634eb9d68eb58a3508562fac_mongo_vla.feedback_submissions",
          "indexFilterSet": false,
          "parsedQuery": {
            "form_id": {
              "$eq": "6523d74ecf7337be2640b59b"
            }
          },
          "queryHash": "BEF482EC",
          "planCacheKey": "44FEDEFF",
          "maxIndexedOrSolutionsReached": false,
          "maxIndexedAndSolutionsReached": false,
          "maxScansToExplodeReached": false,
          "winningPlan": {
            "stage": "PROJECTION_DEFAULT",
            "transformBy": {
              "_id": true,
              "responses": {
                "$filter": {
                  "input": "$responses",
                  "as": "response",
                  "cond": {
                    "$in": [
                      "$$response.field_type",
                      { "$const": [2, 3] }
                    ]
                  }
                }
              }
            },
            "inputStage": {
              "stage": "FETCH",
              "inputStage": {
                "stage": "IXSCAN",
                "keyPattern": { "form_id": 1 },
                "indexName": "form_id_1",
                "isMultiKey": false,
                "multiKeyPaths": {
                  "form_id": []
                },
                "isUnique": false,
                "isSparse": false,
                "isPartial": false,
                "indexVersion": 2,
                "direction": "forward",
                "indexBounds": {
                  "form_id": [
                    "[ObjectId('6523d74ecf7337be2640b59b'), ObjectId('6523d74ecf7337be2640b59b')]"
                  ]
                }
              }
            }
          },
          "rejectedPlans": []
        },
        "executionStats": {
          "executionSuccess": true,
          "nReturned": 110007,
          "executionTimeMillis": 935,
          "totalKeysExamined": 110007,
          "totalDocsExamined": 110007,
          "executionStages": {
            "stage": "PROJECTION_DEFAULT",
            "nReturned": 110007,
            "executionTimeMillisEstimate": 370,
            "works": 110008,
            "advanced": 110007,
            "needTime": 0,
            "needYield": 0,
            "saveState": 157,
            "restoreState": 157,
            "isEOF": 1,
            "transformBy": {
              "_id": true,
              "responses": {
                "$filter": {
                  "input": "$responses",
                  "as": "response",
                  "cond": {
                    "$in": [
                      "$$response.field_type",
                      { "$const": [2, 3] }
                    ]
                  }
                }
              }
            },
            "inputStage": {
              "stage": "FETCH",
              "nReturned": 110007,
              "executionTimeMillisEstimate": 143,
              "works": 110008,
              "advanced": 110007,
              "needTime": 0,
              "needYield": 0,
              "saveState": 157,
              "restoreState": 157,
              "isEOF": 1,
              "docsExamined": 110007,
              "alreadyHasObj": 0,
              "inputStage": {
                "stage": "IXSCAN",
                "nReturned": 110007,
                "executionTimeMillisEstimate": 46,
                "works": 110008,
                "advanced": 110007,
                "needTime": 0,
                "needYield": 0,
                "saveState": 157,
                "restoreState": 157,
                "isEOF": 1,
                "keyPattern": { "form_id": 1 },
                "indexName": "form_id_1",
                "isMultiKey": false,
                "multiKeyPaths": {
                  "form_id": []
                },
                "isUnique": false,
                "isSparse": false,
                "isPartial": false,
                "indexVersion": 2,
                "direction": "forward",
                "indexBounds": {
                  "form_id": [
                    "[ObjectId('6523d74ecf7337be2640b59b'), ObjectId('6523d74ecf7337be2640b59b')]"
                  ]
                },
                "keysExamined": 110007,
                "seeks": 1,
                "dupsTested": 0,
                "dupsDropped": 0
              }
            }
          },
          "allPlansExecution": []
        }
      },
      "nReturned": 110007,
      "executionTimeMillisEstimate": 655
    },
    {
      "$unwind": { "path": "$responses" },
      "nReturned": 220004,
      "executionTimeMillisEstimate": 768
    },
    {
      "$group": {
        "_id": {
          "field_id": "$responses.field_id",
          "answer": "$responses.answer"
        },
        "count": { "$sum": { "$const": 1 } }
      },
      "maxAccumulatorMemoryUsageBytes": {
        "count": 640
      },
      "totalOutputDataSizeBytes": 3920,
      "usedDisk": false,
      "spills": 0,
      "nReturned": 8,
      "executionTimeMillisEstimate": 935
    },
    {
      "$group": {
        "_id": "$_id.field_id",
        "submissions": {
          "$push": {
            "k": {
              "$convert": {
                "input": "$_id.answer",
                "to": { "$const": "string" }
              }
            },
            "v": "$count"
          }
        }
      },
      "maxAccumulatorMemoryUsageBytes": {
        "submissions": 2336
      },
      "totalOutputDataSizeBytes": 2778,
      "usedDisk": false,
      "spills": 0,
      "nReturned": 2,
      "executionTimeMillisEstimate": 935
    },
    {
      "$project": {
        "_id": true,
        "submissions": {
          "$arrayToObject": ["$submissions"]
        }
      },
      "nReturned": 2,
      "executionTimeMillisEstimate": 935
    }
  ],

query / Command

"command": {
    "aggregate": "feedback_submissions",
    "pipeline": [
      {
        "$match": {
          "form_id": "6523d74ecf7337be2640b59b"
        }
      },
      {
        "$project": {
          "responses": {
            "$filter": {
              "input": "$responses",
              "as": "response",
              "cond": {
                "$in": [
                  "$$response.field_type",
                  [2, 3]
                ]
              }
            }
          }
        }
      },
      { "$unwind": { "path": "$responses" } },
      {
        "$group": {
          "_id": {
            "field_id": "$responses.field_id",
            "answer": "$responses.answer"
          },
          "count": { "$sum": 1 }
        }
      },
      {
        "$group": {
          "_id": "$_id.field_id",
          "submissions": {
            "$push": {
              "k": { "$toString": "$_id.answer" },
              "v": "$count"
            }
          }
        }
      },
      {
        "$project": {
          "_id": 1,
          "submissions": {
            "$arrayToObject": "$submissions"
          }
        }
      }
    ],
    "cursor": {},
    "maxTimeMS": 60000,
    "$db": "mongo_vla"
  },

Also as this is a testing cluster its free and shared ( will it affect time , if yes how much).

As i am designing the collection we can change the collection architecture also if you have any idea which can make it faster

Gouri_Sankar · October 18, 2023, 12:37pm

Hello , Can you check below comment

steevej · October 18, 2023, 10:01pm

Removing

from the first $match stage was not something I mentioned? Why did you removed it?

Of course it will. It is free and shared. Shared means many other might connect on other free and shared clusters running on the same machine. So it is clear it will affect time and it is impossible to say how much.

Gouri_Sankar · October 22, 2023, 7:04am

So you mean I should not be removed from the first match right? If I keep it, it is 150ms slower but let’s say in the longer run obviously it will help as will pick only submissions having those field types which makes sense here .

steevej · October 22, 2023, 2:22pm

Yes, in principle, you should keep it. Otherwise, you fetch, $filter and $unwind documents that are not needed.

However, if all your top documents $match all the time, it is kind of useless and might be the reason why it is slower with it.

But doing performance analysis on a shared cluster is kind of useless since

I have notice on the explain plan that you did not

Also the index used is form_id_1 so I suspect that the following was also ignored

Gouri_Sankar · October 22, 2023, 3:00pm

Yes it makes no much sense to have a specific caluclation on shared cluster but thats what will se what can be done and will test with indepndent also.

Yes because in response i need that _id at the last one if you are talking about but if about the _id of submission , than making_id : 0 didn’t really changes much it was a change if 10ms , i had done but wasn’t there in copied explain plan.

I am unsure exactly what you are talking about but if adding index to response.answer i feel it will increase the index size.

steevej · October 22, 2023, 6:13pm

I am talking about the _id of the original documents, not the _id out of the $group stage. The _id of the original documents are not needed for the computation. So they can be projected out in the $project stage where you $filter.

Oh it will definitively make the index bigger. But, may be, just may be, the aggregation can be covered by the index, hence avoiding fetching the documents on disk.