Aggregate query very slow for 2 million documents, also causing 503 server error

As a successor to this topic: Group by on a string doesn't work, the aggregate query is very slow and not returning results anymore (no results is likely because of a timeout of 15 seconds that I added).

The query matches almost 2 million documents.

This is my document structure:

account_id (integer)
survey_id (integer)
answers (array)
0 (object)
answer
q
qid

1 (object)
answer
q
qid
etc.

This is the query:

        $cursor = $collection->aggregate([
                                    ['$match' => ['account_id' => intval($values['account_id']), 'survey_id' => 
        intval($values['survey_id'])]],
                                    ['$project' => ['answers.q' => 1, 'answers.qid' => 1]],
                                    ['$unwind' => '$answers'],
                                    ['$group' => ['_id' => '$answers.qid']]
                                ]);

The goal of this query is to get all unique questions of the documents grouped by question id (qid).
Is there anyway this query can be faster? account_id, survey_id and qid have an index.

Hi @Ralph_van_der_Sanden,

Can you please submit an example document in a Markdown code block?

{
   "something": "like this",
   "would be": "awesome"
}

Thanks,
Maxime.

Something like this (this is a full document example)?

{
"_id": {
    "$oid": "61223d3fcf43de31200c70e1"
},
"survey_id": 205244,
"account_id": 1005,
"feedback_id": "fideaedcc9da80af7",
"date_reg": "2021-08-12",
"time_reg": "12:04:15",
"source": 2,
"trigg": 0,
"read": 0,
"ip": "",
"ent": {
    "person": "Steve",
    "date": "2021-04-22"
},
"topics": ["Jolijn", "dame", "nederlands", "tevreden", "klantenservice"],
"sentiment": 1,
"country": "DE",
"star": 1,
"ref": "https://...",
"page": "https://...",
"meta": {
    "brow": 2,
    "os": 1,
    "dev": 0,
    "utm_source": "marketing_camp_1"
},
"answers": [{
    "answer": "......",
    "original_answer": ".....",
    "type": 8,
    "qid": 2030,
    "datatype": 9,
    "so": 0,
    "q": "Heeft u nog feedback?"
}, {
    "answer": "...",
    "type": 12,
    "qid": 2033,
    "datatype": 5,
    "so": 1,
    "q": "E-mail"
}, {
    "answer": 2,
    "type": 13,
    "qid": 2035,
    "datatype": 9,
    "so": 2,
    "q": "Hoe zag de showroom er uit?"
}, {
    "answer": 4,
    "type": 4,
    "qid": 2034,
    "datatype": 9,
    "so": 3,
    "q": "Hoe groot is de kans dat u ons aanbeveeelt bij vrienden en/of familie?"
}, {
    "answer": "Slecht",
    "type": 1,
    "qid": 2036,
    "datatype": 9,
    "so": 4,
    "q": "Hoe was de bestelervaring?"
}, {
    "answer": ".....",
    "type": 0,
    "qid": 2039,
    "datatype": 2,
    "so": 5,
    "q": "Voornaam"
}],
"tags": [12, 3]
}

Hi @Ralph_van_der_Sanden,

Can you try to create the index {account_id: 1, survey_id: 1} and see what performance you get with the query:

db.coll.distinct("answers.qid", {account_id: 1005, survey_id: 205244})

Looks like we can’t get a covered query by adding answers.qid because answers is an array and makes the index multikey.

Cheers,
Maxime.

Hi, sorry for the big delay. This is the PHP converted query:

$manager = new MongoDB\Driver\Manager("mongodb://.....");

$query = ['account_id' => 1005, 'survey_id' => 205244]; 

$cmd = new MongoDB\Driver\Command([
'distinct' => 'tenants',
'key' => 'answers.qid',
'query' => $query
]);

$cursor = $manager->executeCommand('sj_tenants_eu1', $cmd);

// to get distinct values as array
$instances = current($cursor->toArray())->values;

var_dump($instances);

This gives an empty array: array(0) { }

Fixed it, wrong db name.

Still took around 14 seconds.
It is only returning the question id, I also need the answers.q value to be returned next to the qid. Any ideas how to do that?

This is what I get back:
stdClass Object ( [values] => Array ( [0] => 2030 [1] => 2033 [2] => 2034 [3] => 2035 [4] => 2036 [5] => 2039 ) [ok] => 1 )

Hi @Ralph_van_der_Sanden,

This query was the equivalent of the distinct operation that I proposed you earlier. The last step is only returning the $answers.qid, just like the distinct operation I gave you.
I guess what you need then is an extra questionsArray: {$push: "$answers.q"}.

Also, drop the $project stage because these optimizations are already done automatically and you might just prevent them by doing this.

Please share an explain plan with exec stats and the actual pipeline so I can see if there is anything I can do to optimize. Make sure it’s using an index that fits in RAM and that you have enough RAM to support this query.

Cheers,
Maxime.

Hi Maxime,
I’m not using that, I use the distinct as you suggested.
But that distinct is only returning the qid.

Will try if questionsArray: {$push: “$answers.q”} works. Thanks!