Aggregate query very slow for 2 million documents, also causing 503 server error

Ralph_van_der_Sanden · October 4, 2021, 1:09pm

As a successor to this topic: Group by on a string doesn't work, the aggregate query is very slow and not returning results anymore (no results is likely because of a timeout of 15 seconds that I added).

The query matches almost 2 million documents.

This is my document structure:

account_id (integer)
survey_id (integer)
answers (array)
0 (object)
answer
q
qid

1 (object)
answer
q
qid
etc.

This is the query:

        $cursor = $collection->aggregate([
                                    ['$match' => ['account_id' => intval($values['account_id']), 'survey_id' => 
        intval($values['survey_id'])]],
                                    ['$project' => ['answers.q' => 1, 'answers.qid' => 1]],
                                    ['$unwind' => '$answers'],
                                    ['$group' => ['_id' => '$answers.qid']]
                                ]);

The goal of this query is to get all unique questions of the documents grouped by question id (qid).
Is there anyway this query can be faster? account_id, survey_id and qid have an index.

MaBeuLux88_xxx · October 4, 2021, 6:07pm

Hi @Ralph_van_der_Sanden,

Can you please submit an example document in a Markdown code block?

{
   "something": "like this",
   "would be": "awesome"
}

Thanks,
Maxime.

Ralph_van_der_Sanden · October 7, 2021, 7:08am

Something like this (this is a full document example)?

{
"_id": {
    "$oid": "61223d3fcf43de31200c70e1"
},
"survey_id": 205244,
"account_id": 1005,
"feedback_id": "fideaedcc9da80af7",
"date_reg": "2021-08-12",
"time_reg": "12:04:15",
"source": 2,
"trigg": 0,
"read": 0,
"ip": "",
"ent": {
    "person": "Steve",
    "date": "2021-04-22"
},
"topics": ["Jolijn", "dame", "nederlands", "tevreden", "klantenservice"],
"sentiment": 1,
"country": "DE",
"star": 1,
"ref": "https://...",
"page": "https://...",
"meta": {
    "brow": 2,
    "os": 1,
    "dev": 0,
    "utm_source": "marketing_camp_1"
},
"answers": [{
    "answer": "......",
    "original_answer": ".....",
    "type": 8,
    "qid": 2030,
    "datatype": 9,
    "so": 0,
    "q": "Heeft u nog feedback?"
}, {
    "answer": "...",
    "type": 12,
    "qid": 2033,
    "datatype": 5,
    "so": 1,
    "q": "E-mail"
}, {
    "answer": 2,
    "type": 13,
    "qid": 2035,
    "datatype": 9,
    "so": 2,
    "q": "Hoe zag de showroom er uit?"
}, {
    "answer": 4,
    "type": 4,
    "qid": 2034,
    "datatype": 9,
    "so": 3,
    "q": "Hoe groot is de kans dat u ons aanbeveeelt bij vrienden en/of familie?"
}, {
    "answer": "Slecht",
    "type": 1,
    "qid": 2036,
    "datatype": 9,
    "so": 4,
    "q": "Hoe was de bestelervaring?"
}, {
    "answer": ".....",
    "type": 0,
    "qid": 2039,
    "datatype": 2,
    "so": 5,
    "q": "Voornaam"
}],
"tags": [12, 3]
}

MaBeuLux88_xxx · October 7, 2021, 3:24pm

Hi @Ralph_van_der_Sanden,

Can you try to create the index {account_id: 1, survey_id: 1} and see what performance you get with the query:

db.coll.distinct("answers.qid", {account_id: 1005, survey_id: 205244})

Looks like we can’t get a covered query by adding answers.qid because answers is an array and makes the index multikey.

Cheers,
Maxime.

Ralph_van_der_Sanden · November 14, 2021, 3:22pm

Hi, sorry for the big delay. This is the PHP converted query:

$manager = new MongoDB\Driver\Manager("mongodb://.....");

$query = ['account_id' => 1005, 'survey_id' => 205244]; 

$cmd = new MongoDB\Driver\Command([
'distinct' => 'tenants',
'key' => 'answers.qid',
'query' => $query
]);

$cursor = $manager->executeCommand('sj_tenants_eu1', $cmd);

// to get distinct values as array
$instances = current($cursor->toArray())->values;

var_dump($instances);

This gives an empty array: array(0) { }

Ralph_van_der_Sanden · November 14, 2021, 3:33pm

Fixed it, wrong db name.

Still took around 14 seconds.
It is only returning the question id, I also need the answers.q value to be returned next to the qid. Any ideas how to do that?

This is what I get back:
stdClass Object ( [values] => Array ( [0] => 2030 [1] => 2033 [2] => 2034 [3] => 2035 [4] => 2036 [5] => 2039 ) [ok] => 1 )

MaBeuLux88_xxx · November 15, 2021, 5:07am

Hi @Ralph_van_der_Sanden,

Ralph_van_der_Sanden:

$cursor = $collection->aggregate([
[’$match’ => [‘account_id’ => intval($values[‘account_id’]), ‘survey_id’ =>
intval($values[‘survey_id’])]],
[’$project’ => [‘answers.q’ => 1, ‘answers.qid’ => 1]],
[’$unwind’ => ‘$answers’],
[’$group’ => [’_id’ => ‘$answers.qid’]]
]);`

This query was the equivalent of the distinct operation that I proposed you earlier. The last step is only returning the $answers.qid, just like the distinct operation I gave you.
I guess what you need then is an extra questionsArray: {$push: "$answers.q"}.

Also, drop the $project stage because these optimizations are already done automatically and you might just prevent them by doing this.

Please share an explain plan with exec stats and the actual pipeline so I can see if there is anything I can do to optimize. Make sure it’s using an index that fits in RAM and that you have enough RAM to support this query.

Cheers,
Maxime.

Ralph_van_der_Sanden · November 15, 2021, 6:37am

Hi Maxime,
I’m not using that, I use the distinct as you suggested.
But that distinct is only returning the qid.

Will try if questionsArray: {$push: “$answers.q”} works. Thanks!