Array of objects unwind, group and counting per document

Masuma_Bibi · March 3, 2023, 7:24am

The collection has the following structure:

[
    {
        "id": "173420",
        "dataset_name": "173420 - gene expression in treatment"
		"file_details": [
            {
                "size": 31983,
                "data_type": "Methylation"
            },
            {
                "size": 110193,
                "data_type": "Methylation"
            },
            {
                "size": 254763,
                "data_type": "Methylation"
            },
            {
                "size": 1632726,
                "data_type": "Clinical",
            }
        ]
	},
    {
        "id": "GSE88",
        "dataset_name": "GSE88 tumour recurrence prediction",
        "file_details": [
            {
                "size": 7402964,
                "data_type": "Expression"
            },
            {
                "size": 7368643,
                "data_type": "Expression"
            },
            {
                "size": 7540211,
                "data_type": "Clinical"

            },
            {
                "size": 7426688,
                "data_type": "Clinical"
            }
	]
	}
]

The requirement is to count the distinct file_details.data_type and count the number of documents that have the data_type in it.
I have used the following unwind-group-count query for it:

db.dataset.aggregate([ 
{ $unwind: "$file_details" },
{"$group" : {_id:"$file_details.data_type", count:{$sum:1}}},
{$sort : {count:-1}}
])

It returns the following result:
_id: “Methylation”
count: “3”
-id:“Expression”
count: “2”
_id: “Clinical”
count: “3”

But the expected result needs to count the occurrence of the data_type per document:
expected output:

_id: “Methylation”
count: “1”
-id:“Expression”
count: “1”
_id: “Clinical”
count: “2”

could you please help me improve the query or any new approch.
Thank you in advance.

steevej · March 4, 2023, 3:27pm

If I understand correctly you want to count Methylation once because it is within the same document.

If that is the case, you must eliminate duplicate within documents. A simple $project that uses $setUnion will do that for you:

{ "$project" : {
    unique_data_types : { $setUnion : [ "$file_details.data_type" ] }
} }

You then $unwind the new array unique_data_types and your $group.