$group + $addToSet don't work with nested fields

Examples of documents:

{
    "_id": {
        "$oid": "6175f6e5f2363e1cc9aa5835"
    },
    "#CHROM": 1,
    "POS": 286747,
    "ID": "rs369556846",
    "REF": "A",
    "ALT": "G",
    "QUAL": ".",
    "FILTER": ".",
    "INFO": [{
            "RS": 369556846,
            "RSPOS": 286747,
            "dbSNPBuildID": 138,
            "SSR": 0,
            "SAO": 0,
            "VP": "0x050100000005150026000100",
            "WGT": 1,
            "VC": "SNV",
            "CAF": [{
                "$numberDecimal": "0.9381"
            }, {
                "$numberDecimal": "0.0619"
            }],
            "COMMON": 1,
            "TOPMED": [{
                "$numberDecimal": "0.88411856523955147"
            }, {
                "$numberDecimal": "0.11588143476044852"
            }]
        },
        ["SLO", "ASP", "VLD", "G5", "KGPhase3"]
    ]
}
{
    "_id": {
        "$oid": "6175f6e5f2363e1cc9aa583b"
    },
    "#CHROM": 1,
    "POS": 911220,
    "ID": "rs35331099",
    "REF": "CT",
    "ALT": "C",
    "QUAL": ".",
    "FILTER": ".",
    "INFO": [{
            "RS": 35331099,
            "RSPOS": 911221,
            "dbSNPBuildID": 130,
            "SSR": 0,
            "SAO": 0,
            "VP": "0x05010008000517013e000200",
            "GENEINFO": "LOC284600:284600",
            "WGT": 1,
            "VC": "DIV",
            "CAF": [{
                "$numberDecimal": "0.2492"
            }, {
                "$numberDecimal": "0.7508"
            }],
            "COMMON": 1,
            "TOPMED": [{
                "$numberDecimal": "0.21621750764525993"
            }, {
                "$numberDecimal": "0.78378249235474006"
            }]
        },
        ["RV", "SLO", "INT", "ASP", "VLD", "G5A", "G5", "GNO", "KGPhase1", "KGPhase3"]
    ]
}

Task:
get the values of the INFO.0.VC field in a unique form.

PyMongo query:
sep_vals = [doc for doc in src_coll_obj.aggregate([{'$group': {'_id': 'null', 'spl_field': {'$addToSet': '$INFO.0.VC'}}}])][0]['spl_field']

Expected result:
['SNV', 'DIV']

The result obtained:
empty two-dimensional array.

Question:
Is there an error in the query? Or is it a MongoDB bug?

You could replace $INFO.0.VC with $INFO.VC. You will get

[ [ 'SNV' ] , [ 'DIV' ] ]

which is almost what you want and as easy to use.

To get exactly what you want, you could use:

  • $arrayElemAt - to get element 0
  • $unwind - to get the only element as an object
  • $addToSet $INFO.VC

Personally, I prefer the simpler pipeline at the cost of a little bit of extra work at the application level in order to leave as much cycle as possible to the server. I think it scales better.

This is only proper syntax if you are matching first element, in aggregation expression this means field named "0" of subdocument INFO. If you want to fetch it the way you do, you have to use $arrayElemAt expression. In your case it would be:

{$addToSet: {$getField: {field: "VC", input: {$arrayElemAt: ["$INFO", 0]}}}}

$getField is new in version 5.0 - it’s possible to do this in earlier version but a little less readable. Simplest would be to have another $addField stage before $group that sets INFO0 to first element of $INFO and then $addToSet can use INFO0.VC.

Asya

1 Like

@steevej @Asya_Kamsky thanks for the answers!

The documentation describes the only way to access subdocuments in arrays. But it turns out that $addToSet violates it. Surprises like this complicate application development.

Personally, I prefer the simpler pipeline

Sometimes simplicity is about uniformity. Example. In my case, the application receives the field path from user, validates it, and uses it to retrieve all uniquified values of a given field. Validation is performed in all possible paths. Paths are gathered automatically strictly according to query rules. If referring to the first post, a valid path is INFO.0.VC. If the user types INFO.VC, the program will throw an exception. Of course it is possible to rewrite the validator allowing to specify paths without array indexes. But this is a complication, not a simplification.

The problem with all of this is that “Field.0” is ambiguous since MongoDB schema allows both:

{ "F": [ 1, 2, 3] }   /* "F.0" is the first element of array "F" */
{ "F": { "0": 5 } }   /* "F.0" is the field named "0" in subdocument "F" */

This makes for complexities when parsing some queries. What would “F.0.0” mean if “F” was an array of subdocuments and one of them had field named “0”? Anyway, these are internal issues that the user shouldn’t worry about, but unfortunately early choices of syntax means sometimes your code isn’t as simple as it could be. We do have a query team that’s thinking forward about how we might be able to simplify the language and make it more consistent (but without breaking any of the existing applications).

Asya
P.S. the page you linked to talks about query expressions - aggregation expressions are different and have different ways to access parts of document.

It seems to me that naming fields as numbers is extremely irrational (are there any examples that refute this?). So in a similar conflict, 0 as an array index should have priority.

I don’t think the question is whether it’s rational - it’s allowed and has been allowed since the beginning of time so we cannot just change the behavior of the database without it breaking applications which have been relying on this.

And frankly I don’t think it’s irrational if you think about keys in subdocuments being named after account numbers or such. It won’t likely be “0” it might be “148235” but that’s still a number.

Asya

You could replace $INFO.0.VC with $INFO.VC .

By the way, this does not work for documents nested in a two-dimensional array. Toy example:

{
    "_id": {
        "$oid": "61fd0980b477c135a9fc6284"
    },
    "q": 1,
    "w": {},
    "e": {
        "a": 10,
        "b": 11,
        "c": {
            "d": -11,
            "e": -10,
            "f": -9
        }
    },
    "r": ".",
    "t": [],
    "y": [100, 101, {
            "l": 0.01,
            "m": 0.02,
            "n": 0.03
        },
        [{
            "h": 0,
            "z": 0
        }, {
            "m": 0,
            "z": 0
        }]
    ]
}

The subdocument is nested in a one-dimensional array. Query without specifying indexes works.
{'_id': 'null', 'out': {'$addToSet': '$y.l'}}

_id: "null"
out: Array
    0: Array
        0: 0.01

The subdocument is nested in a two-dimensional array. A query without specifying the subdocument path does not output a result.
{'_id': 'null', 'out': {'$addToSet': '$y.h'}}

_id: "null"
out: Array
    0: Array

Perhaps this is yet another argument for the fact that explicitly specifying field path is always better than hidden one.

In general, $addToSet expects a scalar so if the fields you want to add to the set are in an array you would have to unwind the array first.

I’d say that it’s an accurate generalization that the more complex your schema/documents can be, the more complicated your queries/pipelines will end up being.

Asya

I propose to implement in future versions at least one of the solutions for this conflict:

Interpret numbers as field names:
’f1.`0`.f2.`0`'
{’$str’: ‘$f1.0.f2.0’}

Interpret numbers as array element indexes:
’f1.0.f2.0’
{’$idx’: ‘$f1.0.f2.0’}