Multiple elemMatch and index

My data is very dynamic with the possibility to add custom attributes and events.

This is how my dataset looks like:

{
    "_id" : ObjectId("5eccf898ac7ff694845f1ccf"),
    "attributes" : [ 
        {
            "k" : "first_name",
            "v" : "John"
        }, 
        {
            "k" : "last_name",
            "v" : "Doe"
        }, 
        {
            "k" : "email",
            "v" : "john.doe@example.net"
        }, 
        {
            "k" : "gender",
            "v" : "Male"
        }
    ],
    "events" : {
        "event" : "add_to_cart",
        "event_data" : [ 
            {
                "k" : "product_name",
                "v" : "T-Shirt"
            }, 
            {
                "k" : "price",
                "v" : 25
            }, 
            {
                "k" : "variants",
                "v" : [ 
                    {
                        "k" : "color",
                        "v" : "red"
                    }, 
                    {
                        "k" : "size",
                        "v" : "xl"
                    }, 
                    {
                        "k" : "matherials",
                        "v" : [ 
                            [ 
                                {
                                    "k" : "name",
                                    "v" : "Cotton"
                                }
                            ], 
                            [ 
                                {
                                    "k" : "name",
                                    "v" : "Wool"
                                }
                            ]
                        ]
                    }
                ]
            }
        ]
    },
    "created_at" : "2020-05-25 16:12:58",
    "updated_at" : "2020-05-25 16:12:58"
}

I definitely can create the following index:

db.clients.ensureIndex({"events.event_data.k" : 1, "events.event_data.v" : 1 })

and it performs great. However, if you take a look at my dataset, you’ll notice that values can be very nested (material e.g.).

This query works great:

db.clients.find({
    "events.event_data": {
        "$elemMatch": {
            "k": "product_name",
            "v": "T-Shirt"
        }
    }
})

anyway, when I need to query a deeper level then it scans the whole size. Here is how to scan clients who bought cotton T-Shirt:

db.clients.find({
    "events.event_data.v.v": 
        { 
            "$elemMatch": { 
                "$elemMatch" : {
                    "k": "name", "v": "Cotton"
                }
             } 
        }
})

However in this case it does COLLSCAN which is, obviously, something I would like to avoid?

Thank you.

Hi @jellyx,

Great use of the attribute pattern! You have a few options here and I’ll briefly discuss my top two, both of which will require small tweaks to your schema. Unfortunately**, to prevent the collection scan you must have an index on the field being queried, even in a multi-key index (i.e. events.data_events.v.k).

(1) Version 4.2 offers wildcard indexes. It de-emphasizes the need to use the attribute pattern by allowing you to index everything within a document (not recommended) or within a sub-object (very usable with your schema). Instead of having events.data_events with an object containing k and v, you use the current key’s value as the actual key. For example: { k: "name", v: "wool" } becomes { name: "wool" }. The wildcard index will ensure the key and value are both properly indexed.

(2) Alternatively, I suggest flattening your attribute array so it’s always one dimensional. Instead of having v be an array, always use a string. You can cleverly manipulate your key names to make this possible. For example: { k: "materials", v: [ { k: "name", v: "Cotton" } ] } becomes { k: "materials.name", v: "cotton" }.

** You can use wildcard indexes to make this work with your current schema design, but I would recommend one of the previous two options instead.

Hopefully this will spur some ideas for a schema that avoids the dreaded collection scan!

Thanks,

Justin

1 Like

Hi @Justin,

Thank you so much for your great response.

Yes, we’re using attribute pattern, but also the outlier pattern since there could be a lot of events for some clients (outlier, as the name suggests). :smile:

I read a bit about wildcard indexes, but I read also the following:

You cannot shard a collection using a wildcard index. Create a non-wildcard index on the field or fields you want to shard on. For more information on shard key selection, see Shard Keys.

Since I’m pretty new to Mongo DB - and haven’t researched a lot about sharding - I just decided to go with attribute pattern.

My question is (if you don’t mind): would sharding with wildcard index produce some issues to me? How would I be able to horizontally scale the number of clients across multiple machines.

And yes, I executed the following:

db.clients.createIndex({“events.event_data.v.k” : 1})

which seems not to produce any results. I already have this one:

db.clients.createIndex({"events.event_data.k" : 1, "events.event_data.v" : 1 })

Not sure if I should have both of them…

Your idea of floating an array is brilliant. If not with wildcard indexes, I’ll go with flattening an array.

Many thanks!

Great! I’m glad my suggestion provided some insight!

Quickly answering your question about wildcard indexes:

This is correct, you cannot shard in a wildcard index. But whenever you index an array, you create a multikey index, which also cannot be used as a shard key. If you aren’t having issues with the attribute pattern now (which use a multikey index) than you definitely won’t have problems using a wildcard index!

Thanks,

Justin

Thanks again! It has provided some insight definitely!

I feel a bit uncomfortable asking too many questions, but still have some opened questions:

a) Isn’t this a compound index:

{
    "events.event_data.k" : 1.0,
    "events.event_data.v" : 1.0
}

b) If it is a multi-key index and wildcard index doesn’t allow sharding, then how am I supposed to shard my collection when that time comes? Maybe I’m worrying too early… I guess…

Many thanks,

Don’t worry about asking too many questions, we’re here to help! I may not be online much longer to answer but the community is great at covering a wide range of topics.

a) Yes, that is a compound index. It’s also a multikey index when events.data_key is an array. It can’t be used for sharding.

b) You’ll need to pick a different shard key since both multikey and wildcard indexes can’t be used as a shard key. This makes a lot of sense if you dive into it. A document may only live on one shard. What would happen if a shard key on an array contained values requiring a single document to reside on more than one shard? It’d be problematic, to say the least.

Picking a shard key is a big topic so some further reading may be helpful:

Thanks,

Justin

2 Likes

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.