Multiple elemMatch and index

jellyx · May 26, 2020, 10:30pm

My data is very dynamic with the possibility to add custom attributes and events.

This is how my dataset looks like:

{
    "_id" : ObjectId("5eccf898ac7ff694845f1ccf"),
    "attributes" : [ 
        {
            "k" : "first_name",
            "v" : "John"
        }, 
        {
            "k" : "last_name",
            "v" : "Doe"
        }, 
        {
            "k" : "email",
            "v" : "john.doe@example.net"
        }, 
        {
            "k" : "gender",
            "v" : "Male"
        }
    ],
    "events" : {
        "event" : "add_to_cart",
        "event_data" : [ 
            {
                "k" : "product_name",
                "v" : "T-Shirt"
            }, 
            {
                "k" : "price",
                "v" : 25
            }, 
            {
                "k" : "variants",
                "v" : [ 
                    {
                        "k" : "color",
                        "v" : "red"
                    }, 
                    {
                        "k" : "size",
                        "v" : "xl"
                    }, 
                    {
                        "k" : "matherials",
                        "v" : [ 
                            [ 
                                {
                                    "k" : "name",
                                    "v" : "Cotton"
                                }
                            ], 
                            [ 
                                {
                                    "k" : "name",
                                    "v" : "Wool"
                                }
                            ]
                        ]
                    }
                ]
            }
        ]
    },
    "created_at" : "2020-05-25 16:12:58",
    "updated_at" : "2020-05-25 16:12:58"
}

I definitely can create the following index:

db.clients.ensureIndex({"events.event_data.k" : 1, "events.event_data.v" : 1 })

and it performs great. However, if you take a look at my dataset, you’ll notice that values can be very nested (material e.g.).

This query works great:

db.clients.find({
    "events.event_data": {
        "$elemMatch": {
            "k": "product_name",
            "v": "T-Shirt"
        }
    }
})

anyway, when I need to query a deeper level then it scans the whole size. Here is how to scan clients who bought cotton T-Shirt:

db.clients.find({
    "events.event_data.v.v": 
        { 
            "$elemMatch": { 
                "$elemMatch" : {
                    "k": "name", "v": "Cotton"
                }
             } 
        }
})

However in this case it does COLLSCAN which is, obviously, something I would like to avoid?

Thank you.

Justin · May 26, 2020, 10:56pm

Hi @jellyx,

Great use of the attribute pattern! You have a few options here and I’ll briefly discuss my top two, both of which will require small tweaks to your schema. Unfortunately**, to prevent the collection scan you must have an index on the field being queried, even in a multi-key index (i.e. events.data_events.v.k).

(1) Version 4.2 offers wildcard indexes. It de-emphasizes the need to use the attribute pattern by allowing you to index everything within a document (not recommended) or within a sub-object (very usable with your schema). Instead of having events.data_events with an object containing k and v, you use the current key’s value as the actual key. For example: { k: "name", v: "wool" } becomes { name: "wool" }. The wildcard index will ensure the key and value are both properly indexed.

(2) Alternatively, I suggest flattening your attribute array so it’s always one dimensional. Instead of having v be an array, always use a string. You can cleverly manipulate your key names to make this possible. For example: { k: "materials", v: [ { k: "name", v: "Cotton" } ] } becomes { k: "materials.name", v: "cotton" }.

** You can use wildcard indexes to make this work with your current schema design, but I would recommend one of the previous two options instead.

Hopefully this will spur some ideas for a schema that avoids the dreaded collection scan!

Thanks,

Justin

jellyx · May 26, 2020, 11:18pm

Hi @Justin,

Thank you so much for your great response.

Yes, we’re using attribute pattern, but also the outlier pattern since there could be a lot of events for some clients (outlier, as the name suggests).

I read a bit about wildcard indexes, but I read also the following:

You cannot shard a collection using a wildcard index. Create a non-wildcard index on the field or fields you want to shard on. For more information on shard key selection, see Shard Keys.

Since I’m pretty new to Mongo DB - and haven’t researched a lot about sharding - I just decided to go with attribute pattern.

My question is (if you don’t mind): would sharding with wildcard index produce some issues to me? How would I be able to horizontally scale the number of clients across multiple machines.

And yes, I executed the following:

db.clients.createIndex({“events.event_data.v.k” : 1})

which seems not to produce any results. I already have this one:

db.clients.createIndex({"events.event_data.k" : 1, "events.event_data.v" : 1 })

Not sure if I should have both of them…

Your idea of floating an array is brilliant. If not with wildcard indexes, I’ll go with flattening an array.

Many thanks!

Justin · May 26, 2020, 11:23pm

Great! I’m glad my suggestion provided some insight!

Quickly answering your question about wildcard indexes:

This is correct, you cannot shard in a wildcard index. But whenever you index an array, you create a multikey index, which also cannot be used as a shard key. If you aren’t having issues with the attribute pattern now (which use a multikey index) than you definitely won’t have problems using a wildcard index!

Thanks,

Justin

jellyx · May 26, 2020, 11:33pm

Thanks again! It has provided some insight definitely!

I feel a bit uncomfortable asking too many questions, but still have some opened questions:

a) Isn’t this a compound index:

{
    "events.event_data.k" : 1.0,
    "events.event_data.v" : 1.0
}

b) If it is a multi-key index and wildcard index doesn’t allow sharding, then how am I supposed to shard my collection when that time comes? Maybe I’m worrying too early… I guess…

Many thanks,

Justin · May 27, 2020, 12:10am

Don’t worry about asking too many questions, we’re here to help! I may not be online much longer to answer but the community is great at covering a wide range of topics.

a) Yes, that is a compound index. It’s also a multikey index when events.data_key is an array. It can’t be used for sharding.

b) You’ll need to pick a different shard key since both multikey and wildcard indexes can’t be used as a shard key. This makes a lot of sense if you dive into it. A document may only live on one shard. What would happen if a shard key on an array contained values requiring a single document to reside on more than one shard? It’d be problematic, to say the least.

Picking a shard key is a big topic so some further reading may be helpful:

Thanks,

Justin

system · May 28, 2020, 12:13am

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.