Dynamic attributes, sharding and embedding vs separate collection

jellyx · March 9, 2020, 10:18pm

Hi everyone,

My users are able to insert clients which have different attributes/properties. Basically, it’s totally unpredictable.

Additionally, users are able to create multiple workspaces so they have some clients in one workspace, some in another (in case they have multiple projects, for example).

My document looks like this:

{
    "workspace_id" : 1
    "attributes" : {
        "first_name" : "John",
         "last_name" : "Doe",
         "email" : "john.doe@example.net",
         "phone_number" : "+1234567890",
         "gender" : "Male",
         "location" : "London"
    }
}

The first question how to structure indexes.

If I create:

wildcard index
single index to workspace_id

then it seems to be slow when I query something like this:

db.clients.find({ "workspace_id": 1, "attributes.first_name": "John", "attributes.gender": "Male" })

However, it’s fast to query something like this:

db.clients.find({ "attributes.first_name": "John", "attributes.gender": "Male" })

As you can see, it’s getting slow when we mix workspace_id and attributes. Note that wildcard index and compound index don’t go together.

Here is what could be an alternative:

{
    "workspace_id" : 1,
    "attributes" : [
        { k: "first_name", v: "John" }
        { k: "last_name", v: "Doe" }
        { k: "email", v: "john.doe@example.net" }
        { k: "phone", v: "+1234567890" }
        { k: "gender", v: "Male" }
        { k: "location", v: "London" }
    ]
}

My tests shows that this is much faster when I mix workspace_id and attributes (which is always) because I added compound index to workspace_id, k and v.

The FIRST question is: Did we choose the right index?

Next, I would like to know what should we take for sharding?

I mean, we’re not ready yet to do sharding (nor we need it), but since we’re designing a scheme, then it could be useful to consider it for the future.

The SECOND question is which sharding key should we choose for our data?

Next, we want to push different events to clients. We can create a seperate collection (events) or we can embed (which is risky for reaching out 16 MB).

We would like to go with events collection and then use lookup in clients. How about performance if we are talking about millions of events and millions of clients.

The THIRD question is what is your experience with lookup?

Thank you and sorry if there are so many questions asked.

Prasad_Saya · March 11, 2020, 4:10am

Some thoughts.

The FIRST question is: Did we choose the right index?

db.clients.find({ "workspace_id": 1, "attributes.first_name": "John", "attributes.gender": "Male" })

For the above query to effectively use the index on workspace_id and attributes (its fields) you have to have a compound index with workspace_id and the attributes’s fields.

The attributes’s fields can be individual fields or as specified by the attributes’s k and v fields (as per the Attribute Design Pattern).

It looks like the compound index on workspace_id, k and v is the right choice.

NOTE: Wild card indexes have these following restrictions: (1) You cannot shard a collection using a wildcard index, and (2) You cannot create a compound index.

Next, I would like to know what should we take for sharding?

Sharding is about distributing data evenly across multiple servers (or shards); it is horizontal scaling.This is based upon your application requirements, mainly, the large amount of collection data, its distribution and performance accessing it. The queries that access the data often and important (fast access) are among the considerations. Shard key plays an important factor in these.

We would like to go with events collection and then use lookup in clients. How about performance if we are talking about millions of events and millions of clients.

It looks like your data has clients with multiple events - a one-to-many relationship. How many events per client? What kind of queries happen with this data? These are the factors to make a decision about embedding vs referencing.

The post 6 Rules of Thumb for MongoDB Schema Design: Part 1 has useful discussion on “How do I model a one-to-N relationship?”.

Thank you and sorry if there are so many questions asked.

Indeed, quite a few questions and covering aspects of design and development; interesting

jellyx · March 11, 2020, 3:46pm

First of all, thanks for the answer!

Well yes, I understand what is sharding, nodes, cluster, etc. But I do wonder what should I set as a sharding key because of my data structure.

I understand that data needs to be evenly distributed. It makes no sense to set “gender” as a sharding key, for example.

Since my data has dynamic attributes, I’m not sure which sharding key I should set. I mean, that’s for later…, but still just curious.

Yes, I read that post. It’s useful. However, I figured out:

Option 1:

Embed events into clients and then I have a limit of 16 MB which might be exceeded for some customers.

Option 2:

If I separate collections then I can store unlimited events, but then I have problems with lookup because it doesn’t support sharding (from collection).

That means I need to do two queries (clients, events) and then I should intersect two results.

Note that I also have problem with grouping events by clients because $group doesn’t support indexes and it’s slower.

Since there is a lot of computations, all of this could lead to Apache Spark. There is an Apache connector: Connector For Apache Spark | MongoDB

What do you think?

Cheers!