Hi everyone,
My users are able to insert clients which have different attributes/properties. Basically, it’s totally unpredictable.
Additionally, users are able to create multiple workspaces so they have some clients in one workspace, some in another (in case they have multiple projects, for example).
My document looks like this:
{
"workspace_id" : 1
"attributes" : {
"first_name" : "John",
"last_name" : "Doe",
"email" : "john.doe@example.net",
"phone_number" : "+1234567890",
"gender" : "Male",
"location" : "London"
}
}
The first question how to structure indexes.
If I create:
- wildcard index
- single index to workspace_id
then it seems to be slow when I query something like this:
db.clients.find({ "workspace_id": 1, "attributes.first_name": "John", "attributes.gender": "Male" })
However, it’s fast to query something like this:
db.clients.find({ "attributes.first_name": "John", "attributes.gender": "Male" })
As you can see, it’s getting slow when we mix workspace_id and attributes. Note that wildcard index and compound index don’t go together.
Here is what could be an alternative:
{
"workspace_id" : 1,
"attributes" : [
{ k: "first_name", v: "John" }
{ k: "last_name", v: "Doe" }
{ k: "email", v: "john.doe@example.net" }
{ k: "phone", v: "+1234567890" }
{ k: "gender", v: "Male" }
{ k: "location", v: "London" }
]
}
My tests shows that this is much faster when I mix workspace_id and attributes (which is always) because I added compound index to workspace_id, k and v.
The FIRST question is: Did we choose the right index?
Next, I would like to know what should we take for sharding?
I mean, we’re not ready yet to do sharding (nor we need it), but since we’re designing a scheme, then it could be useful to consider it for the future.
The SECOND question is which sharding key should we choose for our data?
Next, we want to push different events to clients. We can create a seperate collection (events) or we can embed (which is risky for reaching out 16 MB).
We would like to go with events collection and then use lookup in clients. How about performance if we are talking about millions of events and millions of clients.
The THIRD question is what is your experience with lookup?
Thank you and sorry if there are so many questions asked.