Is there a performance penalty for querying in an array of subdocuments vs normalized schemas?

Schema for multitenant entities with shared/public vs. tenant-specific/private properties

Example Background

  • Dog entity with user-independent properties (e.g. “breed”)
  • User entity
  • Multiple users (<100) can own a dog, a person can own multiple dogs (>10 000)
  • Each owner has private properties in relation to a dog (e.g. “walkCount”)
  • An owner may only read their own dogs
  • Public attributes are updated from an external source

Question

If the most common queries are
a) get all dogs by ownerId
b) update dog by dogId+ownerId

Is a nested schema for dog with dog+owner relationship viable performance-wise or is it advisable to separate the collections (e.g. data duplication / joins)?

Alternative Schemas

Nested Schema

// Dog entity with owners
{
  _id: ObjectId("abc"),
  breed: "Pitbull",
  owners: [
    { ownerId: ObjectId("123"), walkCount: 5  },
    { ownerId: ObjectId("456"), walkCount: 2  },
  ]
}

db.dogs.index({ _id: 1, "owners.ownerId": 1 }, { unique: true })

// Sample query
db.dogs.find({ _id: ObjectId("abc"), "owners.ownerId": ObjectId("123") })

Duplicated Data

Public attributes (e.g. “breed”) need to be synced between documents.

// Dog entity (tenant-specific)
{
  dogId: "abc",
  breed: "Pitbull",
  ownerId: ObjectId("123")
  walkCount: 5,
},
{
  dogId: "abc",
  breed: "Pitbull",
  ownerId: ObjectId("456")
  walkCount: 2,
}

db.dogs.index({ dogId: 1, ownerId: 1 }, { unique: true })

// Sample query
db.dogs.find({ _id: ObjectId("abc"), ownerId: ObjectId("123") })

Hello @Kim_Kern ,

Welcome to The MongoDB Community Forums! :wave:

The design of your schema depends on your requirements and how comfortable are you to handle a complex system. As you have discussed some scenarios, I would recommend you to check the performance of your specific use-cases as per the requirements. You are right on the separate collection for both as well, I have seen many use-cases where people try to avoid duplication in their data. These are mostly for applications where you have a document that will never change, for example if we have an address collection and another collection of people, multiple people can stay at the same address so instead of writing the address every time some new documents of people are populated, they generally try to refer to the id of the address from another collection. Though it might increase the overhead as you need to use $lookup in your query but if you think that this will help you in some way in your use-case, you can design your application schema like this. Overall it totally depends on your requirements and your understanding of your application.

Below are some links that I think will be helpful for you with regards to your schema design. Please take a look at these.

One way to check the performance of your system is to check the output of your query with explain in executionStats mode (e.g. `db.collection.explain(‘executionStats’).aggregate(…)). Let’s check this with an example

This is the document and index I added as first example and the execution stats of query db.collection1.find({ _id: ObjectId("643ce9212c3e1d50dd8c493c"), "owners.ownerId": 123 }).explain("executionStats")

--> Document
{
  "_id": {
    "$oid": "643ce9212c3e1d50dd8c493c"
  },
  "breed": "Pitbull",
  "owners": [
    {
      "ownerId": 123,
      "walkCount": 5
    },
    {
      "ownerId": 456,
      "walkCount": 2
    }
  ]
}

-->Index

[
  { v: 2, key: { _id: 1 }, name: '_id_' },
  {
    v: 2,
    key: { _id: 1, 'owners.ownerId': 1 },
    name: '_id_1_owners.ownerId_1',
    unique: true
  }
]

--> Output

    executionSuccess: true,
    nReturned: 1,
    executionTimeMillis: 0,
    totalKeysExamined: 1,
    totalDocsExamined: 1,

Below are the documents and index I added as second example and the execution stats of query db.collection2.find({ "dogId": ("abc"), "ownerId": (123) }).explain("executionStats")

--> Documents
{
  "_id": {
    "$oid": "643cebfd2c3e1d50dd8c493f"
  },
  "dogId": "abc",
  "breed": "Pitbull",
  "ownerId": 123,
  "walkCount": 5
},
{
  "_id": {
    "$oid": "643cec642c3e1d50dd8c4940"
  },
  "dogId": "abc",
  "breed": "Pitbull",
  "ownerId": 456,
  "walkCount": 2
}
 
-->Index

[
  { v: 2, key: { _id: 1 }, name: '_id_' },
  {
    v: 2,
    key: { ownerId: 1, dogId: 1 },
    name: 'ownerId_1_dogId_1',
    unique: true
  }
]

--> Output

    executionSuccess: true,
    nReturned: 1,
    executionTimeMillis: 0,
    totalKeysExamined: 1,
    totalDocsExamined: 1,

As you can see in both the cases the execution time is pretty quick as our query was using the respective index and hence it was able to provide results without any overhead. We can also deduce that on small databases, we can have any type of schema and it will hardly matter, it will matter when your database will grow in size, whether your application will be write intensive/read intensive, resources available(RAM, Memory etc…), Indexes used (Efficient indexes improves the performance significantly) and others.

Note: This is just an example and users should test and decide their application schemas and performance based on their own requirements/testing. For professional help, you can Get in touch with our team and they could help you with any related questions.

Additionally, in case you put ObjectId as ObjectId(“abc”) or ObjectId(“123”), you will receive below error

BSONError: Argument passed in must be a string of 12 bytes or a string of 24 hex characters or an integer.
For more information regarding this, please refer ObjectId documentation

To learn more about performance, please check below resource

Regards,
Tarun

2 Likes