Storing an array vs binary key value pairs in MongoDB

Big_Cat_Public_Safety_Act · November 14, 2022, 5:48am

{
  field_1: ["A", "B", "C"]
}

vs

{
  A: true,
  B: true,
  C: true,
}

{
  field_1: ["A", "B"]
}

vs

{
  A: true,
  B: true,
}

1: Which method requires more storage space.
2. Which method is better for query performance, assuming indexes.

steevej · November 15, 2022, 3:24am

Easy to test, with 3 collections with one document for each possible cases.

mongosh > db.big_cat_1.find()
{ _id: ObjectId("6372fea50b3dad06b2ca1c8e"), flags: [] }
{ _id: ObjectId("6372feaa0b3dad06b2ca1c8f"), flags: [ 'A' ] }
{ _id: ObjectId("6372feae0b3dad06b2ca1c90"), flags: [ 'B' ] }
{ _id: ObjectId("6372feb80b3dad06b2ca1c92"), flags: [ 'C' ] }
{ _id: ObjectId("6372fed30b3dad06b2ca1c93"),
  flags: [ 'A', 'B' ] }
{ _id: ObjectId("6372feda0b3dad06b2ca1c94"),
  flags: [ 'A', 'C' ] }
{ _id: ObjectId("6372fedf0b3dad06b2ca1c95"),
  flags: [ 'B', 'C' ] }
{ _id: ObjectId("6372feed0b3dad06b2ca1c96"),
  flags: [ 'A', 'B', 'C' ] }
mongosh > db.big_cat_2.find()
{ _id: ObjectId("6372ff6d0b3dad06b2ca1c97") }
{ _id: ObjectId("6372ff730b3dad06b2ca1c98"), A: true }
{ _id: ObjectId("6372ff820b3dad06b2ca1c99"), B: true }
{ _id: ObjectId("6372ff850b3dad06b2ca1c9a"), C: true }
{ _id: ObjectId("6372ff8a0b3dad06b2ca1c9b"), A: true, B: true }
{ _id: ObjectId("6372ff8d0b3dad06b2ca1c9c"), A: true, C: true }
{ _id: ObjectId("6372ff910b3dad06b2ca1c9d"), B: true, C: true }
{ _id: ObjectId("6372ff980b3dad06b2ca1c9e"),
  A: true,
  B: true,
  C: true }

/* same as big_cat_2 but with better field names compared to A , B , C */
mongosh > db.big_cat_3.find()
{ _id: ObjectId("6373011f0b3dad06b2ca1c9f") }
{ _id: ObjectId("637301240b3dad06b2ca1ca0"), flag_A: true }
{ _id: ObjectId("637301280b3dad06b2ca1ca1"), flag_B: true }
{ _id: ObjectId("6373012b0b3dad06b2ca1ca2"), flag_C: true }
{ _id: ObjectId("637301330b3dad06b2ca1ca3"),
  flag_A: true,
  flag_B: true }
{ _id: ObjectId("6373013b0b3dad06b2ca1ca4"),
  flag_A: true,
  flag_C: true }
{ _id: ObjectId("637301400b3dad06b2ca1ca5"),
  flag_B: true,
  flag_C: true }
{ _id: ObjectId("637301480b3dad06b2ca1ca6"),
  flag_A: true,
  flag_B: true,
  flag_C: true }

And now let see some stats:

mongosh > db.big_cat_1.stats().size
380
mongosh > db.big_cat_2.stats().size
224
mongosh > db.big_cat_3.stats().size
284

Even with good names the A:true version seems to be a winner compared to the array at this point.

But

With the array version you need 1 and only 1 index, that is { “flags” : 1 }. With the Boolean fields, it depends of your use cases, but you need at least 3 indexes {A:1},{B:1},{C:1} for the most basic queries where you only query 1 of the flags. If you to find({A:true,C:true}) often then you might need more indexes. So the array version would be the winner.

But other questions you must ask are:

Q3 - Which model provides easier updates? How do you update when you want to turn a flag on? How do you update to turn a flag off?

Q4 - Which model is easier to migrate? For example, how do you update the model when you need a new flag? How do you update the model when a flag is deprecated? What happen to the indexes?

Big_Cat_Public_Safety_Act · November 15, 2022, 4:16am

Regarding the index on the array, even though on the surface it’s only one index, under the hood mongodb creates a multi key index on each of the element of the array? This would be similar to the multiple indexes required for the flags?

steevej · November 15, 2022, 11:07pm

Yes it

But I am not too sure if it

I hope it has some optimization. For the small set of documents it looks like there is some:

mongosh > db.big_cat_2.stats().indexSizes
{ _id_: 36864, A_1: 20480, B_1: 20480, C_1: 20480 }
mongosh > db.big_cat_1.stats().indexSizes
{ _id_: 36864, flags_1: 20480 }

And each index is an open file, so less resources taken by the array.

I would probably go with the array because it is a more flexible model, easier to migrate. (See Q4)