Handling Several Booleans in a Collection in MongoDB

Jason_Tulloch · September 13, 2021, 1:01pm

** I asked this question at StackOverflow the other day but am not getting any useful responses. Was hoping someone in this community may be better suited to help! **

Goal: Build a MongoDB schema (in terms of performance and scalability) that can account for several similar booleans. All booleans would be in the same collection, which will have other data as well.

Example (not real, just for discussion): App will have boolean values in collection based on cities someone has lived in over their lifetime. Currently, the app only has 5 booleans (isNYC, isChicago, isSF, isSeattle, and isAustin). As the app increases its users and markets, it will eventually be adding several more cities (let’s say this is capped at 100 different cities in the US). While users will be able to toggle between true and false, the expectation is that the boolean value will not change too frequently.

Ideas Explored:

Every city boolean is its own object in the collection. As new cities are added, the collection is updated to include a new boolean for that city.
Attribute pattern with a key value pair for the city booleans. Key is the city and value is true/false. I can have all the cities as T/F or just keys for the cities someone has lived in to keep the size down.
Array of the cities the user has lived in (e.g. [“NYC”, “SF”, “Seattle”]). As new cities are available the user can add the new city to the array. ** I know this isn’t a boolean, but just wanted to be thorough.
Something else?!?!

Reactions to ideas Explored:

Booleans take up very little space in the collection which can enhance performance. This setup is very simple and straight forward on the backend. Initial concern is that on the frontend I may need an if statement to account for every city in case I want to display different information for each city. Once I have 100 cities I feel like my code will get a little messy / jumbled.
Attribute pattern is great for indexing. I can easily query for the specific city I want but this may be slower than just running a query on idea #1. I like how organized this pattern is but have not seen any use cases of booleans in the attribute pattern but also don’t see why I couldn’t use this approach.
Likely not the best option. Saving strings takes up more space and then I would have to map through the array and then add an if statement to set up the frontend properly.

Final Note: I am really hoping to find a solid design that allows for the user to select the cities they have lived in, allow the user to add data only specific to that city, and allow other users to find users by city. There may be other purposes as well so something flexible is really important.

Thanks so much!

MaBeuLux88_xxx · September 14, 2021, 1:53am

Hi @Jason_Tulloch and welcome in the MongoDB Community !

Stackoverflow link in case someone wants to read.

I didn’t understand the idea #1. Could you illustrate with an example document what this would look like?

#2 would look like this if I understood correctly:

{
  "name": "Bob",
  "london": true,
  "paris": false,
  "new-york": true
}

In my opinion, this is a terrible idea as indexing this isn’t efficient at all. With 150 cities in your app, this means that each document now has 150 fields and requires 150 indexes on this collection (at least) => BIG NO-NO. Firstly because the max is 64. Secondly because more than 20 will murder your performances as each update / delete / insert would need to update all the indexes, which is costly when multiplied by a large number of indexes.

#3 is actually the good idea if you ask me:

{
  "name": "Bob",
  "cities": ["NYC", "Paris", "London"]
}

This is easy to index (one field). If storage is really an issue, you could use a reference collection and link to it (with integers to save space) but I think it’s overkill. Especially because WiredTiger will compress the strings anyway automatically so the resulting total size of the collection shouldn’t be that bad.

Side note: If you considered wildcard indexes as a potential solution, I would say it’s a bad idea for the same reasons that #2 is a bad idea.

Related to the answer you got on SO from “Joe”: I strongly disagree. #2 isn’t viable… And it’s not an opinionated answer, it’s a factual one…

I have another solution for you if optimisation is REALLY the heart of the problem:

{
  "user": "Bob",
  "cities": "10110011011101111011"
}

Here, I’m storing the booleans directly in a binary string where NY is the first bit, London the second one, etc. Space is optimised. Index is easy. But querying for “users who went to London” is .

Which makes me think about another viable solution:

{
  "user": "Bob",
  "cities": [1,0,1,1,0,0,1,1,1,0,1,0,0,1]
}

Indexing is easy. Space is optimised. Querying is easy:

db.foo.find({"cities.3": 1})

BUT an index on cities won’t help this query as the index doesn’t know the position of the value in the array… So this means you are stuck with a COLLSCAN on this one with the schema. For this reason, I wouldn’t consider this as a viable option either. Also, it’s annoying to remove a country from the list. Easy to add one though (at the end).

I hope this helps a bit.
Cheers,
Maxime.

Jason_Tulloch · September 14, 2021, 4:01am

@MaBeuLux88_xxx thank you so much for the thoughtful response, this was exactly what I was hoping for in terms of guidance. Apologies that my ideas weren’t as clear as I thought. I have added examples for both #1 and #2 below.

#1 - Every city boolean is its own object in the collection. As new cities are added, the collection is updated to include a new boolean for that city.

const userSchema = mongoose.Schema(
   {
      ...
      isLondon: { type: Boolean },
      isParis: { type: Boolean },
      isNYC: { type: Boolean },
      ...
   }
)

const users = [
   {
      ...
      isLondon: true,
      isParis: false,
      isNYC: true,
      ...
   }
]

#2 - Attribute pattern with a key value pair for the city booleans. Key is the city and value is true/false. I can have all the cities as T/F or just keys for the cities someone has lived in to keep the size down.

const userSchema = mongoose.Schema(
   {
      ...
      cities: [
         {
            k: String,
            v: Boolean,
         }
      ]
      ...
   }
)

const users = [
   {
      ...
      cities: [
         { k: "isLondon", v: true },
         { k: "isParis", v: false },
         { k: "isNYC", v: true }
      ...
   }
]

Regardless of above, my understanding is that the third option I suggested, with an array of the cities may be the best option. My only concern would be that there could be some performance issues mapping through the array and then having an if statement to set up the front end properly. Like I mentioned before there will only about a maximum of 100 cities so the computational needs wouldn’t be that drastic.

Maybe I misread or am mistaken (and I know I was not too clear initially) but wouldn’t #2 only need 1 index for the cities array?

I think how you laid out the cities as a binary string is an interesting concept as well but since users will be able to read-write this data and there will be queries based on cities, I agree it is probably not a viable option.

Curious if the code I wrote out for #1 and #2 changed any of your thoughts, but otherwise thanks again for the detailed explanation it is super helpful.

MaBeuLux88_xxx · September 14, 2021, 5:53am

Hi again,

Ok so in my previous post, my #2 was actually your #1 here (if that makes any sense) !

So now, with your numbers, here is my opinion - which is just an opinion, maybe it’s not perfect - but here it is.

#1 is terrible. Impossible to index to say the least. Also Mongoose is kinda an anti-pattern as it forces a schema over a schemaless/schemafree database . Again - just an opinion - but I find these tools (mongoose, Spring Data MongoDB, …) a bit counter productive. They bring more complexity than they save time & I honestly prefer to stick to the plain JS, Java or Python driver without an extra “proxy”.

#2 indexes would depend on how you are running the queries but I don’t see the “pros” of this schema design. You would need to populate all the cities in the array while in the #3 solution you just store the cities that people visited - which is - I guess - a smaller portion of the actual list for each people. In term of storage I think #3 is way easier to query (and index) and saves more storage.

Regarding this:

Your front-end implementation shouldn’t be THAT coupled to the back-end implementation. The computation should be done in the back-end and hide the actual implementation. The REST or GraphQL interface that the front-end consumes doesn’t have to be impacted even if you choose #1 over #3… But to serve that “convenient” API to the front-end, you would, indeed, require more or less computation & simple or hard queries in the back-end to build the answer for the front-end.
If it’s easier for the front-end to build the UI with

{
  "name": "Bob",
  "London": true,
  "Paris": false,
  ...
}

Then you should return this in your API. But that doesn’t prevent you from storing the data in your collection in an array of strings and compute the transition in your back-end.

By the way, #3 requires to store somewhere the list of all the cities (metadata doc) and this also suggests that this list needs a caching system so you don’t read this metadata doc every 3µs when it’s actually only updated every 6 weeks. I’m going to invent a new proverb just for you!

1 query is better than 2!

I’m glad this was somewhat useful to you !
Cheers,
Maxime.

Jason_Tulloch · September 14, 2021, 11:39am

Agreed about #1 after spending some more time thinking about it. I also appreciate your comment about mongoose, but for the sake of focusing on one ‘issue’ at a time, I will ignore that for now and do some more research on my end. Needless to say, the extra tidbit of advice is helpful.

For #2, I just want to make sure I am not missing something here. My understanding is that the Attribute Pattern is a great way to organize collections. I know that this isn’t the typical use case of the Attribute Pattern but still feel that it could work? Per the MongoDB docs, why couldn’t I just have the following index

{ "cities.k": 1, "cities.v": 1 }

I could also only add in cities that someone has lived in (essentially making all the cities.v = true and then just remove them if the user changes the city to false.

Your thoughts on #3 continue to make sense and could be the most sensible approach. But, I am a little lost as to what you mean by

“#3 requires to store somewhere the list of all the cities (metadata doc0 and this also suggests that this list needs a caching system so you don’t read this metadata doc every 3µs when it’s actually only updated every 6 weeks.”

If you don’t mind explaining a little bit further, that would be great.

Thanks!

MaBeuLux88_xxx · September 14, 2021, 1:27pm

Your index is correct and probably the smarter one to have with the attribute pattern used like this. But if you choose to remove the cities which have v:false… Then it means you can also remove the booleans with v:true as they would never be false.

You would end up with

{
  "name": "Bob",
  "cities": [ {"k": "London"}, {"k" : "Paris"} ]
}

Which looks very close to #3 with a useless layer of complexity… So in the end, you end up with #3.

About my comment about the caching. Let’s take an exemple. Let’s imagine I have this in my collection.

[
  {
    "name": "Bob",
    "cities": ["London", "Paris"]
  },
  {
    "name": "Alice",
    "cities": ["Tokyo", "NY"]
  }
]

And let’s imagine that our front-end wants to print Alice’s page with her world map with red or green pins (visited or not) for each city.

To build the doc that my back-end will send to the front-end (through a REST or GraphQL API I presume), I need the list of all the cities managed by my app. This comes from the metadata collection if I want to make it dynamic. You could hardcode it in the code directly. But adding a new city would require a new version of the back-end at least.

> db.metadata.find({"_id": "cities"})
[
  {
    _id: 'cities',
    cities: [ 'Paris', 'London', 'Tokyo', 'NY', 'Madrid', 'Berlin' ]
  }
]

Answer I send to the front-end to build the UI:

{
  'name': 'Alice',
  'Paris' : false,
  'London': false,
  'Tokyo': true,
  'NY' : true,
  'Madrid': false,
  'Berlin' : false
}

In order to produce this in my back-end, I need to retrieve the list of countries & Alice’s document.

db.metadata.find({"_id": "cities"}) // index on _id by default so all good and only a single doc in this collection anyway...
db.persons.find({"name": "Alice"})  // need index on "name".

The second one is required. But the first query will be spammed for each user that want to build their page and will return the same list in 99.9% of the case - unless you added / removed a city from the list.

If you want to avoid a round trip to the database to fetch an information that only changes every 6 weeks, then you can cache the first query in your back-end and only send that query (== invalidate the cache) when you actually change the list or every hours by default.

Note that another way to answer the front-end request would be to compute directly this answer with an aggregation pipeline. This shouldn’t be too hard to do I guess.

Cheers,
Maxime.

steevej · September 14, 2021, 1:36pm

My sentiment exactly. In addition, I also forgo using POJO in a lot of situations for the simplicity of working directly with documents.

If we have to cast votes, I am going with

Since you already know that cities might be added, all other solutions discussed, except the one above, would require an update script that set to false all new cities.

Jason_Tulloch · September 14, 2021, 6:18pm

@MaBeuLux88_xxx excellent and very fair point about using the array of strings versus a complicated version of key-value pairs without any values. The way you laid this out makes a lot of sense and also helped me feel like I have a better understand of the concept and why I should use #3 overall so cheers!

Your example regarding saving a metadata collection of the cities makes a lot more sense as well, super helpful (am I broken record yet?). I’m not certain how to cache a query in my backend and only send it whenever the list changes, but I guess that’s just another topic I can dive into.

@steevej thanks for joining this thread and the added input! I honestly only learned how to use MongoDB with Mongoose (wish my instructor mentioned this wasn’t necessary) and initial research definitely supports that it could end up being a detriment as the application scales. And agreed that it is counter productive…

MaBeuLux88_xxx · September 14, 2021, 6:50pm

Instead of pulling every X hours from the back-end to refresh the cache without knowing if it’s useful or not, you could use a Change Stream. So it’s directly MongoDB that pushes the update to you when there is one. .

If you are looking for more trainings, we have a fresh batch on MongoDB University. Especially M220JS for you maybe?

Cheers,
Maxime.