Which is the Mongo-way of storing filters related to documents?

eddy_turbox · March 31, 2022, 11:23am

Morning all,

I’ve been suffering an issue with a growing collection that happens to store filter-related data in a way which may not be the most optimal one: all possible filter combinations in a list (country, city, …, actually all filters are dynamic - not only in value but in key, you could introduce a new filter anytime you want).

This causes a quite high cardinality, making collections not usable since you can’t really retrieve more than 2 weeks of data.

My question here is… How do you react to this kind of explossion in MongoDB? Which is the most optimal way to store dynamic filters in here (I can imagine they can’t be attributes of the document by themselves? How bad is a list?).

Is splitting the data in a “filter per collection” approach something realistic to do? - If everything is ATM within a my_collection, send everything to my_collection_by_country, my_collection_by_city, my_collection_by_nFilter.

Thanks in advance!

eddy_turbox · April 14, 2022, 12:09pm

Little up here, any tips for this?

steevej · April 14, 2022, 1:05pm

When a post does not attract any reply for such a long time it is usually a sign than no one really understand the issue and we all sit and wait in hope that someone will understand.

Post sample documents from your collections.

What do you mean by dynamic filters, for me all filters are dynamic by nature? And by filters do you mean queries.

Samples of filters will also be useful.

What do you mean by ATM with a my_collection?

eddy_turbox · April 20, 2022, 8:14am

My bad, hope this clears the waters a bit:

I’ve got a collection named “bookings”, which contain different bookings performed by users:

{
    _id: ObjectId("..."),
    timestamp: 123456789,
    hotel_id: "123456789",
    city: "paris",
    country: "france",
    device: "web",
    filters: [
       "city=paris",
       "country=france",
       "device=web",
       "city=paris&country=france",
       "city=paris&device=web",
       "country=france&device=web",
       ...
    ],
    booking_ref: "ABCD"
}

What I mean with dynamic filters is the following: here I’ve got city, country and device as the fields I want to filter. The problem is, when trying to perform a find I want to be able to search for any filter, in any combination.

AFAIK for indexes to properly work the ordering of the fields must be strict, that is, if I added an index for country, city, device, I will be able to do this: find({country: france}), but not this find({city: paris}) or find({city: paris, device: web}) since it won’t use the index.

So, what I did was to put all possible filter combinations in that filters list so I convert the input filters into a & separated string and look into all lists to check if its there, having filters as part of the index.

If I want to search for country = france, city = paris, I will convert that into country=france&city=paris.

And this can’t be optimal by any means, at least when filter cardinality explodes in size it’s not manageable and times could be faster for sure.

Which would be the ideal way to search for any filter combination? And the proper way to store them in Mongo?

Also, with ATM with a my_collection I was thinking out loud. All documents from all filters are in the same collection at the moment, would it be something feasible to split those by collection? Like bookings_by_country, bookings_by_city, and so on.

steevej · April 20, 2022, 11:49am

I think you are missing some basic knowledge about MongoDB documents, indexes and queries.

So you should take M001, M121 and M201 from https://university.mongodb.com/. You should also take one of the M220 flavor for the programming language of your choice to see how queries are used in a real application.

eddy_turbox · April 20, 2022, 11:59am

Any pointers to specific sections in those courses or what’s exactly wrong in my approach?

steevej · April 20, 2022, 12:08pm

None of the following is JSON

Data is duplicated (but the first one is JSON at least)

and

Misconceptions about indexes:

Because ordering of fields in strict equality queries is not important; {city:paris,device:web} is the same query as {device:web,city:paris} and the query planner, as explained in the course, will reorder to leverage the same appropriate index. And nothing stops you from ordering your queried field in a consistent order in your back end before sending to the server.

eddy_turbox · April 20, 2022, 12:36pm

Thank you for such a fast reply!

Regarding this part:

Because ordering of fields in strict equality queries is not important; {city:paris,device:web} is the same query as {device:web,city:paris} and the query planner

What if I’m missing an starting column of the index? If I have country, city, device as index and I’m only querying for city and device, should the backend cover the missing part of country with something such as exists: true?

That’s the concept I mostly remember, that you can miss columns right side but not left side, and in this case one could randomly filter by country, city or device.

Manuel_Reil1 · April 21, 2022, 9:47am

I think @eddy_turbox does not necessarily miss basic knowledge, but is facing a real problem we face as well. Typically, research is guiding you to the Attribute Pattern, but it’s slow and gets even slower when adding more conditions.

I guess @eddy_turbox was driving the attribute pattern further by materializing the attribute pattern into an array of flat strings to reduce the number of conditions. I came up with the same idea yesterday, but I left out the concatenation of all permutations.

A query would like like this:

{
  "filters": {
    "$all": ["city=paris", "country=france", "device=web"]
  }
  /// other criteria backed by the index
}

I saw in the explain that during the index scan only one criterium is matched, in @eddy_turbox’s example e.g. city=paris then documents are fetched and then filtered in a filter stage using the other criteria.
This of course over-fetches many documents, in my examples 60k documents are fetched and then filtered to eg. 10k.

When I am adding pagination patterns (skip/limit, cursor-based) it does get even more complicated and sometimes counter-intuitive in performance.

In my limited examples with only 1 million records, the collection scan with a partially covering index wins most of the time for low-cardinality filters with 10ks of results, but of course this would not be sustainable going forward.

@steevej Maybe this helps to view from a slightly different angle, would be glad if you had further insights, even if it’s just confirming what we see or if there are further modeling tricks.

steevej · April 21, 2022, 4:56pm

Adding redundant data with the associated indexes will increase your working set so will negatively impact performance and needed resources (RAM and disk).

When you have documents like:

{
  city: "paris" ,
  country: "france" ,
  device: "web" ,
}

You have all data required to perform queries like

because you simply have to have the query:

{ country : france , city : paris , device : web }

and with the appropriate indexes will be much more performant that have the redundant string things. The following index is an appropriate index:

{ city : 1 , country : 1 , device : 1 }

Note that I purposely use country first in the query and city first in the index to bring the point the field order of the query can and will be reordered to match the order of an index. Also to bring the point that having city first in the index is likely to be more performant because city names have a higher cardinality and lower frequency than country or device.

Defining the appropriate indexes is the kind of knowledge you get when taking the courses.

eddy_turbox · April 22, 2022, 8:03am

But then, there would be an issue if you want to search only for device right? Or device + country. As said in the docs, if a query omits a particular index prefix, it is unable to make use of any index fields that follow that prefix..

And creating indexes for those fields alone hurts performance (at least from what I’ve been testing in the past).

steevej · April 23, 2022, 9:55pm

True

Creating an index to support your use-case should not hurt performance. On the contrary. Unless of course you are low in RAM. But adding your redundant field=blabla filter array will worsen your performance even more.

May be your testing were not appropriate.

eddy_turbox · April 25, 2022, 10:54am

Got you, thanks!

Which worries me in case of dynamic filters (let’s say I add tomorrow a couple of filters more), having a list is transparent and new data will contain those automatically, same goes for the access to it which is dynamic and won’t require a change codewise.

However, creating indexes everytime a filter is removed/updated may be a challenge in big collections when doing so… It’s a complex problem for sure.

I suppose it’s worth trading performance for flexibility in this cases.

steevej · April 25, 2022, 12:00pm

Take a look at

Yes, slower than direct key:value.

Yes, slower when adding more conditions.

But way faster and more efficient that the alternatives discussed here.

Always indeed!

eddy_turbox · April 25, 2022, 12:28pm

This was an interesting thread for sure!

Thank you both @steevej and @Manuel_Reil1 for replying and discussing!

system · April 30, 2022, 12:29pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.