Heterogenous collection and indexing difficulties

kembhootha_k · September 22, 2022, 1:46am

I have a huge heterogenous collection( = different subclasses of data in the same collection, think ecommerce catalog type of data, where all products are in the same collection but with a hugely disjoint attribute set but with some common attributes inherited from a parent type).

There can be a few thousand such types (subclasses) each with about 50-60 attributes and they are pretty dynamic - new subtypes get added all the time. The documents can also be nested. The single collection contains a few hundred million documents.

Now, given this, the number “distinct” attributes that I need to index on the collection is pretty high a few hundred attributes perhaps and therefore that many number of indexes on the same collection approximately. With MongoDB’s 64 indexes limit, I cannot keep adding indexes ( even otherwise it’s not an approach that warms the cockles of my heart anyway). Any of these attributes are searchable and needs to be performant.

I have the following options

Bite the bullet and split each class into it’s own collection. This will solve the primary problem, but introduce a extensibility issue. Newer types are added all the time and the last thing I want is to keep creating collections all the time - it’s very dynamic. So this is ruled out overall. Btw, this is one of the reaons why I went with a document store to start with.
Secondary Indexes - Ingest data into elastic search and query ES first to retrieve document ids and retrieve actual data from Mongo. Would work, but to keep ES and Mongo in sync is another problem. Is there another way to build out a large set of secondary indexes to supplement the mongo indexes ?
Cache - Get the queries to run against a cache that “papers” over mongo. Too much build and long term support
Other workarounds such as “mapping” attributes. Create indexes on 64 attributes. This will be static but the incoming data with all it’s attributes will map to one of these 64 attributes and we’ll use this mapping information to interpret data when we read it. This ensures taht we don’t have to go beyond a fixed set of static “surrogate” attributes to index.

Thoughts ? How have you solved this problem?

kembhootha_k · September 22, 2022, 1:48am

Can’t seem to edit my post. The last option again is a workaround and a horribly broken design, so not my preference TBH.

Stennie_X · September 22, 2022, 4:19am

Welcome back @kembhootha_k !

I would go with an option not listed yet – apply the Attribute Pattern to your data modelling use case since:

There is a subset of fields that share common characteristics and you may want to sort or query on that subset of fields
You need to add a dynamic range of attributes that may only be found in a small subset of documents

The Attribute Pattern allows you to efficiently index using key/value attribute pairs for dynamic attributes.

For more reference patterns, please see Building with Patterns: A Summary.

If MongoDB Atlas is an option for your use case, Atlas Search is an integrated search solution based on Apache Lucene (similar to ES). Configuration of indexing is done via the Atlas UI/API, index sync is automatic, and queries are performed via the standard MongoDB API ($search aggregation stage).

Regards,
Stennie

kembhootha_k · September 22, 2022, 6:00am

Thank you. I will have a look at the Attribute Pattern.

Further, the Atlas Search approach, is that recommended for non-analytics usecases where a small % of documents not being found on the search indexes could lead to a catastrophy? I have a 0 tolerance to out of sync data between primary store and seconday indexes.

Much appreciated!

kembhootha_k · September 26, 2022, 2:38pm

@Stennie_X

Thank you for the inputs. I spent some time looking at the Attribute Pattern and it certainly would work. However, I have the following concerns with the Attribute Pattern.

The Attribute Pattern seems to dilute a document store’s capabilities. Most of us use document stores like MongoDB for the fundamental tenet - of being able to have the flexibility of schemaless/extensible schema documents. Given that, the attribute pattern seems a workaround (IMHO). Put another way, it could be done with a regular(non document) database with a large vertical table that stores the attributes as keyvalue pairs (parent object id, attribute id, value as string ) without a document store?
Size of indexes - If I don’t need all my attributes indexed potentially, but only some of them only (say 20-30% attributes ), the attribute pattern wouldn’t let me do that, leading to large indexes ?
Doesn’t handle nested documents well ( If I have a document with flexible attributes and it has a nested document with flexible attributes, I cannot use this patterns for any arbitrary level of or type of nested document )

Sample Document For this discussion with attribute pattern

{
   "id": "A12355",
   "commonAtt1": "value1",
   "commonAtt2": "value2",
   "a":[{"k": "CustomAtt1", "v": "Value3"},{"k": "CustomAtt2", "v": "Value4"},{"k": "CustomAtt3", "v": "Value5"}]

   "childDocuments": [{
      //looks very simlar to the parent document wrt it's attributes for example
    },
    {
      //looks very simlar to the parent document wrt it's attributes for example
    }]
}