Use case for Wildcard Index on all fields

viktor_mex · November 18, 2021, 1:51am

Hi community,

In the official manual it stated that it’s possible create a wildcard index on all fields.

I’m a newbie about MongoDB, but from what I understand indexes are used especially to improve the speed of read operations on most queried fields of a document but slowling the write operations. So I don’t get in which use cases it’s preferrable create an index on all fields instead to let query uses a complete collection scan.

Pavel_Duchovny · November 18, 2021, 6:05am

Hi @viktor_mex

Welcome to MongoDB community

Wild card indexes are meant for specific use case where a query might filter on any one of the fields , but only one. This means index will not assist in compound or more complex queries.

If you know the set of predicts for your application you should create specific indexes.

Read the following for more info

Thanks
Pavel

viktor_mex · November 18, 2021, 11:42am

Hi @Pavel_Duchovny and thank you for the welcome and the response.

I understand the use of wildcard indexes, but I didn’t get the use case of this specific example.

Pavel_Duchovny · November 18, 2021, 11:55am

@viktor_mex ,

In this example a product document has some common fields like product_name while the product_attributes sub fields can vary from product to product.

As MongoDB allow a flexible schema we cannot predict upfront all field name possibilities, however with wild card indexes we can capture any field in the document.

Now the application will be required to search based on one attribute name only like “size” or “color” and the documents that have the attributes with corresponding values can use the wild card index this way.

There is another strategy to use attribute schema design where a regular index of keys and values can be utilised

Let me know if that makes sense.

Thanks
Pavel

viktor_mex · November 18, 2021, 7:09pm

@Pavel_Duchovny ,

thank you for the insights and the linked post, very helpful resources.

Now if I read again the chapter about wildcard indexes in the MongoDB documentation it’s more clear why and when using wildcard indexes with fields that share a similar path, like:

db.products_catalog.createIndex( { "product_attributes.$**" : 1 } )

But it’s not still very clear this use case:

db.products.createIndex( { "$**" : 1 } )

In a collection called “products” I expect there are not only fields like “product_attributes.size”, “product_attributes.color” but also fields like “publishing_date”, “price”, etc. These fields are not related (like instead “product_attributes.$” fields) and an index, if I’m not wrong, has an impact on performance for write operations (and with this kind of indexes I suppose that the impact is bigger than a single or compound index).

So I didn’t still get the use case of creating a wildcard indexes for all fields of a collection, to me (as a newbie) there is no great benefit over collection scan.

Pavel_Duchovny · November 19, 2021, 6:21am

@viktor_mex ,

Lets say you jave two fields with subdocuments and you have to filter on one of the fields in each of them.

{
product_attributes : { ... },
product_company_details : { ... }
...
}

You can save the need to create 2 indexes and create one top level wild card index.

Additionally, if you do not know the fields names that will hold this data and you only dynamic build them during query time, you need to do a top level.

Hope that helps

Pavel

viktor_mex · November 19, 2021, 4:07pm

@Pavel_Duchovny ,

Lets say you jave two fields with subdocuments and you have to filter on one of the fields in each of them. You can save the need to create 2 indexes and create one top level wild card index.

In this case, shouldn’t it be better use a wildcard index only those specific fields, I mean:

db.collection.createIndex(
    { "$**" : 1 },
    { "wildcardProjection" :
        { "fieldA" : 1, "fieldB.fieldC" : 1 }
    }
)

And about this:

Additionally, if you do not know the fields names that will hold this data and you only dynamic build them during query time, you need to do a top level.

If I don’t know any fields names of a collection on which to create an index, shouldn’t be better analyze the queries and evaluate the use of a more specific index (also a wildcard one) to support them effectively? To me (as a newbie) the wildcard index on top level sounds very similar to a collection scan.

Stennie_X · November 20, 2021, 4:22am

Hi @viktor_mex,

Your instincts are correct: you generally should design appropriate data models and indexes rather than indexing all fields. If you are planning to index a collection more broadly with wildcard indexes, you should also try to minimise inclusion of unrelated and low cardinality fields in your data model as they will add likely more overhead than optimisation. In addition to some overhead on write performance, larger indexes will increase your working set and require more server resources (RAM and I/O).

Some of the nominal benefits of using wildcard indexes more broadly are:

Indexed queries on fields with high cardinality will be more efficient than a collection scan
You do not have to specify all of the fields to index, so new fields matching wildcard index paths will automatically be added

The documentation example you referenced is more focused on the latter case:

The product_attributes field can contain arbitrary nested fields, including embedded documents and arrays

Indexed queries are generally (but not unilaterally) more efficient than collection scans. For example, if you have a low cardinality boolean field where the value you query is in a majority of your documents (eg is_published: true), a collection scan may actually be more efficient. This outcome may seem counter-intuitive, but indexes work best on queries that are selective or where you need to sort query results.

If there are fields that your use case will never be included in filter criteria there is also no need to index them.

Reducing Unnecessary Indexes (and unnecessarily indexed fields) will leave more resources (RAM and I/O) for the rest of your deployment.

In theory, yes. In practice the additional field names may be created as a result of user-generated content or data pulled from a third party API where you do not have full control over how (or when) new fields will be added.

As @Pavel_Duchovny suggested, there are Schema Design Patterns like the Attribute Pattern which can help you better model your data to minimise the number and size of indexes required. There are also Schema Design Anti-Patterns to avoid like unnecessary indexes and bloated documents.

Regards,
Stennie

viktor_mex · November 20, 2021, 7:09pm

Hi @Stennie_X ,

thank you for the clear and exhaustive answer and the linked resources!

You and @Pavel_Duchovny made me understood very well the main concepts behind wildcard indexes, so thank you very much!

I would clarify some final points:

Does the “working set” refer to the most used data in a MongoDB database that, for this reason, are usually stored in RAM for quick access? (I searched the term but I’m not sure about the definitions I found)

Why indexes on low cardinality fields may be less efficient of a collection scan? Maybe due to the resources (I/O and RAM) used to retrieve the documents related to the filtered indexes entries? Or is there another reason?

Pavel_Duchovny · November 21, 2021, 6:32am

Hi @viktor_mex ,

The working set refer to the entire foot print that is mostly used (data and indexes), if for example all your queries are covered by indexes (search and all fields) than data should not be fetched therefore won’t be defined as a working set.

When working set does not rit to ram it should be paged in and out from disk. In databases its considered an expensive op .

There is another term called cache hit ratio if all your indexes are in memory and your queries are only accessing via indexes the hit ratio will be closer to optimal.

Now collection scans are not always bad. If the amount of data fetched from the index compared to the entire collection is 95% this means that there is almost a full index scan and each entry needs to fetch a doc. Collection scans are more optimised for large consecutive scans as this is how initial syncs happen in replication. Therefore, if almost every document in the collection needs to be accessed I expect coll scans to be faster and leas resources intensive.

Thanks

viktor_mex · November 22, 2021, 12:21am

Hi @Pavel_Duchovny ,

I got, thank you for the very clear explanation!

system · November 27, 2021, 12:21am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.