Create index on 4 millions documents

Frederic_Klein · June 21, 2023, 4:42pm

Dear community
Context: Atlas M20 cluster, a collection with unexpected about 4 millions documents, still counting.
I am about to add a new field to my documents (nodejs migration script). This field will be a string to be considered as an enum (<5 different values).
The value will be taken from an existing field on which I am currently querying with regex.
I am expecting a performance improvement.
In addition I am considering adding an index on this new field.
Here my questions:

can I expect an even bigger performance boost when querying on this newly indexed field (equality only) ?
should I add the index before migrating the documents with new field value or should I first migrate all documents with the new field and then create the index?
how long will the collection be locked by the index creation process preventing writes?

chris · June 22, 2023, 1:59pm

If the field(or the one you are sourcing from) was not indexed, then yes you will see a performance increase, as the query will perform an index scan instead of a collection scan.

Hovever the low cardinality of the index won’t be as performant as one with higher cardinality.

Its is probably more efficient to do it afterwards. If you create it beforehand there will be an additional 4M index entries to update when you run your migration, the non existing field is indexed as null. But with this number of documents you are unlikely to notice the difference.

Later versions of MongoDB(4.2+) only lock the collection at the start and end of the build. In my experience this is a very short period.

If you have a Pre-Prod cluster test there beforehand to see how you applications perform.

Wuth dedicated Atlas tiers (M10+) you can use a rolling index build to avoid any performance impact from building an index. However there will be a step-down of the primary when it comes to build on that member. I just mention this for completeness not that I think you need it.

Frederic_Klein · June 22, 2023, 3:13pm

Thanks a lot @chris for these answers.
I am running version 5.0.18 on GCP.

When you say “a very short period”, what should I understand: a couple of seconds (2, 5, 10) or minutes?
What will happen to writes (inserts) during that time?

In your experience, how long should I expect the index to build for e.g. 6M entries by next Sunday (lower load on that day)?

And after the index is built, is it a good idea to let my migration script run at full speed, possibly hammering hard on the opcounters or should I rather throttle it?

In addition, I noticed that, although I am querying an aggregation pipeline having as first stage a $match on a range of a non indexed Date (timestamp) field and as second stage a $match ($ne) on another non indexed String field that can hold 3 different values, the Performance Advisor still does not suggest any index.
Any idea why not?
Is it considering that the potential performance improvement would not compensate the cost of the index?