Best way to amend indexes on large existing data

Tuan_Dinh1 · February 18, 2021, 11:40pm

Hi All,

I have the following use case: there is a large amount of live data on a collection (~250GB) with a set of indexes (these are created when db is empty and then data are inserted afterwards). Then, we’ve done some analysis and decided to use a new set of indexes.

Question: what is the best way to deploy the new set of indexes and remove the old set of indexes (that affects existing data as well ) ?

Backup data and reload doesn’t seem to work, because the snapshot would include the old indexes too.
A no-brainer would be just run “db.removeIndexes()”, then “db.createIndexes()” but given the large existing data, it could be very time-consuming (I reckon)

I’m wondering if there’s a better way, or better, what is the industry standard approach to deal with this problem ?

Tuan

Pavel_Duchovny · February 19, 2021, 5:30pm

Hi @Tuan_Dinh1,

Yes we have a rolling maintenance index creation for replica sets and sharded clusters:
https://docs.mongodb.com/manual/tutorial/build-indexes-on-replica-sets/

If this is not possible make sure to use a background methods to not block database activity.

Thanks.
Pavel

Tuan_Dinh1 · February 21, 2021, 11:29pm

Thanks @Pavel_Duchovny, will give it a go.

Tuan_Dinh1 · February 22, 2021, 12:22am

Hi @Pavel_Duchovny,

Thanks again for pointing me to this doc. The idea seems to be rolling out the new indexes on one replica of the replica set at a time, with the primary to be the last. But at each replica, there is no way-around running db.collection.createIndex(). This is great when you have to perform the index creation to live cluster.

Fortunately, we don’t have this constraint in our case. Yes, there will be existing data, but we have blue/green deployment strategy where all operations can be to a new cluster and checked before it’s switched. So basically, we will do:

Stop write operations to the cluster
Make a manual snapshot of the cluster
Create a new cluster with existing data from the snapshot created in step 2.
Perform db.connection.dropIndexes() to new cluster (to remove all existing indexes except for the one on _id)
Perform db.connection.createIndexes() to the new cluster (to create new indexes) (<== This is the key time-consuming operation what I’d like to ask for faster way, but seems there isn’t one)
Blue/Green switch: Make the new cluster the active cluster (all consumer/producer now point to this cluster)
Enable write operation (now to new cluster)
Tear down the old cluster.

Any comments, thoughts ?

Tuan

Pavel_Duchovny · February 22, 2021, 4:28am

Hi @Tuan_Dinh1,

Nope sounds good. In that case a regular index build will be the fastest.

Thanks,
Pavel