TTL Deletes are extremely slow

I have seen this issue posted before, but my setup is Sharded Replicasets. I think I have found why this is the case but just to put it here in case anyone has some input.

Database has about 4.5 billion documents all around 4K in size.
I have TTL set to 8 days.
Cluster is 16 shards.
Each shard is a PSA replicaset.
All Primary and Secondary have their own seperate Nodes, P/S split across 2 DC.
All Arbitors are located on a single node in third DC.
Each node is a VM with 8 CPU, 64GB MEM, 1TB DISK

Bulk Write I am pushing in 20K/s documents.

However the TTL deletion is lucky to be getting a few hundred per sec. I am out to about 20 days retention cause deletion just is so slow, It’s like they are not even moving.
I can see in the logs a lazy handful of REPL CRUD “d” on the $oid which I assume is the Replica getting individual delete events for TTL.
The IOWait is up to about 20% due to deletion and is affecting the inserts.
I believe the issue might be the operation and replication of the TTL deletes are one by one which is way beond the capability of the DB for the incoming rate.

I have looked at Partitioning but it is not really existent even though it comes up in serches, Partitioning seems to be equated to Sharding, But this really is a different strategy.

I am thinking pperhaps I create a manual partitioning strategy into multiple collection names representing 4 hour periods of time, Then I can calculate the collections to insert/query in middleware based on the document date/time and have the old collections dropped at expire. This seems real clunky, Surely a proper Partitioning solution would be great to add as native.

Sharding + Partitioning would have so many benefits.

What’s annoying is there doesn’t appear to be a fast method for dumping data from a collection and keeping the collection and index definitions in tact, Like Truncate. Even deleteMany performs poorly.

Capped collections are not suitable as they require unique IDs and I am unable to guarantee this due to multiple inserters and the frequency of data.


Hi @John_Torr,

Welcome to the MongoDB community forums :sparkles:

There is a background single-thread in mongod that reads the values in the index and removes expired documents from the collection. The background task that removes expired documents runs every 60 seconds. And as a result, MongoDB starts deleting documents 0 to 60 seconds after the index completes.

Because the duration of the removal operation depends on the workload of your mongod instance, expired data may exist for some time beyond the 60 second period between runs of the background task.

You can check the documents which are deleted in the serverStatus command’s ttl.deletedDocuments fields.

Whereas in replica set members, the TTL background thread only deletes documents when a member is in the state primary. The TTL background thread is idle when a member is in secondary state. These deletions are then replicated to any secondary replica set members. That is, secondaries depend on TTL deletes being replicated from the primary instead of having their own TTL process. This is to keep consistency between replica set members. To read more about it, please refer to the documentation link.

It’s likely that the reason for “beyond the capability of the DB” is related more to hardware/network performance rather than a constraint within MongoDB itself.

Here it is important to note that the TTL index needs to keep up with the rate of incoming data. I think the rate at which new items are inserted must be the same or slower than the rate at which items are being deleted otherwise it will never be able to catch up. It’s also worth noting that insert operations take priority over TTL indexes.

To resolve this issue, you could consider upgrading your hardware to one that is capable of handling 20K/s inserts and also 20K/s deletes simultaneously (although this is just in general terms and there are likely more nuanced considerations besides these numbers). Alternatively, if you don’t necessarily need the TTL index, you could opt for a batch operation where you identify and delete batches of data after a specificed period of time. However, please note that the rate of data being deleted must exceed the rate of new data being inserted, otherwise it could place constraints on your hardware.

Correspondingly with any delete operation in MongoDB, blocks of bytes inside existing .wt files in the dbPath are marked as free, but the file itself is not automatically trimmed. So, after a new TTL index deletes a large number of documents, consider resyncing replica set member nodes, or trying compact.

I hope it helps!


1 Like