TTL Deletes are extremely slow

John_Torr · March 4, 2023, 3:41am

I have seen this issue posted before, but my setup is Sharded Replicasets. I think I have found why this is the case but just to put it here in case anyone has some input.

Database has about 4.5 billion documents all around 4K in size.
I have TTL set to 8 days.
Cluster is 16 shards.
Each shard is a PSA replicaset.
All Primary and Secondary have their own seperate Nodes, P/S split across 2 DC.
All Arbitors are located on a single node in third DC.
Each node is a VM with 8 CPU, 64GB MEM, 1TB DISK

Bulk Write I am pushing in 20K/s documents.

However the TTL deletion is lucky to be getting a few hundred per sec. I am out to about 20 days retention cause deletion just is so slow, It’s like they are not even moving.
I can see in the logs a lazy handful of REPL CRUD “d” on the $oid which I assume is the Replica getting individual delete events for TTL.
The IOWait is up to about 20% due to deletion and is affecting the inserts.
I believe the issue might be the operation and replication of the TTL deletes are one by one which is way beond the capability of the DB for the incoming rate.

I have looked at Partitioning but it is not really existent even though it comes up in serches, Partitioning seems to be equated to Sharding, But this really is a different strategy.

I am thinking pperhaps I create a manual partitioning strategy into multiple collection names representing 4 hour periods of time, Then I can calculate the collections to insert/query in middleware based on the document date/time and have the old collections dropped at expire. This seems real clunky, Surely a proper Partitioning solution would be great to add as native.

Sharding + Partitioning would have so many benefits.

What’s annoying is there doesn’t appear to be a fast method for dumping data from a collection and keeping the collection and index definitions in tact, Like Truncate. Even deleteMany performs poorly.

Capped collections are not suitable as they require unique IDs and I am unable to guarantee this due to multiple inserters and the frequency of data.

Regards
John

Kushagra_Kesav · March 7, 2023, 7:18am

Hi @John_Torr,

Welcome to the MongoDB community forums

There is a background single-thread in mongod that reads the values in the index and removes expired documents from the collection. The background task that removes expired documents runs every 60 seconds. And as a result, MongoDB starts deleting documents 0 to 60 seconds after the index completes.

Because the duration of the removal operation depends on the workload of your mongod instance, expired data may exist for some time beyond the 60 second period between runs of the background task.

You can check the documents which are deleted in the serverStatus command’s ttl.deletedDocuments fields.

Whereas in replica set members, the TTL background thread only deletes documents when a member is in the state primary. The TTL background thread is idle when a member is in secondary state. These deletions are then replicated to any secondary replica set members. That is, secondaries depend on TTL deletes being replicated from the primary instead of having their own TTL process. This is to keep consistency between replica set members. To read more about it, please refer to the documentation link.

It’s likely that the reason for “beyond the capability of the DB” is related more to hardware/network performance rather than a constraint within MongoDB itself.

Here it is important to note that the TTL index needs to keep up with the rate of incoming data. I think the rate at which new items are inserted must be the same or slower than the rate at which items are being deleted otherwise it will never be able to catch up. It’s also worth noting that insert operations take priority over TTL indexes.

To resolve this issue, you could consider upgrading your hardware to one that is capable of handling 20K/s inserts and also 20K/s deletes simultaneously (although this is just in general terms and there are likely more nuanced considerations besides these numbers). Alternatively, if you don’t necessarily need the TTL index, you could opt for a batch operation where you identify and delete batches of data after a specificed period of time. However, please note that the rate of data being deleted must exceed the rate of new data being inserted, otherwise it could place constraints on your hardware.

Correspondingly with any delete operation in MongoDB, blocks of bytes inside existing .wt files in the dbPath are marked as free, but the file itself is not automatically trimmed. So, after a new TTL index deletes a large number of documents, consider resyncing replica set member nodes, or trying compact.

I hope it helps!

Best,
Kushagra

John_Torr · March 23, 2023, 2:36am

Thanks for the response Kushagra,

I have been assessing the DB for a week to make sure my observations are justified. I just want to clear up a few things from the response to define the issue a little more.

The duration of the deltes are not in the area of minutes here, They are days to weeks where the TTL data should have been cleared out.

I am not saying the TTL deletes are not working, They are indeed working and being replicated to the secondaries. However the rate of deletion is in the order of 10’s per second.

I had stated the 20K/s inserts to give a general idea of the DB performance, This might have been misleading, It can handle 56K/s inserts without too much strain, but generally is only doing around 16K/s.

I have removed the TTL index for the past week and a half and note the IOWait is way down to around 3-4% per node.

Also in terms of space, yes I realize the disk space is not being relased, The issue was with the data in the collection not being deleted fast enough.

I still believe the TTL implementation is at issue here, It is probably not suitable for big data.

I consider the DB to be fairly well provisioned @ 16 shards, 256 CPU cores, 1TB RAM, Enterprise SSD SAN storage over 10GB interfaces.

John_Torr · March 23, 2023, 2:55am

Update:
I dumped my DB and re-initiated it without the TTL Index. It ran for 11 days before filling 16TB. Totaling around 30 billion documets at 4KB each, Given the DB gets compressed to disk. This equates to around 200MB per second ingress documents to the DB.

In this time the IOWait remained around 3-4% and there were no issues with performance. Actually the performance remained great the whole time, even up until I have to dump it again for filling up.

As compared to previously when the TTL index kicked in and the IOWait then remained between 15-20% and insert performance was impacted.

Question: Does TTL delete records one by one or as a batch operation?

Next steps: I plan to partition the DB data over multiple collections with the date as a component of the name. Then I will drop the older collections and create new one each day. The scripts for insertion will be modified to use the correct collection and the query will utilize the $unionWith aggregation pipeline.

Suggestion: Provide native partitioning across collections based on a date value from the document.

Kushagra_Kesav · March 24, 2023, 3:45am

Hi @John_Torr,

Based on my understanding, it is relative and depends on certain factors like how big the data is quantitatively and whether the data being processed is aligned with the hardware specs or not.

Although, as the TTL is single-threaded and large batches of deletes in a single period, may overwhelm the cluster’s IOPS capacity on primary or secondary nodes as the changes are also being replicated simultaneously.

From MongoDB 6.1 the TTL deletion happens in batches. Please refer to the SERVER-63040 for more details. There are also some tunable setParameters for batched deletes which may make your TTL deletes efficient but it all depends on your data size.

By default MongoDB batches every 10 documents, 2MB or 5 ms. If you have larger documents you may possibly benefit from raising the data size threshold, but that would also be more impactful on the storage engine’s cache.

Best,
Kushagra

John_Torr · March 24, 2023, 6:06am

I am using MongoDB 6.0.4, I use it containerized so will need wait until the mongo dockerhub catches up.

I am using 120 seperate processes sharing 30 mongos, each doing bulk inserts @ 1000 documents at a time. Each document is arounf 4KB, So I would say I am well over the batch thresholds.

My take away here seems to be that the TTL index is not intended for big data. Hence I believe the partitioning will be the way to go. Therfore it makes sense to have this functionality provided as a native mechanism. Otherwise we can’t really classify MongoDB as suitable for big data if it can’t deal with big clean-up.

Thanks for the info.

Dimitri_Scole · August 21, 2023, 2:42pm

I have exactly the same issue here

I’m using amazon documentDB 5.0 and with a big amount of data (object that represents audit) I have 1,5 millions documents per day and I need to keep one year of data.

The query of search is quick but my main issue is the clean up. After one week the delete of one day of data take 160 seconds .

I’m not sure if use TTL in 2 or 3TB of data will be efficient , do you think it can ?

If no what’s the best solution ?

John_Torr · November 3, 2023, 1:48am

Hi,

As I found out that yes indeed the TTL delete does individual delete operations prior to 6.1, So essentially the TTL can’t keep up with inserts on a very heavy load and ends up running you out of disk. From 6.1 the TTL deletes were changed to do batch.

I opted to partition my data into hourly collections where I use a function to determine the range of collections the query will need to fetch from and utilize an aggregation pipeline to union the collections.

This way I get to drop the old collections at practically zero cost. Also don’t need any indexes for datetimes and only use the Shard Key as the primary key for the data I need to find.