Document deletes very slow, much slower than inserts

RSommer · May 25, 2022, 3:06pm

Hi we have a very large collection ~900m documents with an average size of 13.6k. Documents live for around a week then need to be deleted, initially we had a capped collection but we didn’t get good enough performance so we tried having a daemon to monitor the number of records in the DB and delete any over a threshold value. This worked for a year or so but the steadily increasing volume we are receiving means this approach is now failing. Investigating the deletes I was surprised to see how slow they are. While inserts are maxing about 15k/s deletes are only 2k/s and often less.

Some typical values from the logs are
Removed 79073933 documents in 794.03105mins
Removed 92086070 documents in 662.8366666666667mins

For context we receive about 300m documents/day that are stored in mongo.

The collection has only 4 single-field indexes in addition to id and the actual delete is performed by javascript loaded into the Mongo shell. The delete logic finds the oldest timestamp in the collection then calculates timestamp2 by stepping forward in time by x minutes until there are enough records where timestamp <= timestamp2 to delete. The actual deletion was by timestamp (indexed field) but we got better performance by using the ID of the latest record we want to delete then executing
collection.deleteMany({ “_id”: { $lte: id } });

There is only the one server, no replica sets. Thankfully the data is not mission-critical and only stored on a ‘best effort’ basis, but lately that best effort is not good enough.

Any thoughts gratefully received.

MaBeuLux88_xxx · May 25, 2022, 3:46pm

Hi @RSommer and welcome in the MongoDB Community !

So just to be clear, you are running a 900m*13.6k = 12,24 TB collection on a single standalone node with about 300m/24/60/60 = 3472 writes per seconds in average. Is that correct?

If that’s correct, 12,24 TB is HUGE. Usually we recommend to start Sharding around 2 TB of uncompressed data. We are 6 times over that value.

Can you share more information about the server specs that you are using to support this? I think you just reached the limits of physic and it’s time to shard and distribute the data on multiple nodes.

Also, I guess you have tried and this approach and it’s probably not working for you, but did you try the TTL indexes?

Oh and also, is this a single node Replica Set or just a stand alone node without an oplog?

Cheers,
Maxime.

RSommer · May 30, 2022, 9:51am

Hi @MaBeuLux88_xxx
Yes correct on the storage side but the writes are very variable much higher during market hours and tailing off steeply at night, it peaks at ~20K inserts and is rarely under 7k during the day.

I did set up a sharded cluster a couple of years ago but throughput was disappointing, I can revisit it, I may have done something wrong. I wasn’t aware of the 2TB guideline.

TTL indexes we didn’t try yet I thought that might be too hard to manage being time based so a flood of data on a particularly busy day would not get deleted promptly.

Standalone machine, no oplog, when I joined the project we had dial node replication and capped collections but we had to remove replication as the volume grew. If we are at the limits of vertical scaling then it’s probably time to revisit shards but should we really expect such a huge discrepancy between insert and delete?

the machine spec is
$lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2200.000
CPU max MHz: 2200.0000
CPU min MHz: 1200.0000
BogoMIPS: 4394.81
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 51200K
NUMA node0 CPU(s): 0-19,40-59
NUMA node1 CPU(s): 20-39,60-79

with 64Gb installed

Cheers

John_Page · May 30, 2022, 2:30pm

I suspect a lot of my answer is what you expect to hear from a MongoDB employee - but it’s also what you are going to hear from any expert.

(a) Depending on the write concern you are using running a Standalone is not only incredibly dangerous but also potentially a lot slower then running a replica set (A single node replica set only has to journal the oplog not the data) - a 3 node (The minimum required) replica set keeps your data safe.
(b) 12TB is far too large for a single instance , when you are inserting your writes are batched and written to the disk during a checkpoint mostly sequentially. When deleting it’s quite possible there is a bunch of random access, seeks and the need to read those blocks from disk back into cache to delete them - all of which will be much much slower than inserting.

If you want to speed things up here are some options.

Make your data a lot smaller (Have you optimised the schema and usage well?)
Ensure your indexes and ideally all data you edit (including deletes) fits in RAM - this is hard but you need either sharding or much larger servers. 64GB isn’t a very large server these days.

Consider not deleting and inserting (depending how you are using the data) but instead overwriting old documents with new ones retaining only _id , you woudl do this based on date and add upsert in case you have to create a new document.

Partition your collections by age - drop old whole collections to get rid of older documents - use $unionWith when querying to query multiple collections at once (or move to cloud and use the datalake)

This si the sort of thing professional services sort out for companies but there is no getting away from the fact hosting 12+TB of data isn’t cheap.

RSommer · June 7, 2022, 2:19pm

Thanks John, yes using a standalone is not ideal, we used to have replication but the performance lagged behind what we needed, cutting back to the bare minimum has been the only way we’ve been able to keep up (until now) and as we are only storing data on a best effort basis we’ve been able to put up with the inconvenience. Thanks for the info on the deletes, that could well be what we see, the deletes happen at the start of the collection and the inserts at the end so I can imagine theres a lot of paging from disk happening.

Theres not really an option to reduce the data, we are at the mercy of what is being sent to us, we can’t control the json.

$unionWith sounds very interesting, I wondered about keeping a collection a day and just dropping a day at a time but I thought my queries would suffer, I’ll certainly look into this.

Looks like it’s time to shard, I’ll order more memory for the servers and resurrect my sharding experiments.

Sello_Mkantjwa · February 6, 2024, 10:06am

@RSommer, Did the sharding solve your issue?