I’m experiencing intermittent slowness in MongoDB queries. On inspection, the logs consistently show that most of the query time is spent “waiting for cache.”
Here’s the context:
WiredTiger cache size: 850 GB
Total size of active collections: ~400 GB (according to current ops)
Dirty pages: Remain well below 64 GB and rarely spike
Based on this, it seems like there should be enough available cache, and cache eviction pressure appears low.
Yet, some queries still take a long time and appear to stall due to cache wait times.
Has anyone experienced something similar or have insights into what might be causing this behavior?
Your “waiting for cache” issue is almost certainly because your query threads are getting stuck doing cache housekeeping work. When MongoDB’s cache hits 95% in any area, your actual query threads start doing eviction work instead of running queries.
that should be safe to do because you’ll just be adding 7 more background threads (assuming you’re on the default) that will compete for cpu cycles (assuming here that you are not cpu bound then it wont be noticable)
either way you should confirm first with that command before updating the conf
I’ve reviewed the attached graph showing the WiredTiger cache size. It doesn’t appear that the cache usage reaches the 95% threshold—at most, it seems to peak around 80%. However, I’m still observing some eviction activity, as indicated by the slight drop in the green line.
Could this eviction be contributing to the slow queries? And if so, why is eviction occurring even though the cache hasn’t reached the typical 95% threshold?
So the 80% is correct. I should’ve mentioned that the eviction_target config value’s default is 80% and the eviction_trigger config value is 95%.
if you check the logs:
db.serverStatus().wiredTiger.cache["pages evicted by application threads"]
you’ll see that its not by application threads. The eviction_trigger configuration value (default 95%) is the level at which application threads start to perform eviction.
The 80%, eviction_target is the level that WiredTiger attempts to keep the overall cache usage, so this is normal behavior.
The problem is your eviction threads can’t keep up with the workload, so even though they’re working at 80%, they’re falling behind and creating backlogs that cause your waiting for cache delays.
said another way, that single eviction thread is working flat-out at 80% but can’t keep up with how fast new data is coming in. This single thread becomes insufficient. (especially when the eviction thread must wait for I/O)…meaning it has to pause and wait for disk writes to complete before it can continue cleaning.
this leads to your queries having to wait in line behind this overwhelmed eviction process to get cache space, even though you theoretically have plenty of room. like when you go to a restaurant and having to wait for clean plates even though the restaurant isn’t full
so you should be able to fix this by adding more eviction workers so cache cleanup doesn’t become a single threaded bottleneck.
apologies for the delay, was heads down in engineering, but made a mental note when i saw this (i was waiting at a restaurant lol)