How to debug page eviction failures?op

Tejas_Jadhav1 · June 15, 2022, 3:17pm

There was an incident in our Mongo cluster wherein secondary got a page fault after which we started seeing spike in pages queued for eviction (mongodb_ss_wt_cache_pages_queued_for_eviction) as well as pages unable to get evicted (mongodb_ss_wt_cache_pages_selected_for_eviction_unable_to_be_evicted). Despite the secondary recovering after ~2.5 hours, these two metrics have been abnormal since the issue.

How do we debug this issue? Should we look at any other metric? Would there be any data loss in case of server restart?

Mongo version: 3.6.2
Cluster setup: PSSA
OS: Ubuntu 16.04

Tejas_Jadhav1 · June 15, 2022, 3:19pm

Attaching the graphs for reference.

kevinadi · June 16, 2022, 2:27am

Hi @Tejas_Jadhav1 welcome to the community!

The two eviction metrics are for internal WT use (eviction is an internal WT process), where it’s mainly used by MongoDB engineers to troubleshoot issues. However, they are used in combination with other metrics, and rarely, if ever, used as a standalone metric.

Typically if they are showing large numbers like what you posted (I would consider millions of anything as large ), it means that the server is trying hard to process backlog of work, i.e. it’s overwhelmed, and is trying to stay on top of the work it’s given.

Mongo version: 3.6.2

There are many improvements made to WT over the years since version 3.6.2 was released in Jan 2018, including many performance improvements that make the eviction process smarter & more efficient. Also, the 3.6 series was not supported anymore as per April 2021. I would strongly encourage you to upgrade to the latest supported version (4.2 is the oldest series still supported), but upgrading to the latest version (currently 5.0.9) is best.

Upgrading to the latest version would also ensure that you don’t experience old bugs that was already fixed.

Cluster setup: PSSA

As per Replica Set Deployment Architectures, it’s not recommended to deploy an even number of members. If you have 2 data bearing nodes, I recommend you remove the Arbiter from the replica set.

Would there be any data loss in case of server restart?

Unless you do a kill -9 of the mongod process, and you’re using majority write concern for your writes, there should be no risk of data loss, unless it’s hardware related.

Best regards
Kevin

Tejas_Jadhav1 · June 16, 2022, 4:12am

Thanks for the reply @kevinadi

How can we further debug this? Are there any more metrics that we can look at to identify the cause?

Yes, we have a plan for this. We might be done with it in the coming weeks. But before that, wanted to get clarity over these anomalous metric that we are seeing on Mongo and would that impact our upgrade process in any way.

The other secondary has no voting rights and does not participate in elections. It was created as a backup node in case we see issues on primary and secondary.

kevinadi · June 16, 2022, 7:05am

I’m afraid there’s not an easy answer here. Basically, I don’t think there’s anything to debug; the database was overwhelmed at some point, but then it managed to clear the backlog of work after some time, and things got better.

In terms of other metrics, there are hundreds of them (you can see them in db.serverStatus()). However the best tool in my opinion is mongostat for the overall health of the server, and mongotop to check your collection activities.

In most cases like this, if you let the server catch up with work and not overwhelm it further, the situation typically resolves itself (as you’ve seen in this case, I believe).

Also as I previously mentioned, things are mostly better behaved in newer MongoDB versions, so you might not need to do anything other than upgrading

The other secondary has no voting rights and does not participate in elections. It was created as a backup node in case we see issues on primary and secondary.

Sorry I don’t follow; isn’t that the purpose of a secondary? To be able to step up and take over as primary when there’s a problem in the replica set, so that you have high availability? What’s the goal of making this secondary not acting like a secondary? Does it have a lesser hardware spec, or other reasons?

Best regards
Kevin

Tejas_Jadhav1 · June 16, 2022, 7:23am

In our case, the page eviction has never been this spiky. Even now it is spiking a lot more than before. If you notice those spikes after 16:00 in the screenshot above, those spikes are still happening now. That looks worrisome.

Yeah, the setup is unusual. We had created that secondary in past when we were seeing some hardware related issues on the primary. But just after creating it, we saw that those failures had recovered and did not happen again. Since this new secondary was already created, we decided to keep it as a backup in case we see any catastrophic hardware failures on both the primary as well as existing secondary. We are planning to remove it now once we are done with MongoDB version upgrade.