MongoDB upgrade questions

Hello all,

Apologies if this is the wrong area. Anyway we’ve been upgrading our MongoDB hosts and I’m seeing items for which I would like to know if they’re answers. I’ll break it apart into three sections for ease.

  1. Correct terminology?

  2. MongoDB WT storage and compression within a replica set 3.2.22 on Ubuntu Xenial.

  3. When is a replica set secondary really a secondary considering it’s status and optime?

  4. Correct terminology

When trying to describe the space freed up by Mongo that is not reallocated, I use the term “holes” often. I don’t know what the correct term is tto convey to someone that basically there’s empty data taking up space which will not be used until a repairDatabase is ran for MMAPv1 and or correct me if I’m wrong a compaction job ran for WT.

  1. MongoDB WT storage and compression within a replica set 3.2.22 on Ubuntu Xenial

I know this is an old version but we’ve recently begun updating and will be making our way to the most recent version of MongoDB but until then, we just recently switched over to WT as 3.2 was considered a good stable point to start using WT.

First - we noticed during an initial sync despite having a decent amount of RAM, mongodb would invoke OOM. We noticed this in our non-prod environment which is kept rather bare (i.e. PSA architecture). The primary has all the resources it needs, secondary has half the RAM (i.e. 8 GB), and the arbiter is a baby host. We first converted our primary to WT, this went well AFAICT. But when we needed to do our secondary, we’d run into memory issues. At this point I bumped thee RAM up to the same as the primary (i.e. 16 GB), then proceeded to sync. We ran into the same issue again. I saw a bug report but it was for an earlier version of Mongo (i.e. 3.0) but it felt very similar. The bug report is SERVER-20159. We do have no swap on our servers however when adding more RAM it didn’t seem to help. It did take a bit longer to invoke OOM but it still happened. Surprisingly aside from altering any DB parameters I just kept flushing the buffers/cache in linux (i.e. echo 3 > /proc/sys/vm/drop_caches) and it finally completed. I understand MongoDB uses cache rather efficiently but I couldn’t understand why if it’s cache and not RAM allocation it grew high enough to invoke OOM. We were on kernel version 4.15.0-72 BTW. After the initial sync the RAM usage also dropped to about 50%. I didn’t want to risk OOM’s while working on a higher environment so I created a task to flush buffers cache when memory usage as reported by free got to 80% so I can’t report further on this, but I am curious what may be going on. This higher environment has a PSSSA architecture. One of the secondaries is prohibited from elections and is used as our backup server. I’ll call it PSSHA for ease of conversation. The higher environment is similar but runs 4.15.0-99 and has a lot more RAM.

So w/ the PSSHA environment, after migrating one host to WT (type S), we then did a snap copy of the data to switch over another host (type S). This went well however upon starting up the host, the sync started and it started processing the oplog. Our disk usage grew quite a bit past the already stood up WT host, about 36 GB more by the time rs.status showed the same optimes. This was concerning but then we noticed it shrunk pretty quickly and now the difference between these two hosts is about 4 GB (i.e. existing WT was at 896 GB, new WT host is at 900 GB). BTW our oplog size is about 300 GB. I was curious as to why that happened. I have a couple theories and they may be wrong but feel free to correct me:

  • There’s a task which waits for a certain amount of time and or data to pass before performing compression for DB efficiency (i.e. compressor is running at a lower priority or something and or only when it seems there’s a decent amount of data which would compress well)?
  • Perhaps journalling was involved and over time the journal for the replay aged out or it was removed since it was committed so it was cleaned up?
  • WT perhaps does do some sort of auto-compaction in a way to reduce space usage? Kind of like THP in Linux, it’s not hugepages, but it kinda is. Again - not sure - couldn’t find anyone who monitored their mongo replication this much, presented the data and asked these questions so asking. I have read though that there isn’t compaction unless we do it however again, the space usage dropped pretty quickly after syncing up the optimes with the other members.
  1. Replica set - when is a secondary a real secondary

I’m asking this because when I did a seeded sync, the member switched to secondary practically immediately. I couldn’t find anything standing out in the logs with regards to storage and or replication events but initially I did see messages regarding the oplog. I used rs.status to monitor the primary and this hosts optimes. Although the secondary showed it’s state was SECONDARY, the optimes were still a couple tens of thousands behind the primary. Eventually the optimes synced up +/- 5 but I didn’t see any specifics for optimes or completed oplog sync in the logs on that secondary.

I ask because the switch to SECONDARY was pretty fast. I did see txn-recover, WT local.oplog.rs threads, oplog markers, etc which started at 17:08:23. This same instance then went to STARTUP2 3 seconds later (i.e. 17:08:26), then started replication applier threads and switched to RECOVERING about 0.02 seconds later (i.e. 17:08:26.010), then switched to SECONDARY about 0.02 seconds after that (i.e. 17:08:26.012). However the hosts optime via rs.status() wasn’t within +/- 5 optimes of the primary until about 12/13 minutes later.

So I’m curious why there’s comfort in switching to SECONDARY despite not being within +/- 5 optimes of the primary? I imagine there’s definitely good reason, I just don’t know it and I would like warm fuzzies in the off chance the primary gets obliterated, we can safely switch to a SECONDARY that is quite a bit away with regards to optimes. During this point our storage also grew as mentioned earlier. It came down shortly after.

Thank you very much in advance. Apologies if these questions have been asked already!

Zahid Bukhari

Hi Zahid,

I believe some of what you experienced could be solved by upgrading to a supported version of MongoDB. The 3.2 series was released in Dec 2015, and the whole series is out of support in Sept 2018. I would recommend you to upgrade to at least MongoDB 3.6 (which will be supported until April 2021), or better yet, to the newest 4.2 series.

Regarding the OOM issues, some operations in older out-of-support MongoDB versions are known to use excessive memory, such as SERVER-25318, SERVER-26488, SERVER-25075, among others. Since you mentioned that you setup no swap, your deployments will be prone to OOMkill. If you can, please allow some swap space to alleviate this.

Regarding provisioning, a secondary node’s function is to provide high availability. It’s supposed to take over as the new primary in case something happened to the current primary, so it’s recommended to provision them with the same hardware as the primary.

Regarding auto-compaction, WiredTiger never does this. This is because WiredTiger operates under the assumption that your data will always grow in size, and not shrinking. Thus, if you delete anything, those space will be marked unused by WT but not released to the OS. This is because if your data size keeps growing, releasing space to the OS and regaining them later provide net negative gain for you, since it’s basically useless work that cancels each other out.

Regarding secondary status, once it reached SECONDARY in the output of rs.status(), it is ready to take over from the primary at any time. In older MongoDB, it is possible for you to see their optime to be way behind the primary if your replica set receive no write for an extended period. This is changed by SERVER-23892 in MongoDB versions 3.4 and newer.

Best regards,
Kevin

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.