Advice for Scaling 2.4.4 on Windows

Disclaimers: I know this is a REALLY old version, but it’s a legacy system that I need to make work for a few more months. Also, I know literally nothing about MongoDB.

I have a cluster of 3 nodes running on Windows as EC2 instances on AWS. The memory allocated to the VMs is 16GB. The data partition is configured with 1000 IOPs. The size of the database is roughly one terabyte. The largest collection is on the order of 400GB or so, with some other 40-100GB collections.

What I’m seeing in logs is a lot of elections and heartbeat failures. Members are being flagged as down, or slow to respond. I’m also seeing a lot of connection drops.

Windows shows there’s only 300MB of free memory, with somewhere around 5-8000 page faults a second on average. The commit charge for the MongoD.exe process varies, but is as high as 55GB.

Additional Info: Two of the smaller collections have TTLs. At this point, the only write activity is the TTL processes.

My analysis is that there’s so much swapping going on that it’s destabilizing the cluster, and that’s manifested as seeing hosts down, elections, etc. Is that a reasonable idea?

What would be a reasonable way to address this? I’m thinking bumping memory allocation to 64GB. Would that be sufficient?

Hi @George_Sexton welcome to the community!

MongoDB 2.4.4 was released in June 2013, so almost 10 years ago! Unfortunately that means that we have limited options. Notably, the 2.4 series was using the MMAPv1 storage engine, which was removed in modern MongoDB versions, so their behaviour and performance characteristics are radically different. In fact, the Atlas Live Migration service only goes as far back as the 2.6 series, so migrating to Atlas is out of the question as well.

Due to the age of the infrastructure, I can perhaps offer some pointers on what to look for, but may be unable to give you a more direct solution, unfortunately.

Two of the smaller collections have TTLs. At this point, the only write activity is the TTL processes.

Do you mean that there is no further data going into the database, only getting removed?

My analysis is that there’s so much swapping going on that it’s destabilizing the cluster, and that’s manifested as seeing hosts down, elections, etc. Is that a reasonable idea?

Barring other evidence from the logs when the event happens, I say this is a very reasonable analysis. It is possible that the server is busy swapping it doesn’t have time to do anything else. Although it’s curious to see this apparent resource crunch in a system where no data is being added. But then again, this is MMAPv1, of which I’m not entirely familiar with its performance behaviour, especially under (I assume) an equally old Windows :slight_smile:

I’m thinking bumping memory allocation to 64GB. Would that be sufficient?

At this point I don’t think there’s any harm in trying this step, although whether it’s sufficient or not, it’s hard to say at this point. The system is on the verge of failing anyway, and adding more RAM usually helps when you suspect that the hardware does not have enough resources to do its work.

If all else fails, I would suggest you to upgrade to at least MongoDB 2.6, then use Live Migration to migrate to Atlas. From there, you can upgrade to a modern MongoDB version, then you can decide if you want to dump the data back into an on-prem deployment, or just simply use Atlas from that point onward.

Best regards
Kevin

Kevin,

I’m concerned that there would be compatibility problems with your suggested route. The OS is old, the code base is equally old. Driver compatibility with the DB is a concern, along with things like TLS versions, supported ciphers, etc.

George

Hello Kevin,
I am working with George and we have a question regarding the upgrade. Which version would you recommend we upgrade to (from 2.4.4) so that :

  • we won’t have compatibility issue when we try to restore the backup from 2.4.4
  • we won’t have to upgrade the driver on the app that’s connecting to the MongoDB ? (So the new version should support the 2.4.4 drivers ideally)

The idea is to cause as little ripple as possible.
Thanx !
Pascal

Hi @Pascal_Audant , @George_Sexton

we won’t have compatibility issue when we try to restore the backup from 2.4.4

Assuming you’re trying to move away from 2.4.4, I would suggest experimenting with at least MongoDB 2.6 at this point, as this is the oldest version that Atlas and most drivers can handle. Using at least 2.6 opens up many possibilities like Atlas Live Migration (which might come in handy later), and the oldest MongoDB version that most drivers support is 2.6.

we won’t have to upgrade the driver on the app that’s connecting to the MongoDB

Most drivers support 2.6 as the oldest version, up to a certain point. For example, the latest Node driver (5.0) supports MongoDB 3.6 as the oldest version, but Node driver 4.1 supports up to MongoDB 2.6. See https://www.mongodb.com/docs/drivers/node/current/compatibility/ for more details. Other drivers would also have a similar compatibility matrix.

The idea is to cause as little ripple as possible.

Without knowing the exact details of the app and infrastructure, I cannot say how risky any operation will be.

I believe the least risky proposition is to upgrade the instance’s RAM to see if it solves the resource issue. However staying at 2.4.4 is just as risky, as you could see a repeat of these events again later since the root cause of the issue is still unknown.

If modernizing the infrastructure is the ultimate goal, though I would experiment with 2.6. MongoDB typically only support upgrades between major versions and 2.6 is one major version up from 2.4.

Another option is, if this is a vital data to your operation and you’re hesitant about the options we’re currently discussing, you might want to engange Enterprise Advanced Support to provide guidance and support to help you modernize the infrastructure.

Best regards
Kevin

1 Like