Experiences in upgrading from M2 to M5

Hi all,

I thought I’d share my experiences of upgrading from M2 to M5 recently and hopefully pass some lesson’s learned on to others.

So I had a production database running on an M2. I know MongoDB don’t recommend running a production system from shared infrastructure and perhaps this highlights one of the reasons it’s a bad idea. It was about 1.8GB in size with a couple of collections having more than 3M records. I chose a Saturday morning to start as it would mean I was free from distractions.

Step 1 was to take a copy of the most recent back-up and download it for safe keeping if anything went wrong.
Step 2 was to disable all of the running scheduled jobs and incoming HTTPS endpoints so that they wouldn’t be putting data into the database during the upgrade.
Step 3 I also tried to unlink the database from the App so that no data could be written to the database. However, clicking unlink in the UI didn’t do anything (possible bug?) and trying to disable the link via a deployment also didn’t work. Given I’d already done steps 1 and 2, I wasn’t too worried about it.
Step 4 was to click the upgrade button. A message next to the button said it would take between 7 - 10 minutes.

After about an hour into the upgrade I started to get worried that it was going badly. During the upgrade process you lose access to all of the metrics so I couldn’t check to see how far through it was. I started to think about rolling back and restoring my back-up. I then created a fresh M5 instance to see if I could restore my back-up to it. However, it turns out the format of the back-up is not suitable for restoring back into MongoDB directly. I read online that I could install a local instance of Mongo and use that to convert the format but that seemed like a big job and not an area I was familiar with.

After another hour, I started to wonder whether not unlinking the application was causing data to be written to the database and restarting the migration process. I hurriedly wrote a site under maintenance page for my website, which, I should have done from the beginning but I thought the site was only going to be down for 10 minutes so hadn’t bothered.

I thought about contacting support but being a Saturday they weren’t available on my plan. Perhaps choosing a Saturday wasn’t a good idea.

I then had the idea to switch my application linked database to the empty M5 database I created earlier to divert any remaining traffic away from the migrating database. I then just left it for a few more hours to see if the migration would complete. Finally the migration completed about 7 hours later with a message saying that it had failed.

Once the database was back up, I could check the stats and saw that it had restarted a couple of times but the third time (after I had diverted the app to the empty database) succeeded. Although the migration still took several hours (much longer than the advertised 10 minutes).

Whilst it said it had failed, it was up and running on an M5 and all of the collections appeared to be in place with data in them so as far as I was concerned it appeared successful. I linked the app back to the migrated database, took down the maintenance page and re-enabled the endpoints and schedules.

I was a bit worried that the database had shrunk in size from 1.8GB to about 1GB but I put this down more efficient use of storage rather than data loss as all my data appeared to be in place.

A few days later I got an email from a user saying they had lost some data and I started to dig into exactly what was missing. What I discovered was that all of the indexes on the collections had gone missing (not something I thought to check after the migration). The reason this hadn’t caused a massive performance issue is because Mongo had created some indexes automatically for me (nice feature!) however, it didn’t know that they needed to be unique. So I quickly tried to recreate the unique indexes but they failed due to the presence of duplicate records. I spend a few hours hunting down the duplicates one by one and deleting them but I wasn’t sure how many I had. Finally I wrote an aggregation to count them all and discovered that I had tens of thousands of duplicate records (not easy to spot in millions of records). I presume these came about from the migration restarting as many that I spot checked were of quite old data.

I then wrote a function to find the duplicates via an aggregation then loop through and delete them using a bulk update in batches of 1000. Finally then was I able to reapply my indexes. Luckily I had written a script to create all of my indexes and views from scratch so this step was easy.

The reason the user thought they had lost data was because an aggregation was failing due to the missing indexes and making it look like there was missing data in the application.

I guess what I should have done and will do next time is create a fresh M5 instance then restore the M2 back-up into the M5 then delete the M2 and re-point the app to the new M5 instance. Hopefully this will be useful to anyone who is considering upgrading from M2 to M5.

I’m also not entirely convinced that disabling the https endpoints really worked as I saw log entries showing them working even after I disabled them

Requests from MongoDB:

  1. change the advice on upgrade to say that it could take several hours.
  2. add additional advice about disconnecting your application and what steps to take to do this.
  3. ensure unlinking your application works properly.
  4. provide some indication as to how long the migration may take - e.g. keep the access to the stats during the migration process.
  5. provide the ability to import a back-up that has been downloaded and saved locally.

Sorry for the long post. Hope someone finds it useful.

5 Likes

Welcome aboard!

I also experienced upgrading from M2 to M5 in production but I only had 1.2Go of data to migrate.

I will be adding my experience of migrating Realm Cloud, maybe it will help someone, who knows…

It was painful too but less than you I trust, probably because I expected the migration to go bananas. And that’s why I have the mobile clients check a status flag on my own server (serving a static file at Google Cloud but it could have been a public Github repo) before trying to connect to Realm.

The flow is Launch app → Check status flag → Connect to Realm if up, else activate maintenance mode of the app.

This way I am sure that nobody tries to save data that can’t be saved during server migration.

Migrating data took about 2h30. It is impossible to guess how long it takes but you can track it in the Realm logs on your backend. Until you see a final log « Operation Complete, took x seconds », it is not over.

This is actually not part of the migration itself, it is part of Realm DB Cloud logic to reconstruct itself whenever you modify the DB schema (in the case of migrating, you are basically creating a new schema). So keep in my mind that if you switch Dev Mode in prod, you will make Realm rebuild itself taking another couple of hours during which users can’t save anything to Realm cloud.

The worst part is that because Realm DB Cloud is rebuilding itself, it can’t serve the correct documents to your users and your client logic will probably believe it can create a new unique document because you don’t have one but you actually already have one that can’t be served to the end users for the time being… So even if you have a unique key index, it won’t help. At least this is from my experience.

What I recommend is to have a status flag endpoint so that you can kick users out of your app, stop them from connecting to Realm Cloud, and update your schema without any complication. Maybe make sure you have a query to check for conflicting entries in case something bad happens.

Simon, Jerome,

All I can say is WOW these posts are a gut punch to read for those us here at MongoDB working hard to make these products better and it’s disappointing that we have not responded sooner.

To be intellectually honest and show some vulnerability, I think each of us that read your note line for line probably tried to respond and then felt ashamed and/or just didn’t quite know how to fully respond, and then said “I will come back to this later” and now it’s been 12 days and a second community member has responded on top with a similar albeit different issue and still 6 days have passed since then which likely caused folks who had planned or hoped to respond to further think “now how can I really even begin to respond”! None of that is to make any excuses, but only to simply start engaging with you.

THANK YOU both for sharing the long and incredibly frustrating journeys you’ve been on with us. I believe that both of you have experienced different issues even if they look and sound similar on the surface.

Simon, I believe you suffered from our M2 to M5 upgrade process running into an edge case in the brittleness of the backend processes that move the data (on the backend we pipe mongodump to mongorestore and have occasionally seen classes of errors that require manual intervention to fix; we have a plan to move to a more modern backend utility to power these upgrades in the future but unfortunately that utility is still in development and we’ve prioritized upgrades from our serverless environment to dedicated clusters ahead of M2 to M5 upgrades first which may have been a mistake in hindsight). The fact that you felt unable to get support when you needed it is also unacceptable – even if this was a small database, your users were counting on it and we let you down. The process you went through to pin down the data issues afterwards sounds nightmarish: I am still not 100% clear on whether you think the data issues derived from your app writing during the upgrade or restore, or if you believe the backed up data itself had the issue? if the latter that is very concerning.

And then Jerome, your issue I believe may be completely different, and related to the fact that upon upgrade, the oplog is not preserved–this can cause a Sync enabled application to lose the ability to stay in sync and to need to re-initialize. We are trying to figure out how to architecturally handle this situation more elegantly: it is unfortunately a nuanced and technically complex topic to properly address. Your suggestion around better ergonomics for managing this state is a good one: ideally we would not need the state at all.

Taking a step back I want to really celebrate both of you for taking a positive “help the community” tone instead of coming in hot and angry as I probably would have done after experiencing these really problematic experiences. Your patience and willingness to help us help the community is really an incredible sign of maturity that all of us at MongoDB appreciate.

-Andrew (SVP Cloud Products)
(we will reach out separately via email)

3 Likes

Hi @Andrew_Davidson - thank you very much for your response (and the separate email).

To answer your question about the data integrity - I don’t think I lost any data; I think the duplicates came about from the restore process restarting - there were way too may records for the application to have generated them in that time and some of the timestamps went back over a year. My assumption is that the restore process takes place before the (unique) indexes are added and restarting that process multiple times can cause duplication.

Thanks once again for reaching out, it is much appreciated.

2 Likes

Hi @Andrew_Davidson

Thanks for officially answering and no problem at all, this is what forums are for :slight_smile:

It is exactly as you said: My issue was with oplog and, as a result, a bigger issue with client reset which wasn’t gracefully handled on our end.
Not sure what the client reset problem was as we have the same code as in the complete sample provided in the iOS documentation. But I heard there is an upcoming SDK release to improve client reset internally :crossed_fingers:

2 Likes

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.