Regularly transferring data from one database to another

Mike_Hayden · October 7, 2022, 10:35am

Hello there!

I’m using Mongo Cloud Altas to host production and staging databases which are both M10 instances.

I have written a node script that retrieves a subset of the production database, anonymises the user data and writes it to the staging database. Up until now this has been adequate for my needs.

However, we’ve had some issues with migrations that are falling over with larger datasets so we want to be able to run them against a full production dataset on our staging server so that we can have a high degree of confidence in our migrations when we run them in production.

Because we have scheduled backups for our production database we can run these against our staging database using the admin UI of the Atlas web interface. Aside from it being a little too easy to accidentally restore the production database, this works fairly well except that:

Any backup made on staging after restoring from a production snapshot fails, creating a fallover snapshot which is 3x larger in size that the production snapshot used to restore the database.
Restoring the data from the production snapshot means that the user data on staging isn’t anonymised.

Ideally we would like to be able to copy the data from prod to staging on a regular interval (at say, midnight on a Sunday for example), so that we always have a decent set of data on staging for QA and testing purposes, ideally anonymised. My thinking behind this is to ensure we are GDPR complient (although I’m not 100% convinced this is necessary).

I have investigated mongodump however I have concerns about running this against the live database due to performance concerns:

When connected to a MongoDB instance, mongodump can adversely affect mongod performance. If your data is larger than system memory, the queries will push the working set out of memory, causing page faults.

Since Atlas takes regular snapshots for us it would probably make sense to use these instead. One idea I had was write a cron task to download a snapshot, unarchive it and use mongorestore to restore the staging database with the production data. This seems reasonable, however the staging server would have to write the contents of the archive to disk which would use up quite a lot of space and potentially memory.

Another thought would be to use the Atlas CLI to schedule a restore from prod to staging using a cron job, which also seems plausible. I am a little worried about developing such a script in case during development I accidentally restore the live database, but perhaps it’s worth the risk?

So I guess my questions to the community are:

Has anyone else found themselves in this position, and if so what solution did you come up with in the end?
Is there a way to schedule a restore on staging using a prod snapshot on Atlas already?
Does restoring one database from a snapshot of another cause issues with future backups on the second database, and if so what can be done to mitigate this?
Should I be concerned about having live user data (including sensitive data like email addresses) on a staging server?

Many thanks for your time!

Benjamin_Flast · October 7, 2022, 6:58pm

Hey Mike,

Nice to meet you.

This seems like a very interesting use case. I have a couple ideas about how you might be able to rearchitect this process to be a bit more full proof. Have you looked at our Atlas Data Federation and Atlas Data Lake offerings at all?

If you’d be interested in chatting, put some time on my calendar here and I can step you through what I’m thinking to see if it’s a good fit. Calendly - Benjamin Flast

Best,
Ben

Mike_Hayden · October 8, 2022, 9:35am

Thanks! I’ve booked in a slot with you

Jay_Eno · December 9, 2022, 5:00pm

I have a suggestion. Don’t take things like this “offline”. The community site gets no benefit from this type of transaction, where the problem is defined publicly, but the solution is procured/defined privately. ):

Benjamin_Flast · December 12, 2022, 3:06pm

Good point @Jay_Eno ! I was suggesting it as I thought a conversation would be easier to clarify some details. But let me follow up here with some of the things I was thinking about.

Depending on a customers needs, they could created a “Federated Database Instance” and then use “$out to Atlas” with a scheduled trigger to copy data from the source cluster specified in the virtual collection to the target. https://www.mongodb.com/docs/atlas/data-federation/supported-unsupported/pipeline/out/

Using the background aggregation option here for Data Federation will be key here so that the connection of the trigger closing does not cancel the query.

Separately if consistency as of a specific point in time is important you could use our new Data Lake Service in order to create a consistent snapshot of your collection at a specific point in time based on your backups, and then run the $out to Atlas from that snapshot.

Best,
Ben

Gleb_Ignatev · March 3, 2023, 11:16am

Hi! I’m willing to implement the same testing of production data on a staging database. Could you share please, what solution have you chosen in the end, if any? I believe it would be valuable for the community as well.