Regularly transferring data from one database to another

Hello there!

I’m using Mongo Cloud Altas to host production and staging databases which are both M10 instances.

I have written a node script that retrieves a subset of the production database, anonymises the user data and writes it to the staging database. Up until now this has been adequate for my needs.

However, we’ve had some issues with migrations that are falling over with larger datasets so we want to be able to run them against a full production dataset on our staging server so that we can have a high degree of confidence in our migrations when we run them in production.

Because we have scheduled backups for our production database we can run these against our staging database using the admin UI of the Atlas web interface. Aside from it being a little too easy to accidentally restore the production database, this works fairly well except that:

  1. Any backup made on staging after restoring from a production snapshot fails, creating a fallover snapshot which is 3x larger in size that the production snapshot used to restore the database.

  2. Restoring the data from the production snapshot means that the user data on staging isn’t anonymised.

Ideally we would like to be able to copy the data from prod to staging on a regular interval (at say, midnight on a Sunday for example), so that we always have a decent set of data on staging for QA and testing purposes, ideally anonymised. My thinking behind this is to ensure we are GDPR complient (although I’m not 100% convinced this is necessary).

I have investigated mongodump however I have concerns about running this against the live database due to performance concerns:

When connected to a MongoDB instance, mongodump can adversely affect mongod performance. If your data is larger than system memory, the queries will push the working set out of memory, causing page faults.

Since Atlas takes regular snapshots for us it would probably make sense to use these instead. One idea I had was write a cron task to download a snapshot, unarchive it and use mongorestore to restore the staging database with the production data. This seems reasonable, however the staging server would have to write the contents of the archive to disk which would use up quite a lot of space and potentially memory.

Another thought would be to use the Atlas CLI to schedule a restore from prod to staging using a cron job, which also seems plausible. I am a little worried about developing such a script in case during development I accidentally restore the live database, but perhaps it’s worth the risk?

So I guess my questions to the community are:

  1. Has anyone else found themselves in this position, and if so what solution did you come up with in the end?
  2. Is there a way to schedule a restore on staging using a prod snapshot on Atlas already?
  3. Does restoring one database from a snapshot of another cause issues with future backups on the second database, and if so what can be done to mitigate this?
  4. Should I be concerned about having live user data (including sensitive data like email addresses) on a staging server?

Many thanks for your time!

Hey Mike,

Nice to meet you.

This seems like a very interesting use case. I have a couple ideas about how you might be able to rearchitect this process to be a bit more full proof. Have you looked at our Atlas Data Federation and Atlas Data Lake offerings at all?

If you’d be interested in chatting, put some time on my calendar here and I can step you through what I’m thinking to see if it’s a good fit. Calendly - Benjamin Flast

Best,
Ben

Thanks! I’ve booked in a slot with you :+1: