"Copy" data from one project to an other

michael_hoeller · March 16, 2021, 8:29pm

Hi there,

as a developer, in an Atlas eco system, I like to provision a dev database with a subset of a production database which is located in an other project, so that I can develop with a small database holding recent data.

Is it possible to write from one project to an other? Has someone done something like this ? Or is there a better solution around, which I missed out?

Regards,
Michael

Pavel_Duchovny · March 17, 2021, 6:05am

Hi @michael_hoeller,

I think we talked about it in the past.

Build in is only an option to complete restore of a backup to another project. Or complete live migrate.

For subset of data you need to use dump and restore.

There might be a nice trick to use $out and clone any collection you need to a temp new database by limit or filter and dump restore only this/those dbs.

Thank
Pavel

michael_hoeller · March 17, 2021, 6:26am

Hello Pavel
#[quote=“Pavel_Duchovny, post:2, topic:99319”]
Build in is only an option to complete restore of a backup to another project. Or complete live migrate.
For subset of data you need to use dump and restore.
[/quote]
As you mention this is not an option since it a) either is a full ‘restore’ or, with mongo dump a particular filter without the ability to reflect relations.

Yes it it simple to use $out and $merge but as of my tests someone would need to stay in the same Project which is not really what you want in an productive environment. So we are back to the actual question which is aiming on Realm and cross project.

Regards,
Michael

Pavel_Duchovny · March 17, 2021, 8:44am

@michael_hoeller,

Realm cannot.link clusters in an app cross project … BUT you got me an idea.

Its pretty creative and I have to check it but it might be interesting. Will share more in the coming days when I poc it.

Thanks
Pavel

michael_hoeller · March 17, 2021, 1:37pm

Hello @Pavel_Duchovny

thanks a lot, I stay tuned

Regards,
Michael

Pavel_Duchovny · March 18, 2021, 10:17am

Hi @michael_hoeller,

Ok so I tailored a solution. I have tto say its not super stright forward but its creative and mind opening for various things.

The key ability I used is that REALM applications can be linked to Atlas Data lakes.

Now Atlas Data lakes can read/write data from/to MongoDB clusters and from/to S3 buckets. Having said that, there is a limitation that the Data lake linked can be only from the same project and the Atlas cluster linked in that Data lake can be from the same project as well.

However, the s3 buckets can be shared cross Data lakes.

In my configuration I have the following topology:

PROD

Project “prod” with id : xxxxxxxxxx:

Cluster “prod” with database sample_mflix and collection movies.
A data lake mapped to the prod cluster as a Atlas store.
S3 bucket store “atlas-datalakepavel” mapped to my S3 storage.

My Data lake configuration on prod is mapping the Atlas cluster so our realm app could read from it and use the $out operator to write to S3.

{
  "databases": [
    {
      "name": "mydb",
      "collections": [
        {
          "name": "movies",
          "dataSources": [
            {
              "collection": "movies",
              "database": "sample_mflix",
              "storeName": "atlasClusterStore"
            }
          ]
        }
],
"stores": [
    {
      "provider": "s3",
      "bucket": "atlas-datalakepavel",
      "includeTags": false,
      "name": "s3store",
      "region": "us-east-1"
    },
    {
      "provider": "atlas",
      "clusterName": "prod",
      "name": "atlasClusterStore",
      "projectId": "xxxxxxxx"
    }
  ]

My webhook in the prod project realm app is linked to this data lake and therefore can perform a write of sample 10 movies to my s3 store:

// This function is the webhook's request handler.
exports = async function(payload, response) {
  
  var movies = context.services.get("data-lake").db("mydb").collection("movies");
  
  var res = await movies.aggregate([{$sample : { size: 10}}, 
    {"$out" : {
       "s3" : {
        "bucket" : "atlas-datalakepavel",
         "region" : "us-east-1",
        "filename" : "10movies/movies",
         "format" : {
          "name" : "json"
        }
      }
    }
  }]).toArray();

};

Once I execute it via a curl or a trigger or any http hook it creates my source files:

DEV

Project “dev” with id : yyyyyyyyyyy:

Cluster “dev” with database “dev” and collection movies.
A data lake mapped to the dev cluster as a Atlas store.
S3 bucket store “atlas-datalakepavel” mapped to my S3 storage.

Now I map the data lake :

{
  "databases": [
    {
      "name": "mydb",
      "collections": [
        {
          "name": "devMovies",
          "dataSources": [
            {
              "path": "10movies/*",
              "storeName": "s3store"
            }
          ]
        }
          ]
        }
],
"stores": [
    {
      "provider": "s3",
      "bucket": "atlas-datalakepavel",
      "includeTags": false,
      "name": "s3store",
      "region": "us-east-1"
    },
    {
      "provider": "atlas",
      "clusterName": "dev",
      "name": "atlasClusterStore",
      "projectId": "yyyyyyy"
    }
  ]

No I can do the opsite import via a webhook in the dev realm app connected to this datalake:

// This function is the webhook's request handler.
exports = async function(payload, response) {
  
  var movies = context.services.get("data-lake").db("mydb").collection("devMovies");
  
  var res = await movies.aggregate([
   {
  "$out": {
    "atlas": {
      "projectId": "yyyyyyyyyy",
      "clusterName": "dev",
      "db": "dev",
      "coll": "movies"
    }
  }
}]).toArray();

};

Once I do that I get the needed data in my dev:

Here you can be as flexible as you want and transfer data with logic lookups and many other options.

I know this is a lot to digest but I hope it might help.

Thanks,
Pavel

michael_hoeller · March 18, 2021, 8:30pm

Hello @Pavel_Duchovny
thank you very much! This fits perfectly well, my initial posting was sparse on details how to create a subset,
since I didn’t want to over load the question. Your suggestion goes further and provides to aggregation where I can add the logic to get the subset.
Thanks a again !
Michael

system · March 23, 2021, 8:30pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.