Querying multiple mongoDB servers to do a query on a collection

Kim_Ng · June 17, 2020, 7:51am

I’m going to be attempting to set up queries to multiple mongoDB servers to query a collection which has different values but the same schema and was wondering if anyone has done this or if there’s some standard method for approaching this.

Our use case is that I’m working with health data and the model for keeping the data secured is by having a private DB and a public DB for each institute in our consortium. This way each member can control what data is shown publicly to the other consortium members due to data sensitivity. This means that the data will be structured in the same way on each system but contain different data. So what we need to do then is query the DBs and work with the collated results.

What we we’re thinking of doing is opening a connection to each DB, preform the query, and then collate the results for additional processing. Is their a way that mongoDB has that would handle this for us? (ie provide multiple connections and then it handles the query and collation) or do we have to do it on our end?

Justin · June 17, 2020, 11:27pm

Hi @Kim_Ng,

This is essentially what happens with sharding. A proxy layer (mongos) sits between the application and MongoDB replica sets, then when a query is performed it will execute and collate the results.

You can cleverly pick a shard key that distributes the data to whichever shard (or replica set) where you’d like to store the data.

Hope that helps!

Justin

PS. Field level encryption may be useful to you. I suggest taking a look: https://docs.mongodb.com/drivers/use-cases/client-side-field-level-encryption-guide

Kim_Ng · June 18, 2020, 12:22pm

Hi @Justin,

Thanks for the reply.

I had a look at sharding before and it was my impression that it probably didn’t work for our use case as sharding applied to collections accross servers is read/writable to all who have access on that collection. In our case we want to ensure data control for public and private data within each institute. So only that institute would modify it’s own public data. So if we had a collection called samples then in our current understanding we’d have:

Server: instutute_1_private Collection: samples
Server: instutute_1_public Collection: samples (where this is a subset of the other one and potentially have filtered fields)
Server: instutute_2_private Collection: samples
Server: instutute_2_public Collection: samples

Query for samples at institute 1: samples against servers instutute_1_private and instutute_2_public
Query for samples at institute 2: samples against servers instutute_2_private and instutute_1_public

My colleague thinks we may be able to get around this if we utilize both union and sharding so that full data control would still be in place. Following the idea on the last example we’d have:

Server: instutute_1 private shard Collection: samples_private
Server: instutute_1 public shard Collection: samples_public_institute_1
Server: instutute_2 private shard Collection: samples_private
Server: instutute_2 public shard Collection: samples_public_institute_2

Query for samples at institute 1: samples_private union samples_public_institute_2 against servers instutute_1 and instutute_2
Query for samples at institute 1: samples_private union samples_public_institute_1 against servers instutute_1 and instutute_2

If this seems like it may be possible but requires more in depth information perhaps we can also be put into contact with a paid consultant to flesh out the details of this viability of this.

As for the field level encryption we’re looking into that as well as a way to layer on additional data protection, I think it’ll be very useful.

Justin · June 22, 2020, 9:40pm

Sharding doesn’t change the access strategies and restrictions placed on your underlying collections by the application. We can definitely dive into this more if you’re interested in a paid consulting engagement. I’m a consulting engineer and we work with requirements like this often.

Have you worked with a MongoDB Account Executive in the past? He/she can help look at consulting options.

Thanks,

Justin

Kim_Ng · June 24, 2020, 12:52pm

Hi @Justin,

I ran it by my two fellow devs on the project and with the data controls we want in place it makes sense for us to pursue the first strategy and not the sharded approach (namely to limit access with DB credentials). We’ll likely also put an API to control the queries and what can affect the DBs more so.

With this we don’t plan on contacting a consultant currently but may look into it again in the future (haven’t done it before). Thanks for this info and I’ll google how to contact a MongoDB Account Executive when we need in the future.