Data duplication in mongodb

Hi there,

I do not want to compare mongodb to relational databases but I have a question regarding the data duplication in collections. In tables we have keys and values (columns and rows) whereas in mongodb we have those keys duplicated into each document of the collection ! I know the rule: “data that is related to each other, must be stored together”! Is this duplication really necessary (I don’t see another way to do it mongodb). Another question: If we reference a document (by _id) inside another document, how can we query some specific fields of that embedded document in 1 query? Thanks

Hi @Soufian_Ta and welcome in the MongoDB Community :muscle: !

For the last part of your question, I think you are referring to the equivalent of a “join” SQL operation. The way to do that with MongoDB is with the $lookup aggregation stage.

We usually try to recommend to avoid this kind of operation because they are more resources intensive than a “standard” access to a single document.

Let’s take an example. Let’s say you have a blog like Learn - MongoDB Developer Hub.
In each page of the blog, you present the title of the blog post, the content, the number of likes, the author’s name and thumbnail, etc. In SQL, the author would be stored in another collection. Same for MongoDB actually because maybe you don’t want to duplicate the same data everywhere. But you don’t have to duplicate EVERYTHING from the author collection to serve that page.

In a blog post page, you just need the author’s name and thumbnail. But if you visit the author’s profile page in the blog, you will also find its GitHub account, LinkedIn, date of birth, author’s bio, etc. These fields aren’t served in the blog post page. Only in the author’s profile page.

By just duplicating a small portion of the data (that is very unlikely to change on top of that), you remove the need for a $lookup query and you can just serve the blog post with a single and simple query indexed on the URL slug.

I hope it makes some sort of sense :sweat_smile: !

If you want to learn more about data modeling, I recommend reading these blog posts and read more about the design patterns. We have an entire series in the MongoDB DevHub.

We also have an entire training about this in the MongoDB University platform.

Cheers,
Maxime.

1 Like

Hi Maxime,

Thank you for your quick answer !

I’m trying to get the real point of using mongodb instead of postgresql for example ! I had also a question about authentication in a sharded environment. I know that authentication is managed by the config server right? What if the config server went down for whatever reason? Should we still be able to access the sharded nodes or we lost access to the database via the mongoshell ? Data should be still there on the servers but we can’t connect to mongo on those servers right? Thanks

Apologies for the horrible response time. I got a little busy and I needed some time to find the right answer!

I found this article which explain very well the 2 kinds of authentication that can be used in a shared cluster.

  • The cluster authentication should be the only one you really need to access the cluster and perform daily queries and operations on the Sharded Cluster.
  • The local authentication mechanism still exist though in a sharded cluster because each shards are actually “normal” Replica Sets which also store their local users & privileges. This method should only be used by DBA and system admins to perform administrative tasks.

The config servers are always a Replica Set (RS) so you have High Availability. If the Primary node in the config servers RS fails, the 2 secondaries will elect a new Primary and you can continue business as usual. There are no SPOF in a MongoDB Sharded cluster.
In a sharded cluster, you always use at least 3 nodes to build the config servers RS.

So what happen if:

  • one of the secondary config server goes down => nothing. The primary keeps doing its job.
  • the primary goes down => The 2 other nodes will perform an election an elect a new Primary => Read and write operations can continue as usual in the entire sharded cluster.
  • at least 2 nodes go down => The RS can’t reach the majority of the nodes anymore => The last node can’t elect a new Primary and stays secondary (read only). I guess (but I’m not sure) that the read / writes operations can continue on the shards but splits & load balancing operations can’t happen. This would require immediate attention & restart the 2 dead nodes in the config RS.

I hope I answered your question and clarified how the authentication mechanisms works in MongoDB.

If you want to learn more & faster about RS and Sharding, I would recommend the M103 free training on MongoDB University.

Cheers,
Maxime.