Use map vs array of embedded documents?

Hasan_Kumar · July 26, 2021, 7:04pm

I am using mongodb to store one-to-many relationship (similar to https://docs.mongodb.com/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/)

Say a collection of blog posts with comments embedded in it. Which of the below approaches is recommended?

Query Patttern:

For a given blog post, return comments with ids in a given array (around 1000 per second)

Write Pattern:

Update/insert a comment with given id (around 100 per second)

Note:

Postgres is the primary database. I am using mongo as a sort of cache to return the serialized data faster
A blog post has max 5000 comments today. So it will fit within the 16MB document limit. Most posts have around 10 comments.

Approach 1: (array of embedded documents)

    {
      _id: 23, // postId in postgres
      comments: [
         {
           _id: 34, // commentId in postgres
           text: "This is comment 34",
         }, 
         {
           _id: 67,
           text: "This is comment 67",
         }, 
      ]
    }

Approach 2: (map of embedded documents)

    {
      _id: 23, // postId in postgres
      comments: {
         34: {
           _id: 34, // commentId in postgres
           text: "This is comment 34",
         }, 
         67: {
           _id: 67,
           text: "This is comment 67",
         }, 
      }
    }

Takis · July 26, 2021, 9:18pm

The normal way for mongodb is to store data(unknown values) to arrays,
and keys to be metadata(the known schema)

The reasons for that is
1)unknown fields cannot be indexed
in your first approach if you want to index the comment_id,you can do it with
a multikey index,but in second how to make index on comment_id?
2)Dynamic embeded documents are not well supported
For example we cant do get($$k)/remove($$k)/add($$k,$$v) in a fast way.
(if key is a mongodb variable)

If we have constant keys,meaning that we know them at the time we send the
query(not keys in mongodb variables),we can use pipeline stage operators,or even
construct, query parts,but those are weird methods.

Query pattern
For a given blog post, return comments with ids in a given array (around 1000 per second)

Query patttern,first approach
1)find post
2)filter comments you need($filter on the comments map)
*if you have like 10 average it will be very fast

Query pattern,second approach
1)find post
2)how you get the comments?
you contruct on driver something like the bellow,and you send the query after?

 {"$project" {comments.100  1
              comments.20 1}
              ....
              }

This is weird way

I cant be sure the first is always the best,but i think in mongodb,data should go
into arrays,and schema should go into keys.

Hasan_Kumar · July 27, 2021, 3:25am

How to make write pattern (Update/insert a comment with given id) work in case of array of embedded dcouments?
Given I always fetch comments by id, I am assuming Approach 2 will be more efficient than Approach 1 as I don’t have to search through all documents (for large documents)

Takis · July 27, 2021, 9:25pm

Its easy to insert/or update a member(here a document) into an array.

Query operators
Insert new → $push
Update existing → update a member

Or you can do a pipeline update
Insert new → $concat
Update existing → $map

You dont need a pipeline update really,its for more complicated updates,query operators are fine.
See also this

I guess you know the post_id parent,of the comment that will be updated/insert,make an index
on post_id,and it will be very fast,you dont have like too many comments/post.

If you dont know the post_id and you have only the comment_id that you want to update,you
can make a multikey index on comments._id,to find the post document fast.

Hasan_Kumar · July 28, 2021, 5:58pm

From what I read from MongoDB - Safely Upserting Subdocuments Into Arrays there is no single operator to atomically upsert into array of embedded documents.
Is it still the case?

I am worried about ending up with duplicate comments if two parallel threads try to insert($push) simulataneously

Takis · July 28, 2021, 7:16pm

MongoDB updates in 1 document they are atomic,you dont have to worry about anything.
(all the changes that you do in 1 document,with 1 update,will be done all or nothing)

MongoDB after >= 4.0 , we are now in MongoDB 5 supports trascactions so you dont have
to worry about multi-document operations also.

Also mongodb supports left joins with $lookup >= 3.2.

This presentation is probably older,but the last slide shows how to push into an array,
with safely.You dont need transactions here or $lookup.

See the above links also.
If you are new to MongoDB see the mongodb uni
To see all the videos you just enroll in the course,its free,and there is no need to pass the course,you can re-take the course if you want another time.

Hasan_Kumar · July 29, 2021, 6:20am

I was not in favour of using the solution in presentation as it involves making multiple queries for making an upsert. Was wondering if there is a native way to do that in a single statement without encountering race conditions.

I understand changes to a document are atomic in general. But was worried about ending up with two simultaneous $push resulting in two duplicate subdocuments in array.

The stack overflow answer linked also does it in two steps (first steps removes the record from array if it exists and second step inserts). But still feel it is prone to race conditions.
The second solution does something similar but uses the aggregation pipeline. Do you know what lock is acquired by the read statement in a pipeline? i.e, if two servers have sent the same query to mongodb, does mongodb serialize the full pipeline or just the write parts of the pipeline?

Takis · July 29, 2021, 4:27pm

Hello

When you send 1 query,you can change many things in the document and change will be
atomic.MongoDB offers pipeline updates also,that can do all sorts of complicated updates
in 1 document.(you can send multiple queries also,but you need transactions to make them atomic)

For an example of how to update an array with members documents see this query

Its atomic and its fast.It checks if new member => update,else add in the end of the array.
In your case if you already know if new or existing comment,you can make it even faster,
and avoid the check.

Ryan_Murphy · January 22, 2023, 4:41pm

Hey Hasan. I’m running into the same thing now re: deciding between structuring as nested objects vs. arrays. What did you end up going with? What do you recommend after dealing with this?

mart_q · June 22, 2023, 2:26am

After considering both options thoroughly, we decided to using map, because there is no Safe Way to Upsert a series of objects without causing duplications using array