Storing (possible millions) of comments in a single collection?

Florian_Walther · August 27, 2022, 5:46am

I’ve tried googling this but didn’t find a definitive answer.

I’m implementing a comment system on my website using MongoDB. My plan is to store all comments in a single collection. The schema looks like this:

export interface ResourceComment {
    comment: string,
    user: mongoose.Types.ObjectId,
    resourceId: mongoose.Types.ObjectId,
    parentCommentId?: mongoose.Types.ObjectId,
    replies?: mongoose.Types.DocumentArray<ResourceComment>, // fetched through lookup
}

const resourceCommentSchema = new mongoose.Schema<ResourceComment>({
    comment: { type: String, ref: 'User', required: true },
    user: { type: mongoose.Schema.Types.ObjectId, ref: 'User', required: true },
    resourceId: { type: mongoose.Schema.Types.ObjectId, required: true },
    parentCommentId: { type: mongoose.Schema.Types.ObjectId },
}, { timestamps: true })

resourceId defines where this comment belongs.

My questions:

Question 1: Is this schema good? Will I be able to query these comments (by resourceId) fast enough even when the collection grows into the millions?

Question 2: I’m also planning to add comments to my blog posts. Instead of the resourceId the comment belongs to, I would need an identifier for the specific post. Should I use the same schema, add another blogPostId field, and make both blogPostId and resourceId optional? Or should I create a separate model + collection? The rest of the feels are the exact same and I want to avoid unnecessary duplication.

kevinadi · August 30, 2022, 2:29am

Hi @Florian_Walther

My plan is to store all comments in a single collection

I think this should be fine. It’s probably a better option vs. putting the comments in an array of sub-document inside e.g. a “post” document, since if a post generated a lot of comments, the “post” document can grow indefinitely, which is probably not what you want.

Question 1: Is this schema good? Will I be able to query these comments (by resourceId ) fast enough even when the collection grows into the millions?

Well “good” is relative I believe as long as the collection is indexed properly (see Create Indexes to Support Your Queries) and if the working set fit in RAM, it should be fast enough. Of course this is also subject to the hardware spec, and whether the hardware can handle the workload or not.

Question 2: I’m also planning to add comments to my blog posts . Instead of the resourceId the comment belongs to, I would need an identifier for the specific post.

To me that use case doesn’t sound too different from the first one you mentioned. If it’s serving the same purpose and you’re expecting a similar usage pattern, I don’t see why you can’t reuse the same schema with minor modifications.

Obligatory caveat: I’m not 100% familiar with the use case you have in mind, so these are just generalized opinion on my part. Before committing to any one solution, I’d recommend you to simulate the workload first to see if the design would work or not

Best regards
Kevin

Florian_Walther · August 30, 2022, 2:33pm

Thank you for your answer!

Just to clarify, do you recommend putting resource and blog post comments into the same collection or keeping them separate? The only difference in the schema is that one needs a resource id and the other one the blog post id to specify where this comment belongs to.

Now that I think about it, maybe I don’t even need different names for that field since they’re both ObjectIds? Maybe I can just give it a generic name (which resourceId already kinda is) and use it for both resource and blog post ids.

kevinadi · August 31, 2022, 12:34am

I think this is reasonable. However I’d like to point out that with regard to the _id field, ObjectId is just the default auto-generated value that is unlikely to be duplicated. If you need to, you can use a custom _id field (and thus would perhaps create a more informative reference in the resourceId field).

Using a custom _id field would be an advantage for some application, since e.g. if you know the primary key for a collection and have a method to generate one, your app won’t be able to insert two identical documents, since the_id field is uniquely indexed.

Best regards
Kevin

Florian_Walther · August 31, 2022, 5:30am

Thank you for the explanation!