What would be the best way to model the data for a forum? Including questions, answers and votes

Hi everyone!
I’m new to MongoDB and would like to get some help modeling my forum data. The main things that I currently try to model are the questions, answers, and votes (I also have users collection, but I dealt with it already, it was pretty easy). My website is very similar to Reddit or StackOverflow. each question has a title, description, creator, if the creator is anonymous, time created, and tags. Each answer has content, creator, if the creator is anonymous, and time created. Every answer is also linked to a question and possibly to another answer (if it’s a reply to another answer). I also want both the questions and answers to have voting (upvotes and downvotes). If I would just embed everything it would look something like this:

questions collection:
 - creator: userId
 - createdAt: Date
 - isAnon: boolean
 - title: string
 - description: string
 - tags: string[]
 - answers: AnswerSchema[]
 - upvotes: userId[]
 - downvotes: userId[]

AnswerSchema:
 - creator: userId
 - createdAt: Date
 - isAnon: boolean
 - content: string
 - replies: AnswerSchema[]
 - upvotes: userId[]
 - downvotes: userId[]

That doesn’t seem like a good idea because even tho embedding is considered the better approach most of the times, it sets a limit for how much data I can store (even if in the beginning I won’t have many answers/votes, but what if my website will grow and have a lot of data?).
I thought of just don’t everything with referencing so I’ll have four collections: one for questions, one for answers (with a question id field to reference to the question, and an answer id field for when it’s a reply.), and two for votes (connecting between answer/question and user, and another field for if it’s downvote or upvote). And then also adding to the questions and answers collection upvotesCount and downvotesCount.

This still doesn’t seem perfect because each time I’ll want to update the votes, I’ll need to update two different collections. Also each time I want to get questions/answers and also to get if a user already voted on them, I’ll need to have two different queries and then somehow combine them.

What would you recommend me to do? Use embedding or referencing and where?

Hi @Roi_Bar ,

Can you share a typical test documents from those 2 collections ?

If somebody upvotes a question why would an answer be upvoted? Am I missing something?

Once I see the current schema I would need the most critical queries and I can tube it based on that.

In general, I would say that each topic/question should be a document and each top level answer should be a document, a reply on an inner answer might be embedded in my opinion.

Now they can all live in one collection meaning that a question document will have a “type” : “question” field and an answer will have an answer type. But if they both in the same collection and you have a question id in each answer document you can run one query for all threads in that question …

db.questionsAndAnswers.find({question_id : xxx }).sort({type : -1, timestamp : 1})

Thanks
Pavel

Hi, thanks for your answer!
When somebody upvotes a question, of course, no answer should be upvoted. By saying that I’ll need to update two different collections I meant that because the votes are in a different collection, I will need to update both the votes collection (save the user who voted, on what question/answer, and if it’s an upvote or downvote) and also in the question/answer document (in a field that saves how much upvotes and downvotes are there, so I don’t need to count the votes each time). That also means that when I want to get a list of questions/answers and also get for each one of them if a specific user has already voted in them (so when the user sees the question/answer he will see if he voted on it already), I will need 2 different queries. The first one to get the list of questions/answers, and after that I’ll need to get for each question/answer id if the user has voted on it, then somehow combine that data to one object. Hope I explained it clearly enough.

About the replies. If I put the replies inside the answer schema, won’t it limit how many replies you can have on an answer? I really don’t know how many I’ll have because that depends on how users I’ll have and how big my website will grow. I’m scared of killing the scalability…

About that I could put questions and answers in the same collection, that’s actually what I’ve done until now, I created a collection for Posts and I thought that I would just put every kind of post there (not only question and answers, because I’m planning to add more things like writing on someone wall, posting a status and more). they all have text content, a creator, if the creator is anonymous, and created date. I thought it would also be easier for stuff like the voting because then I can have one votes collection that will connect between user and post, instead of needing to create a separate collection for votes on questions and answers.
But, with all that said, I’m still thinking that it might be a bad idea. First of all, each type of post will have different fields (like question having tags and description, and answer having question id), and it’s harder to manage when they are all in the same collection. Also, I can’t have mongoose schema validation, because each type of post has different rules to validate (even tho I have other layers of validation, it’s always good to have the mongoose validation to be extra sage). Even in the votes, I only want questions and answers to have votes, and not other kinds of posts. And even then, what if someday I’ll decide that I don’t want downvotes on questions for example?

I feel like maybe putting everything in one collection is messier and complicated then just creating a collection for each type of post, and that it’ll limit changes afterward. Would be happy to hear your opinion.

Hi @Roi_Bar ,

In terms of upvotes I am more of a fan for keeping data that is queried and upfated together in the same document.

What is possible is for each post that is being upvoted keep an array of user ids that upvoted on it and the total number of votes/downvotes

votes : { n : 50,
              users : [ "id1", "id2" .... "Id50" ],
downVotes : ...

Traversing this array on the client side when building the Ui should not be problematic and will be fast to indicate a full or empty like for the connected user.

For the replies on a specific inner answer/post, the nature of those would usually be of a lower magnetitude compared to the amount of messages on the main thread. Therefore I assume the 100-200 inner comments can live in the embedded array. Moreover the nature of showing those heirarchy is usually paginated. So keeping the top comments embedded and any click on “load more” can go to this outlier collection which holds the extra long comment threads …

For mongoose I have not much to share as I rather not use it exactly because of the schema type limitations, it prevents me for using MongoDB polymorphism which is one of the strongest points of MongoDB. Documents does not have to be the same, and fhey can only share a common attribute for logical queries … This is a classic example where all content will probably hold a post id . The documents can then have different fields for different type of posts. Your UI should be Smart enough to get a post and based on its structure for it correctly. Any validation can be done on the buisness logic of the application backend …

Hope that helps

Pavel

1 Like

Thanks again!
I think I’ll just go with referencing, I don’t want to have limitations on how many upvotes or downvotes are there (and if I would also have the replies embedded, that can be a pretty big limitation because I will have even more data in the document. also, most of the time, I only need to check if a specific user has voted, I don’t need the whole array of users who voted, I never show the list of users who have voted to a question/answer).
If I go in that route, what do you think would be the best way to update and get the votes? I need to update both the upvotes/downvotes count on the post, and add it to the votes collection, and when getting the data I also need to check for each comment if the user has already voted on it (I know that in SQL those things are supported with atomic updates and queries with join, how would you do that in mongo?)

About the replies I need to think about it more - do I want to sacrifice scalability for speed and ease of working with data? what if my website will grow very big (unlikely, but still, I don’t want to block this option completely)?

I’m still not sure about combining all of the post types into one Posts collection. I mean, either way, I have a database layer that handles working with the different types of posts separately, so what value do I get by combining those into one collection? The only thing I can think of is easier references like with the votes (having one votes collection to connect Posts and Users instead of having two separate votes collections for questions and answers). Isn’t it just easier to have a separate collection for each type of post?

Hi @Roi_Bar ,

If you go down the route of having a document representing if a user has voted or not you will need aome sort of a transaction to update the total on the specific post document. You can use the native mongodb transactions.

If placing the data in one collection or several depands on your code and UI.

If I can imagine correctly you wil probably have different topics for posts and therefore the main screen will be to show some preview of available posts, therefore i assume that you will have some sort of grouping per category on main posts. Once a specific post is loaded you will need to show first portion of replies/answers …

Therefore I thought that if you spread the posts into their types to group them you will need several queries and with one collection you will need one query doing it all…

If you feel like separating is better try it. Remember that MongoDB is a very good database for changing schema , so moving your application from many collection to one once you are in the air shouldn’t be that of heavy lift…

I don’t think I’ll use transactions because voting is something that you do very often, so it should be as fast as possible, and even if once in a million times it won’t update the question/answer votes count, it’s worth it. Also, I’ll probably just handle it myself, that when the question/answer update fails or it does not exist, I’ll just delete/update to the previous value in the votes collection.

About the topics, I’m not exactly sure what you mean but if I understand correctly you mean I’ll have different topics, and each question belongs to a topic. I thought you meant to combine the questions and answers in one collection. I only have tags for questions, not topics, and I already embedded them in the question schema.

Thank you very much for your help and time!

1 Like