What would be the best way to model the data for a forum? Including questions, answers and votes

Hi!

It’s great that you’re working with MongoDB. Your thinking is already on the right track. Let’s dive a little deeper into the approaches you’re considering.

Embedding vs. References

Embedding is good when the data is small and often used together. For example, question metadata like title, description, and tags can be embedded right into the question document. This data will always be available along with the question, and it’s easy to do.

References are good when the data can grow or change a lot. For example, answers and votes can become very large, and embedding them in the question document can cause performance issues as the amount of data increases. In this case, references between collections can help improve scalability.

Question Collection
For a question collection, you can keep the basic question data (creator, creation time, tags, etc.) and store references to answers in a separate collection. This will improve performance because questions and answers will be handled separately. You can also store the number of votes (upvotes and downvotes) right in the question document.

Answers Collection
Answers to questions are best stored in a separate collection, with links to the corresponding questions. If an answer is a response to another answer (i.e. nested), then it is also worth storing a link to the parent answer. This will allow you to store the structure of answers flexibly and efficiently.

Votes Collection
For voting, it is better to create a separate collection. This will allow you not to update two documents every time a vote changes (for example, update the number of votes for a question and an answer). Instead, you can work with a collection of votes, which will simplify the logic.

Combining queries
Yes, if you use links, you will have to make several queries to get all the data. But MongoDB has an aggregation tool that allows you to efficiently collect data from different collections, combining them in a single query.

Which one to choose?
In my opinion, the best solution is a hybrid approach:

For small data that is often used together, use embedding.

For larger data, such as answers and votes, use links.

Use aggregation to combine data when you need to get everything at once.

This approach will be flexible and scalable, with the ability to adapt as your project grows. If you have any further questions, don’t hesitate to reach out!