Advice for Chat schema design

Hi, in my application I’m developing I need to allow for chats between any 2 users. I have the following schema in mind for a Messages collection:

{
  _id: ObjectId(),
  userID1: "userA_ID",
  userID2: "userB_ID",
  sender: "one of the users",
  message: "some text",
  timestamp: time
}

userID1 will always be the first user to start the chat, I’ll save this for every Chat room consisting of 2 people. (EDIT: Should I use lexicographical ordering instead, with userID1 coming before userID2?)

I’ll create an index on the timestamp field so I can sort it in reverse, and then I can do the following to get the data every time the chat room is loaded (with pagination as you scroll up):

db.find({ userID1, userID2 }).sort({ "timestamp" : -1 })
  .skip(offset).limit(limit)

And this I think should give me my intended behavior.

  1. Does this make sense? Is there anything I’m missing out or have overlooked?
  2. Is this the correct practice, storing ALL user messages in one big Messages collection? Of course I’ll be implementing all necessary security/privacy protection but the fact still remains that all these messages are stored in one large collection. Is this a concern from a security/privacy/logical standpoint?

Appreciate any advice, thanks.

That particulate query will be better serve with a compound index on userID1,userID2,timestamp. See Performance Best Practices: Indexing | MongoDB Blog for the specifics.

1 Like

In your schema you might want also want to add a field “sent_by” or something similar, this way you could moderate / find messages by users with a search feature.

1 Like

Thanks for pointing this out, it’s a necessity actually. Otherwise I have no way of knowing who actually sent the message, oversight on my part.

I wonder if I should use lexicographical sorting to determine userID1 and userID2, with userID1 coming before userID2, instead of who initiates the chat first.

Could you merge userID1, userID2 and sender into just 2 fields?

sender rather than userID1
receiver rather than userID2

This would reduce the size of each document.

1 Like

Thanks for the suggestion, that makes a lot of sense, then the accompanying query would be the following right?

db.find({ $or: [{ sender: userA, receiver: userB }, { sender: userB, receiver: userA }]}
  .sort({ timestamp: -1 }).skip(offset).limit(limit)

So I would then have to create a multikey index on the sender, receiver and timestamp fields?

Another implementation you could do that would involve another collection is, when a new chat is created in a “Conversation” collection you can add the information:


{ id: 123
  participants: ['user1', 'user2'],
}

Then in the message instead of UserID1 and UserID2 you would just have “conversation_id”

{ sender: 'user1', 
  message: 'Hello World', 
  timestamp: time,
  converstationId: 123
}

This way if the members of the group change you only have to change it once in the “Conversation” collection and all messages referencing the ID will see the changes.

This would be similar to a One-to-Many with reference

2 Likes

That’s a nice approach! Thank you both for the insights and advice.

I was wondering (which was my second question), is storing a huge amount of random chat data in a collection good practice? I say random because in the Messages collection the ordering will be jumbled up when different users communicate at different times. Although it has no effect on the end users, is that acceptable practice?

A more traditional data structure would be perhaps to keep all this chat in an array within a document (but of course it’s subject to the 16MB BSON size limit), but logically this means there’s no way any messages are interleaved with other messages.

In general I don’t believe that collections with a lot of documents is an issue. As long as your queries are indexed it shouldn’t be a problem.

Blockquote
A more traditional data structure would be perhaps to keep all this chat in an array within a document (but of course it’s subject to the 16MB BSON size limit), but logically this means there’s no way any messages are interleaved with other messages.

MongoDB does have a bucket design pattern. In which you store related items in an array.

{
    conversation_id: 12345,
    time: time,
    members: ['user1', 'user2'],
    messages: [
      {
         sender: 'user1', 
         message: 'Hello World', 
         timestamp: time
      },
      {
         sender: 'user1', 
         message: 'Hello World', 
         timestamp: time
      }],
   total_messages: 2
}

You could have a field called “total_messages” that is the sum of all messages and once it hits a certain number it creates a second bucket so you stay within the 16MB limit and don’t have the massive arrays anti pattern.
Although this may be more complicated than is required.

2 Likes

Thank you for the links and explanation. That’s a nice approach as well, but yes it comes with a massive array regardless, and queries become a little more complex.

What we’re doing now without using the bucket design pattern is essentially merging all of these potential arrays into one collection. I think I shall be going ahead with what’s been discussed so far with the sender/receiver method.