Mongodb count the number of consecutive matching docs and merge them into one

Consider that I have a mongodb collection called chatMessages with these properties (using mongoose on nodejs):

const schema = {
  _id: ObjectID,
  chatID: String,
  type: String,
  message: String,
  senderID: String,
  date: Date
}

Imagine that I have 5 documents in the collection:

[
  {
    _id: '18c9bb49-03f4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'txt',
    message: 'first message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-02T01:01:11.001Z')
  },
  {
    _id: '18c9bb4w-03f4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'second message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-03T01:01:11.001Z')
  },
  {
    _id: '18c9bb4w-03g4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'third message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-04T01:01:11.001Z')
  },
  {
    _id: '18c9bb4w-03g4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'txt',
    message: 'fourth message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-05T01:01:11.001Z')
  },
  {
    _id: '18c9bb4w-03g4-4dc4-10ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'fifth message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-06T01:01:11.001Z')
  }
]

I want to query these documents such that consecutive rows (when ranked by date) with type of groupedMsgs are represented as one, as well as the number of consecutive rows that are present for each unique groupedMsgs in the final output. Concretely, I would like output as shown below:

[
  {
    _id: '18c9bb49-03f4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'txt',
    message: 'first message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-02T01:01:11.001Z')
  },
  {
    _id: '18c9bb4w-03f4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'second message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-03T01:01:11.001Z'),
    numConsecutiveItems: 2
  },
  {
    _id: '18c9bb4w-03g4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'txt',
    message: 'fourth message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-05T01:01:11.001Z')
  },
  {
    _id: '18c9bb4w-03g4-4dc4-10ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'fifth message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-06T01:01:11.001Z'),
    numConsecutiveItems: 1
  }
]

Notice that the third message is not in the final output because it has type of groupedMsgs and consecutively follows another message with type groupedMsgs, and the second message has numConsecutiveItems of 2 for the same reason. More so, the fifth message is present because it doesn’t immediately follow another groupedMsgs message, and its value of numConsecutiveItems is 1 for the same reason. What is an aggregation pipeline that can do this for me? My preference would be to avoid using $accumulator, $function, $where, and $accumulator to avoid running javascript during the query as that can slow down the query operation, but I’m open to all answers nevertheless.

Hi @Uchechukwu_Ozoemena,

The presented dataset has all the _id’s with the same value. So I am not sure How do you identify messages into groups? Is it just by the fact that they have groupedMsgs and they are by date the following object?

Can you explain the application side idea of getting this data set? Wouldn’t it be better to group those messages as an embeeded array inside a parent doc?

{
    _id: '18c9bb49-03f4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'txt',
    message: 'first message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-02T01:01:11.001Z')
  },
  {
    _id: '18c9bb4w-03f4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'second message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-03T01:01:11.001Z'),
    ConsecutiveItems: [{
    _id: '18c9bb4w-03g4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'third message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-04T01:01:11.001Z')
  }
  }
...

Pavel

1 Like

I agree with @Pavel_Duchovny,

it looks like you just need to adjust your schema to have 2 types of documents:

  1. ‘Regular’ document with all the properties you’ve mentioned
  2. A Bucket document that will contain metadata and an array of the specific elements together with numConsecutiveItems property that will be updated manually or by $size operation on the array.

anyway, it looks like you need a solution that will be partially managed by the application on the writes, and easily fetch on reading.

the only restriction in the solution I have suggested is that the array shouldn’t contain too many elements (I think the magic number is ~1000 and above)

Hey thanks for the response and suggestion. The IDs were actually meant to all be unique by changing one letter for each one, I should’ve used simpler IDs. And now strangely it seems I can’t edit the post. Anyways I want to identify groups based on if the groupedMsgs follow each other when they’re sorted by time (i.e. the date property). So the second and third messages would be a group containing 2 items because they come after each other, but the fifth would be another group of 1. I haven’t found a pipeline using $group that achieves this, once I add the type as a group ID then all the groupedMsgs get put into that group.

Thanks for the response. Can you please give an example of what the 2 buckets would look like? I’m not sure if the regular one is supposed to contain the grouping or if you mean that it contains the messages in their normal order.

@Uchechukwu_Ozoemena ,

I think @Shay_I Idea is that every chat item will be either classified as a parent, while it stores either pointers to the child “grouped” messages or it embeeds them,

For example:

{
    _id: '18c9bb4w-03f4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'second message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-03T01:01:11.001Z'),
    numConsecutiveItems: 2
   numConsecutiveIds : ["'18c9bb4w-03f4-4dc4-b0ba-a3c45c324347'", "18c9bb4w-03g4-4dc4-b0ba-a3c45c324347"]
  }

Or


    _id: '18c9bb4w-03f4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: '',
    message: 'second message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-03T01:01:11.001Z'),
    numConsecutiveItems: 2
   childMessages : [{
    _id: '18c9bb4w-03g4-4dc4-b0ba-a3c45c324347',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'third message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-04T01:01:11.001Z')
  }]
  }

Thanks
Pavel

1 Like

Yes the more I think of it I think this is what I will have to do. Some of it will have to be done by writes so that reads don’t have to do a lot of complex grouping.

1 Like

You got me right @Pavel_Duchovny

it’s just that @Uchechukwu_Ozoemena will have to maintain it because the bucket definition depends on the sorted messages list.

so if we are really talking about chat messages I assume these buckets wouldn’t have to be changed over time and it will be fine, so if it is not the case you should reconsider the bucket pattern.

in a new message insertion of type ‘groupedMsgs’ you should check if the lastest document is ‘groupedMsgs’ bucket if it is - push the new element and increase numConsecutiveItems, otherwise create a new bucket

after that, the find documents should be very straightforward

1 Like

Yep this is what I will end up doing, but with a slight variation for my specific needs. Thank you!

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.