How can i export huge dataset using mongodb package in node.js

I have a table named “Chats” with below model:

{
  "session_id": "session-000001",
  "messages": [
    {
      "human": "hi",
      "chatbot": "Hello",
      "startTime": 1709660783132,
      "endTime": 1709660788169
    },
    {
      "human": "how is weather today",
      "chatbot": "It's 30 degree celsius",
      "startTime": 1709660789236,
      "endTime": 1709660796201
    }
  ]
}

I would like to export all records that fall within a specific date range based on message.endTime. here is my pipeline

[
  {
    $match: {
      "messages.human": {
        $ne: "",
      },
      "messages.endTime": {
        $gte: 1709663400000,
        $lte: 1712341799999,
      },
    },
  },
  {
    $unwind: "$messages",
  },
]

but this is taking more than 1 min for 10k records with node.js aggregate wrapper

  1. how can i improve the performance?
  2. Is there any settings i need to update/modify in mongodb?
  3. I’m trying to export 100k records, what’s the best approach?

Note: if I add an additional stage to the pipeline ($count), it’s just taking 5 seconds.

I’m new to mongo DB please bear with me if I’m asking a dumb question. Thanks!

Do you have indexes setup on the data that satisfy your match conditions?

I feel deception when we spend time reading posts and the original author does not do any followup.

Cheers to you John

1 Like

Yes, it is a shame when we dont hear anything back! Seems pointless to take time to make a post thats well formated and just ignore it!

1 Like

@John_Sewell @steevej I’m extremely sorry for delayed response. This was my first port and so my account was under review. I missed these replies.

Yes, I have indexes setup on “messages .startTime”, “messages.endTime” and “session_id”.

“messages.human” is a text field and could have long string, so I haven’t indexed this field.

1 Like

Welcome back :slight_smile:

When you say you have indexes on those three fields, is it three indexes or one compound index on three fields?

Also…that human field do you mean to check is it’s empty string, null or does not exist, which are all different scenarios in MongoDB (which can be confusing, especially the way things like Oracle deal with empty strings…unless it’s just me that finds that annoying)

This

and this

Points to a network I/O issue.

Is 100K before or after $unwind?

Since $unwind is the last stage, it would be preferable in terms of network I/O to forgo this last stage. An $unwind stage increase the amount of data transferred because all data outside the $unwind array is duplicated for each element of the array. You could use $project to remove duplicated values that are not required.

I think that your $match should use an $elemMatch because I suspect that you want both conditions to be true for the same message.

I also think that you would want a $set stage with $filter on messages to only get the matching elements.

By not exporting 100K documents by using the aggregation framework to do what ever computation you do with 100k on the server rather than the client.

Thanks for your detailed input. This is much useful!