How can i export huge dataset using mongodb package in node.js

Pugazh_R · April 7, 2024, 6:37pm

I have a table named “Chats” with below model:

{
  "session_id": "session-000001",
  "messages": [
    {
      "human": "hi",
      "chatbot": "Hello",
      "startTime": 1709660783132,
      "endTime": 1709660788169
    },
    {
      "human": "how is weather today",
      "chatbot": "It's 30 degree celsius",
      "startTime": 1709660789236,
      "endTime": 1709660796201
    }
  ]
}

I would like to export all records that fall within a specific date range based on message.endTime. here is my pipeline

[
  {
    $match: {
      "messages.human": {
        $ne: "",
      },
      "messages.endTime": {
        $gte: 1709663400000,
        $lte: 1712341799999,
      },
    },
  },
  {
    $unwind: "$messages",
  },
]

but this is taking more than 1 min for 10k records with node.js aggregate wrapper

how can i improve the performance?
Is there any settings i need to update/modify in mongodb?
I’m trying to export 100k records, what’s the best approach?

Note: if I add an additional stage to the pipeline ($count), it’s just taking 5 seconds.

I’m new to mongo DB please bear with me if I’m asking a dumb question. Thanks!

John_Sewell · April 8, 2024, 11:51am

Do you have indexes setup on the data that satisfy your match conditions?

steevej · April 13, 2024, 1:37pm

I feel deception when we spend time reading posts and the original author does not do any followup.

Cheers to you John

John_Sewell · April 13, 2024, 2:09pm

Yes, it is a shame when we dont hear anything back! Seems pointless to take time to make a post thats well formated and just ignore it!

Pugazh_R · April 15, 2024, 11:46pm

@John_Sewell @steevej I’m extremely sorry for delayed response. This was my first port and so my account was under review. I missed these replies.

Yes, I have indexes setup on “messages .startTime”, “messages.endTime” and “session_id”.

“messages.human” is a text field and could have long string, so I haven’t indexed this field.

John_Sewell · April 16, 2024, 8:56am

Welcome back

When you say you have indexes on those three fields, is it three indexes or one compound index on three fields?

John_Sewell · April 16, 2024, 8:58am

Also…that human field do you mean to check is it’s empty string, null or does not exist, which are all different scenarios in MongoDB (which can be confusing, especially the way things like Oracle deal with empty strings…unless it’s just me that finds that annoying)

steevej · April 17, 2024, 12:42pm

This

and this

Points to a network I/O issue.

Is 100K before or after $unwind?

Since $unwind is the last stage, it would be preferable in terms of network I/O to forgo this last stage. An $unwind stage increase the amount of data transferred because all data outside the $unwind array is duplicated for each element of the array. You could use $project to remove duplicated values that are not required.

I think that your $match should use an $elemMatch because I suspect that you want both conditions to be true for the same message.

I also think that you would want a $set stage with $filter on messages to only get the matching elements.

By not exporting 100K documents by using the aggregation framework to do what ever computation you do with 100k on the server rather than the client.

Pugazh_R · April 17, 2024, 3:48pm

Thanks for your detailed input. This is much useful!