Randomly slow script

Boow · April 22, 2024, 1:09pm

Hi,

We have a script that read a collection X, do some processing and write in the collection Y. Usually the script takes ~1 hour but sometime it’s very very slow. What can affect performance?

I don’t see a big load on my cluster… maybe a lock on my X collection that is slowing down my script? Size of my X collection: Long(‘27776896250’).

Thanks for your help.

steevej · April 22, 2024, 2:01pm

Without more information from your exact cluster setup with cannot really help.

When you write read a collection X and size of X being 27_776_896_250 do you mean you read all of it and then write all of it back into a collection Y? Where is the script running compared to the server?

Boow · April 22, 2024, 3:07pm

Oh yes sorry.

Replica set of 4 nodes.

Where is the script running compared to the server?

Script is in another machine but communicate with Internal network (1GB/s).

When you write read a collection X and size of X being 27_776_896_250 do you mean you read all of it and then write all of it back into a collection Y?

I read the data from the collection X, do some processing (if needed) and then I apply it to the new collection Y. I work in batches of 300. I create an empty collection, write all data and then create index (maybe it’s better to create the index before? I don’t think so because mongo will have to keep it up-to-date…).

Let me know if you have suggestion / tips / questions.
Thanks.

steevej · April 22, 2024, 3:11pm

Use the aggregation framework with an $out stage rather than

steevej · April 26, 2024, 12:16pm

One idea I got since you have

which may or not be applicable to the whole use case is to:

take 1 node out of the replica set
make this node a dedicated not replicated node where collection Y is written

My idea is that since you read and write a lots of data from the same cluster, you constantly swap in and out the working set. Your cluster might be overloaded by disk I/O. With the new setup, the replica set is not busy replicating all the writes and can serve the reads better.

Anyway 4 nodes is not a recommended configuration so you lose nothing by taking a node out.

Can you give more details about:

What metrics are you using for the above.

The hardware configuration of the 4 nodes.

The read and write concerns you use.

Since the script usually takes 1 hour, it looks like an analytic use-case and having dedicated notes for analytic is sometimes a good choice.

Have you tried different sizes?

I don’t know but it should be straight forward to test. But, I suspect that it may be better because the documents need to be fetched again, adding a lot of I/O again.

steevej · May 3, 2024, 12:57pm

@Boow, it has been a week since I provided what I think is valuable input.

I would appreciate a follow-up.

Thanks