We have a script that read a collection X, do some processing and write in the collection Y. Usually the script takes ~1 hour but sometime it’s very very slow. What can affect performance?
I don’t see a big load on my cluster… maybe a lock on my X collection that is slowing down my script? Size of my X collection: Long(‘27776896250’).
Without more information from your exact cluster setup with cannot really help.
When you write read a collection X and size of X being 27_776_896_250 do you mean you read all of it and then write all of it back into a collection Y? Where is the script running compared to the server?
Where is the script running compared to the server?
Script is in another machine but communicate with Internal network (1GB/s).
When you write read a collection X and size of X being 27_776_896_250 do you mean you read all of it and then write all of it back into a collection Y?
I read the data from the collection X, do some processing (if needed) and then I apply it to the new collection Y. I work in batches of 300. I create an empty collection, write all data and then create index (maybe it’s better to create the index before? I don’t think so because mongo will have to keep it up-to-date…).
Let me know if you have suggestion / tips / questions.
Thanks.
which may or not be applicable to the whole use case is to:
take 1 node out of the replica set
make this node a dedicated not replicated node where collection Y is written
My idea is that since you read and write a lots of data from the same cluster, you constantly swap in and out the working set. Your cluster might be overloaded by disk I/O. With the new setup, the replica set is not busy replicating all the writes and can serve the reads better.
Anyway 4 nodes is not a recommended configuration so you lose nothing by taking a node out.
Can you give more details about:
What metrics are you using for the above.
The hardware configuration of the 4 nodes.
The read and write concerns you use.
Since the script usually takes 1 hour, it looks like an analytic use-case and having dedicated notes for analytic is sometimes a good choice.
Have you tried different sizes?
I don’t know but it should be straight forward to test. But, I suspect that it may be better because the documents need to be fetched again, adding a lot of I/O again.