Chunk migration performance

After a chunks merge procedure (see, the number of chunks is very imbalanced between our 2 replicated shard.
(The objective was to equilibrate disk occupation)
During the first ~24h, the moving of each chunk were not blazing fast but had acceptable performance (may be 30" to 1’ for a 32M to 64M chunk)
At this rate, I was planning to have the data balanced in ~1W. it was OK.

“Suddenly” this performance degraded in a couple of hours.
The move of a single chunk take now > 1000"
changelog extract :

    "step 1 of 6" : 0,
    "step 2 of 6" : 79,
    "step 3 of 6" : 86,
    "step 4 of 6" : 1344089,
    "step 5 of 6" : 12,
    "step 6 of 6" : 456,
    "to" : "shard1",
    "from" : "shard2",
    "note" : "success"

The cluster is usually heavily loaded.

I’ve read a lot about this subject, here, mongodb Jira, dbstackexchange etc but can’t explain this bad perf.
I’ve tried to unload the cluster during short times. Then the balancer is alone on the cluster, but the slowness remains.
When the cluster is loaded, iotop shows typically something like ~30M/s(R) and ~10M/s(W)
When the cluster is unloaded, the only process that remains with some IO in iotop on the target shard is the “chunkInserter” that readz a lot (~4M/s) and writes a little (~100 k/s)
After have read some Jira ticket I tried the legacy replica transfer protocol (pv0). No effect.
Mongo data filesystems are XFS
Plenty of RAM, SSD disks

Any idea ?