Why does mongodb spark connector v10 log the info Partitioner Getting collection stats and it took too long time for big data size

jj_jj · November 22, 2022, 4:08am

Hi:
When using mongo db connector of spark 2.x or spark 3.x, for the same volume of data, it both works well for sample partition and splitvector partition.
But, after upgrade to mongodb spark connector v10.x, we met strange issues:
1: There is log info “Partitioner Getting collection stats …”, and it took too long time for big data size
2: we have 2 collections, it only read the first one without the second one.

my questions:
Is there any limitation for v10.x or what we can do to ignore the “Partitioner Getting collection stats” operation?
There is no Splitvector Partition for v10.x, is there any reason to remove it?

Lukasz_75327 · October 20, 2023, 10:53am

I have exactly the same problem as you. My jobs get stuck in the partition size calculation phase. After 40 minutes, there is no progress. I cannot migrate to connectors 10.2 because of this.