How to throttle mongo-spark-connector

My app server creates queries and inserts data into mongo based on live user actions. This is important and should take precedence over reading from Mongo by Spark for data analysis, which runs concurrently. At present we get timeouts when trying do live action-based queries during Spark read tasks.

How do I throttle down the load mongo-spark-connector puts on Mongo so that my live input can continue to be inserted while Spark is reading from Mongo?

UPDATE: Maybe a clue to controlling load from Spark could be what the load is related to. Number of partitions, number of cores for the job, something in the Spark config or Job config?

There are a few things to consider. First, make sure your spark job connections are specifying a read preference of secondary or secondaryPreferred. This will ensure the read burden off of the primary. If you are also writing and still have issues you may want to add additional "mongoS"s and use log files to further troubleshoot where the bottle neck is.

We understand how to scale Mongo but that is not our problem. The problem we face is that Spark can be HUGE in terms of the number of cores used. We should not have to scale Mongo to support this since we only use Spark for a few hours per week to generate ML models. The rest of the time mongo performs quite well for data input, which comes in continuously. What we need to do is scale the mongo-spark-connector so it doesn’t overload the mongo nodes we have already scaled to fit our live load (+ some margin). In short we need to scale the load put on mongo from Spark, not scale Mongo to handle Spark load, which (without some way to throttle mongo-spark-connector) is FAR in excess of what is normally needed by the system.

For example when we write from Spark we can “repartition” the Spark Dataset and this indirectly throttles the connections made for the write.

But for input there is no dataset to partition until the Spark read happens.

Hope this helps explain the issue and many thanks for your attention. We love Mongo and hope that solving this will allow others to use it in a similar way.

Throttling does not really make much sense in Spark since by slowing down mongo-spark-connector operations you are likely holding on to some precomputed results that take up memory and executors with their resources linger around longer potentially preventing other jobs from running on the cluster.

You can decrease number of executor and/or cores and perhaps increasing number of partitions though Spark mongo driver will try to insert as quickly as possible. The best you can do is too make Spark less efficient but why would you want to do that at the expense of overall resource utilization!?

Making sense or not, throttling might be needed if (as is our case) your bottleneck restriction is resources allocated to the mongo db. We must write huge amounts of information (~GB) with limited RU/s allocated. Therefore, we would like to prevent it to write once the maximum limit has been achieved and to wait for the time period to end, writing another batch once a new time period begins