Worker in mongosh

I’m porting some old aggregation scripts that used ScopedThread from parallelTester.js to split mapReduce operations to multiple threads to make more effective use of the available cores. The scripts are run directly on the mongo shell.

I have rewritten that threading part with NodeJS Worker; I am hitting a roadblock there, though, as the threads spawned in that way don’t inherit any of the mongosh environment. So neither db nor connect() are available. I cannot pass db as workerData either, as this cannot be cloned. If I install mondogb via npm, I can require it on the MainThread, but the children cannot find any npm modules (tried npm install with and without -g).

I’m just about to give up and rewire the whole thing so as to not try to run in parallel threads at all, but I was hoping that someone here might have has some success using Worker inside mongosh?

Here’s some mockup code as an example:

const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');
const {MongoClient} = require('mongodb');
const uri = "mongodb+srv://localhost/mydb?retryWrites=true";

bar = 'global info';

if (isMainThread) {
	const threads = new Set();
	threads.add(new Worker(__filename, { workerData: { foo: '1', bar }}));
	threads.add(new Worker(__filename, { workerData: { foo: '2', bar }}));
	for (let worker of threads) {
		worker.on('error', (err) => { throw err; });
		worker.on('exit', () => {
			threads.delete(worker);
			if (threads.size === 0) {
				console.log('done');
			}
		});
		worker.on('message', (msg) => {
			console.log('Worker Message: ',  msg);
		});
	}
} else {
	async function main(){
		const client = new MongoClient(uri);

	}
	main().catch(console.error);
	parentPort.postMessage(workerData);
}

Running this yields two error messages, one for each of the threads:

mongosh --quiet mydb ./testWorkerThreads.js

Uncaught:
Error: Cannot find module 'mongodb'
Require stack:
- ./testWorkerThreads.js
Uncaught:
Error: Cannot find module 'mongodb'
Require stack:
- ./testWorkerThreads.js
done

Using “db” inside the threads (which I’d prefer, as I wouldn’t need new connections) gets me a similar result - db is not defined.

Any ideas?

@Markus_Wollny Right, worker_threads inside mongosh just refers to the Node.js API, and when you use one you’ll get a plain Node.js worker thread, without mongosh APIs.

It is odd that require('mongodb') works in the main thread but not in Worker threads. That might be worth investigating; e.g. what does require.resolve('mongodb') look like in the main thread, and how does that related to the path of the script you’re running? (If I locally run npm install mongodb and then run a similar test script, it works just fine.)

That being said, it might be worth taking a step back and looking at the bigger picture:

  • Is this a good use case for mongosh? If you’re at the point where you want to use the driver directly already, it might be worth considering doing just that and running your scripts using plain Node.js instead of mongosh.
  • What exactly are you trying to parallelize here? If you’re referring to server-side mapReduce, then parallelism on the client won’t help a lot, especially considering that mongosh and the Node.js driver support parallelism for server-side operations without any threading at all (unlike the legacy shell).
2 Likes

Thank you for your response. As I said, these MR-tasks are very old and are running on a server now with a lot more I/O than when they were originally deployed. These MRs are doing analytics aggregations for a couple of websites, each thread is used to process a bundle of one or more sites. Parallel processing was originally used to avoid having to set up separate tasks for each bundle while making sure that the bundles would make use of the multiple cores. The segmentation for these tasks is somewhat different from parallelization within an aggregation such as is provided for example by $facet in an aggegration pipeline.

But again, I guess you’re correct on both counts - the threading here is probably more a case of premature optimization and I’ll likely get rid of it. Regardless, I’d simply like to understand what’s going on here, otherwise this would feel too much like surrender.

Your suggestion of trying require.resolve has helped tremendously. I now understand that the main thread will always search the module in the current directory, whereas the children will look for the module in the directory of the script. Easy workaround is installing the module globally and referencing the full path in the require like

const {MongoClient} = require('/usr/local/lib/node_modules/mongodb');

No more guesswork in either case. I just checked with require.resolve() at this point, and that does look promising, but I will probably just remove the threading altogether. I just wanted to understand what’s going on - there might still be a more valid use case, so this solution might still come in handy.

So thank you very much!

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.