Saving the results of an agregation pipeline into a text file

Pietro_Simcic · September 7, 2021, 12:08pm

Hello everyone!

I have an aggregation pipeline that headhunts for doublets in my database and I need to print the cursor on a text file. I’m working from the terminal and I have ubuntu on my machine
Can anyone help me with this?
This is my aggregation :
db.candidates.aggregate([{$group:{_id:{name:"$name",xing:"$xing"},count:{$sum:1},docs:{$push:"$_id"}}},{$match:{count:{$gt:1}}}])

Well, thank you in advance!

Pietro

Anna_Henningsen · September 7, 2021, 12:17pm

Hi @Pietro_Simcic !

Are you using mongosh or the legacy shell (mongo)? If you are using mongosh, you can print a cursor to a file like this:

cursor = db.test.aggregate([...]);
cursorAsStream = stream.Readable.from(cursor.map(entry => EJSON.stringify(entry) + '\n'));
cursorAsStream.pipe(fs.createWriteStream('/tmp/out'));

where the script takes a cursor, turns it into a Node.js stream of Extended-JSON-formatted objects, and then write that data into an output file.

Pietro_Simcic · September 7, 2021, 12:28pm

Hey @Anna_Henningsen !
Amazing!
That`s mean that I can pass the cursor as a variable in mongosh?
And in the last line, I just need to swap /tmp/out with a path to the designated file right?

Anna_Henningsen · September 7, 2021, 12:39pm

@Pietro_Simcic Yes, you can store cursors in variables. If you need the results as arrays, you can also call .toArray() on them and store that in a variable (however, streaming will work much better when you’re dealing with large datasets).

And yup, just replace it with the filename you want it to have

Pietro_Simcic · September 7, 2021, 1:01pm

I tried this:

cursor = db.test.aggregate([{$group:{_id:{name:"$name",xing:"$xing"},count:{$sum:1},docs:{$push:"$_id"}}},{$match:{count:{$gt:1}}}]);
cursorAsStream = stream.Readable.from(cursor.map(entry => EJSON.stringify(entry) + '\n'));
cursorAsStream.pipe(fs.createWriteStream('/home/blatta/Documents/IT-REC/doublets'));

But I’ve got this error :
MongoCursorExhaustedError: Cursor is exhaustedtabase> Uncaught Preformatted text

Any ideas?

Anna_Henningsen · September 7, 2021, 1:18pm

@Pietro_Simcic Can you wrap that whole piece of code in { ... } curly brackets? If you paste it line-by-line into the shell, it won’t work because mongosh tries to print the cursor output after the first line, before it has even seen that there is more code.

Pietro_Simcic · September 7, 2021, 1:24pm

@Anna_Henningsen the result is an empty file.
I also tried to add an extension but I have the same result.
If needed I can send you the WriteStream object.

Anna_Henningsen · September 7, 2021, 1:33pm

@Pietro_Simcic Can you share the exact steps that you are running?

Pietro_Simcic · September 7, 2021, 1:35pm

sure @Anna_Henningsen !

{cursor = db.test.aggregate([{$group:{_id:{name:"$name",xing:"$xing"},count:{$sum:1},docs:{$push:"$_id"}}},{$match:{count:{$gt:1}}}]);
cursorAsStream = stream.Readable.from(cursor.map(entry => EJSON.stringify(entry) + '\n'));
cursorAsStream.pipe(fs.createWriteStream('/home/blatta/Documents/IT-REC/doublets'));}

Anna_Henningsen · September 7, 2021, 1:56pm

@Pietro_Simcic If I run

db.test.drop(); /* clear the test collection */
db.test.insertMany([
{name: 'A', xing: 'X'},
{name: 'A', xing: 'X'},
{name: 'B', xing: 'Y'},
]);

(just for test data) and then run your exact script, I get a file with one line, as I would expect. I’m not sure what is different here, tbh.

Pietro_Simcic · September 7, 2021, 2:10pm

Pietro_Simcic:

{cursor = db.test.aggregate([{$group:{_id:{name:"$name",xing:"$xing"},count:{$sum:1},docs:{$push:"$_id"}}},{$match:{count:{$gt:1}}}]);
cursorAsStream = stream.Readable.from(cursor.map(entry => EJSON.stringify(entry) + '\n'));
cursorAsStream.pipe(fs.createWriteStream('/home/blatta/Documents/IT-REC/doublets'));}

is also giving me a file with only one line

Anna_Henningsen · September 7, 2021, 2:21pm

@Pietro_Simcic How many results are you expecting?

Pietro_Simcic · September 7, 2021, 2:22pm

@Anna_Henningsen 3705

Anna_Henningsen · September 7, 2021, 2:32pm

@Pietro_Simcic Some things to check:

Are you looking at the right file?
Does the single line you are looking at contain all the duplicates (i.e. are these all duplicates of a single name/xing combination) or are you looking at multiple groups?
What happens when you add cusorAsStream.pipe(process.stderr); before the other .pipe() command (in the same {...} block)? Are you seeing more results then?

Anna_Henningsen · September 7, 2021, 2:40pm

Also: Are you running a recent mongosh version, and does the aggregation take a long time to run? (The command returns control to the user immediately, but it may take a bit of time for the aggregation to actually run and becomes written to disk.)

steevej · September 7, 2021, 2:42pm

May be, just may be.

cursor = db.test.aggregate( … )

should be written as

cursor = await db.test.aggregate( … )

Pietro_Simcic · September 7, 2021, 2:44pm

yes, I’m looking at the right file, there is nothing else in the folder :).
This is the content of the file:
{"_id":{"name":"A","xing":"X"},"count":2,"docs":[{"$oid":"61377252a41e6aebfbd98cae"},{"$oid":"61377252a41e6aebfbd98caf"}]}
3.when I add the cusorAsStream.pipe(process.stderr); in this way :

{cursor = db.test.aggregate([{ $group: { _id: { name: "$name", xing: "$xing" }, count: { $sum: 1 }, docs: { $push: "$_id" } } }, { $match: { count: { $gt: 1 } } }]); cursorAsStream = stream.Readable.from(cursor.map((entry) => EJSON.stringify(entry) + '\n')); cusorAsStream.pipe(process.stderr); cursorAsStream.pipe(fs.createWriteStream('/home/blatta/Documents/IT-REC/doublets'));}

I have an error : ReferenceError: cusorAsStream is not defined

the aggregation is quick, I have an almost instant response.
My version of mongosh is 1.0.5

steevej · September 7, 2021, 2:47pm

Because you wrote cusorAsStream rather than cursorAsStream.

Anna_Henningsen · September 7, 2021, 2:48pm

@steevej No – in the shell, async/await is implicit, and in both the shell and modern versions of the Node.js driver, db.test.aggregate() returns a Cursor object, not a Promise.

@Pietro_Simcic

This is the content of the file:

Did you override your data with my test data?

when I add the cusorAsStream.pipe(process.stderr); in this way :

There’s a typo in there (cusor, not cursor).

Pietro_Simcic · September 7, 2021, 2:51pm

yep thank you @steevej !
Yes I followed your instructions Anna.