Saving the results of an agregation pipeline into a text file

Hello everyone!

I have an aggregation pipeline that headhunts for doublets in my database and I need to print the cursor on a text file. I’m working from the terminal and I have ubuntu on my machine :slight_smile:
Can anyone help me with this?
This is my aggregation :
db.candidates.aggregate([{$group:{_id:{name:"$name",xing:"$xing"},count:{$sum:1},docs:{$push:"$_id"}}},{$match:{count:{$gt:1}}}])

Well, thank you in advance!

Pietro

1 Like

Hi @Pietro_Simcic !

Are you using mongosh or the legacy shell (mongo)? If you are using mongosh, you can print a cursor to a file like this:

cursor = db.test.aggregate([...]);
cursorAsStream = stream.Readable.from(cursor.map(entry => EJSON.stringify(entry) + '\n'));
cursorAsStream.pipe(fs.createWriteStream('/tmp/out'));

where the script takes a cursor, turns it into a Node.js stream of Extended-JSON-formatted objects, and then write that data into an output file.

5 Likes

Hey @Anna_Henningsen !
Amazing!
That`s mean that I can pass the cursor as a variable in mongosh?
And in the last line, I just need to swap /tmp/out with a path to the designated file right?

1 Like

@Pietro_Simcic Yes, you can store cursors in variables. If you need the results as arrays, you can also call .toArray() on them and store that in a variable (however, streaming will work much better when you’re dealing with large datasets).

And yup, just replace it with the filename you want it to have :+1:

2 Likes

I tried this:

cursor = db.test.aggregate([{$group:{_id:{name:"$name",xing:"$xing"},count:{$sum:1},docs:{$push:"$_id"}}},{$match:{count:{$gt:1}}}]);
cursorAsStream = stream.Readable.from(cursor.map(entry => EJSON.stringify(entry) + '\n'));
cursorAsStream.pipe(fs.createWriteStream('/home/blatta/Documents/IT-REC/doublets'));

But I’ve got this error :
MongoCursorExhaustedError: Cursor is exhaustedtabase> Uncaught Preformatted text

Any ideas?

1 Like

@Pietro_Simcic Can you wrap that whole piece of code in { ... } curly brackets? If you paste it line-by-line into the shell, it won’t work because mongosh tries to print the cursor output after the first line, before it has even seen that there is more code.

2 Likes

@Anna_Henningsen the result is an empty file.
I also tried to add an extension but I have the same result.
If needed I can send you the WriteStream object.

1 Like

@Pietro_Simcic Can you share the exact steps that you are running?

1 Like

sure @Anna_Henningsen !

{cursor = db.test.aggregate([{$group:{_id:{name:"$name",xing:"$xing"},count:{$sum:1},docs:{$push:"$_id"}}},{$match:{count:{$gt:1}}}]);
cursorAsStream = stream.Readable.from(cursor.map(entry => EJSON.stringify(entry) + '\n'));
cursorAsStream.pipe(fs.createWriteStream('/home/blatta/Documents/IT-REC/doublets'));}
1 Like

@Pietro_Simcic If I run

db.test.drop(); /* clear the test collection */
db.test.insertMany([
{name: 'A', xing: 'X'},
{name: 'A', xing: 'X'},
{name: 'B', xing: 'Y'},
]);

(just for test data) and then run your exact script, I get a file with one line, as I would expect. I’m not sure what is different here, tbh.

1 Like

is also giving me a file with only one line

1 Like

@Pietro_Simcic How many results are you expecting?

1 Like

@Anna_Henningsen 3705 :frowning:

1 Like

@Pietro_Simcic Some things to check:

  • Are you looking at the right file?
  • Does the single line you are looking at contain all the duplicates (i.e. are these all duplicates of a single name/xing combination) or are you looking at multiple groups?
  • What happens when you add cusorAsStream.pipe(process.stderr); before the other .pipe() command (in the same {...} block)? Are you seeing more results then?
1 Like

Also: Are you running a recent mongosh version, and does the aggregation take a long time to run? (The command returns control to the user immediately, but it may take a bit of time for the aggregation to actually run and becomes written to disk.)

1 Like

May be, just may be.

cursor = db.test.aggregate( … )

should be written as

cursor = await db.test.aggregate( … )

1 Like
  1. yes, I’m looking at the right file, there is nothing else in the folder :).
  2. This is the content of the file:
    {"_id":{"name":"A","xing":"X"},"count":2,"docs":[{"$oid":"61377252a41e6aebfbd98cae"},{"$oid":"61377252a41e6aebfbd98caf"}]}
    3.when I add the cusorAsStream.pipe(process.stderr); in this way :

{cursor = db.test.aggregate([{ $group: { _id: { name: "$name", xing: "$xing" }, count: { $sum: 1 }, docs: { $push: "$_id" } } }, { $match: { count: { $gt: 1 } } }]); cursorAsStream = stream.Readable.from(cursor.map((entry) => EJSON.stringify(entry) + '\n')); cusorAsStream.pipe(process.stderr); cursorAsStream.pipe(fs.createWriteStream('/home/blatta/Documents/IT-REC/doublets'));}

I have an error : ReferenceError: cusorAsStream is not defined

  1. the aggregation is quick, I have an almost instant response.
  2. My version of mongosh is 1.0.5
1 Like

Because you wrote cusorAsStream rather than cursorAsStream.

2 Likes

@steevej No – in the shell, async/await is implicit, and in both the shell and modern versions of the Node.js driver, db.test.aggregate() returns a Cursor object, not a Promise.

@Pietro_Simcic

This is the content of the file:

Did you override your data with my test data?

when I add the cusorAsStream.pipe(process.stderr); in this way :

There’s a typo in there (cusor, not cursor).

3 Likes

yep thank you @steevej !
Yes I followed your instructions Anna.

1 Like