When to use aggregation pipelines instead of processing data at the app level?

empty2 · July 17, 2023, 6:45pm

I am learning the aggregation framework and I have some questions. I would be very grateful for your answers.

When should I use aggregation pipelines, and when should I use simple find queries and then process data in code?
Does the aggregation pipeline execute in the database and only the result transfers via the network?

Kobe_W · July 17, 2023, 8:23pm

It can save you a few round trip of requests. And the db code may be optimized sometimes for special tasks, who knows.

Aggregation can run i db shard or mongos , secondary or primary.

Jack_Woehr · July 17, 2023, 8:39pm

Aggregation is the mainstream of MongoDB programming.
Use it all the time because that’s the way to get familiar.

empty2 · July 17, 2023, 8:52pm

But I guess we can not use just one option every time. Aggregation pipelines have drawbacks and minuses. Sometimes writing the same functionality but in a programming language like JS would be easier.

And I doubt when to use aggregation pipelines and when to omit them.

Jack_Woehr · July 17, 2023, 8:55pm

Well, if you were doing this in a relational database, when would you use:

SQL query, possibly complex with JOIN
Stored procedure
Simple or multiple queries and the rest in procedural program logic?

It’s the same thing. But it’s hard to do much in MongoDB without aggregation beyond a simple find(),

empty2 · July 17, 2023, 9:06pm

But I can do everything with a simple find() update() and so on. When I’m combining them on level of application.

For example, instead of lookup I can write something like this:

 const x = await dbFindX({
      _id: 'someId',
    });
const y = await dbFindY({
      _id: x.yId,
    });

And mostly everything can be written in the same way in application level instead of writing it in database level.

John_Sewell · July 17, 2023, 10:28pm

You can so that but running server side you get access to indexes, data and server runtime in it’s most pure form. Say you wanted to group 5M records, by fields A and B, summing field C and then outputting the results to a new collection, or using that to update another collection based on that data for monthly sales figure aggregation or similar.
You can pull all that data back to the client, limiting fields, but then you need to process all that data client side, form the updates and send back to the server.
You can do that in one aggregation call to the server, which runs in the server process with covered indexes so is running physically in memory, even over a cluster of sharded collections.

Personally I use it constantly in the day, to explore and analyse data. I can sit down with a BA who needs something and in the space of minutes we can quickly drill into the data, slicing and dicing it and then collating and outputting, all from one window.

Of course it has its place, you shouldn’t use it to reformat everything that flows into or out of your application, you should model the data as it’s used and cope with the edge cases or special situations. If 99% of the application is presenting data in a certain way, then keeping it together makes sense and you can then add indexes etc to help cope with the special things, or build to cope with that.

As you can probably tell, I like it, you can gradually build up aggregation pipelines and debug them easily by adding them in and removing them, even debugging performance with explains. It’s probably what I find most powerful about Mongo.

Of course, if you’re using a purely relation model…then use a relational database, but don’t use Oracle…friends don’t let friends use Oracle…

empty2 · July 19, 2023, 8:29am

I guess the most important reasons why we should not use aggregation pipelines are:

we replace easy-to-understand language (e.g. JavaScript) to “MongoDB vendor language” and in the future, it would be more difficult to maintain the code base.
also Node.js drivers for MongoDB are typed very badly. So we lose all the power of TypeScript and we do not have compile time validation.

@Jack_Woehr @John_Sewell
What do you think about that?

John_Sewell · July 19, 2023, 9:17am

While vendor lock-in is an issue, there is a balance to be found, for example if I want to limit data from a SQL based server…SQL Server, Oracle and others have different options for Limit/Top so I could have my client pull all data back and only take the first X records, or if the grouping styles were different I could ignore all the server functions available and implement grouping and rollup etc in client code.

For the second point, I can’t comment that much as I don’t use TypeScript (much…I’m learning Angular at the moment).

As I said before, for your bread and butter calls to Mongo for your app, you probably don’t want to be doing complex aggregations as the data will be in a usable format.
For other scenarios, say reporting that are run on occasion or statistic generation it comes down to do you want to harness the power of the server side processing that aggregation gives or re-write the server code in an interpreted language. With large datasets this could be a huge performance hit, plus you are now maintaining code that’s an implementation of something that already on the server.

If you really want to abstract things out, then put an extra data access abstraction layer in there so your application can talk to Mongo, Oracle or csv files, but you definitely are creating more complex code now!

Jack_Woehr · July 19, 2023, 2:22pm

easy to understand vs MongoDB

If you want to use an RDB, you use SQL. If you want to use MongoDB, you use aggregation. It’s just part of the game. MongoDB aggregation pipelining is the most efficient way to run complex aggregation over a MongoDB database.

Node.js driver typed very badly.

Perhaps, I’m not an expert. It’s open source. Contribute code to the project to fix that deficiency.

In the end it’s your decision. What works for you, works for you. You don’t need anyone’s approval to do what you are doing!