The most performant method of querying paginated data with MongoDB

This topic is all about what is more performant. I am new to MongoDB & NoSQL and this may be obvious to some, but I am trying to figure out the most efficient way of making my back-end allow the front-end to fetch paginated data that combines two fields. I essentially want to know whether plan A or B is best - that is, either storing more data in the database vs storing less data but having to amend, merge and sort the data before returning it to the front-end.

To provide some context, I want the back-end to merge the affiliatePayments and affiliateCommissions fields in date order and expose the pagination-friendly result to the front-end via an API.

Current user model (with an example user):

_id: ObjectId('631a5067hcfeb5940686f144')
email: “test@user.com”
hashedPassword: “test”
dateCreated: 2022-09-08T20:28:23.766+00:00
plan: "FREE"
directAffiliateSignup: false
referralPromoCode: "f2f11fed-ab02-46df-8eff-ba3d8cc3fb93"
affiliateRate: 0.3
affiliatePayments: [{date: 2022-07-27, amount: 2, id: someMongoId}, {date: 2022-08-23, amount: 3, id: someMongoId}]
affiliateCommissions: [{date: 2022-08-01, amount: 10, id: someMongoId}, {date: 2022-09-01, amount: 4, id: someMongoId}]

It’ll also need to be able to differentiate between each affiliatePayment and affiliateCommission, so each affiliatePayment will need a type field with a payment value and the affiliateCommission will need a type field with a commission value.

So the API should return something like this example data:

[{date: 2022-07-27, amount: 2, id: someMongoId, type: ‘payment’}, {date: 2022-08-01, amount: 10, id: someMongoId, type: ‘commission’}, {date: 2022-08-23, amount: 3, id: someMongoId, type: ‘payment’}, {date: 2022-09-01, amount: 4, id: someMongoId, type: ‘commission’}]

As the affiliatePayments & affiliateCommissions fields don’t have a type field stored in the database, I assumed the plan to return the correct data would be something like this:

Plan A:

  1. fetch all affiliatePayments & affiliateCommissions
  2. add ‘type’ key to each affiliate payment record as ‘payment’
  3. add ‘type’ key to each affiliate commission record as ‘commission’
  4. merge affiliatePayments & affiliateCommissions into single array
  5. sort (merged array) in date order
  6. limit & skip (just for pagination)
  7. return data to FE

Plan B, however, would involve simpler processing but more data fields stored in the database: the affiliatePayments and affiliateCommissions would each have a type stored in each array entry, and an affiliateTransactions field would exist in the database that would already contain the example data as above - which is essentially duplicates of affiliatePayments and affiliateCommissions but just merged into an additional affiliateTransactions field.

Plan B:

  1. Fetch affiliateTransactions
  2. limit & skip (just for pagination)
  3. return data to FE

So my key question is this: which plan is more performant?

Is it better to store a merged affiliateTransactions list (and extra type fields on affiliatePayments and affiliateCommissions) in the database? It’ll essentially copy affiliatePayments and affiliateCommissions. Or is it better to store less data in the database & accept the more expensive query processing that’ll presumably use more bandwidth?

Am I correct in thinking I should be reluctant to add extra fields to the database? It would make it much easier, but isn’t it unclean, inefficient and more expensive? I don’t know the scale of the tradeoff between the two options. I could be making wrong assumptions, i.e. does it cost bandwidth to query data more, does it cost more to store more data, etc.

Any advice would be greatly appreciated here. Thank you

Note: I also wonder whether I should get rid of the affiliatePayments and affiliateCommissions fields and just have one affiliateTransactions field (that contains the payments and commissions data)? Note that I have a separate affiliate payments query for just getting the payments, and another for commissions. Having them in separate lists would prevent the need to filter from a single affiliateTransactions field.

Hi @Nick_Smith and welcome to the community!!

According to plan A, if performing these modifications, reduces the query execution time, the recommendation would be to do so.

However, for plan B, if adding data into each of the document increases the performance, this could also be a preferred method.

But please note that skip+limit is not the most performant way to do the pagination.
To avoid that, range queries or bucket pattern would be useful.

Generally there are tradeoffs between disk space & execution time as far as I understand your use case. Depending on your hardware, plan A might be performant enough for 99% of your use case, but plan B might be better once the collection hits a certain size.

However I would note that in most cases, the less work the database has to work to return the result, the more performance it can give. That is, the plan that requires the least amount of steps would usually be relatively more performant compared to the more complex execution plan. But I would not put this generalisation into production without extensive testing with the hardware you have, along with the workload you’re expecting in your day-to-day operation.

This depends on how much data you’re adding to each document. If extra fields simplifies your workflow, queries, and results in better performance, I don’t think you should hesitate adding fields. But again this also depends on extensive testing using your expected workload.

Let us know if you have any further questions.

Best regards
Aasawari

4 Likes

Thank you very much Aasawari. I will take your advice and I think I will choose option B, and research the bucket pattern you suggested.

Best wishes,
Nick

Quick question: I have read the bucket pattern article and noticed that the history array is an unbounded array. I thought this would be an anti pattern since the array could be infinite size and breach the 16MB size? If this is true, how can pagination be improved? I can see how skipping is less efficient though, especially as the data grows.

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.