MongoDB - Newbie question on Aggregation

Jacques_Luckhoff · July 9, 2020, 7:37am

Hi all,

Im completely new to Mongo and trying to understand how it stores data but specifically its capability to aggregate data.

The reason I am asking is that I am currently looking for to replace a relational system with billions or records of orderdata.
effectively there are about 150 million orders in the system but through all the lines, transactions and services this results in billons of records across multiple tables.
I understand that Mongo stores its data in Json/bson formats but I am not clear on how it would perform aggregating this

Please correct me if I am wrong but my thinking would be that in Mongo it would be possible to store a document (Json/bson) that contains for example an order and some (maybe all) of its lines/transactions

Something like this:

{
  "_id": 1,
  "reference : "ORDERXYZ",
  "orderdate" : "2020-01-01",
  "orderlines" : [
    {
      "position" : 1,
      "description" : "blue box",
      "quantity" : 1,
      "price" : 1.99
    }, {
      "position" : 2,
      "description" : "red box",
      "quantity" : 2,
      "price" : 3.99
    }
  ]
}

So in my example the idea would be to have a 150 million documents stored like this … instead of 800 million rows accross multiple tables:

My question is how would Mongo perform if I were to query and aggregate this data? For example in Classic SQL assuming I have a order and orderline table I could use the following query to give me a list of orders containing “red boxes” and the total value they make up in each order:

SELECT o.reference, SUM(l.price * l.quantity) AS summedvalue
FROM order o
INNER JOIN lines l
ON o.id = l.order_id
WHERE l.description = ‘red box’
GROUP BY o.reference

I am sure this is possible in Mongo but was wondering if someone could help me out on how the qery performance would compare to “classic” relational databases and also whether my understanding of how Mongo works is correct.

Many thanks

j

Prasad_Saya · July 9, 2020, 11:11am

Hello @Jacques_Luckhoff, welcome to the MongoDB forum.

The aggregation query for the corresponding SQL query would be as follows (this is run from mongo shell):

db.orders.aggregate( [
  { 
      $match: { "orderlines.description": "red box" } 
  },
  { 
      $unwind: "$orderlines" 
  },
  { 
      $group: { 
          _id: { "reference": "$reference" }, 
          sum_value: { $sum: { $cond: [ { $eq: [ "$orderlines.description", "red box" ] }, 
                                        { $multiply: [ "$orderlines.quantity", "$orderlines.price" ] },
                                        0
                                ]
          } } 
      } 
  }
] )

This output will be like this:

{ "_id" : { "reference" : "ORDERXYZ" }, "sum_value" : 7.98 }

The aggregation query has three stages; and the first is the $match. The query reads the collection and filters documents in the match stage. The filtered documents are the ones with the order line item description as “red box”. The $unwind stage flattens the array field orderlines- this is for grouping and summing in the next stage, the $group.

In the above query the documents are scanned one at a time in the match stage, and are filtered. This can take a long time scanning all the collection documents and filter based upon the supplied predicate. To improve the performance, we can create an index on the fields used for filtering - in this case the field of the orderlines.description, a field in an embedded document of an array. Create an index and verify it is created from mongo shell:

db.orders.createIndex( { "orderlines.description" : 1 } )
db.orders.getIndexes()

The index on an array field is called as Multikey Index.

Further, verify the index is applied on the query. This is by generating a query plan with the explain() method on the query:

db.orders.explain().aggregate( [ { ... } ] )

The generated plan document shows the index usage, this is indicated by a IXSCAN. When there is no index usage it will be a COLLSCAN (collection scan).

Indexes are used to make your queries run faster (or perform better). There are many types of indexes, including unique, compound, text, etc. And, indexes are used not only on query filters, but also for sort operations. Within aggregation queries, there are ways to optimally define and utilize indexes - see Aggregation Pipeline Optimization and Indexing Strategies.

Jacques_Luckhoff · July 10, 2020, 7:50am

Hi Prassad,

Thank you for your very informative answer, it was exactly the information I was trying to find out

Regards

Jacques

system · July 15, 2020, 7:50am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.