Views vs Denormalized Data for Effective Querying of Highly-Normalized Data

Ches_Calloway · June 16, 2021, 10:35pm

Hi there!

I’ve got my hands on a large, normalized database in MongoDB (LONG STORY for another time…).

After a few months of study I believe I have a solid grasp on how/where to appropriately denormalize data, so I’m preparing to do just that. However, as I’ve been studying the Aggregation Pipeline (shout out to the MongoDB University course! ) I’ve also been using them to create Views of data from across a bunch of collections. These Views appear sufficient to provide the (read-only) data the frontend needs to display for reports/large tables/etc…

My question: Does it make sense to take the time/effort to maintain a large & highly-denormalized database or can I get by on a few key denormalized fields + some well-aggregated Views?

For the Record: I’m using MongooseJS with ObjectID refs everywhere & more-than-less one schema Model per collection. That architecture has allowed me to use Populate when pulling small-to-medium batches of [normalized] data, easy peasy. The catch is that Populate isn’t so great for pulling huge swaths of multi-level referenced data (& MongooseJS never promised as much/I’m not pointing fingers ), so that’s why I’ve been considering denormalization–despite the notion that implementing & maintaining it could be a hassle. Then gaining a hint of AggPipeline proficiency got me curious re the effectiveness of using Views–especially for read-only data.

I understand this topic could potentially live in Drivers & ODMs but I’m not here to ask about MongooseJS or debate Populate’s effectiveness, etc. Mostly I’m curious if anyone has any input and/or if there are known pitfalls re creating a large handful of views to make querying a highly-normalized MongoDB considerably faster/more efficient.

Thank you in advance for any responses!

Pavel_Duchovny · June 18, 2021, 8:46am

Hi @Ches_Calloway ,

Welcome back to MongoDB community.

Please bear in mind that eventually views are a saved pipeline query, therefore each rerun will execute the full pipeline all over.

You can’t index views as well.

What MongoDB offers is a way to persist the views using $out or $merge and keep them up to date by remerging the data.

This will allow you to persist and index the denormalized data for your application queries and significantly improve this data access. There is a cost of extra disk space and maintenance of consistent duplicate data…

Hopefully that helps
Thanks
Pavel

Ches_Calloway · June 22, 2021, 10:21pm

Hi @Pavel_Duchovny ,

Thank you for your response–I hadn’t considered that Views can’t take advantage of indexing. That reason alone is enough to take the option off the table

Regarding $out or $merge, how do I ensure that the data in the target collection stays updated? Surely the aggregation pipeline must be rerun every time the source collection–or “collections” (if we take into account the $Lookup stage(s))–is/are updated?

Regarding running the aggregation pipeline: is it possible to create a listener in MongoDB itself to automate this process or do I have to trigger it by manually calling a function (from my application logic or via cron, etc)? I see the “Atlas Triggers” as an option from my Atlas Cluster view, but is that my only option? My lack of knowledge makes me think that there likely some obvious way of which I am unaware.

Thank you again for your assistance, it has already proved invaluable!

Best,
Ches Calloway

Pavel_Duchovny · June 23, 2021, 3:48am

Hi @Ches_Calloway ,

You will need to automate the process yourself .

Atlas triggers is definitely a good option

I have written a blog to help with this:

Thanks
Pavel

Ches_Calloway · June 23, 2021, 4:16pm

Hello again @Pavel_Duchovny ,

Thank you once again for your response! Other than splitting where my application logic lives, are there any known drawbacks to using these triggers? In particular I’m curious re costs. Searching MongoDB docs I only see references to Triggers–>Billing in relation to a Realm application. Cost is a concern based on the high frequency with which I’d need to run the aggregation.

I also see in your blog post (which is great by the way!) that your cron is set to run every 5min. I would likely have to run mine every 30sec or 1min due to how quickly users would expect updates to show in various front end views.

Thank you again and who do we need to talk to in order to get you on the MongoDB Podcast to talk about Automatic Refresh of Materialized Views?

Best,
Ches Calloway

Pavel_Duchovny · June 23, 2021, 4:35pm

Hi @Ches_Calloway ,

The atlas triggers are based on a realm application created behind the scenes so the same billing as any other realm apply .

Running every 1min or so is ok as long as you believe your $merge will finish in under a minute

Regarding the podcast I can raise it with the relevant people , thanks for the kind words

Thanks
Pavel