HomeLearnArticle7 Things I Learned While Modeling Data for YouTube Stats

7 Things I Learned While Modeling Data for YouTube Stats

Published: Feb 02, 2021

  • MongoDB
  • Charts
  • Realm
  • ...

By Lauren Schaefer

Rate this article

Mark Smith, Maxime Beugnet, and I recently embarked on a project to automatically retrieve daily stats about videos on the MongoDB YouTube channel. Our management team had been painfully pulling these stats every month in a complicated spreadsheet. In an effort to win brownie points with our management team and get in a little programming time, we worked together as a team of three over two weeks to rapidly develop an app that pulls daily stats from the YouTube API, stores them in a MongoDB Atlas database, and displays them in a MongoDB Charts dashboard.

Screenshot of the Charts dashboard. Charts include Views / Videos, Subscribers Gained & Lost / Videos, Likes & Dislikes / Videos, and Minutes Watched / Videos
Screenshot of the MongoDB Charts dashboard that contains charts about the videos our team has posted on YouTube

Mark, Max, and I each owned a piece of the project. Mark handled the OAuth authentication, Max created the charts in the dashboard, and I was responsible for figuring out how to retrieve and store the YouTube stats.

To learn more about what we built and why we built it the way we did, check out the recording of the Twitch stream above where Mark, Max, and I shared about our app.

In this post, I'll share seven things I learned while modeling the data for this app. But, before I jump into what I learned, I'll share a bit of context about how I modeled the data.

#Table of Contents

#Our Data Model

Our project had a tight two-week deadline, so we made quick decisions in our effort to rapidly develop a minimum viable product. When we began, we didn't even know how we wanted to display the data, which made modeling the data even more challenging.

I ended up creating two collections:

  • youtube_videos: stores metadata about each of the videos on the MongoDB YouTube channel.
  • youtube_stats: stores daily YouTube stats (bucketed by month) about every video in the youtube_videos collection.

Every day, a scheduled trigger calls a Realm serverless function that is responsible for calling the YouTube PlaylistItems API. This API returns metadata about all of the videos on the MongoDB YouTube channel. The metadata is stored in the youtube_videos collection. Below is a document from the youtube_videos collection (some of the information is redacted):

1{
2 "_id":"8CZs-0it9r4",
3 "kind": "youtube#playlistItem",
4 "isDA": true,
5 ...
6 "snippet": {
7 "publishedAt": 2020-09-30T15:05:30.000+00:00,
8 "channelId": "UCK_m2976Yvbx-TyDLw7n1WA",
9 "title": "Schema Design Anti-Patterns - Part 1",
10 "description": "When modeling your data in MongoDB...",
11 "thumbnails": {
12 ...
13 },
14 "channelTitle": "MongoDB",
15 ...
16 }
17}

Every day, another trigger calls a Realm serverless function that is responsible for calling the YouTube Reports API. The stats that this API returns are stored in the youtube_stats collection. Below is a document from the collection (some of the stats are removed to keep the document short):

1{
2 "_id": "8CZs-0it9r4_2020_12",
3 "month": 12,
4 "year": 2020,
5 "videoId": "8CZs-0it9r4",
6 "stats": [
7 {
8 "date": 2020-12-01T00:00:00.000+00:00,
9 "views": 21,
10 "likes": 1
11 ...
12 },
13 {
14 "date": 2020-12-02T00:00:00.000+00:00,
15 "views": 29,
16 "likes": 1
17 ...
18 },
19 ...
20 {
21 "date": 2020-12-31T00:00:00.000+00:00,
22 "views": 17,
23 "likes": 0
24 ...
25 },
26 ]
27}

To be clear, I'm not saying this was the best way to model our data; this is the data model we ended up with after two weeks of rapid development. I'll discuss some of the pros and cons of our data model throughout the rest of this post.

If you'd like to take a peek at our code and learn more about our app, visit https://github.com/mongodb-developer/dream.

#What I Learned

Without further ado, let's jump into the seven things I learned while rapidly modeling YouTube data.

#1. Duplicating data is scary—even for those of us who have been coaching others to do so

One of the rules of thumb when modeling data for MongoDB is data that is accessed together should be stored together. We teach developers that duplicating data is OK, especially if you won't be updating it often.

April screams and covers her eyes
Duplicating data can feel scary at first

When I began figuring out how I was going to use the YouTube API and what data I could retrieve, I realized I would need to make two API calls: one to retrieve a list of videos with all of their metadata and another to retrieve the stats for those videos. For ease of development, I decided to store the information from those two API calls in separate collections.

I wasn't sure what data was going to need to be displayed alongside the stats (put another way, I wasn't sure what data was going to be accessed together), so I duplicated none of the data. I knew that if I were to duplicate the data, I would need to maintain the consistency of that duplicate data. And, to be completely honest, maintaining duplicate data was a little scary based on the time crunch we were under, and the lack of software development process we were following.

In the current data model, I can easily gather stats about likes, dislikes, views, etc, for a given video ID, but I will have to use $lookup to join the data with the youtube_videos collection in order to tell you anything more. Even something that seems relatively simple like listing the video's name alongside the stats requires the use of $lookup. The $lookup operation required to join the data in the two collections isn't that complicated, but best practices suggest limiting $lookup as these operations can negatively impact performance.

While we were developing our minimum viable product, I weighed the ease of development by avoiding data duplication against the potential performance impact of splitting our data. Ease of development won.

Now that I know I need information like the video's name and publication date with the stats, I can implement the Extended Reference Pattern. I can duplicate some of the information from the youtube_videos collection in the youtube_stats collection. Then, I can create an Atlas trigger that will watch for changes in the youtube_videos collection and automatically push those changes to the youtube_stats collection. (Note that if I was using a self-hosted database instead of an Atlas-hosted database, I could use a change stream instead of an Atlas trigger to ensure the data remained consistent.)

Duplicating data isn't as scary when (1) you are confident which data needs to be duplicated and (2) you use Atlas triggers or change streams to make sure the data remains consistent.

#2. Use the Bucket Pattern only when you will benefit from the buckets

I love schema design patterns (check out this blog series or this free MongoDB University course to learn more) and schema design anti-patterns (check out this blog series or this YouTube video series to learn more).

When I was deciding how to store the daily YouTube stats, I realized I had time-series data. I knew the Bucket Pattern was useful for time-series data, so I decided to implement that pattern. I decided to create a bucket of stats for a certain timeframe and store all of the stats for that timeframe for a single video in a document.

I wasn't sure how big my buckets should be. I knew I didn't want to fall into the trap of the Massive Arrays Anti-Pattern, so I didn't want my buckets to be too large. In the spirit of moving quickly, I decided a month was a good bucket size and figured I could adjust as needed.

Young girl dumps a bucket of water on her unsuspecting mom's head at the beach
How big should your bucket be? Big enough to startle your mom.

The buckets turned out to be really handy during development as I could easily see all of the stats for a video for a given month to ensure they were being pulled correctly.

However, the buckets didn't end up helping my teammates and I much in our app. We didn't have so much data that we were worried about reducing our index sizes. We didn't implement the Computed Pattern to pre-compute monthly stats. And we didn't run queries that benefited from having the data grouped by month.

Looking back, creating a document for every video every day would have been fine. We didn't benefit from any of the advantages of the Bucket Pattern. If our requirements were to change, we certainly could benefit from the Bucket Pattern. However, in this case, I added the complexity of grouping the stats into buckets but didn't get the benefits, so it wasn't really worth it.

#3. Use a date field to label date-based buckets

As I described in the previous section, I decided to bucket my YouTube video stats by month. I needed a way to indicate the date range for each bucket, so each document contains a field named year and a field named month. Both fields store values of type long. For example, a document for the month of January 2021 would have "year": 2021 and "month": 1.

Liza asks, "Is it a date?"
No, I wasn't storing date information as a date. But perhaps I should have.

My thinking was that we might want to compare months from multiple years (for example, we could compare stats in January for 2019, 2020, and 2021), and this data model would allow us to do that.

Another option would have been to use a single field of type date to indicate the date range. For example, for the month of January, I could have set "date": new Date("2021-01"). This would allow me to perform date-based calculations in my queries.

As with all data modeling considerations in MongoDB, the best option comes down to your use case and how you will query the data. Use a field of type date for date-based buckets if you want to query using dates.

#4. Cleaning data you receive from APIs will make working with the data easier

As I mentioned toward the beginning of this post, I was responsible for retrieving and storing the YouTube data. My teammate Max was responsible for creating the charts to visualize the data.

I didn't pay too much attention to how the data I was getting from the API was formatted—I just dumped it into the database. (Have I mentioned that we were working as fast as we could?)

Cement truck dumps cement into car
As long as the data is being dumped into the database, who cares what format it's in?

As Max began building the charts, he raised a few concerns about the way the data was formatted. The date the video was published was being stored as a string instead of a date. Also, the month and year were being stored as string instead of long.

Max was able to do type conversions in MongoDB Charts, but ultimately, we wanted to store the data in a way that would be easy to use whether we were visualizing the data in Charts or querying the data using the MongoDB Query Language (MQL).

The fixes were simple. After retrieving the data from the API, I converted the data to the ideal type before sending it to the database. Take a look at line 37 of my function if you'd like to see an example of how I did this.

If you're pulling data from an API, consider if it's worth remodeling or reformatting the data before storing it. It's a small thing that could make your and your teammates' jobs much easier in the future.

#5. Optimizing for your use case is really hard when you don't fully know what your use case will be

OK, yes, this is kind of obvious.

Teenager points at herself and says, "Duh!"

Allow me to elaborate.

As we began working on our application, we knew that we wanted to visually display YouTube stats on a dashboard. But we didn't know what stats we would be able to pull from the API or how we would want to visualize the data. Our approach was to put the data in the database and then figure it out.

As I modeled our data, I didn't know what our final use case would be—I didn't know how the data would be accessed. So, instead of following the rule of thumb that data that is accessed together should be stored together, I modeled the data in the way that was easiest for me to work with while retrieving and storing the data.

One of the nice things about using MongoDB is that you have a lot of flexibility in your schema, so you can make changes as requirements develop and change. (The Schema Versioning Pattern provides a pattern for how to do this successfully.)

As Max was showing off how he created our charts, I learned that he created an aggregation pipeline inside of Charts that calculates the fiscal year quarter (for example, January of 2021 is in Q4 of Fiscal Year 2021) and adds it to each document in the youtube_stats collection. Several of our charts group the data by quarter, so we need this field.

I was pretty impressed with the aggregation pipeline Max built to calculate the fiscal year. However, if I had known that calculating the quarter was one of our requirements when I was modeling the data, I could have calculated the fiscal year quarter and stored it inside of the youtube_stats collection so that any chart or query could leverage it. If I had gone this route, I would have been using the Computed Pattern.

Now that I know we have a requirement to display the fiscal year quarter, I can write a script to add the fiscal_year_quarter field to the existing documents. I could also update the function that creates new documents in the youtube_stats collection to calculate the fiscal year quarter and store it in new documents.

Modeling data in MongoDB is all about your use case. When you don't know what your use case is, modeling data becomes a guessing game. Remember that it's OK if your requirements change; MongoDB's flexible schema allows you to update your data model as needed.

#6. There is no "right way" to model your data

I confess that I've told developers who are new to using MongoDB this very thing: There is no "right way" to model your data. Two applications that utilize the same data may have different ideal data models based on how the applications use the data.

However, the perfectionist in me went a little crazy as I modeled the data for this app. In more than one of our team meetings, I told Mark and Max that I didn't love the data model I had created. I didn't feel like I was getting it "right."

Jimmy Fallon says, "Perfect," dabs tears from his eye, and says "Wow"
I just want my data model to be perfect. Is that too much to ask?

As I mentioned above, the problem was that I didn't know the use case that I was optimizing for as I was developing the data model. I was making guesses and feeling uncomfortable. Because I was using a non-relational database, I couldn't just normalize the data systematically and claim I had modeled the data correctly.

The flexibility of MongoDB gives you so much power but can also leave you wondering if you have arrived at the ideal data model. You may find, as I did, that you may need to revisit your data model as your requirements become more clear or change. And that's OK.

(Don't let the flexibility of MongoDB's schema freak you out. You can use MongoDB's schema validation when you are ready to lock down part or all of your schema.)

#7. Determine how much you want to tweak your data model based on the ease of working with the data and your performance requirements

Building on the previous thing I learned that there is no "right way" to model your data, data models can likely always be improved. As you identify what your queries will be or your queries change, you will likely find new ways you can optimize your data model.

The question becomes, "When is your data model good enough?" The perfectionist in me struggled with this question. Should I continue optimizing? Or is the data model we have good enough for our requirements?

To answer this question, I found myself asking two more questions:

  • Are my teammates and I able to easily work with the data?
  • Is our app's performance good enough?

The answers to the questions can be a bit subjective, especially if you don't have hard performance requirements, like a web page must load in X milliseconds.

In our case, we did not define any performance requirements. Our front end is currently a Charts dashboard. So, I wondered, "Is our dashboard loading quickly enough?" And the answer is yes: Our dashboard loads pretty quickly. Charts utilizes caching with a default one-hour refresh to ensure the charts load quickly. Once a user loads the dashboard in their browser, the charts remain displayed—even while waiting for the charts to get the latest data when the cache is refreshed.

If your developers are able to easily work with the data and your app's performance is good enough, your data model is probably good enough.

Ariana Grande says, "You are good enough."

#Summary

Every time I work with MongoDB, I learn something new. In the process of working with a team to rapidly build an app, I learned a lot about data modeling in MongoDB:

If you're interested in learning more about data modeling, I highly recommend the following resources:

Remember, every use case is different, so every data model will be different. Focus on how you will be using the data.

If you have any questions about data modeling, I encourage you to join the MongoDB Community. It's a great place to ask questions. MongoDB employees and community members are there every day to answer questions and share their experiences. I hope to see you there!

Rate this article

Related

M320: Data Modeling
MongoDB Schema Design Patterns
MongoDB Schema Design Anti-Patterns
MongoDB Schema Design Anti-Patterns
MongoDB Icon
  • Developer Hub
  • Documentation
  • University
  • Community Forums

© MongoDB, Inc.