7 Things I Learned While Modeling Data for YouTube Stats
Lauren SchaeferPublished Jan 10, 2022 • Updated Sep 23, 2022
Rate this article
Mark Smith, Maxime Beugnet, and I recently embarked on a project to automatically retrieve daily stats about videos on the MongoDB YouTube channel. Our management team had been painfully pulling these stats every month in a complicated spreadsheet. In an effort to win brownie points with our management team and get in a little programming time, we worked together as a team of three over two weeks to rapidly develop an app that pulls daily stats from the YouTube API, stores them in a MongoDB Atlas database, and displays them in a MongoDB Charts dashboard.
Mark, Max, and I each owned a piece of the project. Mark handled the OAuth authentication, Max created the charts in the dashboard, and I was responsible for figuring out how to retrieve and store the YouTube stats.
In this post, I'll share seven things I learned while modeling the data for this app. But, before I jump into what I learned, I'll share a bit of context about how I modeled the data.
If you prefer to watch a video instead of read text, look no further.
To learn more about what we built and why we built it the way we did, check out the recording of the Twitch stream below where Mark, Max, and I shared about our app.
If you'd like the video version of this article, check out the live stream Mark, Max, and I hosted. We received some fantastic questions from the audience, so you'll discover some interesting nuggets in the recording.
If you'd prefer a more concise video that only covers the contents of this article, check out the recording below.
Our project had a tight two-week deadline, so we made quick decisions in our effort to rapidly develop a minimum viable product. When we began, we didn't even know how we wanted to display the data, which made modeling the data even more challenging.
I ended up creating two collections:
youtube_videos: stores metadata about each of the videos on the MongoDB YouTube channel.
youtube_stats: stores daily YouTube stats (bucketed by month) about every video in the
Every day, a scheduled trigger calls a Realm serverless function that is responsible for calling the YouTube PlaylistItems API. This API returns metadata about all of the videos on the MongoDB YouTube channel. The metadata is stored in the
youtube_videoscollection. Below is a document from the
youtube_videoscollection (some of the information is redacted):
Every day, another trigger calls a Realm serverless function that is responsible for calling the YouTube Reports API. The stats that this API returns are stored in the
youtube_statscollection. Below is a document from the collection (some of the stats are removed to keep the document short):
To be clear, I'm not saying this was the best way to model our data; this is the data model we ended up with after two weeks of rapid development. I'll discuss some of the pros and cons of our data model throughout the rest of this post.
If you'd like to take a peek at our code and learn more about our app, visit https://github.com/mongodb-developer/dream.
Without further ado, let's jump into the seven things I learned while rapidly modeling YouTube data.
One of the rules of thumb when modeling data for MongoDB is data that is accessed together should be stored together. We teach developers that duplicating data is OK, especially if you won't be updating it often.
When I began figuring out how I was going to use the YouTube API and what data I could retrieve, I realized I would need to make two API calls: one to retrieve a list of videos with all of their metadata and another to retrieve the stats for those videos. For ease of development, I decided to store the information from those two API calls in separate collections.
I wasn't sure what data was going to need to be displayed alongside the stats (put another way, I wasn't sure what data was going to be accessed together), so I duplicated none of the data. I knew that if I were to duplicate the data, I would need to maintain the consistency of that duplicate data. And, to be completely honest, maintaining duplicate data was a little scary based on the time crunch we were under, and the lack of software development process we were following.
In the current data model, I can easily gather stats about likes, dislikes, views, etc, for a given video ID, but I will have to use $lookup to join the data with the
youtube_videoscollection in order to tell you anything more. Even something that seems relatively simple like listing the video's name alongside the stats requires the use of
$lookupoperation required to join the data in the two collections isn't that complicated, but best practices suggest limiting
$lookupas these operations can negatively impact performance.
While we were developing our minimum viable product, I weighed the ease of development by avoiding data duplication against the potential performance impact of splitting our data. Ease of development won.
Now that I know I need information like the video's name and publication date with the stats, I can implement the Extended Reference Pattern. I can duplicate some of the information from the
youtube_videoscollection in the
youtube_statscollection. Then, I can create an Atlas trigger that will watch for changes in the
youtube_videoscollection and automatically push those changes to the
youtube_statscollection. (Note that if I was using a self-hosted database instead of an Atlas-hosted database, I could use a change stream instead of an Atlas trigger to ensure the data remained consistent.)
Duplicating data isn't as scary when (1) you are confident which data needs to be duplicated and (2) you use Atlas triggers or change streams to make sure the data remains consistent.
I love schema design patterns (check out this blog series or this free MongoDB University course to learn more) and schema design anti-patterns (check out this blog series or this YouTube video series to learn more).
When I was deciding how to store the daily YouTube stats, I realized I had time-series data. I knew the Bucket Pattern was useful for time-series data, so I decided to implement that pattern. I decided to create a bucket of stats for a certain timeframe and store all of the stats for that timeframe for a single video in a document.
I wasn't sure how big my buckets should be. I knew I didn't want to fall into the trap of the Massive Arrays Anti-Pattern, so I didn't want my buckets to be too large. In the spirit of moving quickly, I decided a month was a good bucket size and figured I could adjust as needed.
The buckets turned out to be really handy during development as I could easily see all of the stats for a video for a given month to ensure they were being pulled correctly.
However, the buckets didn't end up helping my teammates and I much in our app. We didn't have so much data that we were worried about reducing our index sizes. We didn't implement the Computed Pattern to pre-compute monthly stats. And we didn't run queries that benefited from having the data grouped by month.
Looking back, creating a document for every video every day would have been fine. We didn't benefit from any of the advantages of the Bucket Pattern. If our requirements were to change, we certainly could benefit from the Bucket Pattern. However, in this case, I added the complexity of grouping the stats into buckets but didn't get the benefits, so it wasn't really worth it.
As I described in the previous section, I decided to bucket my YouTube video stats by month. I needed a way to indicate the date range for each bucket, so each document contains a field named
yearand a field named
month. Both fields store values of type
long. For example, a document for the month of January 2021 would have
My thinking was that we might want to compare months from multiple years (for example, we could compare stats in January for 2019, 2020, and 2021), and this data model would allow us to do that.
Another option would have been to use a single field of type
dateto indicate the date range. For example, for the month of January, I could have set
"date": new Date("2021-01"). This would allow me to perform date-based calculations in my queries.
As with all data modeling considerations in MongoDB, the best option comes down to your use case and how you will query the data. Use a field of type
datefor date-based buckets if you want to query using dates.
As I mentioned toward the beginning of this post, I was responsible for retrieving and storing the YouTube data. My teammate Max was responsible for creating the charts to visualize the data.
I didn't pay too much attention to how the data I was getting from the API was formatted—I just dumped it into the database. (Have I mentioned that we were working as fast as we could?)
As Max began building the charts, he raised a few concerns about the way the data was formatted. The date the video was published was being stored as a
stringinstead of a
date. Also, the month and year were being stored as
Max was able to do type conversions in MongoDB Charts, but ultimately, we wanted to store the data in a way that would be easy to use whether we were visualizing the data in Charts or querying the data using the MongoDB Query Language (MQL).
The fixes were simple. After retrieving the data from the API, I converted the data to the ideal type before sending it to the database. Take a look at line 37 of my function if you'd like to see an example of how I did this.
If you're pulling data from an API, consider if it's worth remodeling or reformatting the data before storing it. It's a small thing that could make your and your teammates' jobs much easier in the future.
OK, yes, this is kind of obvious.
Allow me to elaborate.
As we began working on our application, we knew that we wanted to visually display YouTube stats on a dashboard. But we didn't know what stats we would be able to pull from the API or how we would want to visualize the data. Our approach was to put the data in the database and then figure it out.
As I modeled our data, I didn't know what our final use case would be—I didn't know how the data would be accessed. So, instead of following the rule of thumb that data that is accessed together should be stored together, I modeled the data in the way that was easiest for me to work with while retrieving and storing the data.
One of the nice things about using MongoDB is that you have a lot of flexibility in your schema, so you can make changes as requirements develop and change. (The Schema Versioning Pattern provides a pattern for how to do this successfully.)
As Max was showing off how he created our charts, I learned that he created an aggregation pipeline inside of Charts that calculates the fiscal year quarter (for example, January of 2021 is in Q4 of Fiscal Year 2021) and adds it to each document in the
youtube_statscollection. Several of our charts group the data by quarter, so we need this field.
I was pretty impressed with the aggregation pipeline Max built to calculate the fiscal year. However, if I had known that calculating the quarter was one of our requirements when I was modeling the data, I could have calculated the fiscal year quarter and stored it inside of the
youtube_statscollection so that any chart or query could leverage it. If I had gone this route, I would have been using the Computed Pattern.
Now that I know we have a requirement to display the fiscal year quarter, I can write a script to add the
fiscal_year_quarterfield to the existing documents. I could also update the function that creates new documents in the
youtube_statscollection to calculate the fiscal year quarter and store it in new documents.
Modeling data in MongoDB is all about your use case. When you don't know what your use case is, modeling data becomes a guessing game. Remember that it's OK if your requirements change; MongoDB's flexible schema allows you to update your data model as needed.
I confess that I've told developers who are new to using MongoDB this very thing: There is no "right way" to model your data. Two applications that utilize the same data may have different ideal data models based on how the applications use the data.
However, the perfectionist in me went a little crazy as I modeled the data for this app. In more than one of our team meetings, I told Mark and Max that I didn't love the data model I had created. I didn't feel like I was getting it "right."
As I mentioned above, the problem was that I didn't know the use case that I was optimizing for as I was developing the data model. I was making guesses and feeling uncomfortable. Because I was using a non-relational database, I couldn't just normalize the data systematically and claim I had modeled the data correctly.
The flexibility of MongoDB gives you so much power but can also leave you wondering if you have arrived at the ideal data model. You may find, as I did, that you may need to revisit your data model as your requirements become more clear or change. And that's OK.
(Don't let the flexibility of MongoDB's schema freak you out. You can use MongoDB's schema validation when you are ready to lock down part or all of your schema.)
Building on the previous thing I learned that there is no "right way" to model your data, data models can likely always be improved. As you identify what your queries will be or your queries change, you will likely find new ways you can optimize your data model.
The question becomes, "When is your data model good enough?" The perfectionist in me struggled with this question. Should I continue optimizing? Or is the data model we have good enough for our requirements?
To answer this question, I found myself asking two more questions:
- Are my teammates and I able to easily work with the data?
- Is our app's performance good enough?
The answers to the questions can be a bit subjective, especially if you don't have hard performance requirements, like a web page must load in X milliseconds.
In our case, we did not define any performance requirements. Our front end is currently a Charts dashboard. So, I wondered, "Is our dashboard loading quickly enough?" And the answer is yes: Our dashboard loads pretty quickly. Charts utilizes caching with a default one-hour refresh to ensure the charts load quickly. Once a user loads the dashboard in their browser, the charts remain displayed—even while waiting for the charts to get the latest data when the cache is refreshed.
If your developers are able to easily work with the data and your app's performance is good enough, your data model is probably good enough.
Every time I work with MongoDB, I learn something new. In the process of working with a team to rapidly build an app, I learned a lot about data modeling in MongoDB:
If you're interested in learning more about data modeling, I highly recommend the following resources:
- Free MongoDB University Course: M320: Data Modeling
- Blog Series: MongoDB Schema Design Patterns
- YouTube Video Series: MongoDB Schema Design Anti-Patterns
- Blog Series: MongoDB Schema Design Anti-Patterns
Remember, every use case is different, so every data model will be different. Focus on how you will be using the data.
If you have any questions about data modeling, I encourage you to join the MongoDB Community. It's a great place to ask questions. MongoDB employees and community members are there every day to answer questions and share their experiences. I hope to see you there!