EventGet 50% off your ticket to MongoDB.local London on October 2. Use code WEB50Learn more >>
MongoDB Developer
MongoDB
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
MongoDBchevron-right

Entangled: A Story of Data Re-modeling and 10x Storage Reduction

Nenad Milosavljevic5 min read • Published Dec 14, 2023 • Updated Dec 14, 2023
Node.jsJavaScriptMongoDB
Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
One of the most distinctive projects I've worked on is an application named Entangled. Developed in partnership with the Princeton Engineering Anomalies Research lab (PEAR), The Global Consciousness Project, and the Institute of Noetic Sciences, Entangled aims to test human consciousness.
The application utilizes a quantum random number generator to measure the influence of human consciousness. This quantum generator is essential because conventional computers, due to their deterministic nature, cannot generate truly random numbers. The quantum generator produces random sequences of 0s and 1s. In large datasets, there should be an equal number of 0s and 1s.
For the quantum random number generation, we used an in-house Quantis QRNG USB device. This device is plugged into our server, and through specialized drivers, we programmatically obtain the random sequences directly from the USB device.
Experiments were conducted to determine if a person could influence these quantum devices with their thoughts, specifically by thinking about more 0s or 1s. The results were astonishing, demonstrating the real potential of this influence.
To expand this test globally, we developed a new application. This platform allows users to sign up and track their contributions. The system generates a new random number for each user every second. Every hour, these contributions are grouped for analysis at personal, city, and global levels. We calculate the standard deviation of these contributions, and if this deviation exceeds a certain threshold, users receive notifications.
This data supports various experiments. For instance, in the "Earthquake Prediction" experiment, we use the contributions from all users in a specific area. If the standard deviation is higher than the set threshold, it may indicate that users have predicted an earthquake.
If you want to learn more about Entangled, you can check the official website.

Hourly-metrics schema modeling

As the lead backend developer, and with MongoDB being my preferred database for all projects, it was a natural choice for Entangled.
For the backend development, I chose Node.js (Express), along with the Mongoose library for schema definition and data modeling. Mongoose, an Object Data Modeling (ODM) library for MongoDB, is widely used in the Node.js ecosystem for its ability to provide a straightforward way to model our application data.
Careful schema modeling was crucial due to the anticipated scaling of the database. Remember, we were generating one random number per second for each user.
My initial instinct was to create hourly-based schemas, aligning with our hourly analytics snapshots. The initial schema was structured as follows:
  • User: a reference to the "Users" collection
  • Total Sum: the sum of each user's random numbers; either 1s or 0s, so their sum was sufficient for later analysis
  • Generated At: the timestamp of the snapshot
  • Data File: a reference to the "Data Files" collection, which contains all random numbers generated by all users in a given hour
Although intuitive, this schema faced a significant scaling challenge. We estimated over 100,000 users soon after launch. This meant about 2.4 million records daily or 72 million records monthly. Consequently, we were looking at approximately 5GB of data (including storage and indexes) each month.
This encouraged me to explore alternative approaches.

Daily-metrics schema modeling

I explored whether alternative modeling approaches could further optimize storage requirements while also enhancing scalability and cost-efficiency.
A significant observation was that out of 5GB of total storage, 3.5GB was occupied by indexes, a consequence of the large volume of documents.
This led me to experiment with a schema redesign, shifting from hourly to daily metrics. The new schema was structured as follows:
Rather than storing metrics for just one hour in each document, I now aggregated an entire day's metrics in a single document. Each document included a "samples" array with 24 entries, one for each hour of the day.
It's important to note that this method is a good solution because the array has a fixed size — a day only has 24 hours. This is very different from the anti-pattern of using big, massive arrays in MongoDB.
This minor modification had a significant impact. The storage requirement for a month's worth of data drastically dropped from 5GB to just 0.49GB. This was mainly due to the decrease in index size, from 3.5GB to 0.15GB. The number of documents required each month dropped from 72 million to 3 million.
Encouraged by these results, I didn't stop there. My next step was to consider the potential benefits of shifting to a monthly-metrics schema. Could this further optimize our storage? This was the question that drove my next phase of exploration.

Monthly-metrics schema modeling

The monthly-metrics schema was essentially identical to the daily-metrics schema. The key difference lay in how the data was stored in the "samples" array, which now contained approximately 720 records representing a full month's metrics.
This adjustment was expected to further reduce the document count to around 100,000 documents for a month, leading me to anticipate even greater storage optimization. However, the actual results were surprising.
Upon storing a month's worth of data under this new schema, the storage size unexpectedly increased from 0.49GB to 0.58GB. This increase is likely due to the methods MongoDB's WiredTiger storage engine uses to compress arrays internally.

Summary

Below is a detailed summary of the different approaches and their respective results for one month’s worth of data:
Hourly DocumentDaily DocumentMonthly Document
Document Size0.098 KB1.67 KB49.18 KB
Total Documents (per month)72,000,000 (100,000 users * 24 hours * 30 days)3,000,000 (100,000 users * 30 days)100,000 (100,000 users)
Storage Size1.45 GB0.34 GB0.58 GB
Index Size3.49 GB0.15 GB0.006 GB
Total Storage (Data + Index)4.94 GB0.49 GB0.58 GB

Conclusion

In this exploration of schema modeling for the Entangled project, we investigated the challenges and solutions for managing large-scale data in MongoDB.
Our journey began with hourly metrics, which, while intuitive, posed significant scaling challenges due to the large volume of data and index size.
This prompted a shift to daily metrics, drastically reducing storage requirements by over 10 times, primarily due to a significant decrease in index size.
The experiment with monthly metrics offered an unexpected twist. Although it further reduced the number of documents, it increased the overall storage size, likely due to the internal compression mechanics of MongoDB's WiredTiger storage engine.
This case study highlights the critical importance of schema design in database management, especially when dealing with large volumes of data. It also emphasizes the need for continuous experimentation and optimization to balance storage efficiency, scalability, and cost.
If you want to learn more about designing efficient schemas with MongoDB, I recommend checking out the MongoDB Data Modeling Patterns series.

Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Quickstart

Java - Aggregation Pipeline


Mar 01, 2024 | 8 min read
Tutorial

Building with Patterns: The Attribute Pattern


Jan 23, 2023 | 3 min read
Article

Mapping Terms and Concepts from SQL to MongoDB


Sep 23, 2022 | 15 min read
Quickstart

Getting Started with Aggregation Pipelines in Python


Sep 23, 2022 | 14 min read
Table of Contents