It is good practice to create a separate collection for each dataset in MongoDB?

Josue_Guerrero1 · February 26, 2025, 11:11am

Hello community !

I’m working on a project where I need to manage several workflows, each associated with a set of players. My goal is to have a separate collection for each workflowId that contains the data of the players associated with that workflow. The idea is that each workflow will have a variable number of players. Instead of deleting players with a single operation in a specific collection—where data is constantly being inserted— We are considering the approach of dropping the entire collection for each specific workflow when required, rather than deleting individual records or using TTL.

Is this a good idea or could I have some problem in the mongodb engine?

Thanks,

Jack_Woehr · February 26, 2025, 1:02pm

Hi @Josue_Guerrero1 … your idea of dropping and adding collections sounds a little bit “heavyweight” and likely to require more resources compared with updating fields in the collection. However, one can imagine situations in which it might make sense. In any case, there is not enough information in your description to make a “scientific” evaluation. Why would you not merely update fields? What’s the win here? Easier coding, or some other advantage?

Hubert_Nguyen1 · February 26, 2025, 4:56pm

Hi @Josue_Guerrero1 and welcome to the MongoDB forums!

In addition to what @Jack_Woehr said, I also wanted to provide additional context if you might have a large number of collections. There are no hard limit on how many collections you can have in a database, but here are general recommendations our product team has for each type if instances:

Josue_Guerrero1 · February 27, 2025, 8:47am

Hello @Jack_Woehr thanks for your swift response, yeah this is because, in the current system, we believe that a collection which is constantly inserting and deleting data (using TTL) is experiencing performance issues. We ran a performance test, and the insertions are very slow. I mean, this collection could receive more than 1 million documents per day. I don’t particularly like this approach, but I’d like to understand the consequences of dropping the collection. Would it lock the database or cause other issues?

Jack_Woehr · February 28, 2025, 2:37am

Maybe you should look at your data design more closely. It’s hard to give you a concrete answer based on what we’ve been told. Possibly your design should be more relational.

Josue_Guerrero1 · February 28, 2025, 3:46am

@Jack_Woehr Thanks for your response, I will try to give you an example about the situation and why I’m trying to go for this approach.

Currently, we have information that needs to be filtered based on specific criteria. Each filter can match a number of users, and since these users are distributed across different databases, we need to store this information gradually in temporary storage (called conclusions). Later, using an AND operation, we match the players who meet the criteria in more than one filter and group them together.

The challenge here is that a single filter could match over 250,000 users. Additionally, these conclusions are temporary and only remain in the database for a limited period of time.

Example 1:

Let’s say we have two filters:

Filter 1: Players from the United States
Filter 2: Players who play a certain game

If Filter 1 matches 300,000 users and Filter 2 matches 150,000, we store these matches temporarily in the conclusions. Then, we use an AND operation to find the users who match both filters (e.g., players from the United States who also play the specific game). The problem here is that we are inserting information into this collection in batches of about 5,000 users per batch, but the process is very slow. Additionally, the information is temporary and typically stays in this collection for about 2 days before being deleted by TTL. Furthermore, this process runs in parallel. The same process can be executed multiple times in the service, depending on the number of instances available on the server. The issue lies in the insertion process: retrieving the information is very fast, but inserting the data into the collection is too slow.

Jack_Woehr · February 28, 2025, 4:56am

@Josue_Guerrero1, I infer from your use case that you’re mining your conclusions from disparate databases that come either from different generations of game design or from partnerships or acquisitions.

Perhaps your operation could benefit from an enterprise event bus model (vide Kafka, etc.) where there’s a constant flow of information rather than a “scoop and dump” approach?

Your workflow as currently implemented sounds challenging not only to MongoDB but also to a classic RDBMS. Perhaps you are cost-constrained and not inclined towards enterprise-class hardware such as IBM z architecture which runs both Db2 (a classic RDBMS) as well as MongoDB.

Hubert_Nguyen1 · February 28, 2025, 2:24pm

I’m curious about how long it takes, in ms or seconds, to insert a batch of 5000. Am I right to assume that you’re using insertMany() ?

If we could have an idea of the average document and update size, it would also shed additional light on your workload.

Josue_Guerrero1 · February 28, 2025, 3:57pm

Hello @Hubert_Nguyen1 yes, we are using insertMany. I will run some tests again to measure the time in milliseconds.

Hubert_Nguyen1 · February 28, 2025, 4:35pm

Thanks @Josue_Guerrero1 , if you run some tests, you can use explain(“executionStats”)

you can check if an index was used
also get the executionTimeMillis execution time