I’m working on a project where I need to manage several workflows, each associated with a set of players. My goal is to have a separate collection for each workflowId that contains the data of the players associated with that workflow. The idea is that each workflow will have a variable number of players. Instead of deleting players with a single operation in a specific collection—where data is constantly being inserted— We are considering the approach of dropping the entire collection for each specific workflow when required, rather than deleting individual records or using TTL.
Is this a good idea or could I have some problem in the mongodb engine?
Hi @Josue_Guerrero1 … your idea of dropping and adding collections sounds a little bit “heavyweight” and likely to require more resources compared with updating fields in the collection. However, one can imagine situations in which it might make sense. In any case, there is not enough information in your description to make a “scientific” evaluation. Why would you not merely update fields? What’s the win here? Easier coding, or some other advantage?
In addition to what @Jack_Woehr said, I also wanted to provide additional context if you might have a large number of collections. There are no hard limit on how many collections you can have in a database, but here are general recommendations our product team has for each type if instances:
Hello @Jack_Woehr thanks for your swift response, yeah this is because, in the current system, we believe that a collection which is constantly inserting and deleting data (using TTL) is experiencing performance issues. We ran a performance test, and the insertions are very slow. I mean, this collection could receive more than 1 million documents per day. I don’t particularly like this approach, but I’d like to understand the consequences of dropping the collection. Would it lock the database or cause other issues?
Maybe you should look at your data design more closely. It’s hard to give you a concrete answer based on what we’ve been told. Possibly your design should be more relational.
@Jack_Woehr Thanks for your response, I will try to give you an example about the situation and why I’m trying to go for this approach.
Currently, we have information that needs to be filtered based on specific criteria. Each filter can match a number of users, and since these users are distributed across different databases, we need to store this information gradually in temporary storage (called conclusions). Later, using an AND operation, we match the players who meet the criteria in more than one filter and group them together.
The challenge here is that a single filter could match over 250,000 users. Additionally, these conclusions are temporary and only remain in the database for a limited period of time.
Example 1:
Let’s say we have two filters:
Filter 1: Players from the United States
Filter 2: Players who play a certain game
If Filter 1 matches 300,000 users and Filter 2 matches 150,000, we store these matches temporarily in the conclusions. Then, we use an AND operation to find the users who match both filters (e.g., players from the United States who also play the specific game). The problem here is that we are inserting information into this collection in batches of about 5,000 users per batch, but the process is very slow. Additionally, the information is temporary and typically stays in this collection for about 2 days before being deleted by TTL. Furthermore, this process runs in parallel. The same process can be executed multiple times in the service, depending on the number of instances available on the server. The issue lies in the insertion process: retrieving the information is very fast, but inserting the data into the collection is too slow.
@Josue_Guerrero1, I infer from your use case that you’re mining your conclusions from disparate databases that come either from different generations of game design or from partnerships or acquisitions.
Perhaps your operation could benefit from an enterprise event bus model (vide Kafka, etc.) where there’s a constant flow of information rather than a “scoop and dump” approach?