Proposal for Enhancing Collection Initialization Control in MongoDB Insertions

alvarolb · February 11, 2024, 10:23am

Hello MongoDB Community,

I’m reaching out to discuss and propose an enhancement for MongoDB, specifically targeting the way we handle collection creation during document insertion processes. As we navigate through the development and maintenance of MongoDB applications, one aspect that often comes up is the automatic creation of collections when new documents are inserted. This feature, while convenient in many scenarios, can sometimes obscure certain operational nuances, particularly when dealing with specialized collections such as timeseries.

The core of the proposal is the introduction of a new parameter within the insertion command options (e.g., require_existing_collection on the optional arguments to a MongoDB insert operation) that would disable the automatic creation of collections upon the insertion of new documents. The motivation behind this suggestion stems from the need to ensure that collections are initialized properly before use, especially in cases where specific indexes, metaFields, timeFields, and other configurations are crucial for the collection’s operation.

Currently, without this parameter, developers might find themselves needing to verify the existence of a collection before each insertion or maintaining this information in a cache. This approach, however, is prone to errors and inconsistencies, particularly if changes are made to the collection outside the current operational context. For timeseries collections, which require a different initialization process than regular collections, the ability to explicitly control when and how a collection is created becomes even more important.

Implementing such a parameter would offer several benefits:

Improved Initialization Control: It would allow developers to detect insert failures due to non-existent collections, prompting a proper initialization sequence that includes necessary index creation and other setup procedures.
Enhanced Efficiency: By avoiding the need to check for a collection’s existence or maintain a potentially inconsistent cache, operations can be streamlined, focusing on handling exceptions meaningfully.
Consistency in Timeseries Collections: For timeseries collections, which demand specific initialization steps, this control is invaluable. Ensuring that collections are set up correctly before any data insertion can significantly impact performance and functionality.

This proposal does not aim to detract from the flexibility and ease of use that MongoDB offers. Instead, it seeks to augment the database’s capability to handle more complex scenarios where collection initialization plays a critical role in the application’s performance and reliability.

I am eager to hear your thoughts on this proposal. Your feedback, insights, and any further suggestions would be immensely valuable in assessing the feasibility and potential impact of this enhancement on MongoDB’s future versions.

Thank you for considering this proposal, and I look forward to our fruitful discussions.

John_Page · February 13, 2024, 12:13pm

Havign permissions to insert into a collection implicity gives the ability to create it. My thoughts on this would be to seperate out createCollection from insert as a permission so explicit creation would fail for a correctly scoped application user. The advantage here is you don’t need to change code everywhere you use insert and you can call out that creating collections, like creating indexes shoudl be an explicit task in produciton.

alvarolb · February 13, 2024, 2:32pm

Thank you for your thoughtful response and suggestion to separate createCollection permissions from insert permissions. This approach indeed introduces a clear separation of concerns and enhances security by ensuring that collection creation is an explicit action, particularly valuable in production environments.

However, while this solution offers significant benefits in terms of security and operational clarity, it may not align well with the requirements of all application contexts, especially those that are dynamic and user-driven, such as in IoT applications that I am currently developing. In these environments, the ability to create collections dynamically in response to user interactions is crucial for flexibility and scalability.

The main challenge with the proposed permission separation approach in such dynamic environments includes:

Loss of Granularity: The need to manage multiple users or roles (one for insertion and another for collection creation) introduces complexity, especially when the application logic requires the dynamic creation of collections. This granularity is essential for IoT backends, where users might need to create their own “buckets” of information on-the-fly.
Operational Complexity: Implementing separate permissions complicates the application’s backend logic, requiring additional logic to handle permission errors and switch between users or roles depending on the operation. This increases the risk of bugs and makes the system harder to maintain.
Reduced Flexibility: For applications that thrive on being able to dynamically adjust to user needs by creating new collections with specific configurations (e.g., timeseries collections with particular indexes), the need to predefine and explicitly create these collections before insertion can hinder responsiveness and agility.
Increased Development Overhead : The requirement to detect insertion failures due to permission errors, followed by initiating collection creation with a different user, introduces significant complexity into the development process. This approach necessitates establishing a new communication session with the database, handling authentication for a different user or role, and managing the context switch between different operations. Such steps not only complicate the codebase but also impact the developer experience negatively. This added complexity can lead to a more cumbersome development process, detracting from the efficiency of error handling and recovery mechanisms, and resulting in additional overhead for maintaining and debugging the application.

Given these considerations, while I appreciate the security and clarity benefits of the suggested approach, I propose maintaining the flexibility to control collection creation through the insertion process itself. Introducing an optional parameter (e.g., require_existing_collection) could provide a more nuanced control, allowing developers to specify when collections should be created dynamically and when they should exist beforehand. This would offer a balanced solution that accommodates both the need for explicit collection management in production and the flexibility required in dynamic, user-centric applications.

I believe a solution that offers flexibility while still enabling strict control when needed could be more beneficial across a wider range of MongoDB use cases.

The issue of automatic creation of databases and collections in MongoDB is a recurring concern within the community, as evidenced by numerous discussions and questions across various forums:

alvarolb · February 20, 2024, 3:58pm

Any other thoughts on this topic?

Daniel_Oliver · June 14, 2024, 6:31am

I’ve only just found this thread, so I help it’s not too late. I’ve found myself banging my head against a keyboard on multiple occasions when I’ve forgotten to change database before executing an insert, only to realise an hour later why my change isn’t working as expected.

We have existing processes to manage our indexes/overall schema, so adding an option require_existing_collection to thousands of operations in code could take a significant amount of effort and not address the mode by which most mistakes happen; making manual changes on the command-line.

Could the proposed createCollection workflow could be made a config-level behaviour? From my perspective, the “challenges” you list above are all good things for deployments with well established collection formats and management mechanisms.

Eugene_Desyatnik · February 11, 2025, 7:38am

I agree with this proposal. There are risks with creating collections on the fly, with respect to performance. When an application creates a collection, and it begins to grow unbeknownst to operations support teams, issues surface around performance for a collection that support teams were not aware existed. I understand there are use cases where flexibility is required, however MongoDB supports a wide variety of analytical as well as transactional applications across a wide range of industries. In certain transactional systems, the introduction of a new collection – as well as new queries against that collection in the code base – are not possible on the fly in the application layer already, as these require a formal deploy of the application code on the front end side. Quite similarly, there should be a required and accompanying database code deploy that precedes it, which introduces the collection as well as all indexes deemed initially beneficial to support the new application code. This would also ensure the new collection is created in all non production environments as well, as that new code base goes through the SDLC process, and that the number of collections present in DEV, TEST, and PROD are in alignment after the release has fully moved up through each of them and into PROD.

aneroid · February 11, 2025, 9:27am

I support the idea of having a way to prevent automatic collection creation.

But I think this issue occurs because of the convenience in the language API’s:

db.xyz.insert_one() in Python or db.xyz.insertOne() in JS will each treat “xyz” like it was a collection name; as long as it doesn’t conflict with an existing method on the db object.
This dynamic attribute lookup means doing db.get_collection("xyz").insert is less convenient to use/type, so the shortcut gets used and leads to these kinds of errors.
I don’t think many experienced devs would use the shortcut-form and it’s a common pitfall/trap for anyone new to MongoDB, regardless of overall experience.

Instead of a per-command flag, I’d prefer a Server-side flag/env var either on the specific DB (preferred) or the whole instance. For example:

db.adminCommand({ setParameter: 1, autoCreateCollections: false })
it would default to true to remain compatible with all existing code
DevOps, Leads, Sys Admins can explicitly set the flag on new and existing instances to prevent auto-creation in the future.

Main reason for not wanting a per-command flag: Anyone who “didn’t check the object type before inserting” or “didn’t use get_collection” will also not use or forget to use the require_existing_collection flag.