I need to build a desktop Java standalone app to process a data file. A requirement is that a file can have data with a size of up to 100 GB. It is not easy to divide the data into several portions. So, I need to process the data in one shot. To do so, I need to store data in various stages in a DB. That leads to 200 GB of data storage. I don’t know how much memory usage is required for the size of data in MongoDB with an assumption of an enouge hard drive disk. Is MongoDB suitable for this usage? Thanks for your advice.
Welcome to the MongoDB Community @V_W !
There is no prescriptive answer for how much RAM is required as this really depends on your use case, system resources (storage type and speed), workload (applications competing for the same resources), and performance expectations. I assume you are talking about 100GB of uncompressed data, which could be significantly less storage size if the data is reasonably compressible.
I recommend testing your outcomes with some representative data. You may find a data generation tool like
mgeneratejs helpful in this regard. If you have some existing test data, you can also extend this using a recipe like mongodb - duplicate a collection into itself - Stack Overflow.
If you have specific scenarios that could perhaps be tuned (for example ingesting, modelling, or updating data) you could start a discussion with more details such as your specific MongoDB driver & server versions, a slow command or query (and associated
explain() output), a snippet of code, and how you are measuring the execution time.
I also recommend reviewing schema design patterns (and anti-patterns) that may apply to your use case:
Thanks very much for your information.
With the data samples I have, I haven’t noticed a big memory usage jump for 500 KB of byte array along with the processed data. I use MongoDB for temporary data storage with a plain byte array, and collection of data entities. So, I don’t need any complex patterns. With Spring Data MongoDB, I won’t notice any native MongoDB features for reading/writing data. Issues may raise when I have a much large data sample.