Writing stream to collection without putting it all into memory

Ali_ihsan_Erdem1 · December 7, 2022, 12:31pm

hello
i want to insert rows into a collection from a java 8 stream. is there a way to insert larger-than-memory streams? and maybe do it parallel? only way i could think of is split it into the chunks (because its a stream i dont know the exact size) and insert each chunk sequentially.
is there a better way to do this ?
i am using sync driver

MaBeuLux88_xxx · December 8, 2022, 1:37pm

Hi @Ali_ihsan_Erdem1,

I don’t understand what you are trying to do exactly.

Where do the streams come from?
What is “larger than RAM” exactly?
What are you running in Prod for MongoDB ? You mentioned “chunks” so is it a sharded cluster?
How many Java process are you running in parallel to handle these streams?
If you are inserting sequentially, you don’t have the correct shard key. See Choose a shard key.

Ali_ihsan_Erdem1 · December 8, 2022, 3:33pm

lets say i have a 20GB csv file i want to insert into a mongodb collection. and i have 8gb of ram. is there a way to do this other than multiple insert_one or splitting the data into chunks and writing it chunk by chunk

MaBeuLux88_xxx · December 8, 2022, 3:50pm

Are you going to transform the data from the CSV (like transform dates in ISODates or make sure some geo loc data is stored as a proper GeoJSON valid point or you just want to insert the CSV “as is”?

If that’s the case then I think I would recommend using mongoimport with --type=csv.

You can use -j to set the number of insertion workers but if you really want to go fast, the easiest way is probably to cut it in 20 parts and spawn 20 jobs on 20 different machines. But then is your cluster strong enough to ingest that much data ?

Cheers,
Maxime.

Ali_ihsan_Erdem1 · December 8, 2022, 8:56pm

i know mongoimport and like i said i know that i can chunk my data in to pieces.
i was looking for a more elegant solution like giving mongodb driver an iterator and expecting the driver to do some magic.
i can resize my cluster to moon, that is not the problem but in my previous experience mongodb cant leverage all the hardware. we get "out of memory " errors on aggregations despite having enormous amount(256GB) of ram and relatively small(2GB max) collections. this is why i am rewriting our pipeline in java.

MaBeuLux88_xxx · December 8, 2022, 11:42pm

This triggers my spidey-sense that you might need to add the {allowDiskUse: true} option in your aggregation command.

You probably get that because your pipeline is trying to use more than 100MB of RAM and needs to write to temporary disk files.

Back to your CSV issue, I wouldn’t use insert_one at all in this situation as each insert_one operation would need a TCP round trip to acknowledge each write operations.

I would - indeed - use a bulkwrite operation instead (or an insert_many) to reduce the number of TCP round trips. I would send the bulkwrite operation every 1000 docs or so. Maybe 10000 if they are small.

Cheers,
Maxime.

Ali_ihsan_Erdem1 · December 9, 2022, 8:43am

i did add the allowDiskUse: true to my pipeline. still same

MaBeuLux88_xxx · December 10, 2022, 12:44am

If you feel like it, we can have a look to the pipeline in another topic and try to find the problem. Feel free to tag me.

Ideally I would need a way to reproduce the problem but it’s most probably not easy with a few sample docs. But if you could provide with a few sample docs + the pipeline + the expected output + the error message, I think we can have a look at it.

Cheers,
Maxime.

square_mcs · December 10, 2022, 10:09am

i was looking for a more elegant solution like giving mongodb driver an iterator and expecting the driver to do some magic.