What are the best practices for reading and writing large byte arrays?

Paulo_Morgado · April 9, 2024, 1:11am

I need to read and write large byte arrays (1 or 2 MBs).

My API receives streams to read from and write to.

I’d like to avoid large object heap allocations.

Is there a MongoDB API to read and write large byte arrays without allocations and copying?

James_Kovacs · April 9, 2024, 11:23pm

Hi, Paulo,

Welcome to the MongoDB Community Forums.

If you already have the byte[] in memory, you can assign them to your models by reference and then persist them with the rest of your model.

var bytes = new byte[] { 1,2,3 };
coll.InsertOne(new Document { Data = bytes });

class Document
{
    public ObjectId Id { get; set; }
    public byte[] Data { get; set; }
}

During serialization, the byte[] will be written to the output buffer used to serialize the wire messages, but intermediate copies won’t be made.

However you probably want to serialize from a stream or IEnumerable<T>. Unfortunately this isn’t possible as BSON binary is written as:

signed_byte(5) <<e_name>> <<numBytes>> <<binarySubtype>> <<bytes>>

Even if your Stream supports Stream.Length (which not all streams do), the signature for BsonWriter.WriteBytes is:

public abstract void WriteBytes(byte[] bytes);

Support for streams would have to be plumbed from the serializers all the way to the wire - and is made more difficult by the fact that you can’t tell if Stream.Length will succeed until you attempt to read it. Even then, I have seen custom stream implementations either lie about their length or not return all their data on error.

Thus you need to know the number of bytes in the byte array, which typically means you have to have read the entire stream or fully enumerate the IEnumerable<byte> to determine its total length. Thus copying can be minimized but not eliminated due to the need to prepend the BSON binary data with its 32-bit length.

Please let me know if you have additional questions.

Sincerely,
James

Paulo_Morgado · April 11, 2024, 10:24pm

Thanks @James_Kovacs.

Controlling the source of the byte[] and pooling it, is easy.

But whatever I do to throw a 1MB byte[] into MongoDB, when I ask for it back, the C# driver will allocate a 1MB byte[] and that goes into the large object heap.

I’m experiencing in chunking the data using something like this:

class Document
{
    public ObjectId Id { get; set; }
    public byte[][] Data { get; set; }
}

and using 32kB chunks to keep driver allocations out of the large object heap.

Cheers,
Paulo