C# Reading invalid UTF-8 strings

CodeMyst_N_A · February 3, 2025, 9:53pm

I have a mongo database which was created in the D programming language. I did something stupid and stored binary data as a string without any encoding - casting a byte array to a string without encoding and storing that in mongo.

Right now I’m running into issues since I want to migrate the database to a proper format using C#.

I have tried reading the data in a bunch of ways - getting the typed collection, getting the raw BsonDocument collection, using a BsonBinaryReader

Nothing worked, and ended up getting these kinds of errors:

Unable to translate bytes [E2] at index 0 from specified code page to Unicode.

I also tried typing the field as a BsonValue instead of a string, but that also failed.

Tried to use the javascript driver as well, since I know it has a disableUtf8Validation option, and that worked, but not well enough since the bytes it read are not exactly correct (I also tried a bunch of stuff here to get it to output the same bytes).

Does anyone have any clue on what I might be able to try next? The last solution that is left is to “preprocess” the database with the same driver that was used to create the DB in the first place, and only then run it use it in C#. This will be very annoying to do however.