Data corruption and large fields


I am trying to reproduce a very weird problem that one customer reported. I am not sure whether this is the correct category, but you will tell me, I guess.

My customer uses Mongo 4.2.19 on Windows.

My application is a headless CMS based on CQRS pattern and the events are stored in a MongoDB collection. Basically the schema for events looks like this:

  "streamName": "...",
  "bodyJson": "..."

As you can see, I am not storing the events directly in MongoDB. I serialize them to JSON first and then I add them as string. The reason is that not all my JSON objects are valid BSON objects.

The problem is that sometimes the JSON cannot deserialized any more because it is corrupt. Only small differences, like wrong colons and so on.

The customer has sent me an example to reproduce it, which does no work on my machine. In this example he creates large document (1 MB) with array properties in the first run. In the next run he fetches all documents and clones some of the array fields and makes an update. So basically the document structure does not change, but they becomes bigger over time. After a few runs he starts to see the deserialization problem.

If there would be a bug with the JSON serializer it should have happened with the first run already. It is a very popular serializer for C#, so I doubt that there is general bug. So for me it seems that something goes wrong in the network stack or on MongoDB side. Perhaps something with compression. Is there a threshold after which fields are compressed? Basically my documents have only one large field which grows over time?

I know that it is not likely that something like this is in the MongoDB code, but I have no more ideas right now.

I analyzed an example and I found this one

 "PredictionProbability": 0.9197f6ceee565a4"}],"PassageType"

it is very interesting, when you compare it with a non-corrupt version you see that it looks like something is actually not written to the database:


So for me it seems there are at least these places to look at:

  1. C# JSON serializer. I have no idea how the serializer would be able to produce something like this.
  2. C# MongoDB Client. Perhaps something goes wrong when writing the string to a buffer?
  3. Network: Perhaps some packages get lost, but if I understand it correctly MongoDB uses a TCP connection and TCP would retry to send these packages or close the connection, if it is not successful with that.
  4. MongoDB itself, but this would be a very crazy and severe bug that should have been detected earlier.
  5. Disk error. No idea what MongoDB would do in this case. But why does it only happen with large strings?

Hi @Sebastian_Stehle

This is curious indeed. I tend to think that all the libraries you mentioned would be well-tested against this type of corruption issue.

I would suggest you do a step-by-step testing to determine the point of failure, e.g., on every step, ensure that the output of the relevant library is as you expect, without corruption.

However I think it’s likely that the error occurs in either step 1, 2, or the glue between them. You also mentioned that a customer of yours reported this. Is it possible that there is something missing from their end? At this point there’s too many variables at play and it’s difficult to determine exactly what happened.

It’ll be helpful if we can have a narrower scope of the issue, e.g. if we can have a set of inputs that is reproducible.

Best regards

Thanks for the answer. I tried that and I cannot reproduce it on my machine. The customer also sent me a sample to reproduce it which didn’t help either and when we tested it via screen sharing on his machine he could not reproduce it anymore.

Is there a retry mechanism in MongoDB driver? Lets say a package gets lost (no idea how this can happen with TCP), the mongo server detects an invalid request and the driver makes a retry, then mongodb would insert invalid content, when the result of is still a valid document.

Yes this feature is retryable writes and it’s been around since MongoDB 3.6.

However I don’t believe that this would be the cause of the corruption issue your client is seeing. This feature is fully specced and tested exhaustively.

Since the issue is not reproducible, it’s really hard to say what’s causing it, and how. However I would look at any TCP issues last since the protocol has been around for a very long time. It’s very, very unlikely this kind of corruption would be caused by TCP, especially since it’s being used to transfer gigantic amount of data every day very reliably since the 70s. Ditto with disk errors. If there are any disk corruption, WiredTiger would know about it (we’ve seen many of these instances :slight_smile: ).

Having said that, please do update us on the situation once a reliable method of reproducing the corruption is found. I’ll be happy to try to assist in that case.

Best regards

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.