Is it possible possible that mongoDB doesn't write to disk/journal at all under high load?

A couple of days ago, one of our servers hosting our mongoDB containers experienced an incredibly high load (more than 200x higher than the usual average, we’re still investigating where that came from). However, as all our apps relying on the databases seemed to behave normally and the general metrics of the server were still fine (it was responsive to requests, total cpu was < 20%, total memory < 50%, total disk usage < 30%, no active swapping), we decided to wait a while until we restarted the whole server to bring it back into a stable state (we had to, as we didn’t have ssh or console access at that point anymore due to the high load, but that is a different issue).

Again, during the incident all our apps behaved normally (nodejs + mongoose) and claimed that data had been written to the database and could also be retrieved normally.

The interesting (as in “wtf”) thing happened once we restarted the whole server once there was no user-interaction anymore, gained back control, restarted the containers and everything seemed to be back to normal. The only problem: All the data, that should’ve been written (according to the apps), was gone.

So, my question:

Is it possible, that mongoDB stores data in memory only, even without writing to the journal for a long time (1h+) in case there are problems like an extremely high load?

For me it simply doesn’t make any sense that the apps were behaving normally, users could interact with it as they usually do, were able to “save” data to the database and retrieve it later on (no client-side caching involved), but the data was not persisted to the database at all. We were also able to confirm that there had been no overwriting of data on our side.

We’re using an Ubuntu 18_04 Server that hosts vanilla mongo 4.0.7 instances (journaling enabled, no replica sets). Client is nodeJs + mongoose.

Have you ruled out that it hasn’t been filesystem / hard drive error? Meaning, if server has certain RAID setup, which has write cache. Writes could appear in MongoDB perspective being written to disk, but instead they are in write cache. Those usually should have backup to prevent data loss on server crash, but they do fail sometimes.
As you speak about containers, it could be also problem in container layer, in similar fashion. Container filesystem thinks it has done write properly, but it hasn’t actually been written to persistent storage.

Haven’t experienced this kind of situations with MongoDB, but in the past those happened. We remedied situation somewhat by disabling write caches from servers, so problems would emerge immediately, and not noticed after server reboot.

1 Like

Check the drivers’ write concern. Just because journalling is enabled on the server does not mean it is being requested to journal or acknowledge a write.

{w:0} or {w:1, j:false} for example may not persist to disk immediately.

If you’re really concerned about persistence use a replica set and w:majority

Thank you both for your help!

@kerbe Currently, we can not rule out a harddrive problem, but the few pointers I have seem to point in that direction. I still don’t know if its the disc itself, but it seems to point in that direction.

@chris
I looked into that and the default write concern is by default {w:1} (we haven’t had the need manually tune the write concern yet, so it’s not defined in the code anywhere), but I couldn’t really find out what the default value for the journal (j) parameter is. Is there some documentation on this somewhere?

Should be indicated in the driver api. I think most of them default to w:1 j:true.

@chris Thank you once again for your response, I am currently investigation this, but I believe mongoose seems not to set a default write concern unless it’s a bulk write. It’s possible to set it but I believe the mongoDB defaults are used by default.

Also, @kerbe, it turns out that our cloud provider had an issue on the hypervisor level, that resulted in the high load and that weird behaviour we experienced.

‘It is possible to commit no mistakes and still lose. That is not weakness , that is life.’

Did they also confirm that this caused dataloss, or that i/o was not behaving correctly, or is that still a mystery how data wasn’t there after restart?