Mongod doesnt have control over WiredTigerHS.wt History Store

Hi MongoDB Community,

I hope this message finds you well. I wanted to raise awareness of a critical issue we’ve been experiencing in our PSA architecture, particularly when one of the secondary nodes goes down.

The problem lies in the WiredTigerHS.wt file, which undergoes excessive growth, eventually occupying the entire disk space and causing outages. I’ve raised a Mongo JIRA server ticket (https://jira.mongodb.org/browse/SERVER-84108) to address this issue. However, the Mongo Team is currently facing challenges in allocating resources to work on it.

To help expedite the resolution, I’d like to provide a straightforward set of steps to reproduce the problem:

Set up a PSA replicSet.
Create approximately 20,000 records with a reasonable payload.
Bring down one of the secondary nodes.
Perform a bulkUpdate for the 20,000 records, as demonstrated below:
db.session.updateMany({}, { $set: { status: "Modified" }}) db.session.updateMany({}, { $set: { status2: "Modified" }})
Repeat the update process a few times.

Upon observation, you’ll notice that the WiredTigerHS.wt file grows significantly with each update.

After going through the mongo code, i could figure out WT Engine RunTime config history_store.file_max parameter, which sets the maximum file size. However, this approach triggers a PANIC and restarts mongod when the file size exceeds this value. Consequently, there is no effective control mechanism to prevent the disk consumption problem.

  mongo host1:27717 --eval 'db.adminCommand( { "setParameter": 1, "wiredTigerEngineRuntimeConfig": "history_store=(file_max=104857600)"})'

I’ve sought assistance through the JIRA ticket, but due to the lack of response from the Mongo Team, I’m reaching out to the community for additional insights or potential solutions. It’s crucial to highlight that the minSnapshotHistoryWindowInSeconds parameter also doesn’t seem to make any difference.

Any guidance or assistance regarding this issue would be greatly appreciated.

Thanks,
Venkataraman

{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F",  "c":"-",        "id":23089,   "ctx":"thread61450","msg":"Fatal assertion","attr":{"msgid":50853,"file":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp","line":574}}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F",  "c":"-",        "id":23090,   "ctx":"thread61450","msg":"\n\n***aborting after fassert() failure\n\n"}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"thread61451","msg":"WiredTiger error","attr":{"error":-31804,"message":"[1707172787:938206][3050376:0x7fe300ded700], file:collection-0--3783294461088769059.wt, eviction-server: __wt_hs_insert_updates, 804: WiredTigerHS: file size of 106291200 exceeds maximum size 104857600: WT_PANIC: WiredTiger library panic"}}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F",  "c":"CONTROL",  "id":6384300, "ctx":"thread61450","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"thread61452","msg":"WiredTiger error","attr":{"error":-31804,"message":"[1707172787:938428][3050376:0x7fe3005ec700], file:collection-0--3783294461088769059.wt, eviction-server: __wt_hs_insert_updates, 804: WiredTigerHS: file size of 106291200 exceeds maximum size 104857600: WT_PANIC: WiredTiger library panic"}}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F",  "c":"-",        "id":23089,   "ctx":"thread61452","msg":"Fatal assertion","attr":{"msgid":50853,"file":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp","line":574}}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F",  "c":"-",        "id":23090,   "ctx":"thread61452","msg":"\n\n***aborting after fassert() failure\n\n"}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F",  "c":"-",        "id":23089,   "ctx":"thread61451","msg":"Fatal assertion","attr":{"msgid":50853,"file":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp","line":574}}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F",  "c":"-",        "id":23090,   "ctx":"thread61451","msg":"\n\n***aborting after fassert() failure\n\n"}
{"t":{"$date":"2024-02-05T22:39:48.056+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"sessionmgr03:27717"}}
{"t":{"$date":"2024-02-05T22:39:48.168+00:00"},"s":"I",  "c":"CONTROL",  "id":31380,   "ctx":"thread61450","msg":"BACKTRACE","attr":{"bt":{"backtrace":[{"a":"5623FD042365","b":"5623F90C6000","o":"3F7C365","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.361","s+":"215"},{"a":"5623FD044DE9","b":"5623F90C6000","o":"3F7EDE9","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"5623FD03D206","b":"5623F90C6000","o":"3F77206","s":"abruptQuit","s+":"66"},{"a":"7FE308F21D10","b":"7FE308F0F000","o":"12D10","s":"funlockfile","s+":"50"},{"a":"7FE308B98ACF","b":"7FE308B4A000","o":"4EACF","s":"gsignal","s+":"10F"},{"a":"7FE308B6BEA5","b":"7FE308B4A000","o":"21EA5","s":"abort","s+":"127"},{"a":"5623FA4DAAB9","b":"5623F90C6000","o":"1414AB9","s":"_ZN5mongo25fassertFailedWithLocationEiPKcj","s+":"F6"},{"a":"5623F9FB2388","b":"5623F90C6000","o":"EEC388","s":"_ZN5mongo12_GLOBAL__N_141mdb_handle_error_with_startup_suppressionEP18__wt_event_handlerP12__wt_sessioniPKc.cold.1149","s+":"16"},{"a":"5623FA7EF083","b":"5623F90C6000","o":"1729083","s":"__eventv","s+":"403"},{"a":"5623F9FC49CD","b":"5623F90C6000","o":"EFE9CD","s":"__wt_panic_func","s+":"BB"},{"a":"5623F9FD0586","b":"5623F90C6000","o":"F0A586","s":"__wt_hs_insert_updates.cold.11","s+":"55"},{"a":"5623FA7CE218","b":"5623F90C6000","o":"1708218","s":"__rec_write_wrapup","s+":"398"},{"a":"5623FA7CFACA","b":"5623F90C6000","o":"1709ACA","s":"__wt_reconcile","s+":"6DA"},{"a":"5623FA79CFC5","b":"5623F90C6000","o":"16D6FC5","s":"__wt_evict","s+":"1935"},{"a":"5623FA793762","b":"5623F90C6000","o":"16CD762","s":"__evict_page","s+":"6A2"},{"a":"5623FA794028","b":"5623F90C6000","o":"16CE028","s":"__evict_lru_pages","s+":"78"},{"a":"5623FA798E14","b":"5623F90C6000","o":"16D2E14","s":"__wt_evict_thread_run","s+":"74"},{"a":"5623FA7FFE09","b":"5623F90C6000","o":"1739E09","s":"__thread_run","s+":"39"},{"a":"7FE308F171CA","b":"7FE308F0F000","o":"81CA","s":"start_thread","s+":"EA"},{"a":"7FE308B83E73","b":"7FE308B4A000","o":"39E73","s":"clone","s+":"43"}],"processInfo":{"mongodbVersion":"5.0.20","gitVersion":"2cd626d8148120319d7dca5824e760fe220cb0de","compiledModules":[],"uname":{"sysname":"Linux","release":"4.18.0-477.27.1.el8_8.x86_64","version":"#1 SMP Thu Sep 21 06:49:25 EDT 2023","machine":"x86_64"},"somap":[{"b":"5623F90C6000","elfType":3,"buildId":"A8EA7166EFC23E0D3802F8AFEFEF2186CF5E5BBD"},{"b":"7FE308F0F000","path":"/lib64/libpthread.so.0","elfType":3,"buildId":"76F163FDBAA9E91050B456A7E5EA8AC78563BD29"},{"b":"7FE308B4A000","path":"/lib64/libc.so.6","elfType":3,"buildId":"44ED73CF68E8FA608DA3B301146C81A0A77A5619"}]}}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FD042365","b":"5623F90C6000","o":"3F7C365","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.361","s+":"215"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FD044DE9","b":"5623F90C6000","o":"3F7EDE9","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FD03D206","b":"5623F90C6000","o":"3F77206","s":"abruptQuit","s+":"66"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"7FE308F21D10","b":"7FE308F0F000","o":"12D10","s":"funlockfile","s+":"50"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"7FE308B98ACF","b":"7FE308B4A000","o":"4EACF","s":"gsignal","s+":"10F"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"7FE308B6BEA5","b":"7FE308B4A000","o":"21EA5","s":"abort","s+":"127"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA4DAAB9","b":"5623F90C6000","o":"1414AB9","s":"_ZN5mongo25fassertFailedWithLocationEiPKcj","s+":"F6"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623F9FB2388","b":"5623F90C6000","o":"EEC388","s":"_ZN5mongo12_GLOBAL__N_141mdb_handle_error_with_startup_suppressionEP18__wt_event_handlerP12__wt_sessioniPKc.cold.1149","s+":"16"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA7EF083","b":"5623F90C6000","o":"1729083","s":"__eventv","s+":"403"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623F9FC49CD","b":"5623F90C6000","o":"EFE9CD","s":"__wt_panic_func","s+":"BB"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623F9FD0586","b":"5623F90C6000","o":"F0A586","s":"__wt_hs_insert_updates.cold.11","s+":"55"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA7CE218","b":"5623F90C6000","o":"1708218","s":"__rec_write_wrapup","s+":"398"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA7CFACA","b":"5623F90C6000","o":"1709ACA","s":"__wt_reconcile","s+":"6DA"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA79CFC5","b":"5623F90C6000","o":"16D6FC5","s":"__wt_evict","s+":"1935"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA793762","b":"5623F90C6000","o":"16CD762","s":"__evict_page","s+":"6A2"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA794028","b":"5623F90C6000","o":"16CE028","s":"__evict_lru_pages","s+":"78"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA798E14","b":"5623F90C6000","o":"16D2E14","s":"__wt_evict_thread_run","s+":"74"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA7FFE09","b":"5623F90C6000","o":"1739E09","s":"__thread_run","s+":"39"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"7FE308F171CA","b":"7FE308F0F000","o":"81CA","s":"start_thread","s+":"EA"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"7FE308B83E73","b":"7FE308B4A000","o":"39E73","s":"clone","s+":"43"}}}

{"t":{"$date":"2024-02-05T22:40:01.827+00:00"},"s":"I",  "c":"CONTROL",  "id":20698,   "ctx":"-","msg":"***** SERVER RESTARTED *****"}
`

@kevinadi @chris would you be able to help on this please?

Best advice is: Don’t use arbiters.

Likely this is due to the majority commit point, but I have not had time to look at this properly.

1 Like

Thanks @Chris . But the problem wont be solved if we remove the ARBITER. Also we are not using readConcern majority. I changed the server defaultWriteConcern to w:1.

Coming to the mongo workaround of setting priority and votes=0 if the secondary member is lagging or unavailable doesnt look like a native solution. Also it violates the mongo fault tolerance claim. out of 5 member replica , 2 members can go down and mongo should run healthy. But after this 5.0, its no longer the case.