The WiredTiger history_store.file_max
configuration option can cause complete replica set unavailability in distributed deployments. When the history store file exceeds the configured limit, MongoDB immediately panics and terminates, potentially causing loss of quorum and complete service outage with no automatic recovery mechanism.
This is the extension of the issue i reported earlier Mongod doesnt have control over WiredTigerHS.wt History Store - #4 by venkataraman_r which was closed by Mongo using SERVER ticket https://jira.mongodb.org/browse/SERVER-84108 stating 5.0 was EOL. BUt this issue exists in all the versions that supports HS.
Bug Type
- Severity: Critical
- Priority: High
- Category: Storage Engine / Replica Sets
- Component: WiredTiger History Store
Environment
- MongoDB Version: All versions with WiredTiger history store support(tested in 7.0 as well)
- Storage Engine: WiredTiger
- Deployment: Replica Set
Problem Description
Current Behavior
- When
history_store.file_max
is configured and exceeded, MongoDB immediately panics withWT_PANIC
- The panic causes immediate process termination via
fassert()
- No graceful degradation, warnings, or recovery options are available
- After restart, the oversized history store file persists, causing immediate re-panic on first write operation
- This creates an infinite restart loop until manual intervention
Critical Failure Scenario
In a 5-member replica set across 3 sites (2+2+1 arbiter):
- One site (2 members) goes down
- Remaining primary handles increased load → history store grows
- History store exceeds
file_max
→ primary panics and shuts down - Loss of primary + previous site failure = no quorum
- Entire replica set becomes unavailable
Root Cause Analysis
Design Flaws
- No graceful degradation: Immediate panic instead of warnings or throttling
- No startup validation: Size check only occurs during write operations, not at startup
- No automatic cleanup: No mechanism to reduce history store size during emergencies
- Poor failure isolation: Storage limit can cause replica set quorum loss
Code References
// src/third_party/wiredtiger/src/history/hs_rec.c:766
if ((uint64_t)hs_size > max_hs_size)
WT_ERR_PANIC(session, WT_PANIC,
“WiredTigerHS: file size of %” PRIu64 " exceeds maximum size %" PRIu64,
(uint64_t)hs_size, max_hs_size);
// src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp:191
fassert(28559, retCode != WT_PANIC || storageGlobalParams.repair);
Expected Behavior
Immediate Fixes Needed
- Configurable panic behavior:
history_store=(file_max=10GB, on_limit=warn|throttle|panic)
- Graceful degradation options:
warn
: Log warnings but continue operationsthrottle
: Slow writes, reject long-running readscleanup
: Auto-truncate oldest history entries
- Startup validation: Check file size during startup and provide recovery options
- Emergency recovery mode: Allow startup with temporary limit override
Long-term Improvements
- Proactive monitoring: Built-in metrics and alerting before hitting limits
- Automatic cleanup: Background process to manage history store size
- Better documentation: Clear warnings about replica set availability risks
- Default behavior change: Consider making
file_max=0
(unbounded) the recommended production setting