So far, this issue has only ever happened when the number of documents in the shard is greater than 500,000,000. This could just be a coincidence, but I’ve erased the entire DB and re-imported all the data numerous times, and whenever this happens, it seems to take out multiple shards, not just one.
I’m running the nodes in docker.
$ mongod --version
db version v4.2.8
git version: 43d25964249164d76d5e04dd6cf38f6111e21f5f
OpenSSL version: OpenSSL 1.1.1 11 Sep 2018
allocator: tcmalloc
modules: none
build environment:
distmod: ubuntu1804
distarch: x86_64
target_arch: x86_64
It occurs on insert, and leaves the database in an unrepairable state. (running --repair
fails with a duplicate key error
):
2020-08-31T15:42:29.003+0000 I - [initandlisten] Index Build: inserting keys from external sorter into index: 35096500/835088325 4%
2020-08-31T15:42:33.501+0000 I STORAGE [initandlisten] Index builds manager failed: 70273619-2562-4e0e-880e-6fe21c4397a7: myapp.comments: DuplicateKey{ keyPattern: { _id: 1 }, keyValue: { _id: "2b00042f7481c7b056c4b410d28f33cf" } }: E11000 duplicate key error collection: myapp.comments index: _id_ dup key: { _id: "2b00042f7481c7b056c4b410d28f33cf" }
2020-08-31T15:42:33.501+0000 I STORAGE [initandlisten] Index build failed: 70273619-2562-4e0e-880e-6fe21c4397a7: myapp.comments ( 8e03ebdc-bbcb-491a-aa49-deec30dde3ad ): DuplicateKey{ keyPattern: { _id: 1 }, keyValue: { _id: "2b00042f7481c7b056c4b410d28f33cf" } }: E11000 duplicate key error collection: myapp.comments index: _id_ dup key: { _id: "2b00042f7481c7b056c4b410d28f33cf" }
2020-08-31T15:42:33.501+0000 F - [initandlisten] Fatal assertion 51076 DuplicateKey{ keyPattern: { _id: 1 }, keyValue: { _id: "2b00042f7481c7b056c4b410d28f33cf" } }: E11000 duplicate key error collection: myapp.comments index: _id_ dup key: { _id: "2b00042f7481c7b056c4b410d28f33cf" } at src/mongo/db/index_builds_coordinator.cpp 1035
2020-08-31T15:42:33.501+0000 F - [initandlisten]
***aborting after fassert() failure
At first I thought it was a problem with the storage that was causing the issue, but all the SSDs passed full read/write tests. I eventually ended up switching out the storage anyway just to be sure, but I’m ending up with the same issue.
2020-08-31T14:05:13.615+0000 E STORAGE [conn57] WiredTiger error (0) [1598882713:615525][1:0x7f47acb92700], file:collection-19--1551888894885693755.wt, WT_CURSOR.search: __wt_bm_corrupt_dump, 135: {98792763392, 24576, 0x2ad0120a}: (chunk 1 of 24): <snipped> Raw: [1598882713:615525][1:0x7f47acb92700], file:collection-19--1551888894885693755.wt, WT_CURSOR.search: __wt_bm_corrupt_dump, 135: {98792763392, 24576, 0x2ad0120a}: (chunk 1 of 24): <snipped>
2020-08-31T14:05:13.619+0000 E STORAGE [conn57] WiredTiger error (-31802) [1598882713:619635][1:0x7f47acb92700], file:collection-19--1551888894885693755.wt, WT_CURSOR.search: __wt_block_read_off, 292: collection-19--1551888894885693755.wt: fatal read error: WT_ERROR: non-specific WiredTiger error Raw: [1598882713:619635][1:0x7f47acb92700], file:collection-19--1551888894885693755.wt, WT_CURSOR.search: __wt_block_read_off, 292: collection-19--1551888894885693755.wt: fatal read error: WT_ERROR: non-specific WiredTiger error 2020-08-31T14:05:13.619+0000 E STORAGE [conn57] WiredTiger error (-31804) [1598882713:619667][1:0x7f47acb92700], file:collection-19--1551888894885693755.wt, WT_CURSOR.search: __wt_panic, 490: the process must exit and restart: WT_PANIC: WiredTiger library panic Raw: [1598882713:619667][1:0x7f47acb92700], file:collection-19--1551888894885693755.wt, WT_CURSOR.search: __wt_panic, 490: the process must exit and restart: WT_PANIC: WiredTiger library panic
2020-08-31T14:05:13.619+0000 F - [conn57] Fatal Assertion 50853 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 414 2020-08-31T14:05:13.619+0000 F - [conn57]
***aborting after fassert() failure
----- BEGIN BACKTRACE -----
<snipped>
mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x55bb10216631]
mongod(+0x28BDC4C) [0x55bb10215c4c]
mongod(+0x28BDCD6) [0x55bb10215cd6]
libpthread.so.0(+0x12890) [0x7f47c8ceb890]
libc.so.6(gsignal+0xC7) [0x7f47c8926e97]
libc.so.6(abort+0x141) [0x7f47c8928801]
mongod(_ZN5mongo32fassertFailedNoTraceWithLocationEiPKcj+0x0) [0x55bb0e66c62b]
mongod(+0xA50666) [0x55bb0e3a8666]
mongod(+0xEB7BBB) [0x55bb0e80fbbb]
mongod(__wt_err_func+0x90) [0x55bb0e3b8f28]
mongod(__wt_panic+0x39) [0x55bb0e3b938c]
mongod(+0xA6DC29) [0x55bb0e3c5c29]
mongod(__wt_bm_read+0x13F) [0x55bb0e8ee69f]
mongod(__wt_bt_read+0x92) [0x55bb0e853002]
mongod(+0xF03C43) [0x55bb0e85bc43]
mongod(__wt_page_in_func+0x3DE) [0x55bb0e85ca7e]
mongod(__wt_row_search+0x93D) [0x55bb0e88963d]
mongod(__wt_btcur_search+0x893) [0x55bb0e846253]
mongod(+0xE51955) [0x55bb0e7a9955]
mongod(+0xE0A33D) [0x55bb0e76233d]
mongod(_ZN5mongo31WiredTigerRecordStoreCursorBase9seekExactERKNS_8RecordIdE+0xA6) [0x55bb0e762686]
mongod(_ZN5mongo16WorkingSetCommon5fetchEPNS_16OperationContextEPNS_10WorkingSetEmNS_11unowned_ptrINS_20SeekableRecordCursorEEE+0x90) [0x55bb0f102200]
mongod(_ZN5mongo11IDHackStage6doWorkEPm+0xEA) [0x55bb0f0cf57a]
mongod(_ZN5mongo9PlanStage4workEPm+0x68) [0x55bb0f0dd668]
mongod(_ZN5mongo11UpdateStage6doWorkEPm+0x3DB) [0x55bb0f0ff05b]
mongod(_ZN5mongo11UpsertStage6doWorkEPm+0x7B) [0x55bb0f100feb]
mongod(_ZN5mongo9PlanStage4workEPm+0x68) [0x55bb0f0dd668]
mongod(_ZN5mongo16PlanExecutorImpl12_getNextImplEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0x230) [0x55bb0f124b90]
mongod(_ZN5mongo16PlanExecutorImpl7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x4D) [0x55bb0f1252ed]
mongod(_ZN5mongo16PlanExecutorImpl11executePlanEv+0x5D) [0x55bb0f1255ad]
mongod(_ZN5mongo14performUpdatesEPNS_16OperationContextERKNS_9write_ops6UpdateE+0xCD8) [0x55bb0ee2e228]
mongod(+0x14CA456) [0x55bb0ee22456]
mongod(+0x14C8C5D) [0x55bb0ee20c5d]
mongod(+0x118C3E8) [0x55bb0eae43e8]
mongod(+0x118E654) [0x55bb0eae6654]
mongod(_ZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageERKNS0_5HooksE+0x41A) [0x55bb0eae73ea]
mongod(_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE+0x3C) [0x55bb0ead4f6c]
mongod(_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE+0xEC) [0x55bb0eae0cfc]
mongod(_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE+0x17F) [0x55bb0eaddfdf]
mongod(+0x1187C6C) [0x55bb0eadfc6c]
mongod(_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE+0x182) [0x55bb0f93bce2]
mongod(_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE+0x10D) [0x55bb0eadaead]
mongod(_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE+0x753) [0x55bb0eadc723]
mongod(_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE+0x316) [0x55bb0eadd2d6]
mongod(_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE+0xDB) [0x55bb0eaddf3b]
mongod(+0x1187C6C) [0x55bb0eadfc6c]
mongod(+0x1FE414B) [0x55bb0f93c14b]
mongod(+0x2639A45) [0x55bb0ff91a45]
mongod(+0x2639AA4) [0x55bb0ff91aa4]
libpthread.so.0(+0x76DB) [0x7f47c8ce06db]
libc.so.6(clone+0x3F) [0x7f47c8a0988f]
----- END BACKTRACE -----
My best theory right now is that I’ve accidentally stumbled upon some strange edge case where the contents of a document actually manage to corrupt the entire database. At any rate, hopefully I’m wrong, because I really like mongodb.