Hi all,
We are running MongoDB 4.4. Recently during an infrastructure outage, the number of open files (our limit is set to 500 000) was being reached and causing the consistency check of WiredTiger to get reset every 20 minutes. I increased the number of open files/ processes to 3 000 000 and restarted the services and the WiredTiger consistency check could continue to completion without being re-initialized.
Shortly after the MongoDB automation agent brought up the replica into a healthy state, the entire replica became unresponsive and displayed a warning/health problem. I saw the following errors being logged by the automation agent:
[2022-07-03T08:03:39.809+0000] [.info] [src/runtimestats/printer.go:mainLoop:58] <runtimestatsPrinter> [08:03:39.809] memory:rss=250593280
alloc=70680792 totalalloc=44663666952 mallocs=344157089 frees=343503736 heapinuse=76693504 heapobjects=653353 gcpausetotalns=176823910 gcpauselastns=78201 gcnum=1328 utimens=365716808000 stimens=190250093000 minflt=2464936 majflt=437
[2022-07-03T08:03:41.410+0000] [.info] [src/config/config.go:ReadClusterConfig:439] [08:03:41.410] Retrieving cluster config from https://api-agents.mongodb.com/agents/api/automation/conf/v1/58adb0513b34b952980b36d2?av=12.0.5.7560&aos=linux&aa=x86_64&ab=64&ad=ubuntu1804&ah=parag-r1ew1-mongo01-p.comapi.internal&ahs=parag-r1ew1-mongo01-p.comapi.internal&at=1656832598695...
[2022-07-03T08:03:41.510+0000] [.info] [main/components/agent.go:LoadClusterConfig:277] [08:03:41.510] clusterConfig unchanged
[2022-07-03T08:03:55.683+0000] [.error] [src/mongoctl/processctl.go:runListCollections:2890] <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/mongoctl/processctl.go:func1:2842] <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/mongoctl/processctl.go:getCollectionsHelper:2870] <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/mongoctl/processctl.go:OplogSize:3304] <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/mongoctl/replsetctl.go:AskAboutOplog:201] <server-rs_16> [08:03:55.683] Could not determine oplog size on mongo-server-docs.internal:27017 (local=false) : <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [state/stateutil/fickleutil.go:currentOplogState:950] <server-rs_16> [08:03:55.683] Failed to ask about oplog size : <server-rs_16> [08:03:55.683] Could not determine oplog size on mongo-server-docs.internal:27017 (local=false) : <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [state/stateutil/stateutil.go:ComputeCurrentState:100] <server-rs_16> [08:03:55.683] Error getting fickle state for current state : <server-rs_16> [08:03:55.683] Failed to ask about oplog size : <server-rs_16> [08:03:55.683] Could not determine oplog size on mongo-server-docs.internal:27017 (local=false) : <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/director/director.go:updateCurrentState:739] <server-rs_16> [08:03:55.683] Error calling ComputeState : <server-rs_16> [08:03:55.683] Error getting fickle state for current state : <server-rs_16> [08:03:55.683] Failed to ask about oplog size : <server-rs_16> [08:03:55.683] Could not determine oplog size on mongo-server-docs.internal:27017 (local=false) : <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/director/director.go:planAndExecute:522] <server-rs_16> [08:03:55.683] Failed to compute states : <server-rs_16> [08:03:55.683] Error calling ComputeState : <server-rs_16> [08:03:55.683] Error getting fickle state for current state : <server-rs_16> [08:03:55.683] Failed to ask about oplog size : <server-rs_16> [08:03:55.683] Could not determine oplog size on mongo-server-docs.internal:27017 (local=false) : <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/director/director.go:mainLoop:395] <server-rs_16> [08:03:55.683] Failed to planAndExecute : <server-rs_16> [08:03:55.683] Failed to compute states : <server-rs_16> [08:03:55.683] Error calling ComputeState : <server-rs_16> [08:03:55.683] Error getting fickle state for current state : <server-rs_16> [08:03:55.683] Failed to ask about oplog size : <server-rs_16> [08:03:55.683] Could not determine oplog size on mongo-server-docs.internal:27017 (local=false) : <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
These blocks of errors continued until I restarted the physical Linux Node. I can see each message is displaying i/o timeout at the end. Has this got to do with Networking or the fact that even though I could see the Open File count was higher than 500 000, some other limit had been reached? Just to let you know, during this period, all other resources, memory, cpu and networking were completed under utilized.
Thanks and Regards,
Alex