Replica completely unresponsive - Number of open files

Hi all,

We are running MongoDB 4.4. Recently during an infrastructure outage, the number of open files (our limit is set to 500 000) was being reached and causing the consistency check of WiredTiger to get reset every 20 minutes. I increased the number of open files/ processes to 3 000 000 and restarted the services and the WiredTiger consistency check could continue to completion without being re-initialized.

Shortly after the MongoDB automation agent brought up the replica into a healthy state, the entire replica became unresponsive and displayed a warning/health problem. I saw the following errors being logged by the automation agent:

[2022-07-03T08:03:39.809+0000] [.info] [src/runtimestats/printer.go:mainLoop:58] <runtimestatsPrinter> [08:03:39.809] memory:rss=250593280
alloc=70680792	totalalloc=44663666952	mallocs=344157089	frees=343503736	heapinuse=76693504	heapobjects=653353	gcpausetotalns=176823910	gcpauselastns=78201	gcnum=1328	utimens=365716808000	stimens=190250093000	minflt=2464936	majflt=437
[2022-07-03T08:03:41.410+0000] [.info] [src/config/config.go:ReadClusterConfig:439] [08:03:41.410] Retrieving cluster config from https://api-agents.mongodb.com/agents/api/automation/conf/v1/58adb0513b34b952980b36d2?av=12.0.5.7560&aos=linux&aa=x86_64&ab=64&ad=ubuntu1804&ah=parag-r1ew1-mongo01-p.comapi.internal&ahs=parag-r1ew1-mongo01-p.comapi.internal&at=1656832598695...
[2022-07-03T08:03:41.510+0000] [.info] [main/components/agent.go:LoadClusterConfig:277] [08:03:41.510] clusterConfig unchanged
[2022-07-03T08:03:55.683+0000] [.error] [src/mongoctl/processctl.go:runListCollections:2890] <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/mongoctl/processctl.go:func1:2842] <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/mongoctl/processctl.go:getCollectionsHelper:2870] <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/mongoctl/processctl.go:OplogSize:3304] <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/mongoctl/replsetctl.go:AskAboutOplog:201] <server-rs_16> [08:03:55.683] Could not determine oplog size on mongo-server-docs.internal:27017 (local=false) : <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [state/stateutil/fickleutil.go:currentOplogState:950] <server-rs_16> [08:03:55.683] Failed to ask about oplog size : <server-rs_16> [08:03:55.683] Could not determine oplog size on mongo-server-docs.internal:27017 (local=false) : <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [state/stateutil/stateutil.go:ComputeCurrentState:100] <server-rs_16> [08:03:55.683] Error getting fickle state for current state : <server-rs_16> [08:03:55.683] Failed to ask about oplog size : <server-rs_16> [08:03:55.683] Could not determine oplog size on mongo-server-docs.internal:27017 (local=false) : <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/director/director.go:updateCurrentState:739] <server-rs_16> [08:03:55.683] Error calling ComputeState : <server-rs_16> [08:03:55.683] Error getting fickle state for current state : <server-rs_16> [08:03:55.683] Failed to ask about oplog size : <server-rs_16> [08:03:55.683] Could not determine oplog size on mongo-server-docs.internal:27017 (local=false) : <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/director/director.go:planAndExecute:522] <server-rs_16> [08:03:55.683] Failed to compute states : <server-rs_16> [08:03:55.683] Error calling ComputeState : <server-rs_16> [08:03:55.683] Error getting fickle state for current state : <server-rs_16> [08:03:55.683] Failed to ask about oplog size : <server-rs_16> [08:03:55.683] Could not determine oplog size on mongo-server-docs.internal:27017 (local=false) : <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout
[2022-07-03T08:03:55.683+0000] [.error] [src/director/director.go:mainLoop:395] <server-rs_16> [08:03:55.683] Failed to planAndExecute : <server-rs_16> [08:03:55.683] Failed to compute states : <server-rs_16> [08:03:55.683] Error calling ComputeState : <server-rs_16> [08:03:55.683] Error getting fickle state for current state : <server-rs_16> [08:03:55.683] Failed to ask about oplog size : <server-rs_16> [08:03:55.683] Could not determine oplog size on mongo-server-docs.internal:27017 (local=false) : <server-rs_16> [08:03:55.683] Error getting collection information for db=local coll=oplog.rs : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error executing WithClientFor() for cp=mongo-server-docs.internal:27017 (local=false) connectMode=SingleConnect : <server-rs_16> [08:03:55.683] Error getting collections for db local : <server-rs_16> [08:03:55.683] Error running ListCollections with filter map[name:oplog.rs] : connection(mongo-server-docs.internal:27017[-93]) incomplete read of message header: read tcp 10.10.10.10:52302->11.11.11.11:27017: i/o timeout

These blocks of errors continued until I restarted the physical Linux Node. I can see each message is displaying i/o timeout at the end. Has this got to do with Networking or the fact that even though I could see the Open File count was higher than 500 000, some other limit had been reached? Just to let you know, during this period, all other resources, memory, cpu and networking were completed under utilized.

Thanks and Regards,
Alex

Hi @Alex_Meyer1

It appears that your deployment is using Cloud Manager. Since your deployment is likely a very large one (due to needing that many open files) and undoubtedly a complex one, I believe troubleshooting the issue would also likely to require significant amount of specific knowledge about the deployment itself and other operational concerns. Since Cloud Manager is part of the MongoDB Enterprise Advanced subscription, you might want to contact MongoDB Support. Otherwise if you’re currently evaluating Cloud Manager, I’d be happy to link you with sales to discuss further options.

Best regards
Kevin