MongoDB switches from primary state to secondary during script

Hi friends,

I have a mongodb replicaset setup using bitnami/helm charts/bitnami/mongodb at main · bitnami/charts · GitHub which lives on my k8 cluster.

I have a cron job that runs every night to store data in a my mongo db. It failed lat night at around 1am-3am because mongo is no longer in a primary state.

MongoError: Not primary while writing

When i check the logs of the mongodb arbiter i can see the following logs around the same time.

kubectl logs -f mongo-prod-mongodb-arbiter-0

First i get a set of logs like this which tell me slow query or the server status was slow.

{"t":{"$date":"2020-11-17T03:19:19.376+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn3","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"replSetHeartbeat":"rs0","configVersion":492521,"hbv":1,"from":"<redacted>6-<redacted>.us-east-2.elb.amazonaws.com:27017","fromId":0,"term":14,"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1605583150,"i":1}},"signature":{"hash":{"$binary":{"base64":"GGP8UlQZ1+TrxWk2hronxraFYrU=","subType":"0"}},"keyId":6855964983600087045}},"$db":"admin"},"numYields":0,"reslen":489,"locks":{},"protocol":"op_msg","durationMillis":3476}}
{"t":{"$date":"2020-11-17T03:19:21.251+00:00"},"s":"I",  "c":"COMMAND",  "id":20499,   "ctx":"ftdc","msg":"serverStatus was very slow","attr":{"timeStats":{"after basic":1279,"after asserts":1287,"after connections":1288,"after electionMetrics":1690,"after extra_info":1690,"after flowControl":1690,"after globalLock":1690,"after locks":1691,"after logicalSessionRecordCache":1710,"after mirroredReads":1712,"after network":1712,"after opLatencies":1723,"after opReadConcernCounters":1723,"after opcounters":1723,"after opcountersRepl":1723,"after oplogTruncation":1756,"after repl":5239,"after security":5579,"after storageEngine":7089,"after tcmalloc":7089,"after trafficRecording":7089,"after transactions":7089,"after transportSecurity":7089,"after twoPhaseCommitCoordinator":7089,"after wiredTiger":7101,"at end":7118}}}
{"t":{"$date":"2020-11-17T03:19:23.436+00:00"},"s":"I",  "c":"COMMAND",  "id":20499,   "ctx":"ftdc","msg":"serverStatus was very slow","attr":{"timeStats":{"after basic":17,"after asserts":17,"after connections":17,"after electionMetrics":17,"after extra_info":17,"after flowControl":17,"after globalLock":17,"after locks":17,"after logicalSessionRecordCache":17,"after mirroredReads":17,"after network":338,"after opLatencies":354,"after opReadConcernCounters":398,"after opcounters":398,"after opcountersRepl":398,"after oplogTruncation":576,"after repl":697,"after security":707,"after storageEngine":810,"after tcmalloc":1015,"after trafficRecording":1028,"after transactions":1038,"after transportSecurity":1038,"after twoPhaseCommitCoordinator":1065,"after wiredTiger":1075,"at end":1113}}}
{"t":{"$date":"2020-11-17T03:19:26.085+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn3","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"replSetHeartbeat":"rs0","configVersion":492521,"hbv":1,"from":"<redacted>6-<redacted>.us-east-2.elb.amazonaws.com:27017","fromId":0,"term":14,"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1605583163,"i":2}},"signature":{"hash":{"$binary":{"base64":"r6eVme2iBLtlxWnwJyYhawoEin4=","subType":"0"}},"keyId":6855964983600087045}},"$db":"admin"},"numYields":0,"reslen":489,"locks":{},"protocol":"op_msg","durationMillis":149}}

Then eventually the member switches to secondary state

{"t":{"$date":"2020-11-17T03:22:18.507+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn7501","msg":"client metadata","attr":{"remote":"100.96.4.176:38788","client":"conn7501","doc":{"driver":{"name":"NetworkInterfaceTL","version":"4.4.1"},"os":{"type":"Linux","name":"PRETTY_NAME=\"Debian GNU/Linux 10 (buster)\"","architecture":"x86_64","version":"Kernel 4.9.0-11-amd64"}}}}
{"t":{"$date":"2020-11-17T03:22:18.508+00:00"},"s":"I",  "c":"ACCESS",   "id":20250,   "ctx":"conn7500","msg":"Successful authentication","attr":{"mechanism":"SCRAM-SHA-256","principalName":"__system","authenticationDatabase":"local","client":"100.96.4.176:38784"}}
{"t":{"$date":"2020-11-17T03:22:18.550+00:00"},"s":"I",  "c":"ACCESS",   "id":20250,   "ctx":"conn7501","msg":"Successful authentication","attr":{"mechanism":"SCRAM-SHA-256","principalName":"__system","authenticationDatabase":"local","client":"100.96.4.176:38788"}}
{"t":{"$date":"2020-11-17T03:22:20.175+00:00"},"s":"I",  "c":"REPL",     "id":21215,   "ctx":"ReplCoord-40","msg":"Member is in new state","attr":{"hostAndPort":"<redacted>-<redacted>.us-east-2.elb.amazonaws.com:27017","newState":"SECONDARY"}}

Not always but this time it seemed to have recovered and gone back into primary state

{"t":{"$date":"2020-11-17T03:22:28.913+00:00"},"s":"I",  "c":"ELECTION", "id":23980,   "ctx":"conn7499","msg":"Responding to vote request","attr":{"request":"{ replSetRequestVotes: 1, setName: \"rs0\", dryRun: true, term: 14, candidateIndex: 0, configVersion: 492521, configTerm: -1, lastCommittedOp: { ts: Timestamp(1605583310, 7), t: 14 } }","response":"{ term: 14, voteGranted: true, reason: \"\" }","replicaSetStatus":"Current replSetGetStatus output: { set: \"rs0\", date: new Date(1605583348912), myState: 7, term: 14, syncSourceHost: \"\", syncSourceId: -1, heartbeatIntervalMillis: 2000, majorityVoteCount: 2, writeMajorityCount: 1, votingMembersCount: 2, writableVotingMembersCount: 1, optimes: { lastCommittedOpTime: { ts: Timestamp(1605583310, 7), t: 14 }, lastCommittedWallTime: new Date(1605583310813), appliedOpTime: { ts: Timestamp(1605583310, 7), t: 14 }, durableOpTime: { ts: Timestamp(0, 0), t: -1 }, lastAppliedWallTime: new Date(1605583310813), lastDurableWallTime: new Date(0) }, members: [ { _id: 0, name: \"<redacted>-<redacted>.us-east-2.elb.amazonaws.com:27017\", health: 1.0, state: 2, stateStr: \"SECONDARY\", uptime: 37487, optime: { ts: Timestamp(1605583310, 7), t: 14 }, optimeDurable: { ts: Timestamp(1605583310, 7), t: 14 }, optimeDate: new Date(1605583310000), optimeDurableDate: new Date(1605583310000), lastHeartbeat: new Date(1605583348185), lastHeartbeatRecv: new Date(1605583348696), pingMs: 277, lastHeartbeatMessage: \"\", syncSourceHost: \"\", syncSourceId: -1, infoMessage: \"\", configVersion: 492521, configTerm: -1 }, { _id: 1, name: \"mongo-prod-mongodb-arbiter-0.mongo-prod-mongodb-arbiter-headless.mongodb.svc.cluster.local:27017\", health: 1.0, state: 7, stateStr: \"ARBITER\", uptime: 0, syncSourceHost: \"\", syncSourceId: -1, infoMessage: \"\", configVersion: 492521, configTerm: -1, self: true, lastHeartbeatMessage: \"\" } ] }"}}
{"t":{"$date":"2020-11-17T03:22:28.918+00:00"},"s":"I",  "c":"ELECTION", "id":23980,   "ctx":"conn7499","msg":"Responding to vote request","attr":{"request":"{ replSetRequestVotes: 1, setName: \"rs0\", dryRun: false, term: 15, candidateIndex: 0, configVersion: 492521, configTerm: -1, lastCommittedOp: { ts: Timestamp(1605583310, 7), t: 14 } }","response":"{ term: 15, voteGranted: true, reason: \"\" }","replicaSetStatus":"Current replSetGetStatus output: { set: \"rs0\", date: new Date(1605583348918), myState: 7, term: 15, syncSourceHost: \"\", syncSourceId: -1, heartbeatIntervalMillis: 2000, majorityVoteCount: 2, writeMajorityCount: 1, votingMembersCount: 2, writableVotingMembersCount: 1, optimes: { lastCommittedOpTime: { ts: Timestamp(1605583310, 7), t: 14 }, lastCommittedWallTime: new Date(1605583310813), appliedOpTime: { ts: Timestamp(1605583310, 7), t: 14 }, durableOpTime: { ts: Timestamp(0, 0), t: -1 }, lastAppliedWallTime: new Date(1605583310813), lastDurableWallTime: new Date(0) }, members: [ { _id: 0, name: \"<redacted>-<redacted>.us-east-2.elb.amazonaws.com:27017\", health: 1.0, state: 2, stateStr: \"SECONDARY\", uptime: 37487, optime: { ts: Timestamp(1605583310, 7), t: 14 }, optimeDurable: { ts: Timestamp(1605583310, 7), t: 14 }, optimeDate: new Date(1605583310000), optimeDurableDate: new Date(1605583310000), lastHeartbeat: new Date(1605583348185), lastHeartbeatRecv: new Date(1605583348696), pingMs: 277, lastHeartbeatMessage: \"\", syncSourceHost: \"\", syncSourceId: -1, infoMessage: \"\", configVersion: 492521, configTerm: -1 }, { _id: 1, name: \"mongo-prod-mongodb-arbiter-0.mongo-prod-mongodb-arbiter-headless.mongodb.svc.cluster.local:27017\", health: 1.0, state: 7, stateStr: \"ARBITER\", uptime: 0, syncSourceHost: \"\", syncSourceId: -1, infoMessage: \"\", configVersion: 492521, configTerm: -1, self: true, lastHeartbeatMessage: \"\" } ] }"}}
{"t":{"$date":"2020-11-17T03:22:30.187+00:00"},"s":"I",  "c":"REPL",     "id":21215,   "ctx":"ReplCoord-40","msg":"Member is in new state","attr":{"hostAndPort":"<redacted>-<redacted>.us-east-2.elb.amazonaws.com:27017","newState":"PRIMARY"}}

{"t":{"$date":"2020-11-17T03:23:29.463+00:00"},"s":"I",  "c":"REPL",     "id":21216,   "ctx":"ReplCoord-40","msg":"Member is now in state RS_DOWN","attr":{"hostAndPort":"<redacted>-<redacted>.us-east-2.elb.amazonaws.com:27017","heartbeatMessage":"Request 18699 timed out, deadline was 2020-11-17T03:23:13.293+00:00, op was RemoteCommand 18699 -- target:[<redacted>-<redacted>.us-east-2.elb.amazonaws.com:27017] db:admin expDate:2020-11-17T03:23:13.285+00:00 cmd:{ replSetHeartbeat: \"rs0\", configVersion: 492521, hbv: 1, from: \"mongo-prod-mongodb-arbiter-0.mongo-prod-mongodb-arbiter-headless.mongodb.svc.cluster.local:27017\", fromId: 1, term: 15 }"}}

rs.status()

rs0:PRIMARY> rs.status()
{
	"set" : "rs0",
	"date" : ISODate("2020-11-17T08:53:33.966Z"),
	"myState" : 1,
	"term" : NumberLong(16),
	"syncSourceHost" : "",
	"syncSourceId" : -1,
	"heartbeatIntervalMillis" : NumberLong(2000),
	"majorityVoteCount" : 2,
	"writeMajorityCount" : 1,
	"votingMembersCount" : 2,
	"writableVotingMembersCount" : 1,
	"optimes" : {
		"lastCommittedOpTime" : {
			"ts" : Timestamp(1605603211, 1),
			"t" : NumberLong(16)
		},
		"lastCommittedWallTime" : ISODate("2020-11-17T08:53:31.190Z"),
		"readConcernMajorityOpTime" : {
			"ts" : Timestamp(1605603211, 1),
			"t" : NumberLong(16)
		},
		"readConcernMajorityWallTime" : ISODate("2020-11-17T08:53:31.190Z"),
		"appliedOpTime" : {
			"ts" : Timestamp(1605603211, 1),
			"t" : NumberLong(16)
		},
		"durableOpTime" : {
			"ts" : Timestamp(1605603211, 1),
			"t" : NumberLong(16)
		},
		"lastAppliedWallTime" : ISODate("2020-11-17T08:53:31.190Z"),
		"lastDurableWallTime" : ISODate("2020-11-17T08:53:31.190Z")
	},
	"lastStableRecoveryTimestamp" : Timestamp(1605603191, 1),
	"electionCandidateMetrics" : {
		"lastElectionReason" : "electionTimeout",
		"lastElectionDate" : ISODate("2020-11-17T03:24:18.778Z"),
		"electionTerm" : NumberLong(16),
		"lastCommittedOpTimeAtElection" : {
			"ts" : Timestamp(1605583378, 1),
			"t" : NumberLong(15)
		},
		"lastSeenOpTimeAtElection" : {
			"ts" : Timestamp(1605583393, 1),
			"t" : NumberLong(15)
		},
		"numVotesNeeded" : 2,
		"priorityAtElection" : 5,
		"electionTimeoutMillis" : NumberLong(10000),
		"numCatchUpOps" : NumberLong(0),
		"newTermStartDate" : ISODate("2020-11-17T03:24:18.784Z"),
		"wMajorityWriteAvailabilityDate" : ISODate("2020-11-17T03:24:18.868Z")
	},
	"members" : [
		{
			"_id" : 0,
			"name" : "<redacted>-<redacted>.us-east-2.elb.amazonaws.com:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 57390,
			"optime" : {
				"ts" : Timestamp(1605603211, 1),
				"t" : NumberLong(16)
			},
			"optimeDate" : ISODate("2020-11-17T08:53:31Z"),
			"syncSourceHost" : "",
			"syncSourceId" : -1,
			"infoMessage" : "",
			"electionTime" : Timestamp(1605583458, 1),
			"electionDate" : ISODate("2020-11-17T03:24:18Z"),
			"configVersion" : 492521,
			"configTerm" : -1,
			"self" : true,
			"lastHeartbeatMessage" : ""
		},
		{
			"_id" : 1,
			"name" : "mongo-prod-mongodb-arbiter-0.mongo-prod-mongodb-arbiter-headless.mongodb.svc.cluster.local:27017",
			"health" : 1,
			"state" : 7,
			"stateStr" : "ARBITER",
			"uptime" : 19766,
			"lastHeartbeat" : ISODate("2020-11-17T08:53:33.072Z"),
			"lastHeartbeatRecv" : ISODate("2020-11-17T08:53:33.078Z"),
			"pingMs" : NumberLong(0),
			"lastHeartbeatMessage" : "",
			"syncSourceHost" : "",
			"syncSourceId" : -1,
			"infoMessage" : "",
			"configVersion" : 492521,
			"configTerm" : -1
		}
	],
	"ok" : 1,
	"$clusterTime" : {
		"clusterTime" : Timestamp(1605603211, 1),
		"signature" : {
			"hash" : BinData(0,"nSv0QPiJ+uvO9A8ljcDIpICTHqg="),
			"keyId" : NumberLong("6855964983600087045")
		}
	},
	"operationTime" : Timestamp(1605603211, 1)
}

Can someone help me understand what is happening here and what i can do to fix it.

fyi: Ive posted this question on stackoverflow why is mongodb switching from primary state to secondary - Stack Overflow where a user mentioned it may have something to do with Locking Performance

My script is making a request to an API and inserting the data received into a MongoDB. It does 10 data requests at a time concurrently. A single request may insert anywhere between 0-1500 documents into mongo.

You have one data node and an arbiter, that does not make a valid replica set. You may as well remove your arbiter as it provides nothing beneficial to your deployment. A replica set should be at least 3 data nodes. If you really, really, really, … cannot run 3 data nodes then 2 data nodes and an arbiter.

You will need to look at the logs from the data node for clues as well as the system logs and metrics(cpu, ram, swap, disk io). This has the signature of a mongod under too much load, becoming unresponsive and triggering an election. That lack of performance could be dues to lack or resources or configuration issues.

Did the container get restarted by the orchestrator(k8s), are there resource constraints on the container?

Even though it is running in a container still look to the Production Notes and Operations Checklist. Transparent Huge Pages, Filesystem Type, swappiness can have impact on your deployment.

There are tips specifically for AWS that may or may not apply.

You have one data node and an arbiter, that does not make a valid replica set. You may as well remove your arbiter as it provides nothing beneficial to your deployment. A replica set should be at least 3 data nodes. If you really, really, really, … cannot run 3 data nodes then 2 data nodes and an arbiter.

Makes sense, i’ve setup mongodb on my k8 cluster using the bitnami helm chart and i wanted to use a replicaset as it mentioned standalone is not appropriate for production. You’re right i have a single node replicaset right now, in the future i will update that.

You will need to look at the logs from the data node for clues as well as the system logs and metrics(cpu, ram, swap, disk io).

The logs ive shown here are from the arbiter, as far as i can remember the logs coming from the actual pod where the mongo node is didn’t show anything of significance during the time period, but i will check that again when it occurs again.

I can see from the arbiter logs that there are messages of “Slow Query” and “serverStatus was very slow”. But im not sure what causes those two.

When you say system logs are these different to the logs im receiving from the k8 pod? https://docs.mongodb.com/manual/reference/log-messages/ the structure of these logs is very similar to what i’ve posted above. if you are referring to something different, where could i find those system logs.

Is there anything you can recommend for me to look at the metrics. I’ve been looking into mongotop and mongostat.

This has the signature of a mongod under too much load, becoming unresponsive and triggering an election. That lack of performance could be dues to lack or resources or configuration issues.

This would make sense as it only occurs when my script runs. During the rest of the day its fine.

Did the container get restarted by the orchestrator(k8s), are there resource constraints on the container?

I’m unsure, kubernetes shows how many times its restarted but it does not provide any event logging as to why it was restarted.

I have an update with some stats/metrics

cpu/memory usage

I’m not sure how accurate this is it, but before the script and during the first 1 hour or 2, the mongo data pod was using around 1.5-2gb of memory. Long after it finished it is still using 6gb~.

logging

During the first 30 mins of running the script the application produces 15k logs and then every 30 mins after it progressively gets slower. I believe its because of these mongo slow queries but not ruling out the possibility that HTTP requests might also be slowing down. (For ever HTTP request im storing the data in mongo).

mongostat

I can see that the used vsize and res increased after the script run compared to before.

before

insert query update delete getmore command dirty used flushes vsize  res qrw arw net_in net_out conn set repl                time
    *0    *0     *0     *0       0     1|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0   499b   48.6k   10 rs0  PRI Nov 18 18:13:17.228
    *0     2     *0     *0       0    13|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0  3.83k    116k   10 rs0  PRI Nov 18 18:13:18.226
    *0    *0     *0     *0       0     3|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0   878b   50.7k   10 rs0  PRI Nov 18 18:13:19.227
    *0    *0     *0     *0       0     3|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0   740b   49.4k   10 rs0  PRI Nov 18 18:13:20.226
    *0     1     *0     *0       0     7|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0  2.85k    111k   10 rs0  PRI Nov 18 18:13:21.228
    *0    *0     *0     *0       0     3|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0   740b   49.4k   10 rs0  PRI Nov 18 18:13:22.226
    *0     1     *0     *0       0     7|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0  2.85k    111k   10 rs0  PRI Nov 18 18:13:23.227
    *0    *0     *0     *0       0     4|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0   946b   50.4k   10 rs0  PRI Nov 18 18:13:24.226
    *0    *0     *0     *0       0     1|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0   497b   48.4k   10 rs0  PRI Nov 18 18:13:25.231
    *0     2     *0     *0       0    11|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0  3.52k    115k   10 rs0  PRI Nov 18 18:13:26.226
insert query update delete getmore command dirty used flushes vsize  res qrw arw net_in net_out conn set repl                time
    *0    *0     *0     *0       0     1|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0   495b   48.3k   10 rs0  PRI Nov 18 18:13:27.234
    *0     2     *0     *0       0    13|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0  3.86k    116k   10 rs0  PRI Nov 18 18:13:28.226
    *0    *0     *0     *0       0     3|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0   878b   50.6k   10 rs0  PRI Nov 18 18:13:29.228
    *0    *0     *0     *0       0     3|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0   740b   49.4k   10 rs0  PRI Nov 18 18:13:30.226
    *0     1     *0     *0       0     7|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0  2.85k    111k   10 rs0  PRI Nov 18 18:13:31.227
    *0    *0     *0     *0       0     3|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0   740b   49.4k   10 rs0  PRI Nov 18 18:13:32.226
    *0     1     *0     *0       0     7|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0  2.85k    111k   10 rs0  PRI Nov 18 18:13:33.228
    *0    *0     *0     *0       0     4|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0   947b   50.5k   10 rs0  PRI Nov 18 18:13:34.226
    *0    *0     *0     *0       0     1|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0   499b   48.6k   10 rs0  PRI Nov 18 18:13:35.227
    *0     2     *0     *0       0    11|0  0.1% 8.8%       0 2.14G 425M 0|0 1|0  3.51k    114k   10 rs0  PRI Nov 18 18:13:36.226

after:

insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn set repl                time
    *0    *0     *0     *0       0     2|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0   757b   91.5k   13 rs0  PRI Nov 19 08:32:45.914
    *0     2     *0     *0       0    12|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  3.79k    159k   13 rs0  PRI Nov 19 08:32:46.908
    *0    *0     *0     *0       0     5|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  1.26k   90.1k   13 rs0  PRI Nov 19 08:32:47.911
    *0     2     *0     *0       0    10|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  3.36k    156k   13 rs0  PRI Nov 19 08:32:48.911
    *0    *0     *0     *0       0     3|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0   765b   92.4k   13 rs0  PRI Nov 19 08:32:49.908
    *0    *0     *0     *0       0     4|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  1.19k   93.7k   13 rs0  PRI Nov 19 08:32:50.909
    *0     2     *0     *0       0    10|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  3.31k    157k   13 rs0  PRI Nov 19 08:32:51.908
    *0    *0     *0     *0       0     4|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  1.00k   92.9k   13 rs0  PRI Nov 19 08:32:52.906
    *0     1     *0     *0       0     8|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  3.11k    155k   13 rs0  PRI Nov 19 08:32:53.908
    *0    *0     *0     *0       0     7|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  1.61k   96.0k   13 rs0  PRI Nov 19 08:32:54.906
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn set repl                time
    *0    *0     *0     *0       0     2|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0   760b   91.8k   13 rs0  PRI Nov 19 08:32:55.909
    *0     2     *0     *0       0    12|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  3.77k    158k   13 rs0  PRI Nov 19 08:32:56.907
    *0    *0     *0     *0       0     6|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  1.50k   95.7k   13 rs0  PRI Nov 19 08:32:57.908
    *0     2     *0     *0       0    10|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  3.36k    156k   13 rs0  PRI Nov 19 08:32:58.906
    *0    *0     *0     *0       0     2|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0   758b   91.6k   13 rs0  PRI Nov 19 08:32:59.912
    *0    *0     *0     *0       0     5|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  1.19k   94.0k   13 rs0  PRI Nov 19 08:33:00.909
    *0     2     *0     *0       0    10|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  3.31k    157k   13 rs0  PRI Nov 19 08:33:01.908
    *0    *0     *0     *0       0     4|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  1.00k   92.9k   13 rs0  PRI Nov 19 08:33:02.906
    *0     1     *0     *0       0     8|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  3.11k    155k   13 rs0  PRI Nov 19 08:33:03.908
    *0    *0     *0     *0       0     7|0  0.1% 77.2%       0 4.76G 2.74G 0|0 1|0  1.61k   96.0k   13 rs0  PRI Nov 19 08:33:04.906

Mongodb’s Wired Tiger engine will take by default the greater of (Available RAM - 1GB) / 2 or 256MB for the wired tiger cache. Beyond that WiredTiger will utilise the filesystem too. This use of memory is expected.

Used wiredTiger cache is 77.2% and CPU is negligible. My guess is storage IO performance. As I mentioned previously look to the Production Notes and Operations Checklist.

Enabling mongodb free monitoring(among others) can give some extra data. The logs from the data node may have additional data about the slow down.

You’re doing 10 concurrent writes which is not a huge amount, you state it is performing inserts so there should be no concurrency issues at the document level.

A insert will create a lot of storage write.
A mongodb insert will:

  • update data in-memory
  • update indexes in-memory
  • write to journal
  • write to storage / checkpoint
  • somewhere in here a write to oplog as well also eventuall going to storage

mongostat : mongostat · GitHub
mongotop: mongotop · GitHub

(new users can only share 2 links at a time :/)

I’ve had my script running during the day so that i could monitor it. I’ve collected some further information.

freemonitoring dashboard

I can see from the freemonitoring dashboard that it says my cpu utilization is above 100%. I am quite confused by this because my grafana dashboard which pulls in metrics from Kubernetes does not say the same. My EC2 node on aws console shows 50-75% util.

You mentioned it could be a storage IO performance. My mongodb in k8 is using a pvc which creates an aws ebs. I checked the metrics for the specific ebs and i see the following during the time the script was running.

logs

Within the mongo-data-node logs i can see many examples slow-query all to do with inserting into my mongo collection.

mongo-data-node: mongo-data-node · GitHub
mongo-arbiter mongo-arbiter · GitHub

The hoverover/help explains this is 1 cpu. So 100% of 1 cpu if this is a 2 cpu system you would consider this 50%.

Looking at your grafana that system is often oversubscribed. Load avg is exceeding your cores. Just after 17:05 the load avg really climbs. The consistent thigh load would explain a slow-down. The upramp at the end I have experienced as a system lock up.

The mongo monitoring looks nominal to me, making me think there is another factor on the host creating this load. Another container or process perhaps.

1 Like

Hey, could you share what the solution was if you remember?
I’m dealing with almost the same problem. I have a Cron job that runs at 12 AM every night, and at around 1 AM, the state switches and I get a NotPrimaryError from pymongo. I still can’t locate what is the problem.