read 5 min
10 / 10
May 2021

We have a three node cluster with one primary and two secondary nodes with 150GB of data and heavy write and read operations.
When cluster was operating in version 3.6, Large amount of slow queries were observed and data packet processing is taking more than 10seconds.
As part of performance optimisationa and multi-document transaction, we upgraded mongodb version to 4.0.23 and query processing speed improved, But within 2-3 hours load primary goes down and one of the secondary nodes becomes unhealthy.
Why mongodb is surviving with version 3.6 and high load, but not on 4.0?
Can anyone help here?

Oplog : 71GB,
RAM: 64GB
Instance: r4.2xLarge EC2 Instance

read 5 min

Please find the output of monostat command.

insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time 187 *0 1855 45 196 544|0 5.1% 79.9% 0 34.8G 28.3G 0|456 6|128 3.21m 120m 1264 rs0 PRI Apr 23 17:33:31.297 198 *0 1619 21 199 519|0 5.2% 80.1% 0 34.8G 28.3G 0|469 4|128 2.93m 125m 1264 rs0 PRI Apr 23 17:33:32.298 172 *0 3200 48 158 613|0 5.7% 80.4% 0 34.8G 28.3G 0|268 1|128 4.17m 122m 1263 rs0 PRI Apr 23 17:33:33.318 101 *0 7378 86 14 731|0 4.9% 79.2% 0 34.8G 28.3G 0|343 3|128 6.05m 133m 1262 rs0 PRI Apr 23 17:33:34.296 133 *0 2521 42 11 419|0 5.6% 79.9% 0 34.8G 28.3G 0|393 6|128 3.30m 188m 1261 rs0 PRI Apr 23 17:33:35.302 136 *0 3022 83 126 536|0 5.5% 80.0% 0 34.8G 28.3G 0|422 2|127 4.09m 178m 1261 rs0 PRI Apr 23 17:33:36.316 175 *0 2585 52 175 623|0 5.2% 79.8% 0 34.8G 28.3G 0|449 4|128 3.66m 158m 1268 rs0 PRI Apr 23 17:33:37.298 95 *0 2027 31 180 528|0 5.1% 79.9% 0 34.8G 28.3G 0|429 4|128 2.92m 127m 1270 rs0 PRI Apr 23 17:33:38.298 83 *0 2418 41 149 523|0 4.9% 79.4% 0 34.8G 28.3G 0|419 4|128 3.65m 146m 1270 rs0 PRI Apr 23 17:33:39.297 148 *0 2288 45 171 556|0 4.9% 79.7% 0 34.8G 28.3G 0|453 1|128 3.26m 109m 1270 rs0 PRI Apr 23 17:33:40.307 insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time 144 *0 1333 27 242 535|0 5.1% 80.0% 0 34.8G 28.3G 0|489 4|128 2.57m 103m 1271 rs0 PRI Apr 23 17:33:41.297 140 *0 1125 22 226 458|0 5.1% 80.4% 0 34.8G 28.3G 0|498 3|128 2.42m 96.7m 1271 rs0 PRI Apr 23 17:33:42.298 183 *0 1778 50 210 530|0 5.1% 80.2% 0 34.8G 28.3G 0|474 8|128 2.77m 122m 1271 rs0 PRI Apr 23 17:33:43.296 217 *0 1680 30 217 556|0 4.8% 79.9% 0 34.8G 28.3G 0|408 1|127 3.01m 122m 1271 rs0 PRI Apr 23 17:33:44.311 200 *0 4268 49 123 611|0 5.2% 80.4% 0 34.8G 28.3G 0|239 2|128 4.68m 136m 1271 rs0 PRI Apr 23 17:33:45.601 160 *0 6644 81 24 644|0 5.7% 80.8% 0 34.8G 28.3G 0|303 3|128 6.11m 213m 1271 rs0 PRI Apr 23 17:33:46.300 62 *0 5432 85 57 554|0 5.0% 79.9% 0 34.8G 28.3G 0|254 1|126 5.20m 148m 1271 rs0 PRI Apr 23 17:33:47.311 89 *0 9086 90 26 545|0 5.2% 80.2% 0 34.8G 28.3G 0|146 3|128 6.07m 130m 1274 rs0 PRI Apr 23 17:33:48.300 30 *0 8915 72 7 541|0 4.9% 79.7% 0 34.8G 28.3G 0|243 1|128 6.29m 128m 1279 rs0 PRI Apr 23 17:33:49.318 33 *0 13579 56 2 375|0 5.4% 80.1% 0 34.8G 28.3G 0|73 1|123 7.62m 37.7m 1275 rs0 PRI Apr 23 17:33:50.409 insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time 23 *0 12968 86 7 380|0 5.8% 80.0% 0 34.8G 28.3G 0|54 2|127 7.81m 111m 1272 rs0 PRI Apr 23 17:33:51.331 12 *0 6958 82 14 548|0 5.1% 79.9% 0 34.8G 28.3G 0|268 3|128 4.90m 259m 1273 rs0 PRI Apr 23 17:33:52.298 95 *0 8353 105 61 453|0 5.5% 80.2% 0 34.8G 28.3G 0|70 3|128 6.28m 103m 1275 rs0 PRI Apr 23 17:33:53.299 95 *0 8484 103 65 410|0 5.3% 79.7% 0 34.8G 28.3G 0|171 3|128 6.08m 138m 1278 rs0 PRI Apr 23 17:33:54.299 90 *0 9836 52 49 163|0 4.6% 79.5% 0 34.8G 28.3G 0|13 1|128 5.71m 89.4m 1279 rs0 PRI Apr 23 17:33:55.343 ^Z

Hi @Rakhi_Maheshwari welcome to the community!

From the mongostat output you posted, it appears that the node was heavily utilized, up to its configured memory usage limits. I would think that the server was killed by the OOMkiller. Can you confirm if this is the case? Note that by default EC2 instances does not come configured with swap, so if any process takes a lot of memory, the OS will just kill it.

If the server continually gets killed by the OOMkiller, it seems that your instance size is too small for the workload you’re putting in. One possible solution is to provision a larger hardware and see if the issue persists.

Why mongodb is surviving with version 3.6 and high load, but not on 4.0?

You mentioned earlier that you upgraded to 4.0 for multi document transactions. Is your app using this feature? If yes, multi document transactions will incur additional memory load (see Performance Best Practices: Transactions and Read / Write Concerns), especially if there are a lot of them.

On another note, I would encourage you to try out the newer MongoDB versions (4.4.5 currently) and see if it helps your situation, since there are many improvements made since 4.0.

Best regards,
Kevin

Hi kevinadi,
Thanks for your suggestion.
For case of OOMkiller we have checked all logs of /var/log/messages but there is no log message indicating the same. Is there any other way checking the same.
And in context to multi document transaction , we have not yet enabled feature compatability version to 4.0, so there are less chances of additional memory load.

Also , we have now downgraded to version 3.6 , and we are running with very low load, but still mongodb nodes are crashing. i.e one of the secondary node got converted into primary and started responding very slow and ultimately reached to unhealthy state, where as the node which was primary originally was not behaving unusual.

Please find some additonal stats if and let us know if you can help us out here.

Current version 3.6 Memory consumption Stats: rs0:PRIMARY> db.serverStatus().wiredTiger.cache["maximum bytes configured"] 32212254720 rs0:PRIMARY> db.serverStatus().tcmalloc.tcmalloc.formattedString ------------------------------------------------ MALLOC: 28084039952 (26783.0 MiB) Bytes in use by application MALLOC: + 7536099328 ( 7187.0 MiB) Bytes in page heap freelist MALLOC: + 374013696 ( 356.7 MiB) Bytes in central cache freelist MALLOC: + 2279168 ( 2.2 MiB) Bytes in transfer cache freelist MALLOC: + 260880624 ( 248.8 MiB) Bytes in thread cache freelists MALLOC: + 114385152 ( 109.1 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 36371697920 (34686.8 MiB) Actual memory used (physical + swap) MALLOC: + 2148909056 ( 2049.4 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 38520606976 (36736.1 MiB) Virtual address space used MALLOC: MALLOC: 610748 Spans in use MALLOC: 449 Thread heaps in use MALLOC: 4096 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory.

Error log Before secondary transiting into Primary:

2021-05-02T18:29:58.804+0000 I COMMAND [conn448567] command admin.$cmd command: isMaster { ismaster: 1, $clusterTime: { clusterTime: Timestamp(1619980198, 12), signature: { hash: BinData(0, ), keyId: } }, $db: "admin", $readPreference: { mode: "primary" } } numYields:0 reslen:678 locks:{} protocol:op_msg 572ms 2021-05-02T18:29:58.804+0000 I NETWORK [listener] connection accepted from ip:54178 #457578 (916 connections now open) 2021-05-02T18:29:58.805+0000 I NETWORK [conn457577] received client metadata from ip:54176 conn457577: { driver: { name: "mongo-java-driver|legacy", version: "3.11.2" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.14.219-119.340.amzn1.x86_64" }, platform: "Java/Eclipse OpenJ9/1.8.0_252-b09" } 2021-05-02T18:29:58.805+0000 I COMMAND [conn457572] commanddbname.$cmd command: saslContinue { saslContinue: 1, conversationId: 1, payload: BinData(0, ), $db: "dbname" } numYields:0 reslen:203 locks:{} protocol:op_query 1018ms 2021-05-02T18:29:58.805+0000 I COMMAND [conn457573] commanddbname.$cmd command: saslContinue { saslContinue: 1, conversationId: 1, payload: BinData(0, ), $db: "dbname" } numYields:0 reslen:203 locks:{} protocol:op_query 1018ms

Data size: 94GB
RAM:64GB
cache : 30 GB
instance size: r4.2xlarge

Now that you are back on 3.6 and still experiencing the same issue I think you need to focus your attention to the platform.

Review and ensure you are implementing the recommendations in the Productions Notes and Operations Checklist, specifically AWS EC2

Review your mongod logs and check for warnings at startup and fix them.

The whole host is ‘crashing’ or mongod ?

Are the hosts dedicated for mongodb or are the other softwares co-located on the host.

How does other metrics of the host look, Load Average, IO Stat, CPU Usage ?

Also check dmesg for OOM.

HI @chris

Now we have done AMI restore, and server is working as expected, but we have still not reached on any conclusion what effect did upgrade to 4.0 and downgrade activity performed on mongodb environment.

Please find the stats before AMI backup:
CPU: Fluctuating between 50% to 6%
Load average per 1 min: 29.5

Mongod reached in unhealthy/ not reachable state .

That is a normalised load average of 3.69 (29.5/8vcpu) This should below 1 on a healthy system. Given that the actual cpu usage you reported is low you’re bottlenecked elsewhere. This could be memory pressure and swapping to disk(without swap you get very high load before OOM killer) or your disk IO is hitting a limit.

@chris With same load average , server was running fine till upgrade to downgrade activity. Now as we have done AMI restore server is running fine again.
Is it the case during upgrade to 4.0 and downgrade some configuration might have changed ?
Is there any way to figure this out?
Now we are not sure even to proceed with upgrade process to 4.0, because on production environment it will be at risk.

Please help us here.

You may think it is running fine. Its not. A load average that high you are under resourced somewhere. In my opinion the changes you made(upgeade and downgrade) are highlighting this issue, not causing it.