We have a three node cluster with one primary and two secondary nodes with 150GB of data and heavy write and read operations.
When cluster was operating in version 3.6, Large amount of slow queries were observed and data packet processing is taking more than 10seconds.
As part of performance optimisationa and multi-document transaction, we upgraded mongodb version to 4.0.23 and query processing speed improved, But within 2-3 hours load primary goes down and one of the secondary nodes becomes unhealthy.
Why mongodb is surviving with version 3.6 and high load, but not on 4.0?
Can anyone help here?
From the mongostat output you posted, it appears that the node was heavily utilized, up to its configured memory usage limits. I would think that the server was killed by the OOMkiller. Can you confirm if this is the case? Note that by default EC2 instances does not come configured with swap, so if any process takes a lot of memory, the OS will just kill it.
If the server continually gets killed by the OOMkiller, it seems that your instance size is too small for the workload you’re putting in. One possible solution is to provision a larger hardware and see if the issue persists.
Why mongodb is surviving with version 3.6 and high load, but not on 4.0?
You mentioned earlier that you upgraded to 4.0 for multi document transactions. Is your app using this feature? If yes, multi document transactions will incur additional memory load (see Performance Best Practices: Transactions and Read / Write Concerns), especially if there are a lot of them.
On another note, I would encourage you to try out the newer MongoDB versions (4.4.5 currently) and see if it helps your situation, since there are many improvements made since 4.0.
Hi kevinadi,
Thanks for your suggestion.
For case of OOMkiller we have checked all logs of /var/log/messages but there is no log message indicating the same. Is there any other way checking the same.
And in context to multi document transaction , we have not yet enabled feature compatability version to 4.0, so there are less chances of additional memory load.
Also , we have now downgraded to version 3.6 , and we are running with very low load, but still mongodb nodes are crashing. i.e one of the secondary node got converted into primary and started responding very slow and ultimately reached to unhealthy state, where as the node which was primary originally was not behaving unusual.
Please find some additonal stats if and let us know if you can help us out here.
Current version 3.6
Memory consumption Stats:
rs0:PRIMARY> db.serverStatus().wiredTiger.cache["maximum bytes configured"]
32212254720
rs0:PRIMARY> db.serverStatus().tcmalloc.tcmalloc.formattedString
------------------------------------------------
MALLOC: 28084039952 (26783.0 MiB) Bytes in use by application
MALLOC: + 7536099328 ( 7187.0 MiB) Bytes in page heap freelist
MALLOC: + 374013696 ( 356.7 MiB) Bytes in central cache freelist
MALLOC: + 2279168 ( 2.2 MiB) Bytes in transfer cache freelist
MALLOC: + 260880624 ( 248.8 MiB) Bytes in thread cache freelists
MALLOC: + 114385152 ( 109.1 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 36371697920 (34686.8 MiB) Actual memory used (physical + swap)
MALLOC: + 2148909056 ( 2049.4 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 38520606976 (36736.1 MiB) Virtual address space used
MALLOC:
MALLOC: 610748 Spans in use
MALLOC: 449 Thread heaps in use
MALLOC: 4096 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
Error log Before secondary transiting into Primary:
Now we have done AMI restore, and server is working as expected, but we have still not reached on any conclusion what effect did upgrade to 4.0 and downgrade activity performed on mongodb environment.
Please find the stats before AMI backup:
CPU: Fluctuating between 50% to 6%
Load average per 1 min: 29.5
Mongod reached in unhealthy/ not reachable state .
That is a normalised load average of 3.69 (29.5/8vcpu) This should below 1 on a healthy system. Given that the actual cpu usage you reported is low you’re bottlenecked elsewhere. This could be memory pressure and swapping to disk(without swap you get very high load before OOM killer) or your disk IO is hitting a limit.
@chris With same load average , server was running fine till upgrade to downgrade activity. Now as we have done AMI restore server is running fine again.
Is it the case during upgrade to 4.0 and downgrade some configuration might have changed ?
Is there any way to figure this out?
Now we are not sure even to proceed with upgrade process to 4.0, because on production environment it will be at risk.
You may think it is running fine. Its not. A load average that high you are under resourced somewhere. In my opinion the changes you made(upgeade and downgrade) are highlighting this issue, not causing it.
New & Unread Topics
Topic list, column headers with buttons are sortable.