PSA replica write concern (seemingly) issue in later versions of Mongo - no failover if 1 node is down

Vasiliy_Naumov · 2024-05-27T08:58:03.942Z

Hi there!

I have an issue with a PSA mongo replica after having upgraded it from 3.* to 7.0.8.

I suspect that it is related to the write concern: https://www.mongodb.com/docs/manual/reference/write-concern/#implicit-default-write-concern

Whenever I shut down one of data-bearing nodes (secondary or primary with rs.stepDown), rs.status() indicates that the replica set has a primary, but applications don’t work. There is nothing in mongo logs which I could discern to be relevant, but as soon as a stopped server is restarted everything immediately starts working, no matter which one ends up being a primary.

This replica set used to have a couple more hidden secondaries for backups at different times (not sure if it may be relevant), but they were removed.

Also tried deploying a 3rd data-bearing node and making it a PSS replica set, in which case I could safely shut down 1 of the nodes and perform the necessary maintenance as it is intended.

It seems to me that it strongly suggests that write concern is the culprit here, so I tried changing the default, and set it to 1 with setDefaultRWConcern. Alas, it didn’t seem to help.

Can anyone suggest what I’m missing here? I used to be able to perform this sort of maintenance in the past before the upgrade.

Thanks!

Aasawari · 2024-05-31T13:38:20.302Z

Hi @Vasiliy_Naumov and welcome to the community forum!!

Vasiliy_Naumov:

I have an issue with a PSA mongo replica after having upgraded it from 3.* to 7.0.8.

Firstly, could you confirm if you have seen similar issues before the upgrade has happened or is this the first time you are seeing the issues?

Vasiliy_Naumov:

Whenever I shut down one of data-bearing nodes (secondary or primary with rs.stepDown), rs.status() indicates that the replica set has a primary, but applications don’t work.

There are few questions here:

For a PSA architecture, when one data bearing node is shut down, the other data node becomes the primary. This is expected. But could you help in understanding, what is meant by application is not working?
As mentioned in the documentation for PSA architecture,

If one data-bearing node goes down, the other node becomes the primary. Writes with w:1 continue to succeed in this state but writes with write concern "majority" cannot succeed and the commit point starts to lag. If your PSA replica set contains a lagged secondary and your replica set requires two nodes to majority commit a change, your commit point also lags.

Vasiliy_Naumov:

This replica set used to have a couple more hidden secondaries for backups at different times (not sure if it may be relevant), but they were removed.

If the replica set members have been gracefully removed from the deployment, it should not have any role in creating the issue.

Vasiliy_Naumov:

Also tried deploying a 3rd data-bearing node and making it a PSS replica set, in which case I could safely shut down 1 of the nodes and perform the necessary maintenance as it is intended.

It seems to me that it strongly suggests that write concern is the culprit here, so I tried changing the default, and set it to 1 with setDefaultRWConcern. Alas, it didn’t seem to help.

Could you confirm if you are facing the issue in the PSS architecture? Also, please note that setting to w:1 would make the rollback to exclude the writes if the primary restarts before the write operation is completed.
Data can be rolled back if the primary steps down before the write operations have replicated to any of the secondaries.

Regards
Aasawari

Vasiliy_Naumov · 2024-05-31T14:44:54.145Z

Hi Aasawari,

Thank you for the reply! I’ll try clarifying the things a bit:

Firstly, could you confirm if you have seen similar issues before the upgrade has happened or is this the first time you are seeing the issues?

The issue wasn’t observed before the upgrade. Moreso the upgrade itself requires to shutdown nodes in sequence and it went well all the way through with a hiccup.
However, during the upgrade the replica set contained 1 extra hidden secondary.

Unless I’m missing something, it suggests that issue appeared after all servers landed on 7.0.8.

For a PSA architecture, when one data bearing node is shut down, the other data node becomes the primary. This is expected. But could you help in understanding, what is meant by application is not working?
As mentioned in the documentation for PSA architecture,

To clarify, yes, I understand, that stepping down the primary is expected to make another node into the primary. What I meant is that the replica set had a perfectly valid status with an active primary whenever we shut down one of the data bearing nodes.

What I meant by application not working is precisely that. Unfortunately we didn’t have time of ability to track it more deeply, but what I can say is that there is absolutely nothing in logs during that time (both mongo log and application log including any errors from native nodejs mongodb driver).

It very similar to as if all write queries were waiting for write acknowledgement that never came.

Could you confirm if you are facing the issue in the PSS architecture?

No, the issue was not present when I rolled out 1 more data-bearing node. It allowed us to perform the maintenance we needed, but it also strengthened suspicion that the issue is related to the write concern.

I’m going to repeat the process next week. Can you suggest anything to troubleshoot it?

PS: I’ve got an idea just now what could be a reason… All mongo servers are firewalled and each and every one of them whitelists only specific clients and other mongo servers, however I think we don’t whitelist any clients on the arbiter. I guess if clients are unable to connect to the arbiter they may be unable to understand its role in the replica set (i.e., treat it as another secondary) and explicitly send majority write concern to the server in the write requests?

I’ll make sure to whitelist things on the arbiter before trying next time and see if it helps.

Vasiliy_Naumov · 2024-06-03T10:51:00.168Z

I did another try shutting down the secondary with following changes:

Opened arbiter mongo node to all clients.
During shutdown I was running a couple of read and a couple of write commands.

Results:
Changing visibility of the arbiter didn’t help - still no fail over.

During the shutdown process everything worked nice up until the shutdown was completed. At that point rs.status() predictably and correctly indicated that 2nd node was unavailable. Read operations against PRIMARY were still working correctly, however the write commands started hanging up seemingly indefinitely. After I restarted the server about 10 seconds later all hanging write commands returned success. Application also didn’t work during the same time as my write commands were hanging. Also note, that application was showing same signs (hanging instead of returning 500 or other error codes for example), suggesting that it experienced the same.

Seems to be related to the server, not client. The app uses native mongodb nodejs driver 6.6.1. My own commands were sent using robomongo client 1.4.4.

Everything seems to confirm that the issue is related to the write concern. Any suggestions please? Is there any way to ensure that writeConcern is 1 for all commands unless specified otherwise?

Thanks!

Vasiliy_Naumov · 2024-06-03T15:15:04.356Z

Some more findings:

mongodb.com

Write Concern - MongoDB Manual v7.0

States that using majority write concern leads to described behaviour.

mongodb.com

Default MongoDB Read Concerns/Write Concerns - MongoDB Manual v7.0

Unless I misunderstand it says more-or-less that PSA will have default write concern of 1 (voting majority is 2, data-bearing node - 2) which is what I need, but it seems to behave like “majority”.

A couple of query results that might be of use:

db.adminCommand(
{
getDefaultRWConcern: 1 ,
inMemory: false
}
)

{
“defaultReadConcern” : {
“level” : “local”
},
“defaultWriteConcern” : {
“w” : 1,
“wtimeout” : 0
},
…
}

And below is excerpts from rs.config():

{
“members” : [
{
“_id” : 4,
“host” : “host1:27017”,
“arbiterOnly” : false,
“buildIndexes” : true,
“hidden” : false,
“priority” : 1.0,
“tags” : {},
“secondaryDelaySecs” : NumberLong(0),
“votes” : 1
},
{
“_id” : 8,
“host” : “host2:27017”,
“arbiterOnly” : false,
“buildIndexes” : true,
“hidden” : false,
“priority” : 1.0,
“tags” : {},
“secondaryDelaySecs” : NumberLong(0),
“votes” : 1
},
{
“_id” : 9,
“host” : “arbiter:27017”,
“arbiterOnly” : true,
“buildIndexes” : true,
“hidden” : false,
“priority” : 0.0,
“tags” : {},
“secondaryDelaySecs” : NumberLong(0),
“votes” : 1
}
],
“protocolVersion” : NumberLong(1),
“writeConcernMajorityJournalDefault” : true,
“settings” : {
“chainingAllowed” : true,
“heartbeatIntervalMillis” : 2000,
“heartbeatTimeoutSecs” : 10,
“electionTimeoutMillis” : 10000,
“catchUpTimeoutMillis” : 60000,
“catchUpTakeoverDelayMillis” : 30000,
“getLastErrorModes” : {},
“getLastErrorDefaults” : {
“w” : 1,
“wtimeout” : 0
}
}
}

writeConcernMajorityJournalDefault being true looks curious, but seems like it shouldn’t be affecting the issue. Still, I’ll try turning it off and going over the process again.

tapiocaPENGUIN · 2024-06-03T21:07:25.238Z

This replica set used to have a couple more hidden secondaries for backups at different times (not sure if it may be relevant), but they were removed.

I think this is the relevant point, if a secondary or primary goes down you do have a majority of nodes that can vote which is why you see the primary is still available (or elected) and you can read. But when you are trying to write with a write concern of majority you don’t have a majority because arbiters don’t hold data. So you can get a majority write concern because only 1 data bearing nodes is available.

With the hidden secondaries it would work because they held data so you could have a write concern that went to multiple nodes.

If the arbiter is down then you would still be able to read/write as normal because you have the majority of data bearing nodes.

Vasiliy_Naumov · 2024-06-05T08:55:10.642Z

Hi Eli, and thank you for the reply! I think you are correct and at this point it’s clear that the majority write concern is the issue.

The problem is that it’s not intended to be that way for PSA. I think, current state almost entirely eliminates the primary purpose (or one of) of a replica set when using PSA – failover. There is not much difference between PSA and using a standalone at this point.

PS: I tried using a PSS with a hidden voting secondary with buildIndexes off to make a “sort-of” arbiter of a data-bearing node, but evidently I needed to set setIndexCommitQuorum to 1 before doing so. Interestingly, it worked initially, but after a few hours commit quorum errors started flowing and I rolled back to PSA. Not sure if anything else was to be done there.

Anyway I’m trying to make a PSA or similar set up to provide a failover. I understand that there are some potential issues with transactions and rollbacks, but anything is better than not having a replica set working when any member goes down. Any help is appreciated!