Hi Aasawari,
Thank you for the reply! I’ll try clarifying the things a bit:
Firstly, could you confirm if you have seen similar issues before the upgrade has happened or is this the first time you are seeing the issues?
- The issue wasn’t observed before the upgrade. Moreso the upgrade itself requires to shutdown nodes in sequence and it went well all the way through with a hiccup.
However, during the upgrade the replica set contained 1 extra hidden secondary.
Unless I’m missing something, it suggests that issue appeared after all servers landed on 7.0.8.
- For a PSA architecture, when one data bearing node is shut down, the other data node becomes the primary. This is expected. But could you help in understanding, what is meant by application is not working?
As mentioned in the documentation for PSA architecture,
To clarify, yes, I understand, that stepping down the primary is expected to make another node into the primary. What I meant is that the replica set had a perfectly valid status with an active primary whenever we shut down one of the data bearing nodes.
What I meant by application not working is precisely that. Unfortunately we didn’t have time of ability to track it more deeply, but what I can say is that there is absolutely nothing in logs during that time (both mongo log and application log including any errors from native nodejs mongodb driver).
It very similar to as if all write queries were waiting for write acknowledgement that never came.
Could you confirm if you are facing the issue in the PSS architecture?
No, the issue was not present when I rolled out 1 more data-bearing node. It allowed us to perform the maintenance we needed, but it also strengthened suspicion that the issue is related to the write concern.
I’m going to repeat the process next week. Can you suggest anything to troubleshoot it?
PS: I’ve got an idea just now what could be a reason… All mongo servers are firewalled and each and every one of them whitelists only specific clients and other mongo servers, however I think we don’t whitelist any clients on the arbiter. I guess if clients are unable to connect to the arbiter they may be unable to understand its role in the replica set (i.e., treat it as another secondary) and explicitly send majority write concern to the server in the write requests?
I’ll make sure to whitelist things on the arbiter before trying next time and see if it helps.