2 / 3
May 2024

I am using MongoDB Community Edition to self-host a private replica set with two members:

  • Windows 10 (19045.4046) - mongod 7.0.7
  • Debian 12 - mongod 7.0.8

Without TLS enabled, the replica set functions perfectly.

I have now enabled TLS. I also generated a certificate authority and additional certificates for each server and some clients, all using RSA/SHA-256.

The servers have a config of the form:

storage: dbPath: <path to data> systemLog: destination: file logAppend: true path: <path to log> net: port: 27017 bindIp: 0.0.0.0 tls: mode: requireTLS CAFile: <path to ca file> certificateKeyFile: <path to key file> replication: replSetName: rs0

I can successfully connect to both servers using mongosh or Compass, but there are issues with the state of the replica set.

rs.status() from the primary node:

{ _id: 0, name: '<primary node hostname>:27017', health: 1, state: 1, stateStr: 'PRIMARY', uptime: 82, optime: { ts: Timestamp({ t: 1715243969, i: 1 }), t: Long('60') }, optimeDate: ISODate('2024-05-09T08:39:29.000Z'), lastAppliedWallTime: ISODate('2024-05-09T08:39:29.172Z'), lastDurableWallTime: ISODate('2024-05-09T08:39:29.172Z'), syncSourceHost: '', syncSourceId: -1, infoMessage: 'Could not find member to sync from', electionTime: Timestamp({ t: 1715243909, i: 1 }), electionDate: ISODate('2024-05-09T08:38:29.000Z'), configVersion: 8, configTerm: 60, self: true, lastHeartbeatMessage: '' }, { _id: 1, name: '<secondary node hostname>:27017', health: 1, state: 2, stateStr: 'SECONDARY', uptime: 79, optime: { ts: Timestamp({ t: 1715238754, i: 1 }), t: Long('52') }, optimeDurable: { ts: Timestamp({ t: 1715238754, i: 1 }), t: Long('52') }, optimeDate: ISODate('2024-05-09T07:12:34.000Z'), optimeDurableDate: ISODate('2024-05-09T07:12:34.000Z'), lastAppliedWallTime: ISODate('2024-05-09T07:12:34.747Z'), lastDurableWallTime: ISODate('2024-05-09T07:12:34.747Z'), lastHeartbeat: ISODate('2024-05-09T08:39:37.264Z'), lastHeartbeatRecv: ISODate('1970-01-01T00:00:00.000Z'), pingMs: Long('1'), lastHeartbeatMessage: '', syncSourceHost: '', syncSourceId: -1, infoMessage: '', configVersion: 8, configTerm: 52 } ]

rs.status() from the secondary node:

{ _id: 0, name: '<primary node hostname>:27017', health: 0, stateStr: '(not reachable/healthy)', uptime: 0, optime: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') }, optimeDurable: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') }, optimeDate: ISODate('1970-01-01T00:00:00.000Z'), optimeDurableDate: ISODate('1970-01-01T00:00:00.000Z'), lastAppliedWallTime: ISODate('1970-01-01T00:00:00.000Z'), lastDurableWallTime: ISODate('1970-01-01T00:00:00.000Z'), lastHeartbeat: ISODate('2024-05-09T08:39:00.829Z'), lastHeartbeatRecv: ISODate('2024-05-09T08:38:59.331Z'), pingMs: Long('0'), lastHeartbeatMessage: 'Error connecting to <primary node hostname>:27017 (<ip address>:27017) :: caused by :: onInvoke :: caused by :: The client and server cannot communicate, because they do not possess a common algorithm.', syncSourceHost: '', syncSourceId: -1, infoMessage: '', configVersion: -1, configTerm: -1 }, { _id: 1, name: '<secondary node hostname>:27017', health: 1, state: 2, stateStr: 'SECONDARY', uptime: 53, optime: { ts: Timestamp({ t: 1715238754, i: 1 }), t: Long('52') }, optimeDate: ISODate('2024-05-09T07:12:34.000Z'), lastAppliedWallTime: ISODate('2024-05-09T07:12:34.747Z'), lastDurableWallTime: ISODate('2024-05-09T07:12:34.747Z'), syncSourceHost: '', syncSourceId: -1, infoMessage: '', configVersion: 8, configTerm: 52, self: true, lastHeartbeatMessage: '' } ]

Replication does seem to work to some extent: write operations can be performed when connected to either sever (again with either mongosh or Compass), but they hang, and need to be explicitly terminated, but changes are reflected in subsequent queries. This is not an issue when TLS is disabled.

Using

security: clusterAuthMode: x509

gives the same results, except an admin user needs to be created and authenticated before rs.status() can be invoked.

It seems that the issue might be related to TLS incompatibilities, and I suspect that Windows is the culprit.

I have messed around with different TLS versions in Windows’ Internet Options. I have also messed around with net.tls.disabledProtocols in the server configs, but I have not had any success.

Any help would be greatly appreciated!

To add to this, the secondary node’s logs are filled with the same error message:

{"t":{"$date":"2024-05-09T15:31:37.936+02:00"},"s":"I", "c":"REPL_HB", "id":23974, "ctx":"ReplCoord-6","msg":"Heartbeat failed after max retries","attr":{"target":"<primary node hostname>:27017","maxHeartbeatRetries":2,"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to <primary host name>:27017 (<ip address>:27017) :: caused by :: onInvoke :: caused by :: The client and server cannot communicate, because they do not possess a common algorithm."}}}

There are no traces of this in the primary node’s logs.

11 months later

My tip would be to create a tcpdump and pay special attention to the signing algorithms. I often had errors when mixing Windows and Linux clients under MongoDB. It seems that mongodb prefers rsa_pkcs1_sha1 as signing algorithm, even if better signing algorithms are available on client and server. In the case of OpenSSL-v3 you have to set CipherString = DEFAULT:@SECLEVEL=0 in the /etc/ssl/openssl.conf file to enable SHA-1. For OpenSSL-v1 it is CipherString = DEFAULT:@SECLEVEL=1.