I have a setup with a primary and secondary node, both of which are working fine with 5 TB each. Recently, I added a new node for replication, but it’s been stuck on syncing for 50 days. The sync keeps dropping and restarting, and I can’t figure out why. Initially, the oplog size was 50 GB, but I increased it to 150 GB, and the problem persists. I also tried taking a snapshot of the EBS volume, backing up the EC2, and attaching the volume created from the snapshot, but MongoDB fails to start every time. I deleted the log and mongod.lock files as suggested in forums, but that hasn’t helped. Additionally, I increased the IOPS from 8k to 15k, but the issue persists. I’m using MongoDB version 6 with a GP3 volume type. I’m really frustrated, as I’ve tried many things without success. I have 2 million records updated or inserted per day, and I need a proper solution to get the new node to fully sync and work correctly.

After going through logs i found out this error
//root@ip-10-10-101-243 ec2-user]# sed -n ‘/2025-04-06T02:20:28.700+00:00/,/2025-04-06T02:20:20/p’ /mongodb/log/mongod.log
{“t”:{“$date”:“2025-04-06T02:20:28.700+00:00”},“s”:“I”, “c”:“COMMAND”, “id”:20499, “ctx”:“ftdc”,“msg”:“serverStatus was very slow”,“attr”:{“timeStats”:{“after basic”:0,“after activeIndexBuilds”:0,“after asserts”:0,“after batchedDeletes”:0,“after bucketCatalog”:0,“after catalogStats”:0,“after connections”:0,“after electionMetrics”:0,“after extra_info”:0,“after flowControl”:0,“after globalLock”:0,“after indexBulkBuilder”:0,“after indexStats”:0,“after locks”:0,“after logicalSessionRecordCache”:0,“after mirroredReads”:0,“after network”:0,“after opLatencies”:0,“after opcounters”:0,“after opcountersRepl”:0,“after oplog”:0,“after oplogTruncation”:0,“after readConcernCounters”:0,“after repl”:0,“after scramCache”:0,“after security”:0,“after storageEngine”:0,“after tcmalloc”:10699,“after tenantMigrations”:10699,“after trafficRecording”:10699,“after transactions”:10699,“after transportSecurity”:10699,“after twoPhaseCommitCoordinator”:10699,“after wiredTiger”:10699,“at end”:10700}}}
{“t”:{“$date”:“2025-04-06T02:20:41.373+00:00”},“s”:“I”, “c”:“NETWORK”, “id”:22943, “ctx”:“listener”,“msg”:“Connection accepted”,“attr”:{“remote”:“10.10.111.108:48920”,“uuid”:“4bf4b259-891c-47d5-8d96-2b39bb6077ed”,“connectionId”:7319,“connectionCount”:186}}
{“t”:{“$date”:“2025-04-06T02:20:41.375+00:00”},“s”:“I”, “c”:“NETWORK”, “id”:51800, “ctx”:“conn7319”,“msg”:“client metadata”,“attr”:{“remote”:“10.10.111.108:48920”,“client”:“conn7319”,“doc”:{“application”:{“name”:“publish.downloadinfocount”},“driver”:{“name”:“mongo-csharp-driver”,“version”:“2.27.0”},“os”:{“type”:“Linux”,“name”:“Debian GNU/Linux 12 (bookworm)”,“architecture”:“x86_64”},“platform”:“.NET 8.0.14”}}}
{“t”:{“$date”:“2025-04-06T02:20:41.438+00:00”},“s”:“I”, “c”:“NETWORK”, “id”:22943, “ctx”:“listener”,“msg”:“Connection accepted”,“attr”:{“remote”:“10.10.111.108:48928”,“uuid”:“496ad60e-6674-4300-8d47-b23c26e7088b”,“connectionId”:7320,“connectionCount”:187}}

I research about it ,it seems to be a issue of resource but when i check resource utilization it was not even 50% of my ec2 and its volume ? pls help me to resolve this issue with my mongodb replica

Hii @khizar_b
I am too facing same issue. Recently we have shifted from Atlas to Self Hosted Environment because of costing.
We are using MongoDB 7.
Physical Data Size: ~8TB
Logical Data Size: ~2TB

I have too raised similar query on community yesterday ( Topic Link ) and currently we are running our production on standalone environment which is not recommended. If you found any way could you please help me too, or are you still looking for help.

Note: I don’t know sharing other topics are relevant or not, but for now I have shared it, let me know if I am violating any rules or not.

Hey @Mehul_Sanghvi,

I faced a similar issue after moving from Atlas to self-hosted. Check if your oplog window is large enough and confirm both nodes use the same MongoDB version and storage settings. If that’s fine, reviewing the sync logs might reveal what’s causing the failure.

Hey @Paul_Welch ,

I realized I was approaching this the wrong way. First we need to ensure that all resources and configurations are identical across all three nodes.

  1. The first step is to convert the standalone instance into a single-primary replica set.
  2. Once that’s done, we should take an EBS snapshot of the primary and attach it to the other two nodes.
  3. This ensures the replication ID (stored in the local database) matches the primary.
  4. After that, we can simply run rs.add() to add the remaining nodes as secondaries and complete the PSS replica setup.