Thanks for your reply. We did identify few of these reasons, but not sure why say e.g. a simple insert take around 4mins during the same time as the replication lag. whether its the cause or symptom of the lag. If you see below Insert to a collection, the waitForWriteConcernDurationMillis i.e. The time (in milliseconds) waited for the write concern to be satisfied is huge. But that could be a symptom of the replication lag if it’s a write that requires an acceptance of 2 nodes??
“totalOplogSlotDurationMicros”:268,“ninserted”:1,“keysInserted”:5,“numYields”:0,“reslen”:230,“locks”:{“ParallelBatchWriterMode”:{“acquireCount”:{“r”:2}},“FeatureCompatibilityVersion”:{“acquireCount”:{“w”:2}},“ReplicationStateTransition”:{“acquireCount”:{“w”:3}},“Global”:{“acquireCount”:{“w”:2}},“Database”:{“acquireCount”:{“w”:2}},“Collection”:{“acquireCount”:{“w”:2}},“Mutex”:{“acquireCount”:{“r”:2}}},“flowControl”:{“acquireCount”:1,“timeAcquiringMicros”:1},“readConcern”:{“level”:“local”,“provenance”:“implicitDefault”},
“writeConcern”:{“w”:“majority”,“wtimeout”:0,“provenance”:“implicitDefault”},“waitForWriteConcernDurationMillis”:249425,
“storage”:{“data”:
{“bytesRead”:8365,“timeReadingMicros”:17}},“remote”:“192.168.254.110:13601”,“protocol”:“op_msg”,“durationMillis”:249425}}
We did create a MongoDB support ticket for the same and they closed it giving some pointers mentioned below. But what should be our next steps?
Replication lag on the secondary node 00-01 began to increase from Point A, coinciding with the spikes in write requests exceeding the configured maximum Disk IOPS of approximately 16K.
Between point A-B, we observed a high ss wt cache fill ratio, indicating significant utilization of the WT cache. Typically, eviction threads start removing cache content when this ratio falls between 80% and 95%.
Additionally, there was a notable rise in ss wt cache dirty fill ratio, representing the percentage of the total WT cache size that is dirty. Eviction threads begin evicting pages when this value reaches 5%, and it reached up to approximately 4.855%.
Around Point A, when the replication lag was approximately 4 minutes, we noticed spikes in ss metrics repl buffer count, which indicates the number of operations stored in the local replication buffer waiting to be applied.
Regarding queries, there were spikes in opcounters. During the same period, we observed very high ss metrics queryExecutor scannedObjects, indicating the number of documents scanned per second during query execution. A high ratio of “scannedObjects” to “scanned” suggests poorly indexed or unindexed queries.
Query activity included insert operations, as evidenced by spikes in ss opcountersRepl insert and ss opcountersRepl update on the primary. These metrics correlate with the disk nvme1n1 write requests issued, indicating a significant number of write operations during periods of replication lag.
Furthermore, due to the high workload of inserts, disk utilization spiked, and concurrently, Read/Write request issue related to IOPS reached around 30K/sec and exceeding the configured maximum Disk IOPS of approximately 16K.