A few weeks ago, we upgraded our MongoDB sharded replication environment from 4.2 to 4.4.28. After the upgrade, we observed a mongo replication stalls all the sudden on our main production.
I am seeing this error
{“t”:{“$date”:“2024-03-13T14:38:13.265+00:00”},“s”:“I”, “c”:“REPL”, “id”:21275, “ctx”:“ReplCoordExtern-1”,“msg”:“Recreating cursor for oplog fetcher due to error”,“attr”:{“lastOpTimeFetched”:{“ts”:{“$timestamp”:{“t”:1710328506,“i”:1899}},“t”:99},“attemptsRemaining”:1,“error”:“CursorNotFound: Error while getting the next batch in the oplog fetcher :: caused by :: cursor id 4877507614192920788 not found”}}
After the research, we upgraded Mongo from 4.4.28 to 4.4.29 two days ago based on the report described in
Which claims that the bug is fixed. But this morning, 4AM my time, the replication stalled again. This issue is marked as fixed, but I don’t think it does.
FYI, I was running “compact” against 3TB collection. Would that cause the problem? Our mongo 4.4.28 replication had stalled few times even we weren’t running the “compact”
This happens on a secondary, and once the replication stalls, the CUP & disk usage go almost nil.
We have constant traffic 24/7. It does have lot of writes, but it is constantly busy. It does not happen when Mongo is too busy, just all the sudden. We have about 22 hours of OpLog. When I restart the mongo process, the replication process starts again. So, we weren’t missing the OpLog size window,
Again, this wasn’t an issue when we were running on 4.2 for 1-2 years, before that we were on 3.6, before that were on 3.2, then 2.8. The server has been running for 6 -7 years.
We are still having this problem. The replication starts to fail 2 - 3 times a week. Restarting mongod fixes the problem. I am seeing this in the log if that helps.