Mongosync stuck during collection copy

Kartal_Kaan_Bozdogan · August 21, 2024, 9:53pm

We have a cluster with a collection of ~1M documents. Each document itself is also moderately large at ~1kbyte.

I tried using Mongosync to sync this remote cluster to a mongo cluster running on my machine. During the collection copy phase, however, the tool gets frequently “stuck”, where the sync is in the “collection copy” phase, mongod continuously consumes a large amount of CPU but the mongosync logs do not show an increase in estimatedCopiedBytes.

Recently I had to stop the sync during the collection copy phase and started it again after several hours. I noticed that the progress, as measured by estimatedCopiedBytes, was still stuck, with mongo still using a lot of CPU. Stopping mongosync did not stop the mongod’s CPU usage, so I used mongosh to run a db.currentOp() to figure out what was going on.

I noticed that mongosync had issued several commands to delete entire _id ranges, roughly the following:

db.MyCollection.deleteMany({
  $and: [
      { '$expr': { '$gte': [ '$_id', 'some id' ] }  },
      { '$expr': { '$lt': [ '$_id', 'a greater id' ] }  },
  ]
})

Which is fair, but the deletion itself was also progressing very slowly, at ~several hundred documents per second. I have a rather recent machine with a half-decent CPU and a pretty fast nvme SSD, mind you. So I ran an explain() on the query, and the results were interesting:

db.MyCollection.find( {$and: [{ '$expr': { '$gte': [ '$_id', 'some id' ] } }, { '$expr': { '$lt': [ '$_id', 'a greater id' ] } }]}, {_id: 1}).explain()

winningPlan: { stage: 'PROJECTION_SIMPLE', transformBy: { _id: 1 }, inputStage: { stage: 'FETCH', filter: { '$and': [ { '$expr': { '$gte': [ '$_id', { '$const': 'some id' } ] } }, { '$expr': { '$lt': [ '$_id', { '$const': 'a greater id' } ] } } ] }, inputStage: { stage: 'IXSCAN', keyPattern: { _id: 1 }, indexName: '_id_', isMultiKey: false, multiKeyPaths: { _id: [] }, isUnique: true, isSparse: false, isPartial: false, indexVersion: 2, direction: 'forward', indexBounds: { _id: [ '["some id", "a greater id")' ] } } } }

This does not look efficient. Not only is mongo doing a completely unnecessary fetch and filter, it does not even realize that the _id index covers the query. Changing the query to an equivalent one without $exprs fixes the issue:

db.MyCollection.find( {$and: [{_id: {$gte: 'some id'}}, {_id: {$lt: 'a greater id'}} ] } , {_id: 1}).explain()

winningPlan: { stage: 'PROJECTION_COVERED', transformBy: { _id: 1 }, inputStage: { stage: 'IXSCAN', keyPattern: { _id: 1 }, indexName: '_id_', isMultiKey: false, multiKeyPaths: { _id: [] }, isUnique: true, isSparse: false, isPartial: false, indexVersion: 2, direction: 'forward', indexBounds: { _id: [ '["some id", "a greater id")' ] } } }

That looks much better, and is how I’d expect mongo to run the query. It does a bounded index scan over the _id index, and does a covered projection, no fetch to be seen.

So I guess my question is two-fold:

Is this a known limitation of the query planner with $expr queries? We are using Mongo 6.0
Why is mongosync deleting documents in the first place? And how can I avoid it getting stuck?