Quick update. We ended up force stepDown the problematic primary, and removed it from the shard. Then updated the record in config db (db.shards.update) to reflect the new primary host and port. The issues are gone now.
Thanks for the support, Kobe!