We have identified a set of bugs in Ops Manager (v4.2 and v4.4)/Cloud Manager/Atlas Legacy Backups which can lead to backup snapshot corruption on clusters running MongoDB 4.2 or greater. These bugs do not affect the data on the source clusters in any way. However, data recovery from the corrupted snapshot is not possible at this time. See the “Impact” section below for more information.
Ops Manager version 4.4.4/4.2.20 contains the fix for this bug and is available now. All Ops Manager users running earlier versions of Ops Manager and backing up MongoDB 4.2+ with FCV 4.2+ should upgrade Ops Manager as soon as possible. See impact details and workaround options below.
Cloud Manager contains the fix currently. It is recommended that customers who need to restore a Legacy Backup utilize a snapshot created before April 21, 2020 or after September 27, 2020 to ensure a viable snapshot.
Atlas contains the fix currently. It is recommended that customers who need to restore a Legacy Backup utilize a snapshot created before August 11, 2020 or after September 27, 2020. For customers utilizing Cloud Provider Snapshots (recently renamed “Cloud Backups”), this issue is not applicable.
Customers who perform an automated restore or manually extract a downloaded snapshot without encountering error messages have not been impacted by the bugs in this advisory and such operations should complete successfully. A failure to restore from a corrupted snapshot will be evident.
Only backup snapshots which were generated under the following conditions may be impacted:
Ops Manager Backup created utilizing MongoDB 4.2+ with FCV 4.2+
Cloud Manager Backup Snapshot created between April 21, 2020 and September 27, 2020
Atlas Legacy Backup Snapshot created between August 11, 2020 and September 27, 2020.
Restoration of an affected snapshot will exhibit the following behavior:
An attempt to download an affected backup snapshot will result in a failure with "invalid tar header"
error message. Once the tar is downloaded, it will fail to uncompress, potentially leaving automation in an incomplete state and requiring intervention. In the case where blocks themselves are corrupted, restoring the files to the dbpath will result in the failure to start the mongod node, leaving the node itself in a corrupted state.
If an automated restore is initiated using an affected snapshot, the restore process will initiate but will appear to hang indefinitely. The process will need to be cancelled by MongoDB Support.
Attempting to perform a queryable restore on an affected snapshot will also fail. The queryable restore may fail to mount, or querying an affected collection itself may return an error.
None of the data on the snapshot source cluster experienced any data loss or impacts to data integrity.
For Cloud Manager or Atlas customers, no action is required.
While the imminent fix will resolve this issue, Ops Manager customers who wish to stay on the current version but avoid this issue may perform the following:
Add this property to your conf-mms.properties configuration file: mms.featureFlag.backup.incrementalWtEnabled=disabled
Restart the Ops Manager App Server.
After the restart, the block size of all backups using Blockstore Snapshot Storage should be updated to 16MB. When the next snapshot is performed, it will be a complete snapshot, without the potential corruption.
Please note: If you subsequently upgrade Ops Manager, be sure to remove this property from your configuration to regain access to the incremental backup feature.
Modifying a custom setting in Ops Manager.
This can be used to set the mms.featureFlag.backup.incrementalWtEnabled
flag.