We have identified certain scenarios in our Atlas Serverless system that may lead to data integrity issues for Serverless instances under specific conditions.
To load balance and maintain a high quality of service, the Atlas Serverless system occasionally moves data of serverless instances between different database servers. Two independent issues have been identified in this live migration system.
As part of the live migration process, existing traffic from serverless application clients must be rerouted from donor servers to recipient servers once the migration is complete. This rerouting is performed as part of a critical cutover, during which routing metadata is updated to accurately reflect the new hosting location of the migrated instance data, ensuring that all subsequent serverless application client traffic is directed to the correct server. We have identified a potential race condition that may temporarily affect operations routing during the cutover period. Specifically, there is a possibility that operations performed on connections established prior to the cutover period may still be directed to the donor server rather than the recipient server, even after the migration has been completed. The incorrect routing could result in stale data being returned for read operations or data loss for write operations. The duration of this issue is estimated to range from milliseconds to two seconds under normal network conditions.
Upon completion of the live migration, any active database commands on the migrated database/collection on the donor servers are interrupted. Atlas Serverless then automatically retries these commands on the recipient servers to minimize disruption for Serverless application clients. However, we have identified that certain multi-write, non-transactional database commands are not safely auto-retried by Atlas Serverless, which can lead to incorrect updates and deletes behavior. Specifically:
The issues noted above, while related to the same system, are independent. For example, a serverless instance that has undergone a data migration might suffer from none, either, or both of those issues.
We have resolved these issues within the Atlas Serverless control plane as of June 20, 2024. Serverless instances that have undergone at least one data migration sequence in the past may have been affected. Due to the nature of these issues, we are unable to definitively confirm which instances were impacted. Our internal assessment suggests that the incident rate is rare but we issue this notice out of an abundance of caution.
If you are concerned about the impact of this issue, we recommend that you cross reference your data with any other records that can help verify your data integrity. For any further questions, please open a support case or start a chat with the Atlas Support team.