Razorpay RDS Multi-AZ Failover and Data Loss in December 2019

On a Friday evening in December 2019, Razorpay experienced an outage when their main application, “API”, began throwing 5xx errors due to an inability to connect to its RDS (MySQL) database. An AWS Multi-AZ failover occurred, and the application automatically recovered after approximately 110 seconds. However, this failover led to replication failure on all five replica instances, which then had to be recreated, a process that took six hours.

The following day, during follow-up calls with AWS RDS tech support, it was discovered that a few records were missing from the database, indicating data loss. As a fintech company, this transactional data loss was critical. Further investigation revealed that 5 seconds of data were lost around the time of the failover. The team successfully corrected the lost data after an 8-hour marathon meeting by digging through binary logs and application trace logs.

The root cause was identified as an incorrect MySQL configuration parameter, innodb_flush_log_at_trx_commit, which was set to 2 instead of the recommended 1. This setting meant that transaction logs were not flushed to disk immediately after each commit. Consequently, during the Multi-AZ failover, the standby instance (which became the new primary) did not receive the latest committed data, leading to the data loss.

Additionally, the sync_binlog parameter was correctly set to 1, causing binary logs to be flushed immediately. Replicas read this data and applied it, but since the data was not durable on the primary, when the new primary inserted the same data, it caused duplicate-key errors on the replicas, leading to their crash. The remediation involved changing innodb_flush_log_at_trx_commit to 1 on the master instance and correcting the lost data.

Postmortem Index

Razorpay RDS Multi-AZ Failover and Data Loss in December 2019

Keywords