Twilio billing system incident of July 2013
Twilio · billing system
On July 18, 2013, Twilio experienced an incident with its billing system. A temporary network partition caused all Redis slaves to simultaneously disconnect and then request full synchronization from the master. This overwhelming load led to performance degradation and eventual failure of services relying on the Redis master.
The overloaded Redis master was misdiagnosed and restarted, but it loaded an incorrect configuration. This caused the master to attempt recovery from a non-existent AOF file, resulting in the loss of all in-flight account balance data. Additionally, the incorrect configuration caused the Redis master to boot in read-only mode, preventing any updates to account balances.
With account balances effectively reset to zero and the system unable to update them, Twilio’s auto-recharge system repeatedly attempted to charge customer credit cards for usage. This resulted in multiple erroneous charges and, in some cases, account suspensions for 1.4% of Twilio’s customers. All customers also experienced delays in usage reports.
Twilio engineers took the billing system offline multiple times to prevent further charges and restore data. They processed refunds for all erroneous charges, which took approximately 24 hours, and provided a 10% credit to all impacted accounts based on their last 30 days of Twilio spend.
As part of the remediation, the original Redis cluster was replaced, and the incorrect configuration was corrected. To prevent recurrence, direct Redis master restarts are now disabled, with future recoveries planned by pivoting a slave. The critical flaw in the auto-recharge system, which failed dangerously, is also being addressed.