Postmortem Index

Explore incident reports from various companies

Slack Outage on January 4th 2021

Slack · Slack

On January 4th, 2021, Slack experienced a significant outage beginning in the Americas’ morning. Error rates began to climb, and by 7:00 AM PST, Slack became unavailable due to widespread network degradation and saturation of its web tier. This was exacerbated by a mini-peak in traffic at the top of the hour, leading to increased packet loss and high latency.

During the incident, automated systems marked instances as unhealthy and attempted to replace them, while autoscaling downscaled the web tier due to perceived low CPU utilization. An attempt to manually scale up by adding 1,200 servers failed because Slack’s provision-service, responsible for configuring new instances, became overloaded. It hit resource bottlenecks (Linux open files limit and an AWS quota) due to communicating over the same degraded network, preventing new healthy instances from coming online.

The root cause was identified as network saturation in an AWS Transit Gateway (TGW) that links Slack’s Virtual Private Clouds (VPCs). Slack’s unusual traffic pattern, with a sharp increase on the first working day after holidays, caused a sudden surge in demand that the AWS-managed TGWs did not scale fast enough to meet, leading to packet loss.

AWS engineers were alerted and manually increased the TGW capacity, which resolved the underlying network issue by 10:40 AM PST. Slack engineers disabled downscaling, cleared broken instances, and relied on load balancer ‘panic mode’ and retries to restore service. By 9:15 AM PST, Slack was degraded but functional, and error rates returned to normal after the AWS fix.

As remediation, AWS is reviewing its TGW scaling algorithms. Slack plans to request preemptive TGW upscaling before future holiday seasons, move its monitoring dashboard services to the same VPC as their databases to remove TGW dependency, and regularly load test provision-service while reevaluating autoscaling configurations to prevent similar issues.

Keywords

slackawsnetworktransit gatewaypacket lossautoscalingprovisioningweb tier