Postmortem Index

Explore incident reports from various companies

Heroku

A change to the core application that manages the underlying infrastructure for the Common Runtime included a dependency upgrade that caused a timing lock issue that greatly reduced the throughput of our task workers. This dependency change, coupled with a failure to appropriately scale up due to increased workload scheduling, caused the application’s work queue to build up. Contributing to the issue, the team was not alerted immediately that new router instances were not being initialized correctly on startup largely because of incorrectly configured alerts. These router instances were serving live traffic already but were shown to be in the wrong boot state, and they were deleted via our normal processes due to failing readiness checks. The deletion caused a degradation of the associated runtime cluster while the autoscaling group was creating new instances. This reduced pool of router instances caused requests to fail as more requests were coming in faster than the limited number of routers could handle. This is when customers started noticing issues with the service.

Source | Edit Metadata | JSON file