Intermittent downtime from repeated crashes

On Friday, November 18th, 2022, incident.io experienced 13 minutes of intermittent downtime over a 32-minute period, from 15:40 to 16:12 GMT. This incident led to repeated crashes of their Go monolith application, impacting customer access to the service.

The core issue was an unhandled panic within the Go application, triggered by a “poison pill” message in the GCP Pub/Sub asynchronous message queue. This message caused a specific handler to panic, which, due to an edge case in Go’s panic recovery mechanism, led to the entire application crashing.

The problem stemmed from the Google Cloud Pub/Sub client’s sub.Receive method, which spawns new goroutines to handle messages. While the parent function had a recover() block, it did not catch panics in these child goroutines, resulting in an unhandled panic that terminated the application. Heroku’s dyno crash restart policy exacerbated the issue by introducing cool-off periods, leading to prolonged downtime.

Investigation was hampered by Heroku’s log buffering and dropping of large stack traces from Go panics, and the immediate termination of the app prevented Sentry from reporting the crash. Engineers eventually identified the problematic Pub/Sub subscription by looking for unacknowledged messages and purged several queues, which stabilized the application.

Two key mitigations were implemented. First, a recover() function was explicitly added to the message handler within the sub.Receive method to correctly catch panics in child goroutines. Second, the monolithic application was physically split into separate Heroku dynos for web, worker, and cron processes, enhancing reliability by isolating failures and preventing one component from crashing the entire service.

Postmortem Index

Intermittent downtime from repeated crashes

Keywords