Incident.io intermittent database connection pool timeouts

Incident.io experienced intermittent application timeouts over a two-week period earlier this year, impacting customer experience and leading to “context canceled errors” in their error reporting. Despite initial investigations, no clear cause like sudden code changes or traffic spikes was immediately apparent.

Traces revealed that HTTP requests were waiting up to 20 seconds to acquire an available connection from the Go database/sql connection pool. This contention was widespread across many endpoints, rather than being isolated to a single slow query or connection pool, making diagnosis challenging.

Initial attempts to resolve the issue included optimizing neglected queries, adding database indices, rewriting inefficient queries, and implementing a one-second lock timeout for transactions. They also began processing Slack events asynchronously to reduce immediate database load. However, these measures did not fully resolve the intermittent timeouts.

To better diagnose the problem, incident.io improved their observability by implementing a custom ngrok/sqlmw middleware. This allowed them to track the total time operations spent holding a database connection pool. The ultimate root cause was identified as an unnecessary transaction wrapping every Slack modal submission.

Removing these unnecessary transactions, and explicitly adding them only where transactional guarantees were required, resolved the issue. The problem was not a single slow operation but an aggregation of many small, fast transactions exhausting the connection pool. The company has been timeout-free for four months since the fix.

Postmortem Index

Incident.io intermittent database connection pool timeouts

Keywords