Postmortem Index

Explore incident reports from various companies

Mailgun Website Intermittent Timeouts

Mailgun · mongodb, api, smtp, tracking services, website

2017-01-12 config-change

On January 12th at 14:09 UTC, Mailgun engineers began responding to alerts regarding their click/open tracking services and primary MongoDB clusters. Within ten minutes, it was determined that secondary MongoDB servers, used for most read operations, were under heavy load and failing to serve requests, leading to elevated errors in tracking services. This degradation was attributed to a gradual increase in connections to these servers, which coincided with MongoDB configuration changes made the previous day.

At 15:15 UTC, the engineering team attempted to roll back the recent configuration changes. However, due to human error, a new change was inadvertently introduced that redirected API and SMTP services from using secondary MongoDB servers to the primary. This caused the primary MongoDB server to become overloaded and experience stability issues, leading to an elevated error rate for Mailgun’s API, SMTP, and website starting at 15:55 UTC.

Between 16:00 and 19:00 UTC, engineers continued troubleshooting, resizing MongoDB servers and deploying connection limits to stabilize the system. These efforts allowed approximately 75% of typical API request throughput to be served in the affected region. Further review of MongoDB configurations led to the deployment of connection limits for services connecting to the primary MongoDB server.

At 20:28 UTC, these new connection limits were deployed, along with a re-deployment of configurations to prefer secondary servers for reads. These changes successfully stabilized the MongoDB server, and services in the region resumed normal operations at 20:35 UTC. Post-incident analysis of logs did not reveal a clear root cause for the initial connection increase, but the imposed connection limits are expected to prevent similar issues.

Corrective actions included the deployment of redundant tracking infrastructure across two regions, implementation of connection limits across all services to prevent single service overload, and a permanent increase in the size of the primary Mongo cluster, rolled out to other data centers. Planned improvements include enhanced failure handling for tracking services.

Keywords

mongodbtracking servicesapismtpwebsiteconnection limitsconfigurationjanuary 2017