Amazon EC2 DNS Resolution Issues in AP-NORTHEAST-2
Amazon · EC2 DNS
On November 22, 2018, between 8:19 AM and 9:43 AM KST, Amazon EC2 instances in the Asia Pacific (Seoul) region (AP-NORTHEAST-2) experienced DNS resolution issues. AWS engineering was alerted at 8:21 AM KST and began working on a resolution, identifying the root cause by 8:48 AM KST. Full recovery for DNS queries from within EC2 instances was achieved by 9:43 AM KST.
The incident was caused by a reduction in the number of healthy hosts within the EC2 DNS resolver fleet, which provides recursive DNS service to EC2 instances. This reduction led to DNS queries from within EC2 instances failing. EC2 network connectivity and DNS resolution outside of EC2 instances were not affected.
The root cause was a configuration update that incorrectly removed the setting specifying the minimum healthy hosts for the EC2 DNS resolver fleet in the Seoul Region. This error caused the system to interpret the minimum healthy hosts configuration as a very low default value, resulting in fewer in-service healthy hosts.
To prevent recurrence, AWS immediately validated and ensured correct capacity settings for the EC2 DNS resolver service across all regions. They are implementing semantic configuration validation for all EC2 DNS resolver configuration updates to guarantee sufficient minimum healthy hosts. Additionally, throttling is being added to limit the amount of healthy host capacity that can be removed from service each hour, which will prevent downscaling of the EC2 DNS resolver fleet even if an invalid configuration parameter is introduced.