{"UUID":"76a3bbe4-ed5a-4c4a-a851-01d77e3ae63f","URL":"https://aws.amazon.com/message/17908/","ArchiveURL":"","Title":"AWS Direct Connect disruption in Tokyo (AP-NORTHEAST-1) on September 2, 2021","StartTime":"2021-09-01T22:30:00Z","EndTime":"2021-09-02T04:42:00Z","Categories":["cloud"],"Keywords":["direct connect","tokyo","ap-northeast-1","network","packet loss","connectivity","aws","protocol"],"Company":"Amazon","Product":"AWS Direct Connect","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T19:51:57.170943Z","Summary":"A new failover-optimization protocol had been enabled in network device OS for 8 months without issue. A customer traffic pattern produced packets matching a very specific signature that triggered a latent defect in the OS, causing devices in one Direct Connect network layer to fail. Failed devices weren't automatically removed from service, so engineers manually drained them, only for additional devices to fail with the same bug. Disabling the new protocol restored Direct Connect to Tokyo after ~6 hours.","Description":"On September 2, 2021, beginning at 7:30 AM JST, AWS Direct Connect customers in the Tokyo (AP-NORTHEAST-1) Region experienced intermittent connectivity issues and elevated packet loss. Customers began to see recovery by 12:30 PM JST, with full resolution by 1:42 PM JST. Other network services, including inter-Availability Zone traffic, internet connectivity, AWS VPN, and Direct Connect to other AWS Regions, were not affected.\n\nThe incident was caused by the failure of a subset of network devices within a single network layer along the path from Direct Connect edge locations to the Tokyo Region's datacenter network. These devices were not correctly removed from service by the usual automated processes, which instead alerted engineers to an unusually high failure rate. Initial attempts by engineers to manually remove affected devices and reset them provided only temporary relief, as additional devices subsequently failed.\n\nEngineers eventually linked the failures to a new protocol, introduced in January 2021, designed to optimize network reaction times for infrequent convergence events. This protocol, which had operated without issue for eight months, was found to interact with a very specific and rare customer traffic pattern, triggering a latent defect in the network device operating system.\n\nTo resolve the issue, engineers disabled the new protocol in a single Availability Zone to confirm its effectiveness, then prepared to roll out this change across the entire Tokyo Region. Disabling the protocol successfully stabilized the affected network devices and restored the Direct Connect service to normal operation.\n\nAWS confirmed the root cause to be a latent defect in the network device operating system, which was exposed by the new protocol and specific traffic patterns. While the new protocol and OS had undergone extensive testing and a phased rollout, this particular combination of conditions was not identified. AWS has disabled the protocol in Tokyo and is developing enhanced methods to detect and remediate such issues proactively in other regions."}