Postmortem Index

Explore incident reports from various companies

PythonAnywhere storage volume failure on 7 July 2020

PythonAnywhere · storage volume

2020-07-07 cloud hardware

On July 7, 2020, at 16:06 UTC, PythonAnywhere experienced an unplanned outage due to a storage volume failure on one of their file servers, ‘livefile1’. This immediately impacted their own website and user programs, including websites, scheduled tasks, and always-on tasks dependent on that volume. The issue later spread to other hosted sites during recovery efforts.

The root cause was identified as an extremely rare multi-component failure of an Amazon EBS volume, which is part of a highly redundant storage system. This failure occurred at the first level of PythonAnywhere’s three-tiered redundancy, effectively causing a redundant array of disks to stop working, an event estimated to happen once every 20 years.

Customers whose data was stored on the affected file server experienced prolonged unavailability of their websites and tasks. Other users faced brief outages during a rolling reboot of servers. PythonAnywhere’s own site was also unavailable or slow. Despite the significant downtime, no user data was lost or at risk due to robust backup and mirroring systems.

The PythonAnywhere team, with AWS support, rebuilt the file server’s volumes from backup snapshots. All sites were reported as up and running by 21:24 UTC, with full volume warm-up completed by 23:50 UTC. Moving forward, PythonAnywhere plans to improve the speed of identifying underlying causes, streamline file server volume rebuilding, limit the scope of future outages to only directly affected users, and optimize the process of warming up new volumes.

Keywords

storageebsawsfile serveroutagepythonanywherevolume failureraid