Postmortem Index

Explore incident reports from various companies

King's College London Strand Data Centre storage failure

King's College London · Strand Data Centre / HP 3PAR

2016-10-17 hardware

On 17 October 2016, one of the four controllers in the principal HP 3PAR storage system at King’s College London’s Strand Data Centre failed. There was no user impact at that point — the system was designed to keep serving from the remaining controllers. HP hardware engineers attended on site and replaced the failed controller. Instead of returning to a normal state, the storage system went offline and many of its disks simultaneously started failing, leading to a complete loss of data on that array.

The failure was traced to a flaw in the storage-controller firmware. HP had issued an updated firmware release several weeks earlier which would have allowed the controller swap to proceed without an outage, but the IT team had not had the opportunity to apply the routine update before the incident.

The College had multiple backup systems in place. Had any of them performed as intended, the data could have been restored and the day would have ended as a routine annoyance. They collectively failed: the IT team had not understood the central importance of the tape backups in the layered scheme, did not follow the documented backup procedures completely, and had silently stopped tape-backing-up some data due to capacity constraints — without telling the College. Backup-restore had never been tested end to end, so the response team could not give meaningful recovery-time estimates while the incident was running. Considerable data — Admissions records and academic research stored on shared drives among them — could not be reconstructed and is permanently lost.

Contributing factors: the College had purchased “proactive support” from HP four years earlier but not the “enhanced support” tier introduced in 2015 that provides risk assessment and change-management advice for complex systems; users had stored critical academic and administrative data on shared drives without realising those were not a long-term archive; IT had been unable to negotiate maintenance windows for full disaster-recovery tests; and an ambitious four-year IT transformation programme had outrun the organisation’s capacity to absorb the change. The Strand Data Centre itself was assessed as no longer fit for purpose and slated for replacement.

Keywords

kings college london3parhpestorage controllerfirmwarebackup failuretape backupdata lossstrand data centrepa consulting