This past summer, there were two major outages to the Vancouver SKYTRAIN
service. It is a driverless system that started in 1986.
An investigation resulted in a report.
CBC article about the issue:
>
www.cbc.ca/news/canada/british-columbia/skytrain-shutdown-investigation-makes-20-recommendations-1.2838964
Actual report:
>
http://www.translink.ca/~/media/documents/about_translink/media/2014/tra_1795_independent_review_booklet_final.ashx
Basically, electrician doing work on the UPS caused one breaker to jump,
cutting off lights and others stuff in the system control room.
The staff didn't know which breaker it was, and assumed the whole system
was down (trains were still moving), and rebooted the whole UPS system
instead of trying to figure out which breaker had jumped.
In doing that reboot, they actually shutdown the whole system, and once
back up, the software had no awareness of where each train was, so
nothing could move.
This is a good example of a disaster recovery plan that did not consider
ALL aspects, notably documentation of what systems were on what UPS
breeaker, how to know which UPS breakers have tripped and how to reset
them other than rebooting the whole UPS.
Also, they plan lacked a "what has failed, and what is still running"
process before taking any action.
I was once caught in a situation where I had not planned for an H4000
AUI to thick coax would fail. This had many consequences as the
production system was still processing but transactions not sent over
ethernet to volume shadowed disks at backup site.
Even sice that event, I have had an interest in looking at simular
failures where something was unplanned for in the disaster recovery plan.