As reported by Debora Weber-Wulff in comp.risks, On January 14, 2009
at 14.03 all of the German national train system's computers shut down.
No ticket machines would work, all Internet pages returned 404 errors,
and the displays in the train stations all went blank. It is unknown
how many of the train control computers died.
The problem was that the AC power failed. They had everything on one
big (supposedly) Uninterruptible Power Supply (UPS) -- a classic
single point of failure blunder by the system designers.
The computer center of the Deutsche Bahn in Mahlsdorf (Berlin)
was upgrading the UPS, and during the upgrade the power to all
of the computers failed at the same time.
They supposedly had a backup system, but the cut-over to the backup
system did not work. Clearly they had nnever tested it by stopping
the main UPS.
It took many hours to get the system back up and running. It seems
that multiple computers assumed that the rest of the computers were
already up and running -- a classic chicken-and-egg flaw. In
addition, the AC power system lacked the capacity to power up all
of the computers at the same time, so someone had to manually turn
each one off, restore the main power, and turn them back on in the
correct sequence.
As expected, there was much public speculation about hackers,
terrorists, viruses, etc., but the root cause was stupidity by
the system designers. There shouldn't have been a single point
of failure.
Supposedly they have found the weak point and fixed it so that this
particular failure mode won't happen again, but there is no evidence
that they have replaced the giant single-point-of-failure UPS with
a redundant distributed system, or that they have scheduled tests
(presumably in the middle of the night when a failure is least
disruptive) to insure that the ability to cut-over to the backup
UPS has not silently failed.
The last time I set up a network, I tested it by spending many
hours unplugging and restoring the power to servers, UPS units,
firewall boxes, etc., as well as unplugging various Ethernet
cables. No single point of failure caused any interruption of
service. This isn't rocket science; all it takes is careful
system design and testing to make sure the system actually works
when various parts of it fail.
-Guy Macon <http://www.GuyMacon.com/>