OT: Know your UPS !

JF Mezei

unread,

Nov 18, 2014, 4:37:39 PM11/18/14

to

This past summer, there were two major outages to the Vancouver SKYTRAIN
service. It is a driverless system that started in 1986.

An investigation resulted in a report.

CBC article about the issue:
> www.cbc.ca/news/canada/british-columbia/skytrain-shutdown-investigation-makes-20-recommendations-1.2838964

Actual report:
> http://www.translink.ca/~/media/documents/about_translink/media/2014/tra_1795_independent_review_booklet_final.ashx

Basically, electrician doing work on the UPS caused one breaker to jump,
cutting off lights and others stuff in the system control room.
The staff didn't know which breaker it was, and assumed the whole system
was down (trains were still moving), and rebooted the whole UPS system
instead of trying to figure out which breaker had jumped.

In doing that reboot, they actually shutdown the whole system, and once
back up, the software had no awareness of where each train was, so
nothing could move.

This is a good example of a disaster recovery plan that did not consider
ALL aspects, notably documentation of what systems were on what UPS
breeaker, how to know which UPS breakers have tripped and how to reset
them other than rebooting the whole UPS.

Also, they plan lacked a "what has failed, and what is still running"
process before taking any action.

I was once caught in a situation where I had not planned for an H4000
AUI to thick coax would fail. This had many consequences as the
production system was still processing but transactions not sent over
ethernet to volume shadowed disks at backup site.

Even sice that event, I have had an interest in looking at simular
failures where something was unplanned for in the disaster recovery plan.

Kerry Main

unread,

Nov 18, 2014, 8:15:05 PM11/18/14

to comp.os.vms to email gateway

JF - as a fyi, Calgary has had a number of huge issues in the last year
(floods, Sept snow storm takes out power lines, freak fire in downtown
sewer takes out downtown core) so I am sure they have a number of
lessons learned there.

Note - the financial institutions big worry these days are pandemic issues
(SARS, ebola etc) and the impact on staff. As an example, if the same staff
manage 2 DC's relatively close together, then 1 infected staff member can
inadvertently take out both DC's.

Scenario might be infected staff Operations resource visiting both sites for
something like tapes pickup (OPS staff regularly visit both DC sites)

To mitigate risk, solution they are looking at are DC's 100km apart with
separate staff using common processes and tools to manage the DC's.

Regards,

Kerry Main
Back to the Future IT Inc.
.. Learning from the past to plan the future

Kerry dot main at backtothefutureit dot com

Arne Vajhøj

unread,

Nov 18, 2014, 8:23:53 PM11/18/14

to

On 11/18/2014 8:07 PM, Kerry Main wrote:
>> From: Info-vax [mailto:info-vax...@info-vax.com] On Behalf Of JF
>> Mezei

There are many good reasons to keep a significant distance between
data centers.

Many natural disasters impact large areas.

Arne

Kerry Main

unread,

Nov 18, 2014, 10:15:05 PM11/18/14

to comp.os.vms to email gateway

Yep, but there is a classic trade-off between performance and distance
risk because if there is a need for synch transactions (RPO=0 or never
lose a transaction), then for write/update performance reasons, the
sites need to typically be within 100km of each other.

There is a pretty good whitepaper on the reason why the 100km
max distance is used in the financial sector at:
http://tinyurl.com/DR-100km

Its why some firms are now looking at Active-Active-Passive 3 site
strategies as a way to address the best of both strategies. More
expensive, but DR is like insurance - how much insurance do you
need? It depends on what the impact is if the risk happens.

:-)