Rackspace Outage

1 view

Skip to first unread message

Nigel

unread,

Nov 4, 2009, 5:19:42 AM11/4/09

to Bambuu Beta

Dear Tom,

At approximately 12:29am CST this morning, our Dallas - Fort Worth
(DFW) data center experienced a power disruption, and consequently an
interruption of our services. The power disruption was the result of
issues during a maintenance effort that was scheduled and expected to
be non-impacting.

This summer our DFW facility had power issues, and as a result, we
invested significant resources to improve all aspects of our power
systems. Last night, during one of these steps, we encountered issues
and had a brief loss in power. The power disruption was approximately
5 minutes in duration. Despite this short power disruption, many
customers experienced downtime that was significantly longer. Since
the power disruption hit the core of many of our cloud services,
recovery of full operations required more effort than simple recovery
of power. The experience you had last night is not acceptable to us.

Here is what we know about the events:

· The scheduled maintenance was planned to occur from 12:05am - 6:05am
CST in our DFW data center. This maintenance is part of a
preventative maintenance schedule for several PDUs in UPS Cluster G at
the DFW datacenter. The PDUs were down for a total of 5 minutes
before power was restored. At approximately 12:29am CST, all PDUs
behind UPS Cluster G lost power.

· Although the power outage was very brief (5 minutes), it forced a
hard re-boot to occur on a portion of our cloud infrastructure. As
our engineers worked to bring hardware back online, we experienced
several unforeseen hardware failures. Further complicating our
recovery effort, the incident also created internal DNS issues, which
caused additional delays. With that said, the vast majority of cloud
customers affected by this outage had service restored within one
hour's time (many in as little as five minutes); however, depending
upon the service, a few customers experienced service interruptions
for up to few hours.

Here is how we plan to deal with it:

· We have invested massively in the DFW facility to ensure it delivers
at a level you expect from Rackspace - despite last night, we feel
very good about our plan and have high confidence in the DFW facility
- clearly we have to prove it.

· We are reviewing our maintenance notifications - we typically do not
share information on expected non-impacting events, but clearly we
need to ensure we calibrate these events and are fully transparent.

· We are reviewing our procedures and systems for quickly resuming
cloud operations when an unexpected event like this occurs -
unexpected events will happen, our job is to minimize their impacts.

We live by high standards and clearly have not lived up to them. We
welcome any feedback. If you would like a call from me, or anyone on
our senior team to discuss these issues personally, please reply with
a phone number.

We have work to do to earn back your trust. We will not rest until we
have.

Thank you,

Emil Sayegh
General Manager, The Rackspace Cloud

Reply all

Reply to author

Forward

0 new messages