Why did it crash?

0 views
Skip to first unread message

Martin Sweeney

unread,
May 7, 2009, 4:10:13 AM5/7/09
to scalr-discuss
So my farm decided to crash this morning, all backups and database
bundles worked fine and another set of instances are in its place.
Hurrah!

What concerns me is why all four instances decided to crash within 3
minutes of each other. They're not connected by anything other than
connections to databases and memcache servers etc, but they all went
at once.

Instance 'i-46f94bxx' found in database but not found on EC2. Crashed.
Instance 'i-9a9014xx' found in database but not found on EC2. Crashed.
Instance 'i-29009axx' found in database but not found on EC2. Crashed.
Instance 'i-27a2ccxx' found in database but not found on EC2. Crashed.

Is there anywhere I can find more info on this other than my Event
log?

M.

Alex Kovalyov

unread,
May 7, 2009, 10:11:52 AM5/7/09
to scalr-discuss
Martin, it was a user error on Scalr.net side. Dev version of poller
has gone nuts and selectively terminated instances on few farms
before it was killed.

Niv

unread,
May 7, 2009, 10:22:20 AM5/7/09
to scalr-discuss
ruined my day & upcoming weekend + major data loss + ~20 extra
instances running for several hours doing nothing. yay.

Niv

unread,
May 7, 2009, 10:30:16 AM5/7/09
to scalr-discuss
and i have to add that the cause of the major data loss is your no-
good way of doing the snapshots. once a snapshot creation starts, the
older snapshot is immediately corrupt.
your human error caused my instances to crash mid-snapshot creation
and when restarted, the servers failed to download the snapshot and
kept terminating.
this bug was submitted more than six months ago and you've done
absolutely nothing to fix it.

Donovan Bray

unread,
May 7, 2009, 9:41:12 PM5/7/09
to scalr-...@googlegroups.com
We lost instances out of several farms, luckily no known corruption.
But I agree the snapshots should use the same pattern the mysql dumps
do. It's basic backup practice to never overwite your last backup
with your next, it's like backing up to the same tape every night.
You are eventually going to get bit and bit hard. We created a task
to grab the periodic snapshots and store them off s3, but would still
like to have rolling snapshots.

Cole

unread,
May 7, 2009, 3:08:00 PM5/7/09
to scalr-discuss
Woa, this is kind of a deal breaker here! Did this really happen?
Rightscale's seeming quite cost-effective now if this is the case!

rainier2

unread,
Jul 6, 2009, 2:39:15 PM7/6/09
to scalr-discuss
Hey, just looking for a little closure here.

Was this a newly deployed production poller, or what it the dev poller
that broke out of the dev sandbox?

Has Scalr.net taken any actions to prevent a similar problem in the
future?


Thanks!

Esé

unread,
Jul 13, 2009, 7:41:53 PM7/13/09
to scalr-discuss
hey folks,

would love to get an update on this as well. it's a little terrifying
to hear about rogue dev scalr processes killing production farms. are
there safeguards in place now to prevent this kind of thing happening?

hoping for a speedy response, thanks!

E.

Nickolas Toursky

unread,
Jul 14, 2009, 12:04:19 PM7/14/09
to scalr-...@googlegroups.com
Hi guys,

We have developed a new staging environment after this has happened.
It gives us an ability to test the new features more accurately before
deploying them live.

Nick

2009/7/14 Esé <opusdp...@gmail.com>:

kenvogt

unread,
Jul 14, 2009, 1:33:08 PM7/14/09
to scalr-discuss
So I'm not clear as to where things stand now. Are there rolling
snapshots or not?

On Jul 14, 9:04 am, Nickolas Toursky <hin...@gmail.com> wrote:
> Hi guys,
>
> We have developed a new staging environment after this has happened.
> It gives us an ability to test the new features more accurately before
> deploying them live.
>
> Nick
>
> 2009/7/14 Esé <opusdpeng...@gmail.com>:
Reply all
Reply to author
Forward
0 new messages