So my farm decided to crash this morning, all backups and database
bundles worked fine and another set of instances are in its place.
Hurrah!
What concerns me is why all four instances decided to crash within 3
minutes of each other. They're not connected by anything other than
connections to databases and memcache servers etc, but they all went
at once.
Instance 'i-46f94bxx' found in database but not found on EC2. Crashed.
Instance 'i-9a9014xx' found in database but not found on EC2. Crashed.
Instance 'i-29009axx' found in database but not found on EC2. Crashed.
Instance 'i-27a2ccxx' found in database but not found on EC2. Crashed.
Is there anywhere I can find more info on this other than my Event
log?
Martin, it was a user error on Scalr.net side. Dev version of poller
has gone nuts and selectively terminated instances on few farms
before it was killed.
On 7 май, 11:10, Martin Sweeney <martin.swee...@gmail.com> wrote:
> So my farm decided to crash this morning, all backups and database
> bundles worked fine and another set of instances are in its place.
> Hurrah!
> What concerns me is why all four instances decided to crash within 3
> minutes of each other. They're not connected by anything other than
> connections to databases and memcache servers etc, but they all went
> at once.
> Instance 'i-46f94bxx' found in database but not found on EC2. Crashed.
> Instance 'i-9a9014xx' found in database but not found on EC2. Crashed.
> Instance 'i-29009axx' found in database but not found on EC2. Crashed.
> Instance 'i-27a2ccxx' found in database but not found on EC2. Crashed.
> Is there anywhere I can find more info on this other than my Event
> log?
> Martin, it was a user error on Scalr.net side. Dev version of poller
> has gone nuts and selectively terminated instances on few farms
> before it was killed.
> On 7 май, 11:10, Martin Sweeney <martin.swee...@gmail.com> wrote:
> > So my farm decided to crash this morning, all backups and database
> > bundles worked fine and another set of instances are in its place.
> > Hurrah!
> > What concerns me is why all four instances decided to crash within 3
> > minutes of each other. They're not connected by anything other than
> > connections to databases and memcache servers etc, but they all went
> > at once.
> > Instance 'i-46f94bxx' found in database but not found on EC2. Crashed.
> > Instance 'i-9a9014xx' found in database but not found on EC2. Crashed.
> > Instance 'i-29009axx' found in database but not found on EC2. Crashed.
> > Instance 'i-27a2ccxx' found in database but not found on EC2. Crashed.
> > Is there anywhere I can find more info on this other than my Event
> > log?
and i have to add that the cause of the major data loss is your no-
good way of doing the snapshots. once a snapshot creation starts, the
older snapshot is immediately corrupt.
your human error caused my instances to crash mid-snapshot creation
and when restarted, the servers failed to download the snapshot and
kept terminating.
this bug was submitted more than six months ago and you've done
absolutely nothing to fix it.
On May 7, 5:22 pm, Niv <nivsin...@gmail.com> wrote:
> ruined my day & upcoming weekend + major data loss + ~20 extra
> instances running for several hours doing nothing. yay.
> On May 7, 5:11 pm, Alex Kovalyov <alex.koval...@gmail.com> wrote:
> > Martin, it was a user error on Scalr.net side. Dev version of poller
> > has gone nuts and selectively terminated instances on few farms
> > before it was killed.
> > On 7 май, 11:10, Martin Sweeney <martin.swee...@gmail.com> wrote:
> > > So my farm decided to crash this morning, all backups and database
> > > bundles worked fine and another set of instances are in its place.
> > > Hurrah!
> > > What concerns me is why all four instances decided to crash within 3
> > > minutes of each other. They're not connected by anything other than
> > > connections to databases and memcache servers etc, but they all went
> > > at once.
> > > Instance 'i-46f94bxx' found in database but not found on EC2. Crashed.
> > > Instance 'i-9a9014xx' found in database but not found on EC2. Crashed.
> > > Instance 'i-29009axx' found in database but not found on EC2. Crashed.
> > > Instance 'i-27a2ccxx' found in database but not found on EC2. Crashed.
> > > Is there anywhere I can find more info on this other than my Event
> > > log?
We lost instances out of several farms, luckily no known corruption.
But I agree the snapshots should use the same pattern the mysql dumps
do. It's basic backup practice to never overwite your last backup
with your next, it's like backing up to the same tape every night. You are eventually going to get bit and bit hard. We created a task
to grab the periodic snapshots and store them off s3, but would still
like to have rolling snapshots.
On May 7, 2009, at 7:30 AM, Niv <nivsin...@gmail.com> wrote:
> and i have to add that the cause of the major data loss is your no-
> good way of doing the snapshots. once a snapshot creation starts, the
> older snapshot is immediately corrupt.
> your human error caused my instances to crash mid-snapshot creation
> and when restarted, the servers failed to download the snapshot and
> kept terminating.
> this bug was submitted more than six months ago and you've done
> absolutely nothing to fix it.
> On May 7, 5:22 pm, Niv <nivsin...@gmail.com> wrote:
>> ruined my day & upcoming weekend + major data loss + ~20 extra
>> instances running for several hours doing nothing. yay.
>> On May 7, 5:11 pm, Alex Kovalyov <alex.koval...@gmail.com> wrote:
>>> Martin, it was a user error on Scalr.net side. Dev version of poller
>>> has gone nuts and selectively terminated instances on few farms
>>> before it was killed.
>>> On 7 май, 11:10, Martin Sweeney <martin.swee...@gmail.com> wrote:
>>>> So my farm decided to crash this morning, all backups and database
>>>> bundles worked fine and another set of instances are in its place.
>>>> Hurrah!
>>>> What concerns me is why all four instances decided to crash
>>>> within 3
>>>> minutes of each other. They're not connected by anything other than
>>>> connections to databases and memcache servers etc, but they all
>>>> went
>>>> at once.
>>>> Instance 'i-46f94bxx' found in database but not found on EC2.
>>>> Crashed.
>>>> Instance 'i-9a9014xx' found in database but not found on EC2.
>>>> Crashed.
>>>> Instance 'i-29009axx' found in database but not found on EC2.
>>>> Crashed.
>>>> Instance 'i-27a2ccxx' found in database but not found on EC2.
>>>> Crashed.
>>>> Is there anywhere I can find more info on this other than my Event
>>>> log?
> and i have to add that the cause of the major data loss is your no-
> good way of doing the snapshots. once a snapshot creation starts, the
> older snapshot is immediately corrupt.
> your human error caused my instances to crash mid-snapshot creation
> and when restarted, the servers failed to download the snapshot and
> kept terminating.
> this bug was submitted more than six months ago and you've done
> absolutely nothing to fix it.
> On May 7, 5:22 pm, Niv <nivsin...@gmail.com> wrote:
> > ruined my day & upcoming weekend + major data loss + ~20 extra
> > instances running for several hours doing nothing. yay.
> > On May 7, 5:11 pm, Alex Kovalyov <alex.koval...@gmail.com> wrote:
> > > Martin, it was a user error on Scalr.net side. Dev version of poller
> > > has gone nuts and selectively terminated instances on few farms
> > > before it was killed.
> > > On 7 май, 11:10, Martin Sweeney <martin.swee...@gmail.com> wrote:
> > > > So my farm decided to crash this morning, all backups and database
> > > > bundles worked fine and another set of instances are in its place.
> > > > Hurrah!
> > > > What concerns me is why all four instances decided to crash within 3
> > > > minutes of each other. They're not connected by anything other than
> > > > connections to databases and memcache servers etc, but they all went
> > > > at once.
> > > > Instance 'i-46f94bxx' found in database but not found on EC2. Crashed.
> > > > Instance 'i-9a9014xx' found in database but not found on EC2. Crashed.
> > > > Instance 'i-29009axx' found in database but not found on EC2. Crashed.
> > > > Instance 'i-27a2ccxx' found in database but not found on EC2. Crashed.
> > > > Is there anywhere I can find more info on this other than my Event
> > > > log?
> Woa, this is kind of a deal breaker here! Did this really happen?
> Rightscale's seeming quite cost-effective now if this is the case!
> On May 7, 10:30 am, Niv <nivsin...@gmail.com> wrote:
> > and i have to add that the cause of the major data loss is your no-
> > good way of doing the snapshots. once a snapshot creation starts, the
> > older snapshot is immediately corrupt.
> > your human error caused my instances to crash mid-snapshot creation
> > and when restarted, the servers failed to download the snapshot and
> > kept terminating.
> > this bug was submitted more than six months ago and you've done
> > absolutely nothing to fix it.
> > On May 7, 5:22 pm, Niv <nivsin...@gmail.com> wrote:
> > > ruined my day & upcoming weekend + major data loss + ~20 extra
> > > instances running for several hours doing nothing. yay.
> > > On May 7, 5:11 pm, Alex Kovalyov <alex.koval...@gmail.com> wrote:
> > > > Martin, it was a user error on Scalr.net side.Devversion ofpoller
> > > > has gone nuts and selectively terminated instances on few farms
> > > > before it was killed.
> > > > On 7 май, 11:10, Martin Sweeney <martin.swee...@gmail.com> wrote:
> > > > > So my farm decided to crash this morning, all backups and database
> > > > > bundles worked fine and another set of instances are in its place.
> > > > > Hurrah!
> > > > > What concerns me is why all four instances decided to crash within 3
> > > > > minutes of each other. They're not connected by anything other than
> > > > > connections to databases and memcache servers etc, but they all went
> > > > > at once.
> > > > > Instance 'i-46f94bxx' found in database but not found on EC2. Crashed.
> > > > > Instance 'i-9a9014xx' found in database but not found on EC2. Crashed.
> > > > > Instance 'i-29009axx' found in database but not found on EC2. Crashed.
> > > > > Instance 'i-27a2ccxx' found in database but not found on EC2. Crashed.
> > > > > Is there anywhere I can find more info on this other than my Event
> > > > > log?
would love to get an update on this as well. it's a little terrifying
to hear about rogue dev scalr processes killing production farms. are
there safeguards in place now to prevent this kind of thing happening?
hoping for a speedy response, thanks!
E.
On Jul 6, 11:39 am, rainier2 <nick.stie...@gmail.com> wrote:
> Was this a newly deployed production poller, or what it the dev poller
> that broke out of the dev sandbox?
> Has Scalr.net taken any actions to prevent a similar problem in the
> future?
> Thanks!
> On May 7, 12:08 pm, Cole <coleflour...@gmail.com> wrote:
> > Woa, this is kind of a deal breaker here! Did this really happen?
> > Rightscale's seeming quite cost-effective now if this is the case!
> > On May 7, 10:30 am, Niv <nivsin...@gmail.com> wrote:
> > > and i have to add that the cause of the major data loss is your no-
> > > good way of doing the snapshots. once a snapshot creation starts, the
> > > older snapshot is immediately corrupt.
> > > your human error caused my instances to crash mid-snapshot creation
> > > and when restarted, the servers failed to download the snapshot and
> > > kept terminating.
> > > this bug was submitted more than six months ago and you've done
> > > absolutely nothing to fix it.
> > > On May 7, 5:22 pm, Niv <nivsin...@gmail.com> wrote:
> > > > ruined my day & upcoming weekend + major data loss + ~20 extra
> > > > instances running for several hours doing nothing. yay.
> > > > On May 7, 5:11 pm, Alex Kovalyov <alex.koval...@gmail.com> wrote:
> > > > > Martin, it was a user error on Scalr.net side.Devversion ofpoller
> > > > > has gone nuts and selectively terminated instances on few farms
> > > > > before it was killed.
> > > > > On 7 май, 11:10, Martin Sweeney <martin.swee...@gmail.com> wrote:
> > > > > > So my farm decided to crash this morning, all backups and database
> > > > > > bundles worked fine and another set of instances are in its place.
> > > > > > Hurrah!
> > > > > > What concerns me is why all four instances decided to crash within 3
> > > > > > minutes of each other. They're not connected by anything other than
> > > > > > connections to databases and memcache servers etc, but they all went
> > > > > > at once.
> > > > > > Instance 'i-46f94bxx' found in database but not found on EC2. Crashed.
> > > > > > Instance 'i-9a9014xx' found in database but not found on EC2. Crashed.
> > > > > > Instance 'i-29009axx' found in database but not found on EC2. Crashed.
> > > > > > Instance 'i-27a2ccxx' found in database but not found on EC2. Crashed.
> > > > > > Is there anywhere I can find more info on this other than my Event
> > > > > > log?
We have developed a new staging environment after this has happened.
It gives us an ability to test the new features more accurately before
deploying them live.
> would love to get an update on this as well. it's a little terrifying
> to hear about rogue dev scalr processes killing production farms. are
> there safeguards in place now to prevent this kind of thing happening?
> hoping for a speedy response, thanks!
> E.
> On Jul 6, 11:39 am, rainier2 <nick.stie...@gmail.com> wrote:
>> Hey, just looking for a little closure here.
>> Was this a newly deployed production poller, or what it the dev poller
>> that broke out of the dev sandbox?
>> Has Scalr.net taken any actions to prevent a similar problem in the
>> future?
>> Thanks!
>> On May 7, 12:08 pm, Cole <coleflour...@gmail.com> wrote:
>> > Woa, this is kind of a deal breaker here! Did this really happen?
>> > Rightscale's seeming quite cost-effective now if this is the case!
>> > On May 7, 10:30 am, Niv <nivsin...@gmail.com> wrote:
>> > > and i have to add that the cause of the major data loss is your no-
>> > > good way of doing the snapshots. once a snapshot creation starts, the
>> > > older snapshot is immediately corrupt.
>> > > your human error caused my instances to crash mid-snapshot creation
>> > > and when restarted, the servers failed to download the snapshot and
>> > > kept terminating.
>> > > this bug was submitted more than six months ago and you've done
>> > > absolutely nothing to fix it.
>> > > On May 7, 5:22 pm, Niv <nivsin...@gmail.com> wrote:
>> > > > ruined my day & upcoming weekend + major data loss + ~20 extra
>> > > > instances running for several hours doing nothing. yay.
>> > > > On May 7, 5:11 pm, Alex Kovalyov <alex.koval...@gmail.com> wrote:
>> > > > > Martin, it was a user error on Scalr.net side.Devversion ofpoller
>> > > > > has gone nuts and selectively terminated instances on few farms
>> > > > > before it was killed.
>> > > > > On 7 май, 11:10, Martin Sweeney <martin.swee...@gmail.com> wrote:
>> > > > > > So my farm decided to crash this morning, all backups and database
>> > > > > > bundles worked fine and another set of instances are in its place.
>> > > > > > Hurrah!
>> > > > > > What concerns me is why all four instances decided to crash within 3
>> > > > > > minutes of each other. They're not connected by anything other than
>> > > > > > connections to databases and memcache servers etc, but they all went
>> > > > > > at once.
>> > > > > > Instance 'i-46f94bxx' found in database but not found on EC2. Crashed.
>> > > > > > Instance 'i-9a9014xx' found in database but not found on EC2. Crashed.
>> > > > > > Instance 'i-29009axx' found in database but not found on EC2. Crashed.
>> > > > > > Instance 'i-27a2ccxx' found in database but not found on EC2. Crashed.
>> > > > > > Is there anywhere I can find more info on this other than my Event
>> > > > > > log?
> We have developed a new staging environment after this has happened.
> It gives us an ability to test the new features more accurately before
> deploying them live.
> > would love to get an update on this as well. it's a little terrifying
> > to hear about rogue dev scalr processes killing production farms. are
> > there safeguards in place now to prevent this kind of thing happening?
> > hoping for a speedy response, thanks!
> > E.
> > On Jul 6, 11:39 am, rainier2 <nick.stie...@gmail.com> wrote:
> >> Hey, just looking for a little closure here.
> >> Was this a newly deployed production poller, or what it the dev poller
> >> that broke out of the dev sandbox?
> >> Has Scalr.net taken any actions to prevent a similar problem in the
> >> future?
> >> Thanks!
> >> On May 7, 12:08 pm, Cole <coleflour...@gmail.com> wrote:
> >> > Woa, this is kind of a deal breaker here! Did this really happen?
> >> > Rightscale's seeming quite cost-effective now if this is the case!
> >> > On May 7, 10:30 am, Niv <nivsin...@gmail.com> wrote:
> >> > > and i have to add that the cause of the major data loss is your no-
> >> > > good way of doing the snapshots. once a snapshot creation starts, the
> >> > > older snapshot is immediately corrupt.
> >> > > your human error caused my instances to crash mid-snapshot creation
> >> > > and when restarted, the servers failed to download the snapshot and
> >> > > kept terminating.
> >> > > this bug was submitted more than six months ago and you've done
> >> > > absolutely nothing to fix it.
> >> > > On May 7, 5:22 pm, Niv <nivsin...@gmail.com> wrote:
> >> > > > ruined my day & upcoming weekend + major data loss + ~20 extra
> >> > > > instances running for several hours doing nothing. yay.
> >> > > > On May 7, 5:11 pm, Alex Kovalyov <alex.koval...@gmail.com> wrote:
> >> > > > > Martin, it was a user error on Scalr.net side.Devversion ofpoller
> >> > > > > has gone nuts and selectively terminated instances on few farms
> >> > > > > before it was killed.
> >> > > > > On 7 май, 11:10, Martin Sweeney <martin.swee...@gmail.com> wrote:
> >> > > > > > So my farm decided to crash this morning, all backups and database
> >> > > > > > bundles worked fine and another set of instances are in its place.
> >> > > > > > Hurrah!
> >> > > > > > What concerns me is why all four instances decided to crash within 3
> >> > > > > > minutes of each other. They're not connected by anything other than
> >> > > > > > connections to databases and memcache servers etc, but they all went
> >> > > > > > at once.
> >> > > > > > Instance 'i-46f94bxx' found in database but not found on EC2. Crashed.
> >> > > > > > Instance 'i-9a9014xx' found in database but not found on EC2. Crashed.
> >> > > > > > Instance 'i-29009axx' found in database but not found on EC2. Crashed.
> >> > > > > > Instance 'i-27a2ccxx' found in database but not found on EC2. Crashed.
> >> > > > > > Is there anywhere I can find more info on this other than my Event
> >> > > > > > log?