bosh vm resurrection <- how should this work?

674 views
Skip to first unread message

da...@castlelaing.com

unread,
Jan 20, 2014, 8:32:55 AM1/20/14
to bosh-...@cloudfoundry.org
Could someone explain how the command

bosh vm resurrection [<job>] [<index>] <new_state>
    Enable/Disable resurrection for a given vm

is supposed to work.

I've activated all the jobs in my MicroBOSH on AWS based cf-release deploy (see below), but I can't see any resurrection happening.

I was expecting that if I terminated one of my AWS VMS; then BOSH would automatically recreate it after a few minutes.

But I can't seem to make this happen.

Does this feature work?

Or have I just misunderstood how it is supposed to work.

Thanks!

D


ubuntu@ip-10-0-0-10:~$ bosh vms --details
Deployment `monitor-cloud'

Director task 363

Task 363 done

+--------------------+---------+-------------------------+--------+~~+--------------+
| Job/index          | State   | Resource Pool           | IPs    |~~| Resurrection |
+--------------------+---------+-------------------------+--------+~~+--------------+
| api/0              | running | vpc-public-small        | 10.0.0.|~~| active       |
| cloud_controller/0 | running | vpc-private-small       | 10.0.1.|~~| active       |
| core/0             | running | vpc-private-small       | 10.0.1.|~~| active       |
| data/0             | running | vpc-private-small       | 10.0.1.|~~| active       |
| dea/0              | running | vpc-private-medium      | 10.0.1.|~~| active       |
| dea_spot/0         | running | vpc-private-medium-spot | 10.0.1.|~~| active       |
| dea_spot/1         | running | vpc-private-medium-spot | 10.0.1.|~~| active       |
+--------------------+---------+-------------------------+--------+~~+--------------+

MicroBOSH details:

ubuntu@ip-10-0-0-10:~$ bosh status
Config
             /home/ubuntu/labs-operations/deployments/monitor-cloud/bosh_config.yml

Director
  Name       microbosh-monitor-cloud
  URL        https://10.0.1.10:25555
  Version    1.5.0.pre.1657 (release:ea2068ab bosh:ea2068ab)
  User       admin
  UUID       f38e81de-d07b-4b56-b691-4f005745575b
  CPI        aws
  dns        enabled (domain_name: microbosh)
  compiled_package_cache disabled
  snapshots  disabled

Deployment
  Manifest   /home/ubuntu/labs-operations/deployments/monitor-cloud/monitor-cloud/monitor-cloud.yml

James Bayer

unread,
Jan 20, 2014, 11:06:57 AM1/20/14
to bosh-users
talked to ferdy (sitting next to me on an empty train. happy MLK day everyone!). ferdy intends to reply later, basically there are 2 settings:
1) enabling resurrection globally on the director (happens when you deploy bosh)
2) enabling resurrection on each job

ferdy thinks you may only be doing 2). more details later. 


To unsubscribe from this group and stop receiving emails from it, send an email to bosh-users+...@cloudfoundry.org.



--
Thank you,

James Bayer

Dr Nic Williams

unread,
Jan 20, 2014, 11:14:25 AM1/20/14
to bosh-...@cloudfoundry.org, bosh-users
Also, resurrection doesn't show up in "bosh tasks recent"; instead add the --no-filter flag and you'll see any resurrections that are (locking and) running on your deployment.

Ferran Rodenas

unread,
Jan 20, 2014, 1:01:48 PM1/20/14
to bosh-...@cloudfoundry.org
The cli command (and the automatic recreation of a terminated vm) only works if Bosh has the resurrector plugin enabled (which is disabled by default). Once you have it enabled, the cli command is useful to disable temporarily the resurrection for a job you pretend to work on, so later you can reenable it again. 

PS: And before anyone says it: I agree It's not very intuitive.

- Ferdy


2014/1/20 Dr Nic Williams <drnicw...@gmail.com>

David Laing

unread,
Jan 20, 2014, 3:11:57 PM1/20/14
to bosh-...@cloudfoundry.org

Aha!

Thanks all.

I shall go twiddle some knobs and dial some more buttons.

Resurrection + spot instances is going to be awesome!

Dear CFO - I just reduced our cloud costs by 90%.

Dear David - well done!  Have half of those cost savings as a bonus.

Brrring.  Goes the alarm clock...

Dr Nic Williams

unread,
Jan 20, 2014, 3:38:14 PM1/20/14
to bosh-...@cloudfoundry.org, bosh-...@cloudfoundry.org
Make sure you get a base line for your current costs so you can show obvious instant improvement ;)

dashbo...@labs.cityindex.com

unread,
Jan 21, 2014, 2:38:05 AM1/21/14
to bosh-...@cloudfoundry.org
Woo-hoo.  Thanks everyone - my terminated spot instances have been resurrected!

$ bosh tasks recent --no-filter

+-----+-------+-------------------------+-------+--------------+----------------------------------------------+
| #   | State | Timestamp               | User  | Description  | Result                                       |
+-----+-------+-------------------------+-------+--------------+----------------------------------------------+
| 383 | done  | 2014-01-21 07:34:29 UTC | admin | scan and fix | scan and fix complete                        |
+-----+-------+-------------------------+-------+--------------+----------------------------------------------+

Dr Nic Williams

unread,
Jan 21, 2014, 4:37:42 AM1/21/14
to bosh-users
*applauds*

Ryan Grenz

unread,
Jun 6, 2014, 8:49:40 AM6/6/14
to bosh-...@cloudfoundry.org
Hi all,

Can I ask what is the intended behaviour with resurrection if an ESX host containing a number of VMs in a deployment suddenly goes offline?
I can see endless attempts to power off the VM in vsphere logs, so resurrection is attempting to work, but as the ESX host is disconnected, it fails to complete the resurrection.

The ESX host should come back online soon, but what if its completely broken and the VMs are lost?  I can attempt to delete the VM reference with bosh cck, but I get this error: 

Director task 9376
  Started applying problem resolutions
  Started applying problem resolutions > out_of_sync_vm 541: Ignore problem. Done (00:00:00)
  Started applying problem resolutions > unresponsive_agent 512: Delete VM reference (DANGEROUS!). Failed: VM `512' has a cloud id, please use a different resolution. (00:00:10)
  Started applying problem resolutions > unresponsive_agent 509: Delete VM reference (DANGEROUS!). Failed: VM `509' has a cloud id, please use a different resolution. (00:00:10)
  Started applying problem resolutions > unresponsive_agent 504: Delete VM reference (DANGEROUS!). Failed: VM `504' has a cloud id, please use a different resolution. (00:00:10)
  Started applying problem resolutions > unresponsive_agent 524: Delete VM reference (DANGEROUS!). Failed: VM `524' has a cloud id, please use a different resolution. (00:00:10)
  Started applying problem resolutions > unresponsive_agent 523: Delete VM reference (DANGEROUS!). Failed: VM `523' has a cloud id, please use a different resolution. (00:00:10)
  Started applying problem resolutions > unresponsive_agent 507: Delete VM reference (DANGEROUS!). Failed: VM `507' has a cloud id, please use a different resolution. (00:00:10)
   Failed applying problem resolutions (00:01:00)

Task 9376 done

There is some debate in our team about whether there should be automatic resolution in this situation, as if the ESX host comes back online, so too will the VMs and so automatic recreation on other ESX hosts may cause trouble later.
I guess BOSH could track a list of dead machines such that if they come back online and their agents call back into the director, they could be immediately terminated.

Whats the best thing to do in this situation though?

Thanks,

Ryan
Reply all
Reply to author
Forward
0 new messages