Hi all,
Can I ask what is the intended behaviour with resurrection if an ESX host containing a number of VMs in a deployment suddenly goes offline?
I can see endless attempts to power off the VM in vsphere logs, so resurrection is attempting to work, but as the ESX host is disconnected, it fails to complete the resurrection.
The ESX host should come back online soon, but what if its completely broken and the VMs are lost? I can attempt to delete the VM reference with bosh cck, but I get this error:
Director task 9376
Started applying problem resolutions
Started applying problem resolutions > out_of_sync_vm 541: Ignore problem. Done (00:00:00)
Started applying problem resolutions > unresponsive_agent 512: Delete VM reference (DANGEROUS!). Failed: VM `512' has a cloud id, please use a different resolution. (00:00:10)
Started applying problem resolutions > unresponsive_agent 509: Delete VM reference (DANGEROUS!). Failed: VM `509' has a cloud id, please use a different resolution. (00:00:10)
Started applying problem resolutions > unresponsive_agent 504: Delete VM reference (DANGEROUS!). Failed: VM `504' has a cloud id, please use a different resolution. (00:00:10)
Started applying problem resolutions > unresponsive_agent 524: Delete VM reference (DANGEROUS!). Failed: VM `524' has a cloud id, please use a different resolution. (00:00:10)
Started applying problem resolutions > unresponsive_agent 523: Delete VM reference (DANGEROUS!). Failed: VM `523' has a cloud id, please use a different resolution. (00:00:10)
Started applying problem resolutions > unresponsive_agent 507: Delete VM reference (DANGEROUS!). Failed: VM `507' has a cloud id, please use a different resolution. (00:00:10)
Failed applying problem resolutions (00:01:00)
Task 9376 done
There is some debate in our team about whether there should be automatic resolution in this situation, as if the ESX host comes back online, so too will the VMs and so automatic recreation on other ESX hosts may cause trouble later.
I guess BOSH could track a list of dead machines such that if they come back online and their agents call back into the director, they could be immediately terminated.
Whats the best thing to do in this situation though?
Thanks,
Ryan