What is supposed to happen when a CRASHED droplet is reported on dea.heartbeat

Cornelia Davis

unread,

Jun 19, 2013, 1:25:14 AM6/19/13

to vcap...@cloudfoundry.org

I'm trying to understand exactly what happens when an instance of my app dies and at the moment I'm not seeing the behavior I expected. In short, I am

pushing a simple ruby app (not bound to any services).
scaling the app to 2 instances
using the warden repl to kill one of the warden containers

At this point I see the following message on NATS

Msg received on [dea.heartbeat] : '{"droplets":[{"cc_partition":"ng","droplet":"8799bc94-aec2-4d9b-89fe-afe0b19ee70e","version":"3ab998c8-30b0-4463-900c-5b24f5661d5e","instance":"648c67d73451faa56e3f06c116170ea5","index":0,"state":"CRASHED","state_timestamp":1371618457.4275367},{"cc_partition":"ng","droplet":"8799bc94-aec2-4d9b-89fe-afe0b19ee70e","version":"3ab998c8-30b0-4463-900c-5b24f5661d5e","instance":"912bae76e7bdb5e6a109f277c26d3329","index":1,"state":"RUNNING","state_timestamp":1371618268.9850419}],"dea":"0-decc0f9337d3e9dd5d2f4fb5c720a9e2","prod":false}'

As you can see, one of the two instances has a state of CRASHED. I can look in the health_manager log and see that the heartbeat message is getting processed, but a new app instance is never started. I've done an initial read through the health manager code and haven't been able to find any evidence that a restart is tried.

What is supposed to happen as a result of the above message?

This is all on v2, btw.

(cross-posting on stack overflow)

James Bayer

unread,

Jun 19, 2013, 4:37:21 AM6/19/13

to vcap...@cloudfoundry.org

i believe you should see a message on the droplet.exited channel sent by the dea and listened to by the HM. later you should see a hm.request message sent from HM to the CC indicating that the HM thinks there should be another instance of that app running. later you should see the CC actually start a new instance on a DEA:

03:04:37 PM     droplet.exited           app: gopaste, reason: CRASHED, index: 11                                                   
03:04:37 PM     hm.request               app: gopaste, operation: start, app last updated: 2013-06-17 15:04:15 -0700, indices: 11

03:04:38 PM     dea.1.start              app: gopaste, dea: 1, index: 11, uris: gopaste.some.systemdomain

i highly recommend that you use alex's excellent cf tunnel-watch bosh.some.systemdomain watch

you can get the tools cf plugin with gem install tools-cf-plugin

if you have a set of reproducer steps for a bug, let us know.

--

Thank you,

James Bayer

Cornelia Davis

unread,

Jun 19, 2013, 8:17:33 PM6/19/13

to vcap...@cloudfoundry.org

Yes, when I destroyed the warden container I did see the droplet.exited message, but did not see any hm.request or dea start messages to follow. Is it the droplet.exited message that kicks off the restart process? What if that message is lost? The dea.heartbeat doesn't cause any type of recovery?

Right now the state I am in is this:

$ cf watch hello

12:11:01 AM dea.heartbeat dea: 0, running: 1

12:11:02 AM router.register app: hello, dea: 0, uris: hello.vcap.me, host: 10.0.2.15, port: 61007

12:11:11 AM dea.heartbeat dea: 0, running: 1

12:11:21 AM dea.heartbeat dea: 0, running: 1

But looking at the app instance object:

cf curl get /v2/apps/8799bc94-aec2-4d9b-89fe-afe0b19ee70e

{

"metadata": {

"guid": "8799bc94-aec2-4d9b-89fe-afe0b19ee70e",

"url": "/v2/apps/8799bc94-aec2-4d9b-89fe-afe0b19ee70e",

"created_at": "2013-06-19 04:33:21 +0000",

"updated_at": "2013-06-19 05:04:26 +0000"

},

"entity": {

"name": "hello",

"production": false,

"space_guid": "166b77a8-b79a-4a65-92cc-38dd7edf48f4",

"stack_guid": "6c8e656b-61fc-4ead-bf92-269b7ca2058d",

"buildpack": null,

"detected_buildpack": "Ruby/Rack",

"environment_json": {

},

"memory": 256,

"instances": 2,

"disk_quota": 1024,

"state": "STARTED",

"version": "3ab998c8-30b0-4463-900c-5b24f5661d5e",

"command": null,

"console": true,

"debug": null,

"staging_task_id": "cc261fa404d053ca4f737f9f82b10221",

"service_bindings_url": "/v2/apps/8799bc94-aec2-4d9b-89fe-afe0b19ee70e/service_bindings",

"routes_url": "/v2/apps/8799bc94-aec2-4d9b-89fe-afe0b19ee70e/routes",

"app_events_url": "/v2/apps/8799bc94-aec2-4d9b-89fe-afe0b19ee70e/app_events",

"space_url": "/v2/spaces/166b77a8-b79a-4a65-92cc-38dd7edf48f4",

"stack_url": "/v2/stacks/6c8e656b-61fc-4ead-bf92-269b7ca2058d"

}

As you can see, it says it should have two instances.

So, again, if that droplet.exited message is not acted upon for whatever reason, am I stuck? Is there no other recovery mechanism?

Thanks,

Cornelia

James Bayer

unread,

Jun 20, 2013, 3:38:08 AM6/20/13

to vcap...@cloudfoundry.org

the HM is supposed to notice after some time passes listening to the dea messages that there are not as many app instances running as the cloud controller asked there to be. then it should nudge the cloud controller to tell it to start another one. that may take some time. we've had a few edge cases in the HM logic recently, so it's possible that you've hit one of those.

https://github.com/cloudfoundry/health_manager/commits/master

Cornelia Davis

unread,

Jun 21, 2013, 12:52:20 AM6/21/13

to vcap...@cloudfoundry.org

Okay, I'm making so progress but have been banging my head against the wall for a while on one thing. When I run Alex's watcher I get something that looks like this:

04:00:07 AM dea.heartbeat dea: 0, running: 1

04:00:10 AM router.register app: hello, dea: 0, uris: hello.customer-app-domain1.com, host: 10.0.2.15, port: 61007

04:00:17 AM router.unregister app: hello, dea: 0, uris: hello.customer-app-domain1.com, host: 10.0.2.15, port: 61007

04:00:17 AM droplet.exited app: hello, reason: CRASHED, index: 0

04:00:17 AM dea.heartbeat dea: 0, crashed: 1

couldn't deal w/ cloudcontrollers.hm.requests.ng '{"droplet":"8ebb0367-2915-418b-bdd8-e5e94599bdce","op":"START","last_updated":null,"version":null,"indices":[0]}': TypeError: can't convert nil into an exact number

That error message at the bottom is because the watcher tool cannot format the null value for "last_updated" - And the cloud controller respondent also looks at that last_updated value during processing, so the cc never handles this message appropriately. I'm trying to track down how the last_updated value could be null (and version too) and I'm getting a bit cross-eyed in the code. Does anyone have any idea on how that value could be null?

James Bayer

unread,

Jun 21, 2013, 3:38:23 AM6/21/13

to vcap...@cloudfoundry.org

Cornelia, we just found some new issues in the HM logic with edge/race conditions, so it's probably not going to do all the right things until this known issue is fixed.

It sounds like you're encountering a potentially new bug also. I've asked some people to review your note.

James Bayer

unread,

Jun 21, 2013, 3:50:20 AM6/21/13

to vcap...@cloudfoundry.org

i opened a new bug based on this detail from alex suraci:

I think that can happen if the health manager sees a crashed instance before it's done a bulk update, i.e. it sees a droplet crash that it knew nothing about. This means the HM will send a start request (with low priority) for an instance it's never seen before, so it can't know the last_modified or version. And yes, it does look like the CC will drop this request on the floor:

https://github.com/cloudfoundry/cloud_controller_ng/blob/master/lib/cloud_controller/health_manager_respondent.rb#L62

This will be doing (nil).to_i == (app's updated_at).to_i, which should be false.

Cornelia Davis

unread,

Jun 21, 2013, 10:29:22 AM6/21/13

to vcap...@cloudfoundry.org

Okay, that seems to make a lot of sense. How often does the HM do a bulk update?

James Bayer

unread,

Jun 21, 2013, 11:06:57 AM6/21/13

to vcap...@cloudfoundry.org

defaults to 60 sec it looks like: https://github.com/cloudfoundry/health_manager/blob/master/config/health_manager.yml#L31

Cornelia Davis

unread,

Jun 21, 2013, 1:54:32 PM6/21/13

to vcap...@cloudfoundry.org

Thanks James - I did find that right after I asked - it's quite straight forward to configure.

And I've finally got it:

05:16:38 PM dea.heartbeat dea: 0, running: 1

05:16:38 PM router.unregister app: hello, dea: 0, uris: hello.vcap.me, host: 10.0.2.15, port: 61003

05:16:38 PM droplet.exited app: hello, reason: CRASHED, index: 0

05:16:39 PM hm.request app: hello, operation: start, app last updated: 2013-06-21 17:00:11 +0000, indices: 0

05:16:39 PM dea.0.start app: hello, dea: 0, index: 0, uris: hello.vcap.me

05:16:42 PM dea.heartbeat dea: 0, running: 1

05:16:42 PM router.register app: hello, dea: 0, uris: hello.vcap.me, host: 10.0.2.15, port: 61007

05:16:48 PM dea.heartbeat dea: 0, crashed: 1, running: 1

Check that restart out!

Turned out not to be a race condition, rather a configuration problem - the bulk api host (https://github.com/cloudfoundry/health_manager/blob/master/config/health_manager.yml#L20) wasn't set right for me. I was doing this all on the Altoros vagrant-based installer (https://github.com/Altoros/cf-vagrant-installer) - will be sending a pull request to them for these fixes.

Thanks for the help James - it was your and Alex's comments that helped me look in the right place.

Manuel Garcia

unread,

Jun 24, 2013, 10:25:07 PM6/24/13

to vcap...@cloudfoundry.org

Merged!

Thank you guys. Great finding.

Reply all

Reply to author

Forward