Confusion on adding a random noise to the delay calculated to restart a flapping instance in HM

Hongliang Sun

unread,

Dec 22, 2013, 6:02:19 AM12/22/13

to vcap...@cloudfoundry.org

Hi, everyone,

When I study the source code of HealthManager2,0, I found that when HM deals with flapping instance, it restarts the instance immediately after the delay calculated. And I think it really makes sense. But in calculating the delay, one action is adding a random noise to the delay to avoid a storm of simultaneous restarts. And what is more in the notes, it says that this action is necessary because delayed restarts bypass nudger's queue.

I can't quite follow the sentence " this is necessary because delayed restarts bypass nudger's queue".

Can someone give me more specific explanation?

Best Regards

Hongliang Sun

James Bayer

unread,

Dec 22, 2013, 8:46:56 AM12/22/13

to vcap...@cloudfoundry.org

pivotal has now been using hm9000 in production for run.pivotal.io for several weeks. sometime soon, likely within 1-2 weeks into the new year, the CF team will recommend that everyone else start using hm9000 instead of health_manager_next also and have nice getting started instructions for how people should switch over, how to config BOSH deployment manifests, etc. there is a nice pairing video here that discusses some of the patterns used in hm9000, but also talks a lot about go best practices [1]. docs are here [2].

in the meantime, i believe what " this is necessary because delayed restarts bypass nudger's queue". means is that restart messages got sent immediately on NATS instead of being buffered in a queue for other scenarios. CC may act on those immediately, therefore not having noise was observed to cause storms in certain scenarios. someone on the runtime team may be able to confirm/fix this explanation or better explain it.

[1] http://www.youtube.com/watch?v=Yvoe2JPyhas&feature=youtu.be&ref=twdec20

[2] https://github.com/cloudfoundry/hm9000/blob/master/README.md

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

--

Thank you,

James Bayer

Hongliang Sun

unread,

Dec 22, 2013, 10:49:59 AM12/22/13

to vcap...@cloudfoundry.org

Thanks for your reply, James.

It is just confusion that haunts me a lot. Actually I am very interested in Golang, and I heard engineers of Pivotal give a talk on Paltform: the Cloud Foundry Conference several months ago. And the talks made me learn a lot on HM.

While actually I know that these request of starting immediately will be sent to Cloud Controller via NATS directly without queuing.

That is to say if no noise added, storm of simultaneous restarts comes. And I can not understand this part. Does "storm" mean a series of continous restarts？

As my understanding，to double the delay every additional crash will avoid some regular crashes of app. Is the noise just simulating the actual situation more approximately？

Best Regards

Hongliang Sun

James Bayer

unread,

Dec 22, 2013, 12:05:41 PM12/22/13

to vcap...@cloudfoundry.org

no, i believe the noise is intended to make sure that a bunch of restart messages don't arrive at the same time to avoid this effect [1] in the system where many flapping apps are restarted. think of the case where the health manager job is restarted and begins restarting many apps. you don't want all of these apps to restart at precisely the same time.

[1] http://www.youtube.com/watch?v=xox9BVSu7Ok

Hongliang Sun

unread,

Dec 23, 2013, 7:03:25 AM12/23/13

to vcap...@cloudfoundry.org

Thanks for explanation in detail, James

I think I have caught the point now. The random noise is used to deal with the situation that many flapping apps come at the same time.

Reply all

Reply to author

Forward