Scale down of vert.x microservices in Kubernetes

Anil

unread,

Apr 19, 2017, 7:43:56 AM4/19/17

to vert.x

Hi Team,

We are running vert.x hazelcast mircro services on Kubernetes. we see requests are getting timed out/terminated during scale down of prods.

We have two verticles (standard and worker) in the application. standard verticle receives the request and push it to eventbus. Worker verticle consumes event from eventbus and process and respond.

Current problem : when prods get the terminate signal, the requests are getting terminated and timed-out.

Could you please clarify below ?

1. Is there any way to achieve grace full shutdown of vert.x application which stop accepting the input requests, new events from clustered eventbus and process the current running requests and terminate.

2. What is the default termination policy ? we noticed that Ctrl + C is forceful termination.

3. Can we achieve #1 using Runtime shutdown hook ?

Please let me know if you have any questions. Thanks.

Thanks

Jochen Mader

unread,

Apr 19, 2017, 9:03:54 AM4/19/17

to ve...@googlegroups.com

If vertx. is gracefully shutdown it will correctly derigester everything.

I guess that the docker-containers you use don't correctly terminate the Vert.x-process running inside.

That causes vert.x to simply die, other nodes have to wait for the heart-beat timeout to trigger removal of the dead node.

So I'd suggest to check how shutdown is handled inside the containers.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/vertx.
To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/c771e219-a743-4620-ac31-fc1dc3932d6f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Jochen Mader | Lead IT Consultant

ayo...@isec.ng

unread,

Apr 20, 2017, 8:06:56 AM4/20/17

to vert.x

Hi Anil,

I'm interested in how you were able to set this up. Please kindly shed some light on it.

Thanks

Best Regards

Anil

unread,

Apr 20, 2017, 8:14:56 AM4/20/17

to vert.x

Hi Jochen,

kubernetes give 30 sec of grace time. Till that 30s time, pod not assigned to any new rest request but worker verticle of same pod can process a message from clustered event bus.

I will try adding shutdown hook with vertx.close(). not sure if vertx.close() gracefully unregister consumers.

Thanks

Anil

unread,

Apr 20, 2017, 8:15:43 AM4/20/17

to vert.x

Hi,

Could you please elaborate ? i did not get your questions. sorry.

Thanks

Tim Fox

unread,

Apr 20, 2017, 8:41:20 AM4/20/17

to vert.x

Vert.x already has a shutdown hook which will call vertx.close(), adding another one won't accomplish anything.

Vert.x close() will unregister any event bus handlers before completing.

Anil

unread,

Apr 20, 2017, 8:49:55 AM4/20/17

to vert.x

Thanks Tim.

Is there any way to stop consuming messages from clustered eventbus, finish the current processing messages and terminate on receiving the terminate signal ?

Thanks.

Jonas Bergström

unread,

Apr 24, 2017, 2:57:02 AM4/24/17

to vert.x

Unfortunately the vertx undeploy dont wait for outstanding requests to complete after unregistration a handler and before taking down the verticle. Which means that there is a race condition where event producers may send events to a verticle that will get undeployed before the event arrives. In which case the event times out as you are experiencing.

We handle it by wrapping all bus messages with a retry mechanism that sends the message again if we get a timeout exception. Works for idempotent events and should be there anyway to handle server crashes. This solves most cases, except for the case when the retry event also is unfortunate enough to be sent to the undeployed handler (very rare use case for us, but it has happened during hazelcast issues)...

However, vertx should make sure that outstanding events already dispatched to a bus handler should complete before the handler is closed. If you are using Hazelcast, the subs map is replicated and not always up to date (synced every 100ms I think by default). Which means that there is always a window of opportunity to send a message to a non-existing handler.

We _always_ get this error during rolling upgrades, which has the impact that some of the api requests get a high response time.

/ Jonas

Jochen Mader

unread,

Apr 24, 2017, 3:33:01 AM4/24/17

to ve...@googlegroups.com

I am not 100% sure if that would work but if it is really required to make sure that no more events arrive it might be an option to override Verticle.stop and deregister Handlers manually.

--

You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/vertx.

To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/63e0756c-2d4a-432c-b191-dec8f8225972%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tim Fox

unread,

Apr 24, 2017, 4:17:11 AM4/24/17

to vert.x

Unless you globally synchronize everything and you allow different parts of the system to be shutdown at different times, then it's always going to be possible to to have events which are never received and processed, or where processing started and never finished. That's not a Vert.x limitation - it would be true in any distributed system. Even if you could code some kind of "global clean shutdown" it's not going to protect you from the case of one of your machines or processes suddenly dying, also leaving your message in a partially or unprocessed state.

Instead of trying to "fix the world" (which you can never do, see King Cnut for a deeper discussion on this issue), and if you care that your message are processed "once and only once" you should code your handlers to be idempotent and implement retry at the sender side if the sender doesn't receive a confirmation the message was processed within a timeout. Again, this isn't a Vert.x specific thing, just a sensible way to code a distributed system based on message processing :)

Julien Viet

unread,

Apr 24, 2017, 4:26:04 AM4/24/17

to ve...@googlegroups.com

well said :-)

I find the feedback from Jonas using it in a practical situtation with Hazelcast very interesting as it explains how this can happen in practice with Vert.x and Hazelcast.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.

To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.

Visit this group at https://groups.google.com/group/vertx.

To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/f7281384-ecfb-490a-b131-92960cf9e82a%40googlegroups.com.

Jonas Bergström

unread,

Apr 24, 2017, 5:15:43 AM4/24/17

to vert.x

Right. I do think it would be nice if vertx supported regular rolling upgrades without loosing any messages though, since thats such a common use case. We do it several times per day. It doesnt imply that u have to "fix the world" or do "global synchronization", it would be enough with a pragmatic "unregister, wait 100ms, undeploy" and pretty much all upgrades would pass without the need to resend events.

/ Jonas

Tim Fox

unread,

Apr 24, 2017, 5:43:07 AM4/24/17

to vert.x

On Monday, 24 April 2017 10:15:43 UTC+1, Jonas Bergström wrote:

Right. I do think it would be nice if vertx supported regular rolling upgrades without loosing any messages though,

But it won't mean you won't lose messages. Processes can fail, get stuck, crash, machines can crash. This can result in lost messages whether or not the system supports what you want (not that it can really support that anyway). So, if you really care about not losing messages, then you'd have to code that retry / idempotency logic _anyway_.

since thats such a common use case. We do it several times per day. It doesnt imply that u have to "fix the world" or do "global synchronization", it would be enough with a pragmatic "unregister, wait 100ms, undeploy"

What if there are messages in the local queue that are going to take longer than 100ms to process? What if there is a GC that takes longer than 100ms? What if the node crashes? All these cases will result in "lost" messages. A solution like this is never going to fulfil your requirement of "egular rolling upgrades without loosing any message".

(100ms is very specific to your use case. You could of course, just implement your own delay of 100ms on shutdown - this would just be a few lines of code in Vert.x)

and pretty much all upgrades would pass without the need to resend events.

What about the ones that don't?

Jonas Bergström

unread,

Apr 24, 2017, 6:51:00 AM4/24/17

to vert.x

Thats what I said :). We do use automatic retries for idempotent messages to support the cases where messages are lost due to whatever.

I mainly have one issue with redeploys causing message failures: we WARN log on event retries because I consider that to be an exceptional occurrence. The warn log triggers monitoring alerts. And I dont want to configure the alerts to suppress warns of a certain type during redeploys.

I am curios though how I would implement my own delay as you mention, cause I've tried it and it didn't work. In the undeploy hook in a verticle I added a timer of a few seconds and completed the undeploy when the timer triggered, but I lost messages anyway. Which made me believe that the subs map wasnt updated until undeploy completed. But maybe I did it wrong, is that how it should be done?

I wouldn't do a timeout though, instead I would have a counter on ongoing requests and when that counter is down to 0 I would complete the undeploy. That also seems possible to do generically in vertx, but I'd be happy if I can do it myself using the vertx apis.

Generally speaking about the subs map it would be nice if handlers that times out are downprioritized (similar to what Netflix Ribbon can do in it's loadbalancer strategy), because sometimes we have seen that even the retry event was sent to the handler that was undeployed (during temporary hazelcast issues when subs map changes arent propagated properly).

/ Jonas

ayo...@isec.ng

unread,

Apr 24, 2017, 7:03:12 AM4/24/17

to vert.x

Hi Anil, I guess was way to abstract, sorry about that.

So, basically, the question is. How were you able to setup Vertx on a kubernetes cluster. I'm really interested in how you did it. The services you used. etc.
Could you write a post on medium on how you setup Vertx on kubernetes for other developers who are interested in having the same infrastructure setup..

Thanks.

I hope my question is clearer now. :-)

On Wednesday, April 19, 2017 at 12:43:56 PM UTC+1, Anil wrote:

Tim Fox

unread,

Apr 24, 2017, 7:09:26 AM4/24/17

to vert.x

On Monday, 24 April 2017 11:51:00 UTC+1, Jonas Bergström wrote:

Thats what I said :). We do use automatic retries for idempotent messages to support the cases where messages are lost due to whatever.

I mainly have one issue with redeploys causing message failures: we WARN log on event retries because I consider that to be an exceptional occurrence.

I wouldn't consider it an exceptional occurrence. In any distributed system it's a normal occurrence if you want retain the ability to cycle distributed components whenever you want. In general you don't have control over who sends messages and when, therefore there is always the chance there are unprocessed messages in flights any time you cycle one of your components. That's just a normal scenario. Failure and lack of availability are not exceptional conditions.

I think, if you want to write a robust distributed system code for idempotency and retry as normal, and let messages quickly be resent when the components come back up.

Diego Feitosa

unread,

Apr 27, 2017, 3:38:38 PM4/27/17

to vert.x

Hi Anil,

It seems we both have a similar setup and requirement. Two days ago I've started a thread about it and was able to land on a potential solution, though I'm not sure if there are implications on it.

On a note about using a shutdown hook, if you are running your application with io.vertx.core.Launcher, it already register a shutdown hook that calls vertx.close() and terminates everything immediately. There's no way to detect something is being processed and wait for it to finish.

The solution I'm evaluating is to stop receiving any traffic into that server and allow for a grace period for all the already accepted requests to finish. As we cannot rely on the SIGTERM/SIGKILL sent by kubernetes when terminating a pod, the main idea is to use a preStop command (as described here) to invoke an internal endpoint that will terminate vertx. I'm still working on it and need to wire everything together to validate the solution is sound and will keep updating the shutdown code here.

If you have any feedbacks or see any flaws on it, let me know!

Cheers,
-Diego

Clement Escoffier

unread,

Apr 27, 2017, 3:43:14 PM4/27/17

to ve...@googlegroups.com

Alternatively, we could add a way in the Launcher to notify you when the shutdown is initiated (before vertx.close()).

Clement

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.
Visit this group at https://groups.google.com/group/vertx.

To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/e8685fe6-c724-41f6-9b4b-b582251660ed%40googlegroups.com.

Anil

unread,

May 3, 2017, 3:41:27 AM5/3/17

to vert.x

Thanks everyone for good learning conversations

Hi Diego,

We did the same. We have added prestop grace period. this stops sending new requests to the pod that is set of shutdown and added retry for evenbus timeouts for all requests of the current instance. Still it is not 100% accurate, we have seen < 1% of failures in case of scale down on heavy load.

Thanks

Thomas SEGISMONT

unread,

May 4, 2017, 3:49:37 AM5/4/17

to ve...@googlegroups.com

Not a bad idea Clément. VertxLifecycleHooks has methods for startup and verticle deployment, but nothing for undeploy and shutdown.

Do you know if there's a ticket for this already?

2017-04-27 21:43 GMT+02:00 Clement Escoffier <clement....@gmail.com>:

Alternatively, we could add a way in the Launcher to notify you when the shutdown is initiated (before vertx.close()).

Clement

On 27 Apr 2017, at 15:38, Diego Feitosa <dnfe...@gmail.com> wrote:

Hi Anil,

It seems we both have a similar setup and requirement. Two days ago I've started a thread about it and was able to land on a potential solution, though I'm not sure if there are implications on it.

On a note about using a shutdown hook, if you are running your application with io.vertx.core.Launcher, it already register a shutdown hook that calls vertx.close() and terminates everything immediately. There's no way to detect something is being processed and wait for it to finish.

The solution I'm evaluating is to stop receiving any traffic into that server and allow for a grace period for all the already accepted requests to finish. As we cannot rely on the SIGTERM/SIGKILL sent by kubernetes when terminating a pod, the main idea is to use a preStop command (as described here) to invoke an internal endpoint that will terminate vertx. I'm still working on it and need to wire everything together to validate the solution is sound and will keep updating the shutdown code here.

If you have any feedbacks or see any flaws on it, let me know!

Cheers,
-Diego

On Wednesday, April 19, 2017 at 4:43:56 AM UTC-7, Anil wrote:

Hi Team,

We are running vert.x hazelcast mircro services on Kubernetes. we see requests are getting timed out/terminated during scale down of prods.

We have two verticles (standard and worker) in the application. standard verticle receives the request and push it to eventbus. Worker verticle consumes event from eventbus and process and respond.

Current problem : when prods get the terminate signal, the requests are getting terminated and timed-out.

Could you please clarify below ?

1. Is there any way to achieve grace full shutdown of vert.x application which stop accepting the input requests, new events from clustered eventbus and process the current running requests and terminate.
2. What is the default termination policy ? we noticed that Ctrl + C is forceful termination.
3. Can we achieve #1 using Runtime shutdown hook ?

Please let me know if you have any questions. Thanks.

Thanks

--
You received this message because you are subscribed to the Google Groups "vert.x" group.

To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.

Visit this group at https://groups.google.com/group/vertx.
To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/e8685fe6-c724-41f6-9b4b-b582251660ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.

To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.

Visit this group at https://groups.google.com/group/vertx.

To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/77FA787A-725F-4E52-8BA5-82655ED97348%40gmail.com.

Clement Escoffier

unread,

May 8, 2017, 3:57:52 AM5/8/17

to ve...@googlegroups.com

On 4 May 2017, at 09:49, Thomas SEGISMONT <tsegi...@gmail.com> wrote:

Not a bad idea Clément. VertxLifecycleHooks has methods for startup and verticle deployment, but nothing for undeploy and shutdown.

Do you know if there's a ticket for this already?

I just created: https://github.com/eclipse/vert.x/issues/1972

Clement

To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.

Visit this group at https://groups.google.com/group/vertx.

To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/CACiEr_RsT34pFBtxvpyk1CyOCt0WxctebZaEfLhUsCeP%3Dw44Ow%40mail.gmail.com.

Reply all

Reply to author

Forward