How to do rolling upgrades

Jonas Bergström

unread,

Oct 20, 2016, 3:26:33 AM10/20/16

to vert.x

Hi all,

We're running about 20 vert.x instances with different business logic in them, typically two or three instances per service type. Each service has 2-5 verticals deployed.

When we do rolling upgrades (take one service instance down, bring the new version up, until all instances are replaced) we often get error logs from hazelcast and subs map inconsistencies. Which eventually sorts itself out, but meanwhile we get a number of failing requests.

So the question is, what is the proper procedure for rolling upgrades? Should we do it in another way?

What we would like is that vertx supports some sort of two-step shutdown procedure so we can take a running instance out of service before shutting it down. I'e unregister verticals so no more messages are delivered and wait until all ongoing messages have been handled, and disconnect from hazelcast. And then we can take down the instance.

Is there a way to implement that today?

BR / Jonas

Jochen Mader

unread,

Oct 20, 2016, 8:07:22 AM10/20/16

to ve...@googlegroups.com

Vert.x does exactly that when shut down correctly.

Make sure that instances are taken down correctly so vert.x can unsubscribe from the cluster.

Vert.x doesn't have a chance to deregister from Hazelcast if the JVM is simply killed.

There has been a discussion on this topic quite recently, search the list.

--
You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/vertx.
To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/96267126-1214-45ae-b885-c3f8c29d80f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Jochen Mader | Lead IT Consultant

Jonas Bergström

unread,

Oct 20, 2016, 8:34:00 AM10/20/16

to vert.x

All I have been able to find is that we should send a stop when taking down nodes. We send SIGTERM which also will trigger the shutdown hooks as I understand.

But we still get issues. I will investigate more, just wanted to know if this was the way to do it or not.

/Jonas

On Thursday, October 20, 2016 at 2:07:22 PM UTC+2, Jochen Mader wrote:

Vert.x does exactly that when shut down correctly.
Make sure that instances are taken down correctly so vert.x can unsubscribe from the cluster.
Vert.x doesn't have a chance to deregister from Hazelcast if the JVM is simply killed.
There has been a discussion on this topic quite recently, search the list.

2016-10-20 9:26 GMT+02:00 Jonas Bergström <lucky...@gmail.com>:

Hi all,
We're running about 20 vert.x instances with different business logic in them, typically two or three instances per service type. Each service has 2-5 verticals deployed.
When we do rolling upgrades (take one service instance down, bring the new version up, until all instances are replaced) we often get error logs from hazelcast and subs map inconsistencies. Which eventually sorts itself out, but meanwhile we get a number of failing requests.
So the question is, what is the proper procedure for rolling upgrades? Should we do it in another way?
What we would like is that vertx supports some sort of two-step shutdown procedure so we can take a running instance out of service before shutting it down. I'e unregister verticals so no more messages are delivered and wait until all ongoing messages have been handled, and disconnect from hazelcast. And then we can take down the instance.
Is there a way to implement that today?

BR / Jonas

--
You received this message because you are subscribed to the Google Groups "vert.x" group.

To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.

Visit this group at https://groups.google.com/group/vertx.
To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/96267126-1214-45ae-b885-c3f8c29d80f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Asher Tarnopolski

unread,

Oct 20, 2016, 11:19:27 AM10/20/16

to vert.x

you said you unregister your handlers which disconnects them from receiving new messages, then wait for a while to let them complete the computations they already started and then complete the undeployment of your vertx instance. don't these steps provide you with the desired solution?

Jonas Bergström

unread,

Oct 20, 2016, 6:08:19 PM10/20/16

to vert.x

No that's not what I said. I said that I want it to work like that, but it doesn't.

Is it possible to unregister handlers without stopping the app? Then I can do that in a "prepare-for-shutdown" endpoint.

/ Jonas

Jochen Mader

unread,

Oct 21, 2016, 7:17:52 AM10/21/16

to ve...@googlegroups.com

I think something is going wrong in your application and you should ebug why that is.

the behavior of Vert.x is that all handlers registered by a Verticle get deregistered and then it shuts down.

The cluster should be in a clean state afterwards.

And that's also the behavior I see whenever i do that.

--

You received this message because you are subscribed to the Google Groups "vert.x" group.

To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.

Visit this group at https://groups.google.com/group/vertx.

To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/a15f9b61-681f-4830-bb87-a9c55440413b%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Asher Tarnopolski

unread,

Oct 21, 2016, 8:52:49 AM10/21/16

to vert.x

maybe jonas is saying that once he stops the app it's undeployed before the handlers completed the tasks they were in the middle of and therefore before they replied yo the messages they received.
i guess in this case one can rewrite main verticle's stop() so that stopfuture will be completed after a pause.

Jonas Bergström

unread,

Oct 22, 2016, 8:44:53 AM10/22/16

to vert.x

I did the simplest possible test now. I have a web proxy that receives http requests and sends ping messages to a service, and responds when the message is returned from the service. They are deployed in separate vertx instances.

public class Proxy extends AbstractVerticle {

  @Override
  public void start(Future<Void> fut) {
    vertx
        .createHttpServer()
        .requestHandler(
                request ->
                    vertx.eventBus().send("ping", null, new DeliveryOptions().setSendTimeout(1000),
                        response ->
                        {
                            if (response.succeeded())
                                request.response().end("<h1>Got pong</h1>");
                            else
                                request.response().setStatusCode(500).end("<h1>Got error</h1>");
                        }))
        .listen(8080, result -> {
          if (result.succeeded()) {
            fut.complete();
          } else {
            fut.fail(result.cause());
          }
        });
  }
}

public class Service extends AbstractVerticle {

  @Override
  public void start(Future<Void> fut)
  {
      vertx.eventBus().consumer("ping", message -> message.reply("pong"));
  }
}

When I start a single proxy and two service instances, run "ab -n 5000 -c 5 http://localhost:8080/", and take down one of the service instances (by Ctrl-C in the terminal) while ab is running, I often get failed requests.

Default config is used.

What am I missing? Why is this test not working?

The code is at https://github.com/luckyswede/vertxclustertest

BR / Jonas

Jochen Mader

unread,

Oct 22, 2016, 10:29:01 AM10/22/16

to ve...@googlegroups.com

Ctrl-C sends SIGINT to the JVM which triggers a non-graceful shutdown.

Graceful should be:

- Run your application

- switch to a separate shell.

- use 'ps' to find the Vert.x-process and use 'kill -15 <pid>'.

This sends a SIGTERM to Vert.x and should allow a graceful shutdown.

--

You received this message because you are subscribed to the Google Groups "vert.x" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vertx+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/vertx.

To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/d2aa51d8-05e8-410b-ab6c-a866e1debec6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jonas Bergström

unread,

Oct 22, 2016, 1:52:13 PM10/22/16

to vert.x

Well, that don't seem right. The shutdown hooks are run on Ctrl-C as well. Unless vertx use special signal handling.

Anyways I tried what u suggest and I still get errors most of the test runs. So, unless I have misunderstood something or there is some config I need to tweek, rolling upgrades seems broken. And in this test I don't even have to wait for outstanding requests to complete (unless verticals don't allow an ongoing handler to respond to a message after undeploy has been triggered).

/ Jonas

To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.

Visit this group at https://groups.google.com/group/vertx.

To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/d2aa51d8-05e8-410b-ab6c-a866e1debec6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jonas Bergström

unread,

Oct 23, 2016, 2:42:37 PM10/23/16

to vert.x

In another post, Tim Fox is saying: "... servers are being shutdown. In that case you should expect any in transit messages to go missing and code your app to deal with timeouts and retry accordingly. ...". So if that is true then rolling upgrades aren't supported I'd say. Sure, we need to do automatic retries etc to support unexpected node failures, but not all requests are idempotent and of course we'd like to keep failure rates of those to a minimum... We are expecting a few 100 requests per second to be the normal case, so if we run e.g. 3 service instances there will be quite a few failing requests on each upgrade.

I don't understand why vert.x don't make sure that the consumer verticle id is successfully removed from the subs map before it is undeployed. Replicated maps in hazelcast are synced after 100ms by default, I guess I can set that to 0 tho which should help quite a lot.

/ Jonas

Reply all

Reply to author

Forward