Openshift - Hazelcast thread blocks

104 views
Skip to first unread message

Simon LEDUNOIS

unread,
Jun 2, 2021, 5:19:30 AM6/2/21
to vert.x
Hi,

I'm currently facing two issues using hazelcast on openshift/kubernetes.
I have 4 clustered services deployed on kubernetes. Hazelcast uses DNS dicovery strategy to connect services.

My first issue is: Downscaling service pod trigger a thread block on cluster members. It works fine on upscaling pods. New member joins the cluster and gateway discovers it fine (upscaling logs). However, downscaling my pods trigger a thread block on my cluster members and freez the application (downscaling logs). 
If needed, here is my cluster configuration and my headless service.

Second issue: I create 2 routes displaying discovery records. The first one displays published records (programmatically) and the second one displays kubernetes records (record that contains kubernetes.uuid metadata). Adding and removing services publish and withdraw records well on the first one but the second one stacks services.
Explanations:
  1. K8S route renders 4 services;
  2. I upscale service 1 from 1 to 3 pods;
  3. K8S route renders 6 services;
  4. I downscale service 1 from 3 to 1 pod;
  5. K8S route renders 6 services;
  6. I build service 1. Build autodeploy service1;
  7. K8S route renders 7 services.
Maybe I have a mistake with my cluster configuration or my kubernetes configuration ?

Thanks for you help,

Simon LEDUNOIS

Julien Viet

unread,
Jun 2, 2021, 7:35:24 AM6/2/21
to vert.x
Hi,

do you have stack trace dumps to share with us ?
> --
> You received this message because you are subscribed to the Google Groups "vert.x" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to vertx+un...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/f02d1452-2134-4d4b-b46a-84d21db4cc95n%40googlegroups.com.

Thomas SEGISMONT

unread,
Jun 2, 2021, 8:09:09 AM6/2/21
to vert.x
Hi Simon,

Le mer. 2 juin 2021 à 11:19, Simon LEDUNOIS <simon.l...@gmail.com> a écrit :
Hi,

I'm currently facing two issues using hazelcast on openshift/kubernetes.
I have 4 clustered services deployed on kubernetes. Hazelcast uses DNS dicovery strategy to connect services.

My first issue is: Downscaling service pod trigger a thread block on cluster members. It works fine on upscaling pods. New member joins the cluster and gateway discovers it fine (upscaling logs). However, downscaling my pods trigger a thread block on my cluster members and freez the application (downscaling logs). 

Can you file an issue on GitHub please? It seems like we need to update the HealtchCheck procedure: if HZ can't reply in a timely manner, the pod can be considered not healthy.
 
If needed, here is my cluster configuration and my headless service.

Second issue: I create 2 routes displaying discovery records. The first one displays published records (programmatically) and the second one displays kubernetes records (record that contains kubernetes.uuid metadata). Adding and removing services publish and withdraw records well on the first one but the second one stacks services.
Explanations:
  1. K8S route renders 4 services;
  2. I upscale service 1 from 1 to 3 pods;
  3. K8S route renders 6 services;
  4. I downscale service 1 from 3 to 1 pod;
  5. K8S route renders 6 services;
  6. I build service 1. Build autodeploy service1;
  7. K8S route renders 7 services.
Maybe I have a mistake with my cluster configuration or my kubernetes configuration ?


Can you please share a small reproducer for that? Thank you
 

Thanks for you help,

Simon LEDUNOIS

--

Simon LEDUNOIS

unread,
Jun 2, 2021, 8:19:41 AM6/2/21
to vert.x
Hi Julien,

Thanks for your response. I attached the file to the answer.

Best regards,

payment-2-txp6b.log

Simon LEDUNOIS

unread,
Jun 2, 2021, 8:29:17 AM6/2/21
to vert.x
Hi,

If you need a reproducer, here is my poc project: https://github.com/SLedunois/vertx-modular-platform/tree/openshift. Don't forget to checkout openshift branch.

Best regards,

Julien Viet

unread,
Jun 2, 2021, 11:36:46 AM6/2/21
to vert.x
this actually seems related to Vert.x health checks and not Vert.x
clustering as the thread dump shows:

io.vertx.core.VertxException: Thread blocked
at java...@11.0.11/jdk.internal.misc.Unsafe.park(Native Method)
at java...@11.0.11/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:357)
at app//com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:634)
at app//com.hazelcast.internal.util.FutureUtil.executeWithDeadline(FutureUtil.java:389)
at app//com.hazelcast.internal.util.FutureUtil.returnWithDeadline(FutureUtil.java:270)
at app//com.hazelcast.internal.util.FutureUtil.returnWithDeadline(FutureUtil.java:240)
at app//com.hazelcast.internal.partition.PartitionServiceProxy.isClusterSafe(PartitionServiceProxy.java:132)
at app//io.vertx.spi.cluster.hazelcast.ClusterHealthCheck.lambda$createProcedure$0(ClusterHealthCheck.java:41)
at app//io.vertx.spi.cluster.hazelcast.ClusterHealthCheck$$Lambda$1242/0x0000000840538840.handle(Unknown
Source)
at app//io.vertx.ext.healthchecks.impl.DefaultProcedure.check(DefaultProcedure.java:46)
at app//io.vertx.ext.healthchecks.impl.DefaultCompositeProcedure.check(DefaultCompositeProcedure.java:76)
at app//io.vertx.ext.healthchecks.impl.HealthChecksImpl.compute(HealthChecksImpl.java:177)
at app//io.vertx.ext.healthchecks.impl.HealthChecksImpl.checkStatus(HealthChecksImpl.java:116)
at app//io.vertx.ext.healthchecks.impl.HealthChecksImpl.checkStatus(HealthChecksImpl.java:130)
at app//io.vertx.ext.healthchecks.impl.HealthCheckHandlerImpl.handle(HealthCheckHandlerImpl.java:107)
at app//io.vertx.ext.healthchecks.impl.HealthCheckHandlerImpl.handle(HealthCheckHandlerImpl.java:28)


so I think we should open an issue in vertx-health-check
> To view this discussion on the web, visit https://groups.google.com/d/msgid/vertx/35394aa8-82be-4a39-bc5b-b696503f240an%40googlegroups.com.

Julien Viet

unread,
Jun 2, 2021, 11:39:59 AM6/2/21
to vert.x
well actually it is in vertx-hazelcast and it comes down to the
ClusterHealthCheck

please apologize for the confusion

Julien

Thomas SEGISMONT

unread,
Jun 2, 2021, 12:01:18 PM6/2/21
to vert.x
Can you please try to the cluster health check procedure with this:
static Handler<Promise<Status>> createProcedure(Vertx vertx) {
Objects.requireNonNull(vertx);
return healthCheckPromise -> {
vertx.executeBlocking(promise -> {
VertxInternal vertxInternal = (VertxInternal) Vertx.currentContext().owner();
HazelcastClusterManager clusterManager = (HazelcastClusterManager) vertxInternal.getClusterManager();
PartitionService partitionService = clusterManager.getHazelcastInstance().getPartitionService();
promise.complete(new Status().setOk(partitionService.isClusterSafe()));
}, false, healthCheckPromise);
};
}
You can set the maximum time Hazelcast sits waiting for isClusterSafe result by assigning the hazelcast.graceful.shutdown.max.wait property (in cluster xml or sysprop) a value lower than the default (600 seconds).
Also, you can register the check with a timeout (in ms) so that the check fails if it does not complete in time.
hc.register("cluster-health", 2000, procedure);

Simon LEDUNOIS

unread,
Jun 2, 2021, 12:06:32 PM6/2/21
to vert.x
Hi Julien,

Thanks for your time. Indeed, after removing the cluster health check, no more thread block. Howerver, the cluster is still freezing when a member is leaving. Exactly when i got this logs :

INFO: [172.17.0.10]:5701 [dev] [4.0.2] Connection[id=6, /172.17.0.10:5701->/172.17.0.8:55134, qualifier=null, endpoint=[172.17.0.8]:5701, alive=false, connectionType=MEMBER] closed. Reason: Connection closed by the other side
Jun 02, 2021 4:02:56 PM com.hazelcast.internal.nio.tcp.TcpIpConnector
INFO: [172.17.0.10]:5701 [dev] [4.0.2] Connecting to /172.17.0.8:5701, timeout: 10000, bind-any: true
Jun 02, 2021 4:02:56 PM com.hazelcast.internal.nio.tcp.TcpIpConnector
INFO: [172.17.0.10]:5701 [dev] [4.0.2] Could not connect to: /172.17.0.8:5701. Reason: SocketException[Connection refused to address /172.17.0.8:5701]
Jun 02, 2021 4:02:56 PM com.hazelcast.internal.nio.tcp.TcpIpConnector
INFO: [172.17.0.10]:5701 [dev] [4.0.2] Connecting to /172.17.0.8:5701, timeout: 10000, bind-any: true
Jun 02, 2021 4:02:56 PM com.hazelcast.internal.nio.tcp.TcpIpConnector
INFO: [172.17.0.10]:5701 [dev] [4.0.2] Could not connect to: /172.17.0.8:5701. Reason: SocketException[Connection refused to address /172.17.0.8:5701]
Jun 02, 2021 4:02:56 PM com.hazelcast.internal.nio.tcp.TcpIpConnector
INFO: [172.17.0.10]:5701 [dev] [4.0.2] Connecting to /172.17.0.8:5701, timeout: 10000, bind-any: true
Jun 02, 2021 4:03:06 PM com.hazelcast.internal.nio.tcp.TcpIpConnector
INFO: [172.17.0.10]:5701 [dev] [4.0.2] Could not connect to: /172.17.0.8:5701. Reason: SocketTimeoutException[null]
Jun 02, 2021 4:03:06 PM com.hazelcast.internal.nio.tcp.TcpIpConnector
INFO: [172.17.0.10]:5701 [dev] [4.0.2] Connecting to /172.17.0.8:5701, timeout: 10000, bind-any: true
Jun 02, 2021 4:03:16 PM com.hazelcast.internal.cluster.ClusterService
INFO: [172.17.0.10]:5701 [dev] [4.0.2]
Members {size:4, ver:23} [
Member [172.17.0.14]:5701 - 1ff812a7-09cb-4c46-a999-15906bbafbdb
Member [172.17.0.7]:5701 - d62730b5-1054-404b-a853-09235b76226e
Member [172.17.0.13]:5701 - 24b51d9a-e414-4f75-a212-1b8787c9156e
Member [172.17.0.10]:5701 - b1553dd1-3fb0-4ad6-bda7-18e879a3218b this
]


Simon LEDUNOIS

unread,
Jun 3, 2021, 3:05:13 AM6/3/21
to vert.x
Hi,

Creating procedure with a vertx blocking promise fixes the stack trace issue but my cluster is still freezing.

I create vertx-health-check issue as soon as possible.

Best regards,

Le mercredi 2 juin 2021 à 18:01:18 UTC+2, tsegi...@gmail.com a écrit :

Thomas SEGISMONT

unread,
Jun 3, 2021, 3:50:48 AM6/3/21
to vert.x
No need to create an issue on Vert.x Health Checks. The event loop blocked issue lies in vertx-hazelcast.
I went ahead and created an issue yesterday: https://github.com/vert-x3/vertx-hazelcast/issues/142

Have you tried:

- to set a timeout for the cluster-health procedure execution?
- to set the hazelcast.graceful.shutdown.max.wait property ?

Also, when you scale down, how many nodes do you close at once?

Simon LEDUNOIS

unread,
Jun 3, 2021, 11:51:31 AM6/3/21
to vert.x
Hi,

In fact, I saw tour issue this morning.

I tried to set a timeout for the cluster healc check procedur execution: 
healthChecks = HealthChecks.create(vertx).register("cluster-health", 2000, createProcedure(vertx));

I tried to set the max wait in HZ properties:
<property name="hazelcast.graceful.shutdown.max.wait">30</property>

And every services has a Rolling update strategy like the following: 
strategy:
  type: Rolling
  rollingParams:
    updatePeriodSeconds: 10
    intervalSeconds: 20
    timeoutSeconds: 600
    maxUnavailable: 1
    maxSurge: 1

Something weird happens during the scaling down. My containers stop with a code 143. Minishift show me a message : "The container store did not stop cleanly when terminated (exit code 143)". Maybe this code could produce the freez issue but I don't understand why. A source to image builds my containers.

Best regards,

Simon LEDUNOIS

Simon LEDUNOIS

unread,
Jun 3, 2021, 11:52:20 AM6/3/21
to vert.x
Update: The only recommendation that I don't implement is the lite members.

Thomas SEGISMONT

unread,
Jun 4, 2021, 3:34:39 AM6/4/21
to vert.x
Le jeu. 3 juin 2021 à 17:52, Simon LEDUNOIS <simon.l...@gmail.com> a écrit :
Update: The only recommendation that I don't implement is the lite members.

You can try that as last resort.

Do you have OutOfMemory errors in the logs? That could explain why the containers do not make any progress.

 

Simon LEDUNOIS

unread,
Jun 7, 2021, 10:45:48 AM6/7/21
to vert.x
I just tried in the openshift cloud dev sandbox and there is not more issue. My lastest issue (freez on scale down) is due to my laptop limitation.

Every thing is ok now.

Thanks for your help.

Best regards,

Simon LEDUNOIS

Thomas SEGISMONT

unread,
Jun 7, 2021, 10:51:40 AM6/7/21
to vert.x
Thanks for letting us know

Reply all
Reply to author
Forward
0 new messages