I'm running a cluster with 4 machines, each running an instance of my service on a java8-jvm, I'm trying
to do rolling restarts of the jvms and I intermittently see TargetNotMemberException during map operations
on some other host in the cluster, just after a node has been shut down.
Using Hazelcast3.7.1 in embedded mode with default settings, on Linux Centos 7, using only distributed maps and locks.
I do the restart of a node like this:
PartitionService partitionService = hazelcastInstance.getPartitionService();
while (!partitionService.isClusterSafe()) {
log.info("Hazelcast cluster is NOT safe for shutdown, sleeping 10 seconds.");
try {
Thread.sleep(10 * 1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
log.info("Trying to shut down hazelcast");
hazelcastInstance.shutdown();
Thread.sleep(10*1000);
} catch (Throwable t) {
log.error("Exception", t);
}
System.exit(0);
Then i wait 120 seconds before the node is restarted again.
---
In the scenario we see here, we have 4 hosts [tq-app1-perf1, tq-app2-perf1, tq-app3-perf1, tq-app4-perf1],
we have already done some restarts, and right now tq-app1-perf1 is beeing shut down, and an exception occurs
on tq-app2-perf1
Logs on tq-app1-perf1:
17:09:55.688 INFO [se.xxx.yyyyy.cluster.ClusterServiceImpl ][::] registerNodeAsInactive:1 (registerNodeAsInactive:288) [GentleShutdownManager_1]
17:09:55.690 INFO [se.xxx.yyyyy.cluster.ClusterServiceImpl ][::] Cluster topology: [4, 3, 2] / [4, 3, 2] (logTopology:270) [GentleShutdownManager_1]
17:09:55.718 INFO [se.xxx.yyyyy.server.GentleShutdownManager ][::] Trying to shut down hazelcast (shutdownHazelcast:186) [GentleShutdownManager_1]
17:09:55.718 INFO [com.hazelcast.core.LifecycleService ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] [tq-app1-perf1]:5701 is SHUTTING_DOWN (log:64) [GentleShutdownManager_1]
17:09:55.721 INFO [com.hazelcast.internal.partition.impl.MigrationManager ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] Shutdown request of [tq-app1-perf1]:5701 is handled (log:64) [GentleShutdownManager_1]
17:09:55.741 INFO [com.hazelcast.internal.partition.impl.MigrationManager ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] Re-partitioning cluster data... Migration queue size: 204 (log:64) [hz.tq.migration]
17:09:56.941 INFO [com.hazelcast.instance.Node ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] Shutting down connection manager... (log:64) [GentleShutdownManager_1]
17:09:56.944 INFO [com.hazelcast.instance.Node ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] Shutting down node engine... (log:64) [GentleShutdownManager_1]
17:09:56.961 INFO [com.hazelcast.instance.NodeExtension ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] Destroying node NodeExtension. (log:64) [GentleShutdownManager_1]
17:09:56.962 INFO [com.hazelcast.instance.Node ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] Hazelcast Shutdown is completed in 1242 ms. (log:64) [GentleShutdownManager_1]
17:09:56.962 INFO [com.hazelcast.core.LifecycleService ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] [tq-app1-perf1]:5701 is SHUTDOWN (log:64) [GentleShutdownManager_1]
17:10:06.962 INFO [se.xxx.yyyyy.server.GentleShutdownManager ][::] Hazelcast is now shut down (shutdownHazelcast:190) [GentleShutdownManager_1]
---
Logs on tq-app2-perf1:
17:09:57.937 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/
172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]
17:09:57.937 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/
172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/
172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]
17:09:57.939 WARN [com.hazelcast.internal.cluster.impl.operations.MasterConfirmationOperation][::] [tq-app2-perf1]:5701 [dev] [3.7.1] [tq-app3-perf1]:5701 has sent MasterConfirmation, but this node is not master! (log:66) [hz.tq.generic-operation.thread-1]
17:09:57.939 WARN [com.hazelcast.internal.cluster.impl.operations.MasterConfirmationOperation][::] [tq-app2-perf1]:5701 [dev] [3.7.1] [tq-app4-perf1]:5701 has sent MasterConfirmation, but this node is not master! (log:66) [hz.tq.generic-operation.thread-0]
17:09:57.972 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/
172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]
17:09:57.972 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/
172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/
172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]
17:09:57.973 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/
172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]
17:09:57.974 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/
172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/
172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]
17:09:58.039 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/
172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]
17:09:58.039 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/
172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/
172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]
17:09:58.072 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/
172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]
17:09:58.073 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/
172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/
172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]
17:09:58.074 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/
172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]
17:09:58.074 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/
172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/
172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]
17:09:58.139 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/
172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]
17:09:58.140 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/
172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/
172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]
17:09:58.173 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/
172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-5]
17:09:58.174 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/
172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/
172.16.99.40:5701] (log:64) [hz.tq.cached.thread-5]
17:09:58.175 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/
172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-5]
17:09:58.175 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/
172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/
172.16.99.40:5701] (log:64) [hz.tq.cached.thread-5]
17:09:58.240 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/
172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-5]
17:09:58.241 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/
172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/
172.16.99.40:5701] (log:64) [hz.tq.cached.thread-5]
17:09:58.241 WARN [com.hazelcast.nio.tcp.TcpIpConnectionMonitor ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Removing connection to endpoint [tq-app1-perf1]:5701 Cause => java.net.SocketException {Connection refused to address tq-app1-perf1/
172.16.99.40:5701}, Error-Count: 5 (log:66) [hz.tq.cached.thread-5]
17:09:58.242 INFO [com.hazelcast.internal.cluster.ClusterService ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Old master [tq-app1-perf1]:5701 left the cluster, assigning new master Member [tq-app2-perf1]:5701 - bbfeeba1-d910-4986-bd33-c65a71a44587 this (log:64) [hz.tq.cached.thread-5]
17:09:58.242 INFO [com.hazelcast.internal.cluster.ClusterService ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Removing Member [tq-app1-perf1]:5701 - 16d607a9-c772-427b-b456-9f60c5a8a4e8 (log:64) [hz.tq.cached.thread-5]
17:09:58.245 INFO [com.hazelcast.transaction.TransactionManagerService ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Committing/rolling-back alive transactions of Member [tq-app1-perf1]:5701 - 16d607a9-c772-427b-b456-9f60c5a8a4e8, UUID: 16d607a9-c772-427b-b456-9f60c5a8a4e8 (log:64) [hz.tq.cached.thread-9]
17:09:58.247 INFO [se.xxx.yyyyy.server.context.YyyyYContextConfigurationRunlevel3 ][::] Hazelcast member removed:tq-app1-perf1/
172.16.99.40:5701 (memberRemoved:352) [hz.tq.event-4]
17:09:58.247 INFO [com.hazelcast.internal.partition.InternalPartitionService ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Fetching most recent partition table! my version: 5554 (log:64) [hz.tq.migration]
17:09:58.248 INFO [com.hazelcast.internal.cluster.ClusterService ][::] [tq-app2-perf1]:5701 [dev] [3.7.1]
Members [3] {
Member [tq-app2-perf1]:5701 - bbfeeba1-d910-4986-bd33-c65a71a44587 this
Member [tq-app3-perf1]:5701 - b50b447d-8987-4f4f-b810-0673cfb22708
Member [tq-app4-perf1]:5701 - b9152a7b-3f67-4487-95b9-238c637650b0
}
(log:64) [hz.tq.cached.thread-5]
17:09:58.252 INFO [com.hazelcast.internal.partition.InternalPartitionService ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Most recent partition table version: 5554 (log:64) [hz.tq.migration]
17:09:58.288 INFO [com.hazelcast.internal.partition.impl.MigrationManager ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Partition balance is ok, no need to re-partition cluster data... (log:64) [hz.tq.migration]
17:09:58.485 WARN [com.hazelcast.map.impl.query.MapQueryEngineImpl ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not get results (log:66) [yyyyy-akka.actor.default-dispatcher-3]
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: [tq-app1-perf1]:5701, partitionId: -1, operation: com.hazelcast.map.impl.query.QueryOperation, service: hz:impl:mapService
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:85) ~[hazelcast-all-3.7.1.jar:3.7.1]
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158) ~[hazelcast-all-3.7.1.jar:3.7.1]
at com.hazelcast.map.impl.query.MapQueryEngineImpl.addResultsOfPredicate(MapQueryEngineImpl.java:605) ~[hazelcast-all-3.7.1.jar:3.7.1]
at com.hazelcast.map.impl.query.MapQueryEngineImpl.invokeQueryAllPartitions(MapQueryEngineImpl.java:492) ~[hazelcast-all-3.7.1.jar:3.7.1]
at com.hazelcast.map.impl.proxy.MapProxyImpl.entrySet(MapProxyImpl.java:633) [hazelcast-all-3.7.1.jar:3.7.1]
at com.hazelcast.map.impl.proxy.MapProxyImpl.entrySet(MapProxyImpl.java:622) [hazelcast-all-3.7.1.jar:3.7.1]
at se.xxx.commons.cluster.HazelcastClusteredMapAdaptor.entrySet(HazelcastClusteredMapAdaptor.java:115) [yyyyy-commons.jar:DEV.1195-SNAPSHOT]
at se.xxx.yyyyy.cluster.ClusterServiceImpl$ClusterSubscriber.handleMemberLost(ClusterServiceImpl.java:80) [yyyyy.jar:DEV.1195-SNAPSHOT]
at se.xxx.yyyyy.cluster.ClusterServiceImpl$ClusterSubscriber.onReceive(ClusterServiceImpl.java:55) [yyyyy.jar:DEV.1195-SNAPSHOT]
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) [akka-actor_2.11-2.4.10.jar:na]
at akka.actor.Actor$class.aroundReceive(Actor.scala:484) [akka-actor_2.11-2.4.10.jar:na]
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) [akka-actor_2.11-2.4.10.jar:na]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [akka-actor_2.11-2.4.10.jar:na]
at akka.actor.ActorCell.invoke(ActorCell.scala:495) [akka-actor_2.11-2.4.10.jar:na]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [akka-actor_2.11-2.4.10.jar:na]
at akka.dispatch.Mailbox.run(Mailbox.scala:224) [akka-actor_2.11-2.4.10.jar:na]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [akka-actor_2.11-2.4.10.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library-2.11.8.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library-2.11.8.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library-2.11.8.jar:na]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library-2.11.8.jar:na]
Caused by: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: [tq-app1-perf1]:5701, partitionId: -1, operation: com.hazelcast.map.impl.query.QueryOperation, service: hz:impl:mapService
at com.hazelcast.spi.impl.operationservice.impl.Invocation.initInvocationTarget(Invocation.java:324) ~[hazelcast-all-3.7.1.jar:3.7.1]
at com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvoke(Invocation.java:256) ~[hazelcast-all-3.7.1.jar:3.7.1]
at com.hazelcast.spi.impl.operationservice.impl.Invocation.access$300(Invocation.java:94) ~[hazelcast-all-3.7.1.jar:3.7.1]
at com.hazelcast.spi.impl.operationservice.impl.Invocation$InvocationRetryTask.run(Invocation.java:530) ~[hazelcast-all-3.7.1.jar:3.7.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_102]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_102]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_102]
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76) ~[hazelcast-all-3.7.1.jar:3.7.1]
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92) ~[hazelcast-all-3.7.1.jar:3.7.1]
at ------ submitted from ------.(Unknown Source) ~[na:na]
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:111) ~[hazelcast-all-3.7.1.jar:3.7.1]
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74) ~[hazelcast-all-3.7.1.jar:3.7.1]
... 20 common frames omitted
Am I doing something wrong in my configuration or in the shutdown procedure, or is it an intermittent bug in hazelcast?
Regards
/Pär