Intermittent TargetNotMemberException during map operation just after another node has shut down.

89 views

Skip to first unread message

Pär Svensson

unread,

Sep 29, 2016, 5:17:16 AM9/29/16

to Hazelcast

I'm running a cluster with 4 machines, each running an instance of my service on a java8-jvm, I'm trying

to do rolling restarts of the jvms and I intermittently see TargetNotMemberException during map operations

on some other host in the cluster, just after a node has been shut down.

Using Hazelcast3.7.1 in embedded mode with default settings, on Linux Centos 7, using only distributed maps and locks.

I do the restart of a node like this:

PartitionService partitionService = hazelcastInstance.getPartitionService();

while (!partitionService.isClusterSafe()) {

log.info("Hazelcast cluster is NOT safe for shutdown, sleeping 10 seconds.");

try {

Thread.sleep(10 * 1000);

} catch (InterruptedException e) {

e.printStackTrace();

}

log.info("Trying to shut down hazelcast");

hazelcastInstance.shutdown();

Thread.sleep(10*1000);

log.info("Hazelcast is now shut down");

} catch (Throwable t) {

log.error("Exception", t);

}

System.exit(0);

Then i wait 120 seconds before the node is restarted again.

---

In the scenario we see here, we have 4 hosts [tq-app1-perf1, tq-app2-perf1, tq-app3-perf1, tq-app4-perf1],

we have already done some restarts, and right now tq-app1-perf1 is beeing shut down, and an exception occurs

on tq-app2-perf1

Logs on tq-app1-perf1:

17:09:55.688 INFO [se.xxx.yyyyy.cluster.ClusterServiceImpl ][::] registerNodeAsInactive:1 (registerNodeAsInactive:288) [GentleShutdownManager_1]

17:09:55.690 INFO [se.xxx.yyyyy.cluster.ClusterServiceImpl ][::] Cluster topology: [4, 3, 2] / [4, 3, 2] (logTopology:270) [GentleShutdownManager_1]

17:09:55.718 INFO [se.xxx.yyyyy.server.GentleShutdownManager ][::] Trying to shut down hazelcast (shutdownHazelcast:186) [GentleShutdownManager_1]

17:09:55.718 INFO [com.hazelcast.core.LifecycleService ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] [tq-app1-perf1]:5701 is SHUTTING_DOWN (log:64) [GentleShutdownManager_1]

17:09:55.721 INFO [com.hazelcast.internal.partition.impl.MigrationManager ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] Shutdown request of [tq-app1-perf1]:5701 is handled (log:64) [GentleShutdownManager_1]

17:09:55.741 INFO [com.hazelcast.internal.partition.impl.MigrationManager ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] Re-partitioning cluster data... Migration queue size: 204 (log:64) [hz.tq.migration]

17:09:56.941 INFO [com.hazelcast.instance.Node ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] Shutting down connection manager... (log:64) [GentleShutdownManager_1]

17:09:56.944 INFO [com.hazelcast.instance.Node ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] Shutting down node engine... (log:64) [GentleShutdownManager_1]

17:09:56.961 INFO [com.hazelcast.instance.NodeExtension ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] Destroying node NodeExtension. (log:64) [GentleShutdownManager_1]

17:09:56.962 INFO [com.hazelcast.instance.Node ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] Hazelcast Shutdown is completed in 1242 ms. (log:64) [GentleShutdownManager_1]

17:09:56.962 INFO [com.hazelcast.core.LifecycleService ][::] [tq-app1-perf1]:5701 [dev] [3.7.1] [tq-app1-perf1]:5701 is SHUTDOWN (log:64) [GentleShutdownManager_1]

17:10:06.962 INFO [se.xxx.yyyyy.server.GentleShutdownManager ][::] Hazelcast is now shut down (shutdownHazelcast:190) [GentleShutdownManager_1]

---

Logs on tq-app2-perf1:

17:09:57.937 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]

17:09:57.937 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]

17:09:57.939 WARN [com.hazelcast.internal.cluster.impl.operations.MasterConfirmationOperation][::] [tq-app2-perf1]:5701 [dev] [3.7.1] [tq-app3-perf1]:5701 has sent MasterConfirmation, but this node is not master! (log:66) [hz.tq.generic-operation.thread-1]

17:09:57.939 WARN [com.hazelcast.internal.cluster.impl.operations.MasterConfirmationOperation][::] [tq-app2-perf1]:5701 [dev] [3.7.1] [tq-app4-perf1]:5701 has sent MasterConfirmation, but this node is not master! (log:66) [hz.tq.generic-operation.thread-0]

17:09:57.972 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]

17:09:57.972 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]

17:09:57.973 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]

17:09:57.974 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]

17:09:58.039 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]

17:09:58.039 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]

17:09:58.072 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]

17:09:58.073 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]

17:09:58.074 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]

17:09:58.074 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]

17:09:58.139 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-9]

17:09:58.140 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/172.16.99.40:5701] (log:64) [hz.tq.cached.thread-9]

17:09:58.173 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-5]

17:09:58.174 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/172.16.99.40:5701] (log:64) [hz.tq.cached.thread-5]

17:09:58.175 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-5]

17:09:58.175 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/172.16.99.40:5701] (log:64) [hz.tq.cached.thread-5]

17:09:58.240 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Connecting to tq-app1-perf1/172.16.99.40:5701, timeout: 0, bind-any: true (log:64) [hz.tq.cached.thread-5]

17:09:58.241 INFO [com.hazelcast.nio.tcp.InitConnectionTask ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not connect to: tq-app1-perf1/172.16.99.40:5701. Reason: SocketException[Connection refused to address tq-app1-perf1/172.16.99.40:5701] (log:64) [hz.tq.cached.thread-5]

17:09:58.241 WARN [com.hazelcast.nio.tcp.TcpIpConnectionMonitor ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Removing connection to endpoint [tq-app1-perf1]:5701 Cause => java.net.SocketException {Connection refused to address tq-app1-perf1/172.16.99.40:5701}, Error-Count: 5 (log:66) [hz.tq.cached.thread-5]

17:09:58.242 INFO [com.hazelcast.internal.cluster.ClusterService ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Old master [tq-app1-perf1]:5701 left the cluster, assigning new master Member [tq-app2-perf1]:5701 - bbfeeba1-d910-4986-bd33-c65a71a44587 this (log:64) [hz.tq.cached.thread-5]

17:09:58.242 INFO [com.hazelcast.internal.cluster.ClusterService ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Removing Member [tq-app1-perf1]:5701 - 16d607a9-c772-427b-b456-9f60c5a8a4e8 (log:64) [hz.tq.cached.thread-5]

17:09:58.245 INFO [com.hazelcast.transaction.TransactionManagerService ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Committing/rolling-back alive transactions of Member [tq-app1-perf1]:5701 - 16d607a9-c772-427b-b456-9f60c5a8a4e8, UUID: 16d607a9-c772-427b-b456-9f60c5a8a4e8 (log:64) [hz.tq.cached.thread-9]

17:09:58.247 INFO [se.xxx.yyyyy.server.context.YyyyYContextConfigurationRunlevel3 ][::] Hazelcast member removed:tq-app1-perf1/172.16.99.40:5701 (memberRemoved:352) [hz.tq.event-4]

17:09:58.247 INFO [com.hazelcast.internal.partition.InternalPartitionService ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Fetching most recent partition table! my version: 5554 (log:64) [hz.tq.migration]

17:09:58.248 INFO [se.xxx.yyyyy.server.context.YyyyYContextConfigurationRunlevel3 ][::] Current Hazelcast cluster members: [tq-app2-perf1/172.16.99.41:5701 tq-app3-perf1/172.16.99.42:5701 tq-app4-perf1/172.16.99.43:5701 ] (logMembers:341) [hz.tq.event-4]

17:09:58.248 INFO [com.hazelcast.internal.cluster.ClusterService ][::] [tq-app2-perf1]:5701 [dev] [3.7.1]

Members [3] {

Member [tq-app2-perf1]:5701 - bbfeeba1-d910-4986-bd33-c65a71a44587 this

Member [tq-app3-perf1]:5701 - b50b447d-8987-4f4f-b810-0673cfb22708

Member [tq-app4-perf1]:5701 - b9152a7b-3f67-4487-95b9-238c637650b0

}

(log:64) [hz.tq.cached.thread-5]

17:09:58.252 INFO [com.hazelcast.internal.partition.InternalPartitionService ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Most recent partition table version: 5554 (log:64) [hz.tq.migration]

17:09:58.288 INFO [com.hazelcast.internal.partition.impl.MigrationManager ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Partition balance is ok, no need to re-partition cluster data... (log:64) [hz.tq.migration]

17:09:58.485 WARN [com.hazelcast.map.impl.query.MapQueryEngineImpl ][::] [tq-app2-perf1]:5701 [dev] [3.7.1] Could not get results (log:66) [yyyyy-akka.actor.default-dispatcher-3]

java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: [tq-app1-perf1]:5701, partitionId: -1, operation: com.hazelcast.map.impl.query.QueryOperation, service: hz:impl:mapService

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:85) ~[hazelcast-all-3.7.1.jar:3.7.1]

at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158) ~[hazelcast-all-3.7.1.jar:3.7.1]

at com.hazelcast.map.impl.query.MapQueryEngineImpl.addResultsOfPredicate(MapQueryEngineImpl.java:605) ~[hazelcast-all-3.7.1.jar:3.7.1]

at com.hazelcast.map.impl.query.MapQueryEngineImpl.invokeQueryAllPartitions(MapQueryEngineImpl.java:492) ~[hazelcast-all-3.7.1.jar:3.7.1]

at com.hazelcast.map.impl.proxy.MapProxyImpl.entrySet(MapProxyImpl.java:633) [hazelcast-all-3.7.1.jar:3.7.1]

at com.hazelcast.map.impl.proxy.MapProxyImpl.entrySet(MapProxyImpl.java:622) [hazelcast-all-3.7.1.jar:3.7.1]

at se.xxx.commons.cluster.HazelcastClusteredMapAdaptor.entrySet(HazelcastClusteredMapAdaptor.java:115) [yyyyy-commons.jar:DEV.1195-SNAPSHOT]

at se.xxx.yyyyy.cluster.ClusterServiceImpl$ClusterSubscriber.handleMemberLost(ClusterServiceImpl.java:80) [yyyyy.jar:DEV.1195-SNAPSHOT]

at se.xxx.yyyyy.cluster.ClusterServiceImpl$ClusterSubscriber.onReceive(ClusterServiceImpl.java:55) [yyyyy.jar:DEV.1195-SNAPSHOT]

at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) [akka-actor_2.11-2.4.10.jar:na]

at akka.actor.Actor$class.aroundReceive(Actor.scala:484) [akka-actor_2.11-2.4.10.jar:na]

at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) [akka-actor_2.11-2.4.10.jar:na]

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [akka-actor_2.11-2.4.10.jar:na]

at akka.actor.ActorCell.invoke(ActorCell.scala:495) [akka-actor_2.11-2.4.10.jar:na]

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [akka-actor_2.11-2.4.10.jar:na]

at akka.dispatch.Mailbox.run(Mailbox.scala:224) [akka-actor_2.11-2.4.10.jar:na]

at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [akka-actor_2.11-2.4.10.jar:na]

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library-2.11.8.jar:na]

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library-2.11.8.jar:na]

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library-2.11.8.jar:na]

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library-2.11.8.jar:na]

Caused by: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: [tq-app1-perf1]:5701, partitionId: -1, operation: com.hazelcast.map.impl.query.QueryOperation, service: hz:impl:mapService

at com.hazelcast.spi.impl.operationservice.impl.Invocation.initInvocationTarget(Invocation.java:324) ~[hazelcast-all-3.7.1.jar:3.7.1]

at com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvoke(Invocation.java:256) ~[hazelcast-all-3.7.1.jar:3.7.1]

at com.hazelcast.spi.impl.operationservice.impl.Invocation.access$300(Invocation.java:94) ~[hazelcast-all-3.7.1.jar:3.7.1]

at com.hazelcast.spi.impl.operationservice.impl.Invocation$InvocationRetryTask.run(Invocation.java:530) ~[hazelcast-all-3.7.1.jar:3.7.1]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_102]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_102]

at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_102]

at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76) ~[hazelcast-all-3.7.1.jar:3.7.1]

at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92) ~[hazelcast-all-3.7.1.jar:3.7.1]

at ------ submitted from ------.(Unknown Source) ~[na:na]

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:111) ~[hazelcast-all-3.7.1.jar:3.7.1]

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74) ~[hazelcast-all-3.7.1.jar:3.7.1]

... 20 common frames omitted

Am I doing something wrong in my configuration or in the shutdown procedure, or is it an intermittent bug in hazelcast?

Regards

/Pär

Vassilis Bekiaris

unread,

Sep 29, 2016, 5:58:47 AM9/29/16

to haze...@googlegroups.com

Hi Pär,

the specific exception you shared in your log files does not affect the correctness of your map query; the exception is logged from here: https://github.com/vbekiaris/hazelcast/blob/ed7b856d5876eb258a5f2eae6881d5b33864423b/hazelcast/src/main/java/com/hazelcast/map/impl/query/MapQueryEngineImpl.java#L501-L501

After this exception is logged, the query logic locates partitions which have not yet been queried and executes the query on each remaining partition, so you should be obtaining correct results.

In general, TargetNotMemberException may occur when a node is not yet aware of another node leaving the cluster; however it is a RetryableException (https://github.com/hazelcast/hazelcast/blob/d0336a3b51893688e0d8116ec5c94d428a252e39/hazelcast/src/main/java/com/hazelcast/spi/exception/RetryableException.java#L23) and the operation will be retried. I don't see anything wrong with your code, perhaps logging this exception at WARN level is more scary than it should be.

Cheers!
Vassilis

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/995c3168-8703-4bdd-9e9b-6cdd17fc071d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages