Hazelcast shutdown race condition using hazelcast.graceful.shutdown.max=3

698 views

Skip to first unread message

Justin Cranford

unread,

Nov 26, 2013, 11:17:26 AM11/26/13

to haze...@googlegroups.com

I am using these settings with Hazelcast 2.5.0 on my 3-node Tomcat cluster (each running on different physical machines).

hazelcast.socket.keep.alive=false

hazelcast.heartbeat.interval.seconds=3 (default was 1)

hazelcast.max.no.heartbeat=15 (default was 300)

hazelcast.graceful.shutdown.max=3 (default was 600)

I started with heartbeat=15sec is to react to dead members quickly while the cluster is running, so the remaining members remain responsive. However, this interferes with normal Tomcat shutdown because when I do an ordered shutdown of Tomcat 1, then Tomcat 2, and finally Tomcat 3, the second node always hangs 15-18sec waiting for re-partition.

To fix Tomcat 2 hanging, I added hazelcast.graceful.shutdown.max=3. That indeed fixed my orderly shutdown hang on Tomcat 2, but now I get these two Hazelcast exceptions on Tomcat 2 instead. It seems like there is a race condition in Hazelcast shutdown.

a) Hazelcast Instance is not active

b) INFO: Illegal access: this web application instance has been stopped already. Could not load com.hazelcast.impl.PartitionManager$6. The eventual following stack trace is caused by an error thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access, and has no functional impact.

Here are the logs from my 3 Tomcat nodes during shutdown:

>>> Tomcat 1 (192.168.1.80, No problems here)

Nov 22, 2013 2:00:45 PM org.apache.coyote.http11.Http11BaseProtocol pause
INFO: Pausing Coyote HTTP/1.1 on http-127.0.0.1-8001
Nov 22, 2013 2:00:46 PM org.apache.catalina.core.StandardService stop
INFO: Stopping service Catalina
Nov 22, 2013 2:00:46 PM org.apache.catalina.core.StandardWrapper unload
INFO: Waiting for 1 instance(s) to be deallocated

>>> Tomcat 2 (192.168.1.30, Shutdown is fast but noisy and Hazelcast throws exceptions that indicate a race condition in shutdown)

Nov 22, 2013 2:00:49 PM com.hazelcast.nio.ReadHandler
WARNING: [192.168.1.30]:5900 [ClusterManager] hz._hzInstance_1_ClusterManager.IO.thread-1 Closing socket to endpoint Address[192.168.1.80]:5900, Cause:java.io.EOFException
Nov 22, 2013 2:00:51 PM org.apache.coyote.http11.Http11BaseProtocol pause
INFO: Pausing Coyote HTTP/1.1 on http-127.0.0.1-8001
Nov 22, 2013 2:00:52 PM com.hazelcast.impl.PartitionManager
WARNING: [192.168.1.30]:5900 [ClusterManager] Waiting for ongoing immediate migration tasks: 184
Nov 22, 2013 2:00:52 PM org.apache.catalina.core.StandardService stop
INFO: Stopping service Catalina
Nov 22, 2013 2:00:52 PM com.hazelcast.impl.PartitionManager
WARNING: [192.168.1.30]:5900 [ClusterManager] Waiting for ongoing immediate migration tasks: 183
Nov 22, 2013 2:00:53 PM com.hazelcast.impl.PartitionManager
WARNING: [192.168.1.30]:5900 [ClusterManager] Waiting for ongoing immediate migration tasks: 183
Nov 22, 2013 2:00:53 PM com.hazelcast.impl.Node
WARNING: [192.168.1.30]:5900 [ClusterManager] Graceful shutdown could not be completed in 3 seconds!
Nov 22, 2013 2:00:53 PM com.hazelcast.impl.PartitionManager
WARNING: [192.168.1.30]:5900 [ClusterManager] Failed migrating from Member [192.168.1.47]:5900
com.hazelcast.core.MemberLeftException: Member [192.168.1.47]:5900 has left cluster!
at com.hazelcast.core.DistributedTask$Inner.innerSetMemberLeft(DistributedTask.java:276)
at com.hazelcast.impl.ExecutorManager$MemberCall.get(ExecutorManager.java:617)
at com.hazelcast.core.DistributedTask$Inner.get(DistributedTask.java:244)
at com.hazelcast.core.DistributedTask.get(DistributedTask.java:146)
at com.hazelcast.impl.PartitionManager$Migrator.run(PartitionManager.java:934)
at com.hazelcast.impl.PartitionManager$MigrationService.safeRun(PartitionManager.java:1064)
at com.hazelcast.impl.PartitionManager$MigrationService.safeRunImmediate(PartitionManager.java:1072)
at com.hazelcast.impl.PartitionManager$MigrationService.run(PartitionManager.java:1020)
Nov 22, 2013 2:00:53 PM com.hazelcast.impl.PartitionManager
WARNING: [192.168.1.30]:5900 [ClusterManager] Migration task has failed => MigrationRequestTask{partitionId=0, from=Address[192.168.1.47]:5900, to=Address[192.168.1.30]:5900, replicaIndex=1, migration=false, diffOnly=true, selfCopyReplicaIndex=-1}
Nov 22, 2013 2:00:53 PM com.hazelcast.impl.PartitionManager
WARNING: [192.168.1.30]:5900 [ClusterManager] Error [java.lang.IllegalStateException: Hazelcast Instance is not active!] while executing MigrationRequestTask{partitionId=0, from=Address[192.168.1.47]:5900, to=Address[192.168.1.30]:5900, replicaIndex=1, migration=false, diffOnly=true, selfCopyReplicaIndex=-1}
Nov 22, 2013 2:00:53 PM com.hazelcast.impl.PartitionManager
WARNING: [192.168.1.30]:5900 [ClusterManager] Hazelcast Instance is not active!
java.lang.IllegalStateException: Hazelcast Instance is not active!
at com.hazelcast.impl.Node.checkNodeState(Node.java:299)
at com.hazelcast.cluster.ClusterService.enqueueAndWait(ClusterService.java:132)
at com.hazelcast.impl.BaseManager.enqueueAndWait(BaseManager.java:1154)
at com.hazelcast.impl.PartitionManager$Migrator.migrationTaskFailed(PartitionManager.java:962)
at com.hazelcast.impl.PartitionManager$Migrator.run(PartitionManager.java:956)
at com.hazelcast.impl.PartitionManager$MigrationService.safeRun(PartitionManager.java:1064)
at com.hazelcast.impl.PartitionManager$MigrationService.safeRunImmediate(PartitionManager.java:1072)
at com.hazelcast.impl.PartitionManager$MigrationService.run(PartitionManager.java:1020)
Nov 22, 2013 2:00:54 PM org.apache.coyote.http11.Http11BaseProtocol destroy
INFO: Stopping Coyote HTTP/1.1 on http-127.0.0.1-8001
Nov 22, 2013 2:00:54 PM org.apache.catalina.loader.WebappClassLoader loadClass
INFO: Illegal access: this web application instance has been stopped already. Could not load com.hazelcast.impl.PartitionManager$6. The eventual following stack trace is caused by an error thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access, and has no functional impact.
java.lang.IllegalStateException
at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1324)
at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1284)
at com.hazelcast.impl.PartitionManager.shutdown(PartitionManager.java:331)
at com.hazelcast.impl.ConcurrentMapManager.shutdown(ConcurrentMapManager.java:210)
at com.hazelcast.impl.Node.doShutdown(Node.java:392)
at com.hazelcast.impl.Node.shutdown(Node.java:332)
at com.hazelcast.impl.Node$NodeShutdownHookThread.run(Node.java:502)
Exception in thread "hz.ShutdownThread" java.lang.NoClassDefFoundError: com/hazelcast/impl/PartitionManager$6
at com.hazelcast.impl.PartitionManager.shutdown(PartitionManager.java:331)
at com.hazelcast.impl.ConcurrentMapManager.shutdown(ConcurrentMapManager.java:210)
at com.hazelcast.impl.Node.doShutdown(Node.java:392)
at com.hazelcast.impl.Node.shutdown(Node.java:332)
at com.hazelcast.impl.Node$NodeShutdownHookThread.run(Node.java:502)
Caused by: java.lang.ClassNotFoundException: com.hazelcast.impl.PartitionManager$6
at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1438)
at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1284)
... 5 more

>>> Tomcat 2 (192.168.1.47, Shutdown is fast with minor expected noise)

Nov 22, 2013 2:00:46 PM com.hazelcast.nio.ReadHandler
WARNING: [192.168.1.47]:5900 [ClusterManager] hz._hzInstance_1_ClusterManager.IO.thread-1 Closing socket to endpoint Address[192.168.1.80]:5900, Cause:java.io.EOFException
Nov 22, 2013 2:00:46 PM com.mycompany.ap memberRemoved
WARNING: ApplicationClusterManager.memberRemoved Detected for Tomcat '192.168.1.80:5900' in application cluster. Event: MembershipEvent {Member [192.168.1.80]:5900} removed
Nov 22, 2013 2:00:50 PM com.hazelcast.nio.ReadHandler
WARNING: [192.168.1.47]:5900 [ClusterManager] hz._hzInstance_1_ClusterManager.IO.thread-2 Closing socket to endpoint Address[192.168.1.30]:5900, Cause:java.io.EOFException
Nov 22, 2013 2:00:50 PM com.mycompany.ap memberRemoved
WARNING: ApplicationClusterManager.memberRemoved Detected for Tomcat '192.168.1.30:5900' in application cluster. Event: MembershipEvent {Member [192.168.1.30]:5900} removed
Nov 22, 2013 2:00:52 PM org.apache.coyote.http11.Http11BaseProtocol pause
INFO: Pausing Coyote HTTP/1.1 on http-127.0.0.1-8001
Nov 22, 2013 2:00:53 PM org.apache.catalina.core.StandardService stop
INFO: Stopping service Catalina

Reply all

Reply to author

Forward

0 new messages