Re: Cluster Failure during Network disconnect

827 views

Skip to first unread message

Joe Planisky

unread,

Dec 5, 2012, 9:35:21 AM12/5/12

to haze...@googlegroups.com

Hi Jamie,

You can reduce the 5 minute wait time on network outages by setting the "hazelcast.icmp.enabled" configuration property to "true". See section 12.6 (Advanced Configuration Properties) in the manual for how to do that. With ICMP pings enabled, network outages are detected within a few seconds.

I can't really help with the reconnection issues except to commiserate. It's not clear from your email if the reconnections always fail, or if they sometimes work and sometimes fail. We've seen occasional reconnection issues, but they usually work as expected for us. All I can offer is that more recent versions of Hazelcast seem to reconnect more reliably than older versions. You might mention which version of Hazelcast you're using, and if it's not the latest, you might want to retry your testing with the latest released version.

--
Joe

On Dec 4, 2012, at 4:42 PM, <jamie.m...@gmail.com> <jamie.m...@gmail.com> wrote:

> I'm having an issue with testing a 'disconnected node' scenario with
> Hazelcast. The intention is to simulate a network failure by bringing down
> the NIC on a virtual machine and seeing how Hazelcast manages this
> situation. e.g:
> 1) Startup three nodes on separate hosts.
> 2) populate a partition store cache with data.
> 3) Start reading back data
> 4) Disconnect one of the nodes.
> Windows: right click, Disable on the network adapter. Then right click
> enable on the network adapter.
> Linux: sudo ifdown eth0; then sudo ifup eth0;
> 5) Wait
> 6) Reconnect. Observer behaviour.
>
> Network is setup with the following nodes:
> Linux Server (rhel5.5) - 10.1.2.42
> Linux Server (Centos5.5) - 10.1.2.48
> VM - Linux Server (rhel5) - 10.1.128.59
> VM - Windows Server 2008r2 - 10.1.128.115
> Both Linux Server's + 1 of the two VM's are run at once (i.e. 3 nodes
> total).
>
> I've encountered some problems.
> 1) When the node disconnects, there is a five minute timeout where the
> cluster (or at least the node I am connected to) stops processing get
> requests.
> (I'm assuming this can be solved by changing
> hazelcast.max.no.heartbeat.seconds, but I've yet to check)
> 2) On a Windows Server 2008r2 host, the client doesn't reconnect in two
> ways:
> 2a) If the network is brought up before 5mins the windows client is unable
> to reconnect and in the case below the cluster then crashes.
> 2b) If the network is brought up after the five mins then the windows
> client does not reconnect.
> 3) On Linux, after the five minute timeout, the disconnected node will not
> reconnect without being restarted.
>
> From earlier experiments it didn't appear to make any difference whether
> the windows node was the cluster master or the linux nodes.
> I have tried setting the DisableDHCPMediaSense
> (HKLM\System\CurrentControlSet\Services\Tcpip\Parameters) as suggested by
> this article (albeit for Coherence, but it sounded like a similiar issue)
> without success.
> https://forums.oracle.com/forums/thread.jspa?threadID=935405
>
> Questions
> 1) Is this scenario (network going down) expected to work? If so, any
> suggestions on what is going wrong on Windows?
> 2) Should the cluster continue to work while waiting for a node to timeout?
> 3) Is it expected that once the 5min timeout has been exceeded that a
> client should not be able to reconnect? I thought this was the purpose of
> the split brain merge process?
>
> Many thanks,
>
> Jamie
>
>
>
> Disconnecting a Windows client:
> Disconnecting and reconnecting within 5mins.
> ===========================
>
>
> Linux / Cluster logs
> ====================
>
> 2012-12-04 13:39:53.727 0.7s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.2.48]:5701
> [ap_cluster_Administrator]
>
> Members [3] {
> Member [10.1.2.42]:5701
> Member [10.1.2.48]:5701 this
> Member [10.1.128.115]:5701
> }
>
> 2012-12-04 13:39:53.728 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-5] : Member added
> @ 1354628393727: MembershipEvent {Member [10.1.128.115]:5701} added
> 2012-12-04 13:39:53.728 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-5] : Cluster size
> @ 133376: 3
> 2012-12-04 13:39:53.728 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] :
> <com.hazelcast.nio.ConnectionManager> [10.1.2.48]:5701
> [ap_cluster_Administrator] 48769 accepted socket connection from
> /10.1.128.115:5701
> 2012-12-04 13:39:54.045 0.3s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] : Migration
> started @ 133693: MigrationEvent{partitionId=14, oldOwner=Member
> [10.1.2.48]:5701 this, newOwner=Member [10.1.128.115]:5701}
> ...
> ... Lots of migrations.....
> ...
> 2012-12-04 13:39:56.641 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-3] : Migration
> completed @ 136289: MigrationEvent{partitionId=75, oldOwner=Member
> [10.1.2.42]:5701, newOwner=Member [10.1.128.115]:5701}
> 2012-12-04 13:39:57.270 0.6s INFO [main] : Throughput measurement @
> 136918: 377
> 2012-12-04 13:39:58.271 1.0s INFO [main] : Throughput measurement @
> 137919: 2167
> ...
> ... Repeated get requests continue....
> ...
> 2012-12-04 13:40:13.290 1.0s INFO [main] : Throughput measurement @
> 152938: 2952
> 2012-12-04 13:40:14.291 1.0s INFO [main] : Throughput measurement @
> 153939: 2900
> ...
> ... <<then I disconnect the nic... there is no more output from the
> clusters log.. this is the 5 min timeout described.>>
> ...
>
> ...
> ... <<about now I reconnect the nic>>
> ...
>
> 2012-12-04 13:43:46.227 211.9s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-0] :
> <com.hazelcast.nio.Connection> [10.1.2.48]:5701 [ap_cluster_Administrator]
> Connection [Address[10.1.128.115]:5701] lost. Reason:
> java.io.IOException[Connection reset by peer]
> 2012-12-04 13:43:46.227 WARN
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-0] :
> <com.hazelcast.nio.ReadHandler> [10.1.2.48]:5701 [ap_cluster_Administrator]
> hz._hzInstance_1_ap_cluster_Administrator.IO.thread-0 Closing socket to
> endpoint Address[10.1.128.115]:5701, Cause:java.io.IOException: Connection
> reset by peer
> 2012-12-04 13:43:46.731 0.5s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-3] :
> <com.hazelcast.nio.ConnectionManager> [10.1.2.48]:5701
> [ap_cluster_Administrator] 58763 accepted socket connection from
> /10.1.128.115:5701
> 2012-12-04 13:43:47.234 0.5s INFO [main] : Throughput measurement @
> 154104: 2648
> 2012-12-04 13:43:47.234 INFO [main] : Throughput measurement @
> 366882: 0
> 2012-12-04 13:43:54.789 7.6s WARN [main] :
> <com.hazelcast.impl.ConcurrentMapManager> [10.1.2.48]:5701
> [ap_cluster_Administrator] Caller ->
> RedoLog{name=c:tbl_DistStorageProvider_rob_001_1,
> redoType=REDO_MEMBER_UNKNOWN, operation=CONCURRENT_MAP_GET,
> target=Address[10.1.128.115]:5701 / connected=true, redoCount=16,
> migrating=null
> partition=Partition [14]{
> 0:Address[10.1.128.115]:5701
> 1:Address[10.1.2.42]:5701
> 2:Address[10.1.2.48]:5701
> }
> }
> 2012-12-04 13:43:55.295 0.5s WARN [main] :
> <com.hazelcast.impl.ConcurrentMapManager> [10.1.2.48]:5701
> [ap_cluster_Administrator] Caller ->
> RedoLog{name=c:tbl_DistStorageProvider_rob_001_1,
> redoType=REDO_MEMBER_UNKNOWN, operation=CONCURRENT_MAP_GET,
> target=Address[10.1.128.115]:5701 / connected=true, redoCount=17,
> migrating=null
> partition=Partition [14]{
> 0:Address[10.1.128.115]:5701
> 1:Address[10.1.2.42]:5701
> 2:Address[10.1.2.48]:5701
> }
> }
> ...
> ... <<The above then repeats until redoCount reaches 90>>
> ...
>
> 2012-12-04 13:44:32.040 0.5s WARN [main] :
> <com.hazelcast.impl.ConcurrentMapManager> [10.1.2.48]:5701
> [ap_cluster_Administrator] Caller ->
> RedoLog{name=c:tbl_DistStorageProvider_rob_001_1,
> redoType=REDO_MEMBER_UNKNOWN, operation=CONCURRENT_MAP_GET,
> target=Address[10.1.128.115]:5701 / connected=true, redoCount=90,
> migrating=null
> partition=Partition [14]{
> 0:Address[10.1.128.115]:5701
> 1:Address[10.1.2.42]:5701
> 2:Address[10.1.2.48]:5701
> }
> }
> 2012-12-04 13:44:32.543 0.5s WARN [main] :
> <com.hazelcast.impl.ConcurrentMapManager> [10.1.2.48]:5701
> [ap_cluster_Administrator] Caller ->
> RedoLog{name=c:tbl_DistStorageProvider_rob_001_1,
> redoType=REDO_MEMBER_UNKNOWN, operation=CONCURRENT_MAP_GET,
> target=Address[10.1.128.115]:5701 / connected=true, redoCount=91,
> migrating=null
> partition=Partition [14]{
> 0:Address[10.1.128.115]:5701
> 1:Address[10.1.2.42]:5701
> 2:Address[10.1.2.48]:5701
> }
> }
> 2012-12-04 13:44:32.543 INFO [main] : Disconnecting from provider
> 2012-12-04 13:44:32.543 INFO [main] :
> <com.hazelcast.impl.LifecycleServiceImpl> [10.1.2.48]:5701
> [ap_cluster_Administrator] Address[10.1.2.48]:5701 is SHUTTING_DOWN
> 2012-12-04 13:44:33.089 0.5s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-2] :
> <com.hazelcast.nio.Connection> [10.1.2.48]:5701 [ap_cluster_Administrator]
> Connection [Address[10.1.2.42]:5701] lost. Reason:
> java.io.EOFException[null]
> 2012-12-04 13:44:33.090 WARN
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-2] :
> <com.hazelcast.nio.ReadHandler> [10.1.2.48]:5701 [ap_cluster_Administrator]
> hz._hzInstance_1_ap_cluster_Administrator.IO.thread-2 Closing socket to
> endpoint Address[10.1.2.42]:5701, Cause:java.io.EOFException
> 2012-12-04 13:44:33.091 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-1] :
> <com.hazelcast.nio.Connection> [10.1.2.48]:5701 [ap_cluster_Administrator]
> Connection [Address[10.1.128.115]:5701] lost. Reason:
> java.io.EOFException[null]
> 2012-12-04 13:44:33.091 WARN
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-1] :
> <com.hazelcast.nio.ReadHandler> [10.1.2.48]:5701 [ap_cluster_Administrator]
> hz._hzInstance_1_ap_cluster_Administrator.IO.thread-1 Closing socket to
> endpoint Address[10.1.128.115]:5701, Cause:java.io.EOFException
> 2012-12-04 13:44:33.422 0.3s INFO [main] : <com.hazelcast.initializer>
> [10.1.2.48]:5701 [ap_cluster_Administrator] Destroying node initializer.
> 2012-12-04 13:44:33.422 INFO [main] : <com.hazelcast.impl.Node>
> [10.1.2.48]:5701 [ap_cluster_Administrator] Hazelcast Shutdown is completed
> in 839 ms.
> 2012-12-04 13:44:33.422 INFO [main] :
> <com.hazelcast.impl.LifecycleServiceImpl> [10.1.2.48]:5701
> [ap_cluster_Administrator] Address[10.1.2.48]:5701 is SHUTDOWN
> 2012-12-04 13:44:33.422 ERROR [main] : Terminating client due to
> exception:
> com.hazelcast.core.OperationTimeoutException: [CONCURRENT_MAP_GET] Redo
> threshold[90] exceeded! Last redo cause: REDO_MEMBER_UNKNOWN, Name:
> c:tbl_DistStorageProvider_rob_001_1
> at
> com.hazelcast.impl.BaseManager$ResponseQueueCall.getRedoAwareResult(BaseManager.java:649)
> at
> com.hazelcast.impl.BaseManager$ResponseQueueCall.getResult(BaseManager.java:636)
> at
> com.hazelcast.impl.BaseManager$RequestBasedCall.getResultAsObject(BaseManager.java:464)
> at
> com.hazelcast.impl.BaseManager$ResponseQueueCall.getResultAsObject(BaseManager.java:555)
> at
> com.hazelcast.impl.BaseManager$RequestBasedCall.getResultAsObject(BaseManager.java:460)
> at
> com.hazelcast.impl.BaseManager$ResponseQueueCall.getResultAsObject(BaseManager.java:555)
> at
> com.hazelcast.impl.BaseManager$RequestBasedCall.objectCall(BaseManager.java:502)
> at
> com.hazelcast.impl.BaseManager$ResponseQueueCall.objectCall(BaseManager.java:555)
> at
> com.hazelcast.impl.BaseManager$RequestBasedCall.objectCall(BaseManager.java:508)
> at
> com.hazelcast.impl.BaseManager$ResponseQueueCall.objectCall(BaseManager.java:555)
> at
> com.hazelcast.impl.ConcurrentMapManager$MGet.get(ConcurrentMapManager.java:1124)
> at com.hazelcast.impl.MProxyImpl$MProxyReal.get(MProxyImpl.java:781)
> at com.hazelcast.impl.MProxyImpl.get(MProxyImpl.java:114)
> at
> com.company.test.corr.storage.distributed.provider.AbstractDistStorageProvider.get(AbstractDistStorageProvider.java:26)
> at Test.go(Test.java:148)
> at Test.main(Test.java:85)
>
> ...
> ... And it seems at this point the cluster is dead, java process is marked
> as <defunct>.
> ...
>
> Windows client logs:
> ====================
>
> 2012-12-04 13:39:53.671 INFO [main] : Dumb Server was started
> successfully.
>
> ...
> ... <<then I disconnect the nic>>
> ...
>
> 2012-12-04 13:40:12.703 19.0s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-2] :
> <com.hazelcast.nio.Connection> [10.1.128.115]:5701
> [ap_cluster_Administrator] Connection [Address[10.1.2.48]:5701] lost.
> Reason: java.io.IOException[An existing connection was forcibly closed by
> the remote host]
> 2012-12-04 13:40:12.703 WARN
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-2] :
> <com.hazelcast.nio.ReadHandler> [10.1.128.115]:5701
> [ap_cluster_Administrator]
> hz._hzInstance_1_ap_cluster_Administrator.IO.thread-2 Closing socket to
> endpoint Address[10.1.2.48]:5701, Cause:java.io.IOException: An existing
> connection was forcibly closed by the remote host
> 2012-12-04 13:40:12.703 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-1] :
> <com.hazelcast.nio.Connection> [10.1.128.115]:5701
> [ap_cluster_Administrator] Connection [Address[10.1.2.42]:5701] lost.
> Reason: java.io.IOException[An existing connection was forcibly closed by
> the remote host]
> 2012-12-04 13:40:12.703 WARN
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-1] :
> <com.hazelcast.nio.ReadHandler> [10.1.128.115]:5701
> [ap_cluster_Administrator]
> hz._hzInstance_1_ap_cluster_Administrator.IO.thread-1 Closing socket to
> endpoint Address[10.1.2.42]:5701, Cause:java.io.IOException: An existing
> connection was forcibly closed by the remote host
> 2012-12-04 13:40:13.546 0.8s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-3] :
> <com.hazelcast.nio.SocketConnector> [10.1.128.115]:5701
> [ap_cluster_Administrator] Could not connect to: /10.1.2.48:5701. Reason:
> SocketException[Network is unreachable: connect]
> 2012-12-04 13:40:13.546 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-2] :
> <com.hazelcast.nio.SocketConnector> [10.1.128.115]:5701
> [ap_cluster_Administrator] Could not connect to: /10.1.2.42:5701. Reason:
> SocketException[Network is unreachable: connect]
> 2012-12-04 13:40:14.562 1.0s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-2] :
> <com.hazelcast.nio.SocketConnector> [10.1.128.115]:5701
> [ap_cluster_Administrator] Could not connect to: /10.1.2.42:5701. Reason:
> SocketException[Network is unreachable: connect]
> 2012-12-04 13:40:14.562 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-3] :
> <com.hazelcast.nio.SocketConnector> [10.1.128.115]:5701
> [ap_cluster_Administrator] Could not connect to: /10.1.2.48:5701. Reason:
> SocketException[Network is unreachable: connect]
> 2012-12-04 13:40:15.578 1.0s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-3] :
> <com.hazelcast.nio.SocketConnector> [10.1.128.115]:5701
> [ap_cluster_Administrator] Could not connect to: /10.1.2.42:5701. Reason:
> SocketException[Network is unreachable: connect]
> 2012-12-04 13:40:15.578 WARN
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-3] :
> <com.hazelcast.nio.ConnectionMonitor> [10.1.128.115]:5701
> [ap_cluster_Administrator] Removing connection to endpoint
> Address[10.1.2.42]:5701 Cause => java.net.SocketException {Network is
> unreachable: connect}, Error-Count: 5
> 2012-12-04 13:40:15.578 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Removing Address Address[10.1.2.42]:5701
> 2012-12-04 13:40:15.578 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.Node> [10.1.128.115]:5701 [ap_cluster_Administrator] **
> setting master address to Address[10.1.2.48]:5701
> 2012-12-04 13:40:15.593 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-2] :
> <com.hazelcast.nio.SocketConnector> [10.1.128.115]:5701
> [ap_cluster_Administrator] Could not connect to: /10.1.2.48:5701. Reason:
> SocketException[Network is unreachable: connect]
> 2012-12-04 13:40:15.593 WARN
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-2] :
> <com.hazelcast.nio.ConnectionMonitor> [10.1.128.115]:5701
> [ap_cluster_Administrator] Removing connection to endpoint
> Address[10.1.2.48]:5701 Cause => java.net.SocketException {Network is
> unreachable: connect}, Error-Count: 5
> 2012-12-04 13:40:15.593 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] : Member
> removed @ 1354628415593: MembershipEvent {Member [10.1.2.42]:5701} removed
> 2012-12-04 13:40:15.593 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.128.115]:5701
> [ap_cluster_Administrator]
>
> Members [2] {
> Member [10.1.2.48]:5701
> Member [10.1.128.115]:5701 this
> }
>
> 2012-12-04 13:40:15.609 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Removing Address Address[10.1.2.48]:5701
> 2012-12-04 13:40:15.609 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.Node> [10.1.128.115]:5701 [ap_cluster_Administrator] **
> setting master address to Address[10.1.128.115]:5701
> 2012-12-04 13:40:15.640 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] : Member
> removed @ 1354628415640: MembershipEvent {Member [10.1.2.48]:5701} removed
> 2012-12-04 13:40:15.640 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Starting to send partition replica diffs...true
> 2012-12-04 13:40:15.640 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.128.115]:5701
> [ap_cluster_Administrator]
>
> Members [1] {
> Member [10.1.128.115]:5701 this
> }
>
> 2012-12-04 13:40:18.812 3.2s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.MigrationThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Total 0 partition replica diffs have been
> processed.
> 2012-12-04 13:40:18.812 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.MigrationThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Re-partitioning cluster data... Immediate-Tasks:
> 0, Scheduled-Tasks: 0
>
> ...
> ... <<about now I reconnect the nic>>
> ...
>
> 2012-12-04 13:42:14.078 115.3s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] :
> <com.hazelcast.nio.SocketAcceptor> [10.1.128.115]:5701
> [ap_cluster_Administrator] 5701 is accepting socket connection from
> /10.1.2.42:47155
> 2012-12-04 13:42:14.078 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] :
> <com.hazelcast.nio.ConnectionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] 5701 accepted socket connection from
> /10.1.2.42:47155
> 2012-12-04 13:42:20.750 6.7s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:42:30.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:42:40.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:42:50.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:43:00.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:43:10.765 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:43:20.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:43:30.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:43:40.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:43:44.687 3.9s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] :
> <com.hazelcast.nio.SocketAcceptor> [10.1.128.115]:5701
> [ap_cluster_Administrator] 5701 is accepting socket connection from
> /10.1.2.48:58763
> 2012-12-04 13:43:44.687 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] :
> <com.hazelcast.nio.ConnectionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] 5701 accepted socket connection from
> /10.1.2.48:58763
> 2012-12-04 13:43:50.750 6.1s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:43:53.250 2.5s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.ConcurrentMapManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Handler ->
> RedoLog{name=c:tbl_DistStorageProvider_rob_001_1,
> redoType=REDO_MEMBER_UNKNOWN, operation=CONCURRENT_MAP_GET,
> caller=Address[10.1.2.48]:5701 / connected=true, redoCount=16,
> migrating=null
> partition=Partition [14]{
> 0:Address[10.1.128.115]:5701
> }
> }
> 2012-12-04 13:43:53.750 0.5s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.ConcurrentMapManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Handler ->
> RedoLog{name=c:tbl_DistStorageProvider_rob_001_1,
> redoType=REDO_MEMBER_UNKNOWN, operation=CONCURRENT_MAP_GET,
> caller=Address[10.1.2.48]:5701 / connected=true, redoCount=17,
> migrating=null
> partition=Partition [14]{
> 0:Address[10.1.128.115]:5701
> }
> }
> 2012-12-04 13:43:54.250 0.5s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.ConcurrentMapManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Handler ->
> RedoLog{name=c:tbl_DistStorageProvider_rob_001_1,
> redoType=REDO_MEMBER_UNKNOWN, operation=CONCURRENT_MAP_GET,
> caller=Address[10.1.2.48]:5701 / connected=true, redoCount=18,
> migrating=null
> partition=Partition [14]{
> 0:Address[10.1.128.115]:5701
> }
> }
>
> ...
> ... <<The above then repeats until redoCount reaches 90>>
> ...
>
> 2012-12-04 13:44:29.484 0.5s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.ConcurrentMapManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Handler ->
> RedoLog{name=c:tbl_DistStorageProvider_rob_001_1,
> redoType=REDO_MEMBER_UNKNOWN, operation=CONCURRENT_MAP_GET,
> caller=Address[10.1.2.48]:5701 / connected=true, redoCount=88,
> migrating=null
> partition=Partition [14]{
> 0:Address[10.1.128.115]:5701
> }
> }
> 2012-12-04 13:44:29.984 0.5s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.ConcurrentMapManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Handler ->
> RedoLog{name=c:tbl_DistStorageProvider_rob_001_1,
> redoType=REDO_MEMBER_UNKNOWN, operation=CONCURRENT_MAP_GET,
> caller=Address[10.1.2.48]:5701 / connected=true, redoCount=89,
> migrating=null
> partition=Partition [14]{
> 0:Address[10.1.128.115]:5701
> }
> }
> 2012-12-04 13:44:30.484 0.5s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.ConcurrentMapManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Handler ->
> RedoLog{name=c:tbl_DistStorageProvider_rob_001_1,
> redoType=REDO_MEMBER_UNKNOWN, operation=CONCURRENT_MAP_GET,
> caller=Address[10.1.2.48]:5701 / connected=true, redoCount=90,
> migrating=null
> partition=Partition [14]{
> 0:Address[10.1.128.115]:5701
> }
> }
> 2012-12-04 13:44:30.750 0.3s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:44:31.031 0.3s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Removing Address Address[10.1.2.48]:5701
> 2012-12-04 13:44:31.062 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.nio.Connection> [10.1.128.115]:5701
> [ap_cluster_Administrator] Connection [Address[10.1.2.48]:5701] lost.
> Reason: Explicit close
> 2012-12-04 13:44:40.750 9.7s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:44:50.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:45:00.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:45:10.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:45:20.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:45:30.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:45:40.750 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:45:50.734 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:46:00.734 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:46:10.734 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:46:20.734 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:46:30.734 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:46:40.734 10.0s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] This is the master node and received a
> PartitionRuntimeState from Address[10.1.2.42]:5701. Ignoring incoming
> state!
> 2012-12-04 13:46:46.609 5.9s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] :
> <com.hazelcast.impl.MulticastService> [10.1.128.115]:5701
> [ap_cluster_Administrator] You probably have too long Hazelcast
> configuration!
> java.net.NoRouteToHostException: No route to host: Datagram send failed
> at java.net.TwoStacksPlainDatagramSocketImpl.send(Native Method)
> at java.net.DatagramSocket.send(DatagramSocket.java:676)
> at com.hazelcast.impl.MulticastService.send(MulticastService.java:173)
> at
> com.hazelcast.impl.MulticastJoiner.searchForOtherClusters(MulticastJoiner.java:112)
> at
> com.hazelcast.impl.SplitBrainHandler.searchForOtherClusters(SplitBrainHandler.java:63)
> at
> com.hazelcast.impl.SplitBrainHandler.access$000(SplitBrainHandler.java:24)
> at com.hazelcast.impl.SplitBrainHandler$1.doRun(SplitBrainHandler.java:49)
> at com.hazelcast.impl.FallThroughRunnable.run(FallThroughRunnable.java:22)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> at
> com.hazelcast.impl.ExecutorThreadFactory$1.run(ExecutorThreadFactory.java:38)
>
>
>
>
> Disconnecting a Windows client:
> Disconnecting and reconnecting over 5mins.
> ===========================
>
>
> Linux / Cluster logs
> ====================
> 2012-12-04 23:47:48.440 7.9s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.2.48]:5701
> [ap_cluster_Administrator]
>
> Members [3] {
> Member [10.1.2.42]:5701
> Member [10.1.2.48]:5701 this
> Member [10.1.128.115]:5701
> }
>
> 2012-12-04 23:47:48.440 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] : Member added
> @ 1354664868440: MembershipEvent {Member [10.1.128.115]:5701} added
> 2012-12-04 23:47:48.445 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-5] :
> <com.hazelcast.nio.ConnectionManager> [10.1.2.48]:5701
> [ap_cluster_Administrator] 51863 accepted socket connection from
> /10.1.128.115:5701
> 2012-12-04 23:47:48.700 0.3s INFO [main] : Connected to all 3 cluster
> nodes
> 2012-12-04 23:47:48.703 INFO [main] : Cluster size @ 15616: 3
> ...
> ... Lots of migrations.....
> ...
> 2012-12-04 23:47:49.290 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-5] : Migration
> completed @ 16203: MigrationEvent{partitionId=266, oldOwner=Member
> [10.1.2.42]:5701, newOwner=Member [10.1.128.115]:5701}
> 2012-12-04 23:48:04.760 15.5s INFO [main] : Finished ADDS: 16,052 ms
> total, 0.642 ms per item, 1,557 items per second
> 2012-12-04 23:48:04.762 INFO [main] : Cluster size @ 31675: 3
> 2012-12-04 23:48:05.762 1.0s INFO [main] : Throughput measurement @
> 32675: 2421
> 2012-12-04 23:48:06.763 1.0s INFO [main] : Throughput measurement @
> 33676: 2795
>
> ...
> ... Repeated get requests continue....
> ...
> 2012-12-04 23:48:24.947 1.0s INFO [main] : Throughput measurement @
> 51860: 2986
> 2012-12-04 23:48:25.948 1.0s INFO [main] : Throughput measurement @
> 52861: 2802
> 2012-12-04 23:48:26.952 1.0s INFO [main] : Throughput measurement @
> 53865: 2912
> ...
> ... <<then I disconnect the nic... there is no more output from the
> clusters log.. this is the 5 min timeout described. Eventually after 5mins
> logging resumes.>>
> ...
> 2012-12-04 23:53:27.195 300.2s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.2.48]:5701
> [ap_cluster_Administrator] Removing Address Address[10.1.128.115]:5701
> 2012-12-04 23:53:27.196 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.nio.Connection> [10.1.2.48]:5701 [ap_cluster_Administrator]
> Connection [Address[10.1.128.115]:5701] lost. Reason: Explicit close
> 2012-12-04 23:53:27.197 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-5] : Member
> removed @ 1354665207197: MembershipEvent {Member [10.1.128.115]:5701}
> removed
> 2012-12-04 23:53:27.197 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-5] : Cluster size
> @ 354110: 2
> 2012-12-04 23:53:27.210 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.2.48]:5701
> [ap_cluster_Administrator]
>
> Members [2] {
> Member [10.1.2.42]:5701
> Member [10.1.2.48]:5701 this
> }
>
> 2012-12-04 23:53:27.711 0.5s INFO [main] : Ignoring a few operations
> between lastCompleted and lastLogged: 1 in 0 ms
> 2012-12-04 23:53:27.711 INFO [main] : Throughput measurement @
> 354624: 0
> 2012-12-04 23:53:28.712 1.0s INFO [main] : Throughput measurement @
> 355625: 7688
> 2012-12-04 23:53:29.713 1.0s INFO [main] : Throughput measurement @
> 356626: 7999
> 2012-12-04 23:53:30.714 1.0s INFO [main] : Throughput measurement @
> 357627: 7844
> 2012-12-04 23:53:31.715 1.0s INFO [main] : Throughput measurement @
> 358628: 7093
> 2012-12-04 23:53:31.882 0.2s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-3] : Migration
> started @ 358795: MigrationEvent{partitionId=258, oldOwner=Member
> [10.1.2.48]:5701 this, newOwner=Member [10.1.2.42]:5701}
> 2012-12-04 23:53:31.890 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-2] : Migration
> completed @ 358803: MigrationEvent{partitionId=258, oldOwner=Member
> [10.1.2.48]:5701 this, newOwner=Member [10.1.2.42]:5701}
> 2012-12-04 23:53:32.716 0.8s INFO [main] : Throughput measurement @
> 359629: 7257
> 2012-12-04 23:53:33.717 1.0s INFO [main] : Throughput measurement @
> 360630: 7595
> 2012-12-04 23:53:34.718 1.0s INFO [main] : Throughput measurement @
> 361631: 7345
>
>
>
> Windows client logs:
> ====================
>
> 2012-12-04 23:47:38.328 INFO [main] : PID = 656@BJS-win2008r2
> 2012-12-04 23:47:38.328 INFO [main] : Max memory = 5726666752
> 2012-12-04 23:47:38.390 0.1s CRIT [main] : JavaMemoryUsageLogger
> starting with total committed JVM size currently 61 MB and max memory 5,461
> MB logging to file:
> c:\dev\jamie_net_down_hazelcast\company-test\system\corr-plugins\storage\provider\robustness\DistStorageProvider_rob_003\Output\amd64-win64-msvc9_Hazelcast_2.4\mem-localhost-server-01.out
> 2012-12-04 23:47:38.421 INFO [main] :
> <com.company.test.utils.JulTocompanyLoggerBridge> Activated: sending JDK
> log messages to Commons Logging
> 2012-12-04 23:47:38.421 INFO [main] : Starting Hazelcast on host:
> BJS-win2008r2 cluster name: ap_cluster_Administrator
> 2012-12-04 23:47:38.546 0.1s INFO [main] :
> <com.hazelcast.impl.AddressPicker> Prefer IPv4 stack is true.
> 2012-12-04 23:47:38.625 0.1s INFO [main] :
> <com.hazelcast.impl.AddressPicker> Picked Address[10.1.128.115]:5701, using
> socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5701], bind any local
> is true
> 2012-12-04 23:47:39.125 0.5s INFO [main] : <com.hazelcast.system>
> [10.1.128.115]:5701 [ap_cluster_Administrator] Hazelcast Community Edition
> 2.4 (20121017) starting at Address[10.1.128.115]:5701
> 2012-12-04 23:47:39.125 INFO [main] : <com.hazelcast.system>
> [10.1.128.115]:5701 [ap_cluster_Administrator] Copyright (C) 2008-2012
> Hazelcast.com
> 2012-12-04 23:47:39.187 0.1s INFO [main] :
> <com.hazelcast.impl.LifecycleServiceImpl> [10.1.128.115]:5701
> [ap_cluster_Administrator] Address[10.1.128.115]:5701 is STARTING
> 2012-12-04 23:47:39.250 0.1s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:39.250 INFO [main] :
> <com.hazelcast.impl.MulticastJoiner> [10.1.128.115]:5701
> [ap_cluster_Administrator] Connecting to master node:
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:39.265 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-1] :
> <com.hazelcast.nio.ConnectionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] 62838 accepted socket connection from
> /10.1.2.42:5701
> 2012-12-04 23:47:39.812 0.6s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:39.828 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.Node> [10.1.128.115]:5701 [ap_cluster_Administrator] **
> setting master address to Address[10.1.2.42]:5701
> 2012-12-04 23:47:40.312 0.5s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:40.812 0.5s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:41.312 0.5s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:41.812 0.5s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:42.312 0.5s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:42.812 0.5s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:43.312 0.5s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:43.812 0.5s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:44.312 0.5s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:44.812 0.5s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:45.312 0.5s INFO [main] : <com.hazelcast.impl.Node>
> [10.1.128.115]:5701 [ap_cluster_Administrator] ** setting master address to
> Address[10.1.2.42]:5701
> 2012-12-04 23:47:45.312 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.128.115]:5701
> [ap_cluster_Administrator]
>
> Members [3] {
> Member [10.1.2.42]:5701
> Member [10.1.2.48]:5701
> Member [10.1.128.115]:5701 this
> }
>
> 2012-12-04 23:47:45.312 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] :
> <com.hazelcast.nio.SocketAcceptor> [10.1.128.115]:5701
> [ap_cluster_Administrator] 5701 is accepting socket connection from
> /10.1.2.48:51863
> 2012-12-04 23:47:45.312 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] :
> <com.hazelcast.nio.ConnectionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] 5701 accepted socket connection from
> /10.1.2.48:51863
> 2012-12-04 23:47:46.906 1.6s INFO [main] :
> <com.hazelcast.impl.LifecycleServiceImpl> [10.1.128.115]:5701
> [ap_cluster_Administrator] Address[10.1.128.115]:5701 is STARTED
> 2012-12-04 23:47:46.906 INFO [main] : Connected to cluster:
> Cluster [3] { Member [10.1.2.42]:5701 Member [10.1.2.48]:5701 Member
> [10.1.128.115]:5701 this }
> 2012-12-04 23:47:46.906 INFO [main] : Dumb Server was started
> successfully.
> ...
> ... The network is disconnected.
> ...
> 2012-12-04 23:48:24.234 37.3s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-1] :
> <com.hazelcast.nio.Connection> [10.1.128.115]:5701
> [ap_cluster_Administrator] Connection [Address[10.1.2.42]:5701] lost.
> Reason: java.io.IOException[An existing connection was forcibly closed by
> the remote host]
> 2012-12-04 23:48:24.234 WARN
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-1] :
> <com.hazelcast.nio.ReadHandler> [10.1.128.115]:5701
> [ap_cluster_Administrator]
> hz._hzInstance_1_ap_cluster_Administrator.IO.thread-1 Closing socket to
> endpoint Address[10.1.2.42]:5701, Cause:java.io.IOException: An existing
> connection was forcibly closed by the remote host
> 2012-12-04 23:48:24.250 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-2] :
> <com.hazelcast.nio.Connection> [10.1.128.115]:5701
> [ap_cluster_Administrator] Connection [Address[10.1.2.48]:5701] lost.
> Reason: java.io.IOException[An existing connection was forcibly closed by
> the remote host]
> 2012-12-04 23:48:24.250 WARN
> [hz._hzInstance_1_ap_cluster_Administrator.IO.thread-2] :
> <com.hazelcast.nio.ReadHandler> [10.1.128.115]:5701
> [ap_cluster_Administrator]
> hz._hzInstance_1_ap_cluster_Administrator.IO.thread-2 Closing socket to
> endpoint Address[10.1.2.48]:5701, Cause:java.io.IOException: An existing
> connection was forcibly closed by the remote host
> 2012-12-04 23:48:24.500 0.3s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-3] :
> <com.hazelcast.nio.SocketConnector> [10.1.128.115]:5701
> [ap_cluster_Administrator] Could not connect to: /10.1.2.42:5701. Reason:
> SocketException[Network is unreachable: connect]
> 2012-12-04 23:48:24.515 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-5] :
> <com.hazelcast.nio.SocketConnector> [10.1.128.115]:5701
> [ap_cluster_Administrator] Could not connect to: /10.1.2.48:5701. Reason:
> SocketException[Network is unreachable: connect]
> 2012-12-04 23:48:25.515 1.0s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-5] :
> <com.hazelcast.nio.SocketConnector> [10.1.128.115]:5701
> [ap_cluster_Administrator] Could not connect to: /10.1.2.42:5701. Reason:
> SocketException[Network is unreachable: connect]
> 2012-12-04 23:48:25.515 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-3] :
> <com.hazelcast.nio.SocketConnector> [10.1.128.115]:5701
> [ap_cluster_Administrator] Could not connect to: /10.1.2.48:5701. Reason:
> SocketException[Network is unreachable: connect]
> 2012-12-04 23:48:26.531 1.0s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-3] :
> <com.hazelcast.nio.SocketConnector> [10.1.128.115]:5701
> [ap_cluster_Administrator] Could not connect to: /10.1.2.42:5701. Reason:
> SocketException[Network is unreachable: connect]
> 2012-12-04 23:48:26.531 WARN
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-3] :
> <com.hazelcast.nio.ConnectionMonitor> [10.1.128.115]:5701
> [ap_cluster_Administrator] Removing connection to endpoint
> Address[10.1.2.42]:5701 Cause => java.net.SocketException {Network is
> unreachable: connect}, Error-Count: 5
> 2012-12-04 23:48:26.531 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Removing Address Address[10.1.2.42]:5701
> 2012-12-04 23:48:26.531 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.Node> [10.1.128.115]:5701 [ap_cluster_Administrator] **
> setting master address to Address[10.1.2.48]:5701
> 2012-12-04 23:48:26.578 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-5] :
> <com.hazelcast.nio.SocketConnector> [10.1.128.115]:5701
> [ap_cluster_Administrator] Could not connect to: /10.1.2.48:5701. Reason:
> SocketException[Network is unreachable: connect]
> 2012-12-04 23:48:26.578 WARN
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-5] :
> <com.hazelcast.nio.ConnectionMonitor> [10.1.128.115]:5701
> [ap_cluster_Administrator] Removing connection to endpoint
> Address[10.1.2.48]:5701 Cause => java.net.SocketException {Network is
> unreachable: connect}, Error-Count: 5
> 2012-12-04 23:48:26.578 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] : Member
> removed @ 1354664906578: MembershipEvent {Member [10.1.2.42]:5701} removed
> 2012-12-04 23:48:26.593 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.128.115]:5701
> [ap_cluster_Administrator]
>
> Members [2] {
> Member [10.1.2.48]:5701
> Member [10.1.128.115]:5701 this
> }
>
> 2012-12-04 23:48:26.593 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Removing Address Address[10.1.2.48]:5701
> 2012-12-04 23:48:26.593 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.Node> [10.1.128.115]:5701 [ap_cluster_Administrator] **
> setting master address to Address[10.1.128.115]:5701
> 2012-12-04 23:48:26.593 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] : Member
> removed @ 1354664906593: MembershipEvent {Member [10.1.2.48]:5701} removed
> 2012-12-04 23:48:26.593 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Starting to send partition replica diffs...true
> 2012-12-04 23:48:26.593 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.ServiceThread] :
> <com.hazelcast.cluster.ClusterManager> [10.1.128.115]:5701
> [ap_cluster_Administrator]
>
> Members [1] {
> Member [10.1.128.115]:5701 this
> }
>
> 2012-12-04 23:48:29.843 3.2s INFO
> [hz._hzInstance_1_ap_cluster_Administrator.MigrationThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Total 0 partition replica diffs have been
> processed.
> 2012-12-04 23:48:29.859 INFO
> [hz._hzInstance_1_ap_cluster_Administrator.MigrationThread] :
> <com.hazelcast.impl.PartitionManager> [10.1.128.115]:5701
> [ap_cluster_Administrator] Re-partitioning cluster data... Immediate-Tasks:
> 0, Scheduled-Tasks: 0
> ...
> ... The client then does nothing until I reconnect the network. At this
> point it seems to fail although the process remains alive but without any
> further logging.
> ...
> 2012-12-04 23:54:39.265 369.4s WARN
> [hz._hzInstance_1_ap_cluster_Administrator.cached.thread-4] :
> <com.hazelcast.impl.MulticastService> [10.1.128.115]:5701
> [ap_cluster_Administrator] You probably have too long Hazelcast
> configuration!
> java.net.NoRouteToHostException: No route to host: Datagram send failed
> at java.net.TwoStacksPlainDatagramSocketImpl.send(Native Method)
> at java.net.DatagramSocket.send(DatagramSocket.java:676)
> at com.hazelcast.impl.MulticastService.send(MulticastService.java:173)
> at
> com.hazelcast.impl.MulticastJoiner.searchForOtherClusters(MulticastJoiner.java:112)
> at
> com.hazelcast.impl.SplitBrainHandler.searchForOtherClusters(SplitBrainHandler.java:63)
> at
> com.hazelcast.impl.SplitBrainHandler.access$000(SplitBrainHandler.java:24)
> at com.hazelcast.impl.SplitBrainHandler$1.doRun(SplitBrainHandler.java:49)
> at com.hazelcast.impl.FallThroughRunnable.run(FallThroughRunnable.java:22)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> at
> com.hazelcast.impl.ExecutorThreadFactory$1.run(ExecutorThreadFactory.java:38)

Jamie McNaught

unread,

Dec 11, 2012, 12:48:56 PM12/11/12

to haze...@googlegroups.com

Hi Joe,

Thanks for the reply and suggestion. Enabling icmp brought the timeouts down to under 10 seconds. However, I've been having other problems with "get" operations returning null when the disconnected node reconnects. I'll put up a separate post about that with a more details.

Sorry about the delay in getting back to you... had been hoping to come back with good news that all was well.