Cluster nodes connection loss

633 views
Skip to first unread message

Paolo Di Tommaso

unread,
Mar 24, 2014, 5:08:21 PM3/24/14
to haze...@googlegroups.com
Dear all, 

I'm setting up an Hazelcast cluster in a cloud based environment. 

I've noted that quite often some nodes suddenly lose the connection and after few seconds they reconnect to the cluster. 

I think that it could be due to high latencies in the cloud network. What is your experience about that?  Is there a suggested configuration for a cloud environment? 

Below you can find the node log trace. 


Thanks,
Paolo 



Mar-24 12:40:03.507 [hz._hzInstance_1_nextflow.IO.thread-in-0] INFO  com.hazelcast.nio.TcpIpConnection - [172.16.1.115]:5701 [nextflow] Connection [Address[172.16.1.92]:5701] lost. Reason: java.io.IOException[Connection reset by peer]
Mar-24 12:40:03.507 [hz._hzInstance_1_nextflow.IO.thread-out-0] WARN  com.hazelcast.nio.WriteHandler - [172.16.1.115]:5701 [nextflow] hz._hzInstance_1_nextflow.IO.thread-out-0 Closing socket to endpoint Address[172.16.1.92]:5701, Cause:java.nio.channels.ClosedChannelException
Mar-24 12:40:03.508 [hz._hzInstance_1_nextflow.IO.thread-in-0] WARN  com.hazelcast.nio.ReadHandler - [172.16.1.115]:5701 [nextflow] hz._hzInstance_1_nextflow.IO.thread-in-0 Closing socket to endpoint Address[172.16.1.92]:5701, Cause:java.io.IOException: Connection reset by peer
Mar-24 12:40:03.739 [hz._hzInstance_1_nextflow.cached.thread-1] INFO  com.hazelcast.nio.SocketConnector - [172.16.1.115]:5701 [nextflow] Connecting to /172.16.1.92:5701, timeout: 0, bind-any: false
Mar-24 12:40:03.741 [hz._hzInstance_1_nextflow.cached.thread-1] INFO  com.hazelcast.nio.SocketConnector - [172.16.1.115]:5701 [nextflow] Could not connect to: /172.16.1.92:5701. Reason: SocketException[Connection refused to address /172.16.1.92:5701]
Mar-24 12:40:04.739 [hz._hzInstance_1_nextflow.cached.thread-1] INFO  com.hazelcast.nio.SocketConnector - [172.16.1.115]:5701 [nextflow] Connecting to /172.16.1.92:5701, timeout: 0, bind-any: false
Mar-24 12:40:04.740 [hz._hzInstance_1_nextflow.cached.thread-1] INFO  com.hazelcast.nio.SocketConnector - [172.16.1.115]:5701 [nextflow] Could not connect to: /172.16.1.92:5701. Reason: SocketException[Connection refused to address /172.16.1.92:5701]
Mar-24 12:40:05.740 [hz._hzInstance_1_nextflow.cached.thread-2] INFO  com.hazelcast.nio.SocketConnector - [172.16.1.115]:5701 [nextflow] Connecting to /172.16.1.92:5701, timeout: 0, bind-any: false
Mar-24 12:40:05.741 [hz._hzInstance_1_nextflow.cached.thread-2] INFO  com.hazelcast.nio.SocketConnector - [172.16.1.115]:5701 [nextflow] Could not connect to: /172.16.1.92:5701. Reason: SocketException[Connection refused to address /172.16.1.92:5701]
Mar-24 12:40:05.741 [hz._hzInstance_1_nextflow.cached.thread-2] WARN  com.hazelcast.nio.ConnectionMonitor - [172.16.1.115]:5701 [nextflow] Removing connection to endpoint Address[172.16.1.92]:5701 Cause => java.net.SocketException {Connection refused to address /172.16.1.92:5701}, Error-Count: 5
Mar-24 12:40:05.742 [hz._hzInstance_1_nextflow.cached.thread-3] INFO  com.hazelcast.cluster.ClusterService - [172.16.1.115]:5701 [nextflow] Master Address[172.16.1.92]:5701 left the cluster. Assigning new master Member [172.16.1.115]:5701 this
Mar-24 12:40:05.742 [hz._hzInstance_1_nextflow.cached.thread-3] INFO  com.hazelcast.cluster.ClusterService - [172.16.1.115]:5701 [nextflow] Removing Member [172.16.1.92]:5701
Mar-24 12:40:05.765 [hz._hzInstance_1_nextflow.cached.thread-3] INFO  com.hazelcast.cluster.ClusterService - [172.16.1.115]:5701 [nextflow] 

Members [1] {
Member [172.16.1.115]:5701 this
}

Mar-24 12:40:05.765 [hz._hzInstance_1_nextflow.migration] INFO  c.h.partition.PartitionService - [172.16.1.115]:5701 [nextflow] Partition balance is ok, no need to re-partition cluster data... 
Mar-24 12:40:05.772 [hz._hzInstance_1_nextflow.cached.thread-4] INFO  nextflow.executor.HzDaemon - Nextflow cluster member remove: Member [172.16.1.92]:5701
Mar-24 12:40:06.777 [hz._hzInstance_1_nextflow.IO.thread-Acceptor] INFO  com.hazelcast.nio.SocketAcceptor - [172.16.1.115]:5701 [nextflow] Accepting socket connection from /172.16.1.92:36418
Mar-24 12:40:06.777 [hz._hzInstance_1_nextflow.IO.thread-Acceptor] INFO  c.h.nio.TcpIpConnectionManager - [172.16.1.115]:5701 [nextflow] 5701 accepted socket connection from /172.16.1.92:36418
Mar-24 12:40:12.267 [hz._hzInstance_1_nextflow.cached.thread-5] INFO  nextflow.executor.HzDaemon - Nextflow cluster member added: Member [172.16.1.92]:5701
Mar-24 12:40:12.268 [hz._hzInstance_1_nextflow.cached.thread-4] INFO  com.hazelcast.cluster.ClusterService - [172.16.1.115]:5701 [nextflow] 

Members [2] {
Member [172.16.1.115]:5701 this
Member [172.16.1.92]:5701
}

Paolo Di Tommaso

unread,
Apr 2, 2014, 11:15:50 AM4/2/14
to haze...@googlegroups.com
Hi all, 

I'm testing a small Hazelcast cluster (both 3.16 and 3.2)  in a cloud environment and I've noticed that nodes disconnect and reconnect quite frequently. 

Nobody has idea why this happen? Is there any "magic" tuning required for the cloud (I suspect that it can be caused by high latencies in the network) ?  


Thanks,
Paolo

Enes Akar

unread,
Apr 2, 2014, 4:12:12 PM4/2/14
to haze...@googlegroups.com

Which instance types do you use? For some types amazon does not guarantee a high quality internal network

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/58b8af6e-6bca-46b4-8756-c47f7e1afab3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Enes Akar

unread,
Apr 2, 2014, 4:13:07 PM4/2/14
to haze...@googlegroups.com

Sorry I assumed you use aws. Do you?

Paolo Di Tommaso

unread,
Apr 2, 2014, 4:26:26 PM4/2/14
to haze...@googlegroups.com
Hi, well no. I'm using "medium" instances on this platform https://www.opensciencedatacloud.org

In the case it is a latency in the network, is there any configuration property that can be tuned ?

Moreover I've noticed on the client side this warning, at the same time I'm losing some some items in my distributed data structure.  



Apr-02 10:39:20.298 [InSelector] WARN  c.h.c.c.nio.ClientConnection - Connection [/172.16.1.133:5701] lost. Reason: java.io.IOException[Connection reset by peer]
Apr-02 10:39:20.299 [InSelector] WARN  c.h.c.c.nio.ClientReadHandler - InSelector Closing socket to endpoint Address[172.16.1.133]:5701, Cause:java.io.IOException: Connection reset by peer
Apr-02 10:39:34.868 [InSelector] WARN  c.h.c.c.nio.ClientConnection - Connection [/172.16.1.159:5701] lost. Reason: java.io.EOFException[Remote socket closed!]
Apr-02 10:39:34.869 [InSelector] WARN  c.h.c.c.nio.ClientReadHandler - InSelector Closing socket to endpoint Address[172.16.1.159]:5701, Cause:java.io.EOFException: Remote socket closed!

Thanks for helping.

Cheers,
Paolo




You received this message because you are subscribed to a topic in the Google Groups "Hazelcast" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hazelcast/wFrlBjHca1o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hazelcast+...@googlegroups.com.

To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.

besui...@gmail.com

unread,
Apr 29, 2016, 5:22:46 AM4/29/16
to Hazelcast
Hi Paolo,

Did you ever get the root cause for this issue? We have been using hazelcast 3.2.3 for our application and we observer the same issue, where the client disconnects and the reconnects every 2 hours. We are hosting our application on linux servers:

[4/28/16 12:35:37:209 MST] 00000056 ClientConnect W com.hazelcast.client.connection.nio.ClientConnection  Connection [/xx.xx.xx.136:5701] lost. Reason: java.io.EOFException[Remote socket closed!]
[4/28/16 12:35:37:210 MST] 00000056 ClientReadHan W com.hazelcast.client.connection.nio.ClientReadHandler  InSelector Closing socket to endpoint Address[xx.xx.xx.136]:5701, Cause:java.io.EOFExcepti
on: Remote socket closed!
[4/28/16 12:35:37:221 MST] 00000058 ClientConnect W com.hazelcast.client.connection.nio.ClientConnection  Connection [/xx.xx.xx.136:5701] lost. Reason: java.io.EOFException[Remote socket closed!]
[4/28/16 12:35:37:222 MST] 00000058 ClientReadHan W com.hazelcast.client.connection.nio.ClientReadHandler  InSelector Closing socket to endpoint Address[xx.xx.xx.136]:5701, Cause:java.io.EOFExcepti
on: Remote socket closed!
[4/28/16 12:35:38:161 MST] 00000055 ClientConnect W com.hazelcast.client.connection.nio.ClientConnection  Connection [/xx.xx.xx.136:5701] lost. Reason: java.io.EOFException[Remote socket closed!]
[4/28/16 12:35:38:162 MST] 00000055 ClientReadHan W com.hazelcast.client.connection.nio.ClientReadHandler  InSelector Closing socket to endpoint Address[xx.xx.xx.136]:5701, Cause:java.io.EOFExcepti
on: Remote socket closed!
[4/28/16 12:36:52:165 MST] 00000050 ClientCluster W com.hazelcast.client.spi.ClientClusterService  Error while listening cluster events! -> ClientConnection{live=true, writeHandler=com.hazelcast.cl
ient.connection.nio.ClientWriteHandler@44184418, readHandler=com.hazelcast.client.connection.nio.ClientReadHandler@43e543e5, connectionId=504, socketChannel=DefaultSocketChannelWrapper{socketChanne
l=java.nio.channels.SocketChannel[connected local=/yy.yy.y.133:40888 remote=/xx.xx.xx.136:5701]}, remoteEndpoint=Address[xx.xx.xx.136]:5701}, Error: java.io.IOException: Connection timed out
[4/28/16 12:36:52:167 MST] 00000050 ClientConnect W com.hazelcast.client.connection.nio.ClientConnection  Connection [null] lost. Reason: Socket explicitly closed
[4/28/16 12:36:52:167 MST] 00000050 LifecycleServ I com.hazelcast.core.LifecycleService  HazelcastClient[hz.client_7_hzadmin][3.2.3] is CLIENT_DISCONNECTED
[4/28/16 12:36:52:173 MST] 00000052 ClientCluster W com.hazelcast.client.spi.ClientClusterService  Error while listening cluster events! -> ClientConnection{live=true, writeHandler=com.hazelcast.cl
ient.connection.nio.ClientWriteHandler@401e401e, readHandler=com.hazelcast.client.connection.nio.ClientReadHandler@3feb3feb, connectionId=538, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/yy.yy.y.133:36997 remote=/xx.xx.xx.138:5702]}, remoteEndpoint=Address[xx.xx.xx.138]:5702}, Error: java.io.IOException: Connection timed out
[4/28/16 12:36:52:174 MST] 00000052 ClientConnect W com.hazelcast.client.connection.nio.ClientConnection  Connection [null] lost. Reason: Socket explicitly closed
[4/28/16 12:36:52:175 MST] 00000052 LifecycleServ I com.hazelcast.core.LifecycleService  HazelcastClient[hz.client_6_hzadmin][3.2.3] is CLIENT_DISCONNECTED
[4/28/16 12:36:53:118 MST] 00000051 ClientCluster W com.hazelcast.client.spi.ClientClusterService  Error while listening cluster events! -> ClientConnection{live=true, writeHandler=com.hazelcast.client.connection.nio.ClientWriteHandler@2240224, readHandler=com.hazelcast.client.connection.nio.ClientReadHandler@1f101f1, connectionId=507, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/yy.yy.y.133:42665 remote=/xx.xx.xx.139:5702]}, remoteEndpoint=Address[xx.xx.xx.139]:5702}, Error: java.io.IOException: Connection timed out
[4/28/16 12:36:53:119 MST] 00000051 ClientConnect W com.hazelcast.client.connection.nio.ClientConnection  Connection [null] lost. Reason: Socket explicitly closed
[4/28/16 12:36:53:120 MST] 00000051 LifecycleServ I com.hazelcast.core.LifecycleService  HazelcastClient[hz.client_8_hzadmin][3.2.3] is CLIENT_DISCONNECTED
[4/28/16 12:36:53:236 MST] 00000052 LifecycleServ I com.hazelcast.core.LifecycleService  HazelcastClient[hz.client_6_hzadmin][3.2.3] is CLIENT_CONNECTED
[4/28/16 12:36:53:236 MST] 00000050 LifecycleServ I com.hazelcast.core.LifecycleService  HazelcastClient[hz.client_7_hzadmin][3.2.3] is CLIENT_CONNECTED
[4/28/16 12:36:53:237 MST] 00000052 ClientCluster I com.hazelcast.client.spi.ClientClusterService
Members [8] {
        Member [xx.xx.xx.136]:5701
        Member [xx.xx.xx.136]:5702
        Member [xx.xx.xx.137]:5701
        Member [xx.xx.xx.137]:5702
        Member [xx.xx.xx.138]:5701
        Member [xx.xx.xx.138]:5702
        Member [xx.xx.xx.139]:5701
        Member [xx.xx.xx.139]:5702
}
 
Reply all
Reply to author
Forward
0 new messages