icmp

122 views
Skip to first unread message

Rune

unread,
May 13, 2014, 5:41:44 AM5/13/14
to haze...@googlegroups.com
Hi


Hazelcast 3.2.1
Two nodes on separate Centos machines virtualized on Hyper-V
Java 7

Enabled icmp (read elsewhere that this is recommended to get faster node alive/dead confirmations).

My properties are

hz.icmp.enabled=true
hz.icmp.timeout=3000
hz.icmp.ttl=1


However, I get these messages in the log

2014-05-13 10:59:26,997 WARN  com.hazelcast.cluster.ClusterService - [10.230.48.190]:5701 [qa-cluster] [3.2.1] Address[10.230.48.190]:5701 will ping Address[10.230.48.189]:5701
2014-05-13 10:59:27,002 WARN  com.hazelcast.cluster.ClusterService - [10.230.48.190]:5701 [qa-cluster] [3.2.1] Address[10.230.48.190]:5701 couldn't ping Address[10.230.48.189]:5701
2014-05-13 10:59:28,139 WARN  com.hazelcast.spi.OperationService - [10.230.48.190]:5701 [qa-cluster] [3.2.1] Member [10.230.48.189]:5701 has left cluster!
2014-05-13 11:02:18,959 WARN  com.hazelcast.cluster.TcpIpJoiner - [10.230.48.190]:5701 [qa-cluster] [3.2.1] Address[10.230.48.190]:5701 is merging [tcp/ip] to Address[10.230.48.189]:5701
2014-05-13 11:02:18,960 WARN  com.hazelcast.cluster.PrepareMergeOperation - [10.230.48.190]:5701 [qa-cluster] [3.2.1] Preparing to merge... Waiting for merge instruction...
2014-05-13 11:02:18,961 WARN  com.hazelcast.cluster.MergeClustersOperation - [10.230.48.190]:5701 [qa-cluster] [3.2.1] Address[10.230.48.190]:5701 is merging to Address[10.230.48.189]:5701, because: instructed by master Address[10.230.48.190]:5701


This is strange as I am able to ping the address from the command line.


The rest interface also reports that both nodes are up and connected

Members [2] {
	Member [10.230.48.190]:5701 this
	Member [10.230.48.189]:5701
}

ConnectionCount: 11
AllConnectionCount: 5


When running two nodes on my local machine, I do not get this error.
This is a problem for the QA cluster.
Also, when running the test-cluster, which runs on the exact same virtual machines, with the same setup, I also do not get these warnings, which is kinda strange?


The logs also reports

2014-05-13 11:05:17,762 WARN  com.hazelcast.spi.impl.BasicInvocation - [10.230.48.190]:5701 [qa-cluster] [3.2.1] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.map.operation.MapSizeOperation@8f40491, partitionId=270, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=110, callTimeout=60000, target=Address[10.230.48.190]:5701}, Reason: com.hazelcast.spi.exception.RetryableHazelcastException: Map is not ready!!!
2014-05-13 11:06:18,959 WARN  com.hazelcast.cluster.TcpIpJoiner - [10.230.48.190]:5701 [qa-cluster] [3.2.1] Address[10.230.48.190]:5701 is merging [tcp/ip] to Address[10.230.48.189]:5701
2014-05-13 11:06:18,959 WARN  com.hazelcast.cluster.PrepareMergeOperation - [10.230.48.190]:5701 [qa-cluster] [3.2.1] Preparing to merge... Waiting for merge instruction...
2014-05-13 11:06:18,959 WARN  com.hazelcast.cluster.MergeClustersOperation - [10.230.48.190]:5701 [qa-cluster] [3.2.1] Address[10.230.48.190]:5701 is merging to Address[10.230.48.189]:5701, because: instructed by master Address[10.230.48.190]:5701
2014-05-13 11:06:19,780 WARN  com.hazelcast.partition.InternalPartitionService - [10.230.48.190]:5701 [qa-cluster] [3.2.1] Owner of partition is being removed! Possible data loss for partition[0]. PartitionReplicaChangeEvent{partitionId=0, replicaIndex=0, oldAddress=Address[10.230.48.190]:5701, newAddress=null}

And sending callables to the cluster generates a lot of  "Map is not ready!!!" exceptions and "Owner of partition is being removed! Possible data loss for partition"

And it finishes with

2014-05-13 11:06:31,863 ERROR com.hazelcast.cluster.ClusterService - [10.230.48.190]:5701 [qa-cluster] [3.2.1] While merging...
java.util.concurrent.ExecutionException: com.hazelcast.core.HazelcastException: java.lang.ClassNotFoundException: LATEST_UPDATE
        at java.util.concurrent.FutureTask.report(Unknown Source)
        at java.util.concurrent.FutureTask.get(Unknown Source)
        at com.hazelcast.cluster.ClusterServiceImpl.waitOnFutureInterruptible(ClusterServiceImpl.java:675)
        at com.hazelcast.cluster.ClusterServiceImpl.access$600(ClusterServiceImpl.java:86)
        at com.hazelcast.cluster.ClusterServiceImpl$6.run(ClusterServiceImpl.java:656)
        at com.hazelcast.instance.LifecycleServiceImpl.runUnderLifecycleLock(LifecycleServiceImpl.java:103)
        at com.hazelcast.cluster.ClusterServiceImpl.merge(ClusterServiceImpl.java:629)
        at com.hazelcast.cluster.MergeClustersOperation.run(MergeClustersOperation.java:54)
        at com.hazelcast.spi.impl.BasicOperationService.processOperation(BasicOperationService.java:363)
        at com.hazelcast.spi.impl.BasicOperationService.runOperation(BasicOperationService.java:228)
        at com.hazelcast.cluster.AbstractJoiner.startClusterMerge(AbstractJoiner.java:256)
        at com.hazelcast.cluster.TcpIpJoiner.searchForOtherClusters(TcpIpJoiner.java:472)
        at com.hazelcast.cluster.SplitBrainHandler.searchForOtherClusters(SplitBrainHandler.java:47)
        at com.hazelcast.cluster.SplitBrainHandler.run(SplitBrainHandler.java:37)
        at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:186)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
        at com.hazelcast.util.executor.PoolExecutorThreadFactory$ManagedThread.run(PoolExecutorThreadFactory.java:59)
Caused by: com.hazelcast.core.HazelcastException: java.lang.ClassNotFoundException: LATEST_UPDATE
        at com.hazelcast.util.ExceptionUtil.rethrow(ExceptionUtil.java:45)
        at com.hazelcast.map.MapService.getMergePolicy(MapService.java:268)
        at com.hazelcast.map.MapService$Merger.run(MapService.java:289)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at com.hazelcast.util.executor.CompletableFutureTask.run(CompletableFutureTask.java:57)
        ... 5 more
Caused by: java.lang.ClassNotFoundException: LATEST_UPDATE
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at com.hazelcast.nio.ClassLoaderUtil.loadClass(ClassLoaderUtil.java:113)
        at com.hazelcast.nio.ClassLoaderUtil.newInstance(ClassLoaderUtil.java:63)
        at com.hazelcast.map.MapService.getMergePolicy(MapService.java:264)
        ... 9 more


When disabling icmp, all of these errors are gone


Noctarius

unread,
May 13, 2014, 5:49:17 AM5/13/14
to haze...@googlegroups.com
Hi Rune,

You have multiple problems:

1. For the ICMP I would guess your TTL is to small since packets die after 1ms.
2. The merge-policy needs to be a fully qualified classname: see here http://hazelcast.org/docs/latest/manual/html-single/#how-is-split-brain-syndrome-handled

Chris

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/0ffcc0eb-2373-435c-9020-a46e5219d6a3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rune

unread,
May 14, 2014, 5:11:55 AM5/14/14
to haze...@googlegroups.com, noctar...@googlemail.com
Hi and thanks for your reply

I can try to increase the time to live.
However, running ping from the command line gives

PING grid10001.tine.no (10.230.48.189) 56(84) bytes of data.
64 bytes from grid10001.tine.no (10.230.48.189): icmp_seq=1 ttl=64 time=0.829 ms
64 bytes from grid10001.tine.no (10.230.48.189): icmp_seq=2 ttl=64 time=0.973 ms
64 bytes from grid10001.tine.no (10.230.48.189): icmp_seq=3 ttl=64 time=1.13 ms
64 bytes from grid10001.tine.no (10.230.48.189): icmp_seq=4 ttl=64 time=0.689 ms


So according to this, 64 hops / ms should be enough?


Merge policy was set like this
.setMergePolicy("LATEST_UPDATE")
so I will update to
.setMergePolicy("com.hazelcast.map.merge.LatestUpdateMapMergePolicy")


Joe Planisky

unread,
May 14, 2014, 9:43:56 AM5/14/14
to haze...@googlegroups.com
Note that on Linux-based hosts, only root can use ICMP. “ping” from the command line works because the ping utility is suid root.

Under the covers, Hazelcast’s ICMP feature uses Java’s INetAddress.isReachable(). INetAddress.isReachable() first tries to use ICMP to contact another host. If that fails (which it will on Linux unless it’s running as root), it then tries to connect to port 7 on the other host. If it is able to connect OR if it gets an explicit “connection refused” error, it assumes the other host is online, otherwise it’s not.

That means that for Hazelcast’s ICMP feature to work on Linux hosts, you must either run your application as root (don’t do that), or your firewalls and security groups must allow TCP connections to port 7. Some people have suggested that you also need to have an echo service running on port 7, but we haven’t found that to be the case; the explicit “connection refused” error serves to indicate that the machine is alive and reachable.

Here’s a simple test you can run on your cluster nodes to see if isReachable() works like you expect.

import java.net.Inet4Address;
import java.net.InetAddress;

public class IsReachableTest {

public static void main(String args[]) throws Exception {
InetAddress addr = InetAddress.getByName(args[0]);
System.out.print(addr.getHostAddress());
if (addr.isReachable(5000)) {
System.out.println(" is reachable.");
} else {
System.out.println(" is NOT reachable.");
}

}
}

Hope this helps.


Joe
> --
> You received this message because you are subscribed to the Google Groups "Hazelcast" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
> To post to this group, send email to haze...@googlegroups.com.
> Visit this group at http://groups.google.com/group/hazelcast.
> To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/2a94ab45-f95b-4db0-bfd1-6c6603380d30%40googlegroups.com.

Rune

unread,
May 15, 2014, 5:50:59 AM5/15/14
to haze...@googlegroups.com
Thanks

Ran the test class on our cluster, is sudo is the only way to make it work, as you wrote.

Disabling icmp and reverting back to default heartbeat

This is probably worth a note in the docs.

Noctarius

unread,
May 15, 2014, 6:41:49 AM5/15/14
to haze...@googlegroups.com
This is nothing specific about Hazelcast, it is the underlying implementation in Linux but yeah we can mention it in the docs.
You can also open port 7 in the firewall and use the echo protocol. It also will work (the fallback implementation).

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
Reply all
Reply to author
Forward
0 new messages