Hector keep retrying connecting to dead nodes

68 views
Skip to first unread message

Filippo Diotalevi

unread,
Jun 6, 2013, 9:10:29 AM6/6/13
to hector...@googlegroups.com
Hi,
I'm having some problems with Hector who seems to keep trying connecting nodes that are marked as dead (see logs under the signature).

In the specific, one node is marked down and then immediately after "discovered" as new node of the cluster. It also seems like connections are trying to use the node anyway, causing many timeouts in the logs.

Cassandra version: 1.2
Hector version: 1.0-5

Hector initialisation:

        CassandraHostConfigurator chc = new CassandraHostConfigurator(hosts);
        chc.setClockResolution(ClockResolution.MICROSECONDS_SYNC);
        chc.setAutoDiscoverHosts(true);
        chc.setRetryDownedHosts(true);
        chc.setRetryDownedHostsQueueSize(8);
        chc.setRetryDownedHostsDelayInSeconds(120);
        chc.setMaxActive(20);
        chc.setCassandraThriftSocketTimeout(2000);

        Cluster cluster = HFactory.getOrCreateCluster("Test Cluster", chc);

        ConfigurableConsistencyLevel cp = new ConfigurableConsistencyLevel();
        cp.setDefaultReadConsistencyLevel(HConsistencyLevel.ONE);        
        cp.setDefaultWriteConsistencyLevel(HConsistencyLevel.QUORUM);
        
        FailoverPolicy failoverPolicy = new FailoverPolicy(numRetries, 100);
        Keyspace keyspace = HFactory.createKeyspace(keyspaceName, cluster, cp, failoverPolicy);
        return keyspace;


Can anyone shed some light? 

Thanks,
-- 
Filippo 

--- Logs follow ------

10:03:59,156  WARN CassandraHostRetryService:213 - Downed cassandra-04.stag.vvvvvvv.com(xxx.xxx.20.64):9160 host still appears to be down: Unable to open transport to cassandra-04.stag.vvvvvvv.com(xxx.xxx.20.64):9160 , java.net.SocketTimeoutException: connect timed out
10:03:59,156  INFO CassandraHostRetryService:159 - Downed Host retry status false with host: cassandra-04.stag.vvvvvvv.com(xxx.xxx.20.64):9160
10:04:05,827  INFO NodeAutoDiscoverService:108 - Found a node we don't know about xxx.xxx.20.64(xxx.xxx.20.64):9160 for TokenRange TokenRange(start_token:151646312376237217242810867221430285223, end_token:24040424780885293444045389434517205927, endpoints:[xxx.xxx.20.61, xxx.xxx.20.64, xxx.xxx.20.62], rpc_endpoints:[0.0.0.0, 0.0.0.0, 0.0.0.0], endpoint_details:[EndpointDetails(host:xxx.xxx.20.61, datacenter:datacenter1, rack:rack1), EndpointDetails(host:xxx.xxx.20.64, datacenter:datacenter1, rack:rack1), EndpointDetails(host:xxx.xxx.20.62, datacenter:datacenter1, rack:rack1)])
10:04:05,828  INFO NodeAutoDiscoverService:108 - Found a node we don't know about xxx.xxx.20.64(xxx.xxx.20.64):9160 for TokenRange TokenRange(start_token:109111016511119909309889041292459258791, end_token:151646312376237217242810867221430285223, endpoints:[xxx.xxx.20.63, xxx.xxx.20.61, xxx.xxx.20.64], rpc_endpoints:[0.0.0.0, 0.0.0.0, 0.0.0.0], endpoint_details:[EndpointDetails(host:xxx.xxx.20.63, datacenter:datacenter1, rack:rack1), EndpointDetails(host:xxx.xxx.20.61, datacenter:datacenter1, rack:rack1), EndpointDetails(host:xxx.xxx.20.64, datacenter:datacenter1, rack:rack1)])
10:04:05,828  INFO NodeAutoDiscoverService:108 - Found a node we don't know about xxx.xxx.20.64(xxx.xxx.20.64):9160 for TokenRange TokenRange(start_token:24040424780885293444045389434517205927, end_token:66575720646002601376967215363488232359, endpoints:[xxx.xxx.20.64, xxx.xxx.20.62, xxx.xxx.20.63], rpc_endpoints:[0.0.0.0, 0.0.0.0, 0.0.0.0], endpoint_details:[EndpointDetails(host:xxx.xxx.20.64, datacenter:datacenter1, rack:rack1), EndpointDetails(host:xxx.xxx.20.62, datacenter:datacenter1, rack:rack1), EndpointDetails(host:xxx.xxx.20.63, datacenter:datacenter1, rack:rack1)])
10:04:05,828  INFO NodeAutoDiscoverService:70 - Found 1 new host(s) in Ring
10:04:05,829  INFO NodeAutoDiscoverService:72 - Addding found host xxx.xxx.20.64(xxx.xxx.20.64):9160 to pool
10:04:07,830 ERROR HConnectionManager:119 - Transport exception host to HConnectionManager: xxx.xxx.20.64(xxx.xxx.20.64):9160
me.prettyprint.hector.api.exceptions.HectorTransportException: Unable to open transport to xxx.xxx.20.64(xxx.xxx.20.64):9160 , java.net.SocketTimeoutException: connect timed out
    at me.prettyprint.cassandra.connection.client.HThriftClient.open(HThriftClient.java:144)
    at me.prettyprint.cassandra.connection.client.HThriftClient.open(HThriftClient.java:26)
    at me.prettyprint.cassandra.connection.ConcurrentHClientPool.createClient(ConcurrentHClientPool.java:147)
    at me.prettyprint.cassandra.connection.ConcurrentHClientPool.<init>(ConcurrentHClientPool.java:53)
    at me.prettyprint.cassandra.connection.RoundRobinBalancingPolicy.createConnection(RoundRobinBalancingPolicy.java:67)
    at me.prettyprint.cassandra.connection.HConnectionManager.addCassandraHost(HConnectionManager.java:112)
    at me.prettyprint.cassandra.connection.NodeAutoDiscoverService.doAddNodes(NodeAutoDiscoverService.java:74)
    at me.prettyprint.cassandra.connection.NodeAutoDiscoverService$QueryRing.run(NodeAutoDiscoverService.java:59)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
    at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:680)
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: connect timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:183)
    at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
    at me.prettyprint.cassandra.connection.client.HThriftClient.open(HThriftClient.java:138)
    ... 16 more
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
    at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:432)
    at java.net.Socket.connect(Socket.java:529)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:178)
    ... 18 more

Patricio Echagüe

unread,
Jun 6, 2013, 11:36:45 AM6/6/13
to hector-users

That is expected. Is the node decommissioned ?

If not it will still be discoverable.

--
You received this message because you are subscribed to the Google Groups "hector-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hector-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Rafael Neves

unread,
Jun 6, 2013, 1:16:28 PM6/6/13
to hector...@googlegroups.com
use Firebrand, man. is a best framework for hector-client.


2013/6/6 Patricio Echagüe <patr...@gmail.com>



--
Att,
Rafael Neves
Formando em Análise e Desenvolvimento de Sistemas
Cursando pós em Engenharia de Software
Analista Júnior!!!

Contato: Ranev...@gmail.com

Filippo Diotalevi

unread,
Jun 6, 2013, 5:14:38 PM6/6/13
to hector...@googlegroups.com

On Thursday, June 6, 2013 4:36:45 PM UTC+1, Patricio Echague wrote:

That is expected. Is the node decommissioned ?

If not it will still be discoverable.

Thanks.
However, that is a temporarily failure. I was under the impression that nodetool decommision was only to permanently remove a node.

Is that the best practice to temporarily remove the node from the ring as well? 

--
Filippo 

Patricio Echagüe

unread,
Jun 6, 2013, 7:15:38 PM6/6/13
to hector-users
not at all. What I was saying is that Hector prints the exception (and that is ok) because it's trying to connect to a node that is in the ring but happens to be down.

If you decomission the node it will just leave the ring forever.

if what you want is to silent the autoDiscoveryService you can set a different logger threshold in log4j to not show messages from that service.


--
Reply all
Reply to author
Forward
0 new messages