Hector state when a cassandra node goes down

154 views
Skip to first unread message

smh

unread,
Feb 22, 2011, 8:43:41 PM2/22/11
to hector-users
Hi,

I am using hector -0.7.0-26 in my application persisting data over the
network to Cassandra cluster of 3 nodes.
Everything works well until i manually bring down one of the cassandra
nodes. i need to bring down a cassandra node as part of doing negative
testing on my application. After the node goes down, i see hector
exceptions on the client side and it seems that hector is trying to
connect to the downed host.
Below is the exception trace and this is repeatedly seen in the logs.
My question is, when hector sees that the node is down, doesn't hector
close the connections to the node and stop trying again until it
detects the node to be up again?
What should be done at the client side (while using Hector) to ensure
that hector cleans up the connections to a dead node and stops trying
to reuse it.

[pool-1-thread-1] ERROR (HThriftClient.java:88) - Unable to open
transport to asp.corp.apple.com(17.108.122.70):9162
org.apache.thrift.transport.TTransportException:
java.net.ConnectException: Connection refused
at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at
org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:
81)
at
me.prettyprint.cassandra.connection.HThriftClient.open(HThriftClient.java:
84)
at me.prettyprint.cassandra.connection.CassandraHostRetryService
$RetryRunner.verifyConnection(CassandraHostRetryService.java:114)
at me.prettyprint.cassandra.connection.CassandraHostRetryService
$RetryRunner.run(CassandraHostRetryService.java:94)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:
441)
at java.util.concurrent.FutureTask
$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor
$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor
$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at java.util.concurrent.ScheduledThreadPoolExecutor
$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:
195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
... 13 more




Patricio Echagüe

unread,
Feb 22, 2011, 8:48:34 PM2/22/11
to hector...@googlegroups.com, smh
Could you please provide your CassandraHostConfigurator values?

Subrahmanya Harve

unread,
Feb 22, 2011, 8:58:34 PM2/22/11
to Patricio Echagüe, hector...@googlegroups.com
Please find the configuration of CassandraHostConfigurator below.

MaxActive=50
MaxIdle=10
MaxWaitTimeWhenExhausted=1000
ThriftSocketTimeout=1000
ExhaustedPolicy=WHEN_EXHAUSTED_FAIL
LoadBalanceingPolicy=RoundRobinBalancingPolicy
ConsistencyLevelPolicy=QuorumAllConsistencyLevelPolicy
FailoverPolicy=ON_FAIL_TRY_ONE_NEXT_AVAILABLE


2011/2/22 Patricio Echagüe <patr...@gmail.com>

Patricio Echagüe

unread,
Feb 22, 2011, 9:10:28 PM2/22/11
to hector...@googlegroups.com
hey I trying to find the code for that version.
I believe there was a fix for that.

Looking at the current code, it seems to addressed in the version .28

    private boolean verifyConnection(CassandraHost cassandraHost) {

      if ( cassandraHost == null ) {

        return false;

      }

      boolean found = false;

      HThriftClient client = new HThriftClient(cassandraHost);

      try {

        

        client.open();

        found = client.getCassandra().describe_cluster_name() != null;

        client.close();              

      } catch (HectorTransportException he) {        

        log.error("Downed {} host still appears to be down: {}", cassandraHost, he.getMessage());

      } catch (Exception ex) {

                

        log.error("Downed Host retry failed attempt to verify CassandraHost", ex);

        

      } 

      return found;

    }


2011/2/22 Subrahmanya Harve <subrahma...@gmail.com>

Nate McCall

unread,
Feb 23, 2011, 11:42:42 AM2/23/11
to hector...@googlegroups.com, Patricio Echagüe
Actually this is working as anticipated. The following line from the
stack trace:

me.prettyprint.cassandra.connection.HThriftClient.open(HThriftClient.java:

indicates this is the host retry service (running in a background
thread) attempting to connect to the downed host every 10 seconds (by
default). What was just fixed in master and tip of 0.7.0 was an issue
with incorrect handling of UnavailableException on a consistency level
failure. This will be released at some point today (marked as
0.7.0-28).

On looking at the above trace again, that error could probably be more
clear about what is going on. I'll clean that up today as well.

2011/2/22 Patricio Echagüe <patr...@gmail.com>:

smh

unread,
Feb 23, 2011, 3:18:46 PM2/23/11
to hector-users

Thank you.
I will download and use hector 0.7.0-28.jar in my next check point.
Meanwhile here is a different issue i noticed.
For the same state of cassandra nodes, i am seeing different behavior
at Hector on separate machines using Hector. On 2 machines, i see the
transactions are going through just fine to cassandra (except for some
timeouts), but on the 3rd machine, hector is throwing a lot of these
exceptions below.
My question is how is it that one hector client is throwing exceptions
below while the other hector clients are running smooth with all 3
pushing the load to the cluster? Note that i am using the same hector
version hector 0.7.0-26.jar in all 3 clients.
Also i am assuming that " Retry burden pushed out to client." means
that the consumer of hector api now owns the responsibility to perform
a retry. If this is true, what would your recommendations be, for
cleaning up and retrying the transactions?

2011-02-22 18:59:09,894 [main] ERROR (CassandraService.java:2023) -
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
2011-02-22 18:59:09,894 [main] ERROR (CassandraService.java:2023) -
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
2011-02-22 18:59:09,895 [main] ERROR (CassandraService.java:2023) -
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
2011-02-22 18:59:09,895 [main] ERROR (CassandraService.java:2023) -
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
2011-02-22 18:59:09,895 [main] ERROR (CassandraService.java:2023) -
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
2011-02-22 18:59:09,895 [main] ERROR (CassandraService.java:2023) -
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
2011-02-22 18:59:09,895 [main] ERROR (CassandraService.java:2099) -
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
2011-02-22 18:59:09,895 [main] ERROR (CassandraService.java:2099) -
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
2011-02-22 18:59:09,896 [main] ERROR (CassandraService.java:582) -
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
2011-02-22 18:59:09,896 [main] ERROR (CassandraService.java:582) -
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
2011-02-22 18:59:09,896 [main] ERROR (CassandraService.java:2099) -
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.
2011-02-22 18:59:09,896 [main] ERROR (CassandraService.java:2099) -
me.prettyprint.hector.api.exceptions.HectorException: All host pools
marked down. Retry burden pushed out to client.



On Feb 23, 8:42 am, Nate McCall <n...@datastax.com> wrote:
> Actually this is working as anticipated. The following line from the
> stack trace:
>
> me.prettyprint.cassandra.connection.HThriftClient.open(HThriftClient.java:
>
> indicates this is the host retry service (running in a background
> thread) attempting to connect to the downed host every 10 seconds (by
> default). What was just fixed in master and tip of 0.7.0 was an issue
> with incorrect handling of UnavailableException on a consistency level
> failure. This will be released at some point today (marked as
> 0.7.0-28).
>
> On looking at the above trace again, that error could probably be more
> clear about what is going on. I'll clean that up today as well.
>
> 2011/2/22 Patricio Echagüe <patric...@gmail.com>:
>
> > hey I trying to find the code for that version.
> > I believe there was a fix for that.
> > Looking at the current code, it seems to addressed in the version .28
>
> >     private boolean verifyConnection(CassandraHost cassandraHost) {
>
> >       if ( cassandraHost == null ) {
>
> >         return false;
>
> >       }
>
> >       boolean found = false;
>
> >       HThriftClient client = new HThriftClient(cassandraHost);
>
> >       try {
>
> >         client.open();
>
> >         found = client.getCassandra().describe_cluster_name() != null;
>
> >         client.close();
>
> >       } catch (HectorTransportException he) {
>
> >         log.error("Downed {} host still appears to be down: {}",
> > cassandraHost, he.getMessage());
>
> >       } catch (Exception ex) {
>
> >         log.error("Downed Host retry failed attempt to verify
> > CassandraHost", ex);
>
> >       }
>
> >       return found;
>
> >     }
>
> > 2011/2/22 Subrahmanya Harve <subrahmanyaha...@gmail.com>
>
> >> Please find the configuration of CassandraHostConfigurator below.
>
> >> MaxActive=50
> >> MaxIdle=10
> >> MaxWaitTimeWhenExhausted=1000
> >> ThriftSocketTimeout=1000
> >> ExhaustedPolicy=WHEN_EXHAUSTED_FAIL
> >> LoadBalanceingPolicy=RoundRobinBalancingPolicy
> >> ConsistencyLevelPolicy=QuorumAllConsistencyLevelPolicy
> >> FailoverPolicy=ON_FAIL_TRY_ONE_NEXT_AVAILABLE
>
> >> 2011/2/22 Patricio Echagüe <patric...@gmail.com>
>
> >>> Could you please provide your CassandraHostConfigurator values?
>

Nate McCall

unread,
Feb 23, 2011, 3:28:06 PM2/23/11
to hector...@googlegroups.com, smh
This means no connections to the cluster were established. Can you
verify connectability from the third host to the cluster? Might there
be a firewall adjustment or similar required?

Subrahmanya Harve

unread,
Feb 23, 2011, 3:44:39 PM2/23/11
to Nate McCall, hector...@googlegroups.com

Connectability is fine since i also see successful transactions from the 3rd client for the same minute. However, i must also mention that i am using several threads to push the load, so it could be that a couple of threads have encountered the below issue while others are working fine. Even if this is correct, it still does not explain how one or two threads have a different behavior compared to the other threads? Could it be an issue with the recoverability of hector connections once it detects that the host is not down anymore?
All 3 machines belong to the same cloud, so no firewall adjustment would be required particularly.

Nate McCall

unread,
Feb 23, 2011, 3:57:52 PM2/23/11
to Subrahmanya Harve, hector...@googlegroups.com
I wonder if the 3rd node has just gotten unlucky and connected marked
the hosts as down due to failed connection attempts. Since this is
cloud-y, the 3rd node could be on a further away network segment for
example.

To verify this, in the Cassandra logs are you seeing message pool
dropping messages do to load or does 'nodetool tpstats' on any of the
nodes show large backlogs of pending messages?

One thing I did notice is that the thrift socket timeout is set pretty
low - which could cause some issues on some cloud providers under
load. Turn this back up to 10000 (10 seconds) on the third client node
and see if that helps.

Reply all
Reply to author
Forward
0 new messages