Retry of downed hosts + auto discover question

774 views
Skip to first unread message

Mike O'Neil

unread,
Jan 17, 2012, 4:04:00 PM1/17/12
to hector...@googlegroups.com
How exactly does the retry of failed connections work? Am I right in saying that a host will only be retried once after "retryDownedHostsDelayInSeconds"?

Related to that I am having an issue when I intentionally bring down a Cassandra server for a few minutes. Now that it's out of the ring, the CassandraHostRetryService removes it permanently and will never retry. I guess I should be enabling "autoDiscoverHosts" so that later when it's back in the ring it will be added to the pool again?


Nate McCall

unread,
Jan 17, 2012, 4:13:01 PM1/17/12
to hector...@googlegroups.com
No, If you did not remove the host from the ring via the
cassandra-cli, then CHRS should continue to try reconnecting to that
host at interval indefinitely. Is this not the case? (What version of
Hector, btw?)

Yes, autoDiscoverHosts must be enabled for new nodes to be discovered.

Mike O'Neil

unread,
Jan 17, 2012, 4:27:30 PM1/17/12
to hector...@googlegroups.com
Hi Nate. I am using 1.0-2 with Cass 1.0.6. I see there is 1.0-3 out, I can give that a shot as well.

Right, I did not remove the ring via cassandra-cli. Just killed the process. I see this in the log very quickly after the first failed request after killing Cassandra:

2012-01-17 19:59:30,698 INFO  connection.CassandraHostRetryService: Removing host 10.90.106.8(10.90.106.8):9161 - It does no longer exist in the ring.

So autoDiscoverHosts is not really what I want then, since I do not need to discover new nodes. I just want already configured nodes to be retried indefinitely. After I kill the second host, I do NOT see the same "no longer exist in the ring" message, though. Instead at that point I get the "client burden" exception indefinitely (even after bringing up both hosts). Anything obvious you think I should look into for troubleshooting?

Thanks,
Mike

Nate McCall

unread,
Jan 17, 2012, 4:34:42 PM1/17/12
to hector...@googlegroups.com
1.0-2 is the latest released version. That sounds like a bug to me,
but I'm not sure I see how yet.

Couple of questions about your setup:
Are you using a different port numbers for thrift on different hosts?
What other configuration settings have you modified from the default
and how many nodes are in the ring?

Mike O'Neil

unread,
Jan 17, 2012, 4:58:03 PM1/17/12
to hector...@googlegroups.com
Hi,

There are 2 nodes in the ring (replication factor = 2). Here is the output from nodetool:

Address         DC          Rack        Status State   Load            Owns    Token                                      
                                                                               140637942640091053069842345145244564255    
10.90.106.8     datacenter1 rack1       Up     Normal  78.86 MB        65.21%  81437580563132525083742012705162878500     
10.90.106.7     datacenter1 rack1       Up     Normal  78.93 MB        34.79%  140637942640091053069842345145244564255

I am using 9161 on both servers. However I should note that there is 0.6.x installation running on the same servers using 9160 (hence why I needed to change the port). It goes without saying I don't have the client configured to talk to 9160.

Here is how I have my cluster + keyspace initialized:

CassandraHostConfigurator config = new CassandraHostConfigurator();
config.setHosts(hosts);
config.setMaxActive(maxActive);
config.setClockResolution(ClockResolution.MICROSECONDS);
config.setCassandraThriftSocketTimeout(connectTimeout);
config.setExhaustedPolicy(ExhaustedPolicy.WHEN_EXHAUSTED_FAIL);
this.cluster = HFactory.getOrCreateCluster("myCluster", config);
Keyspace ks = HFactory.createKeyspace(keyspace, this.cluster,
    new AllOneConsistencyLevelPolicy(), FailoverPolicy.ON_FAIL_TRY_ALL_AVAILABLE);

I restarted the client and tried taking A down again (10.90.106.7). After doing so, I checked the ring info from cli just to be sure, and the output was identical to the above. Here is the full log output after I brought A down:

2012-01-17 21:44:00,204 ERROR client.HThriftClient: Could not flush transport (to be expected if the pool is shutting down) in close for client: CassandraClient<10.90.106.7:9161-1>
org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
    at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
    at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
    at me.prettyprint.cassandra.connection.client.HThriftClient.close(HThriftClient.java:98)
    at me.prettyprint.cassandra.connection.client.HThriftClient.close(HThriftClient.java:26)
    at me.prettyprint.cassandra.connection.HConnectionManager.closeClient(HConnectionManager.java:308)
    at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:257)
    at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
    at me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate.multigetSliceInternal(ThriftColumnFamilyTemplate.java:110)
    at me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate.doExecuteMultigetSlice(ThriftColumnFamilyTemplate.java:51)
    at me.prettyprint.cassandra.service.template.ColumnFamilyTemplate.queryColumns(ColumnFamilyTemplate.java:117)
    ...
Caused by: java.net.SocketException: Broken pipe
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
    at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
    ... 17 more
2012-01-17 21:44:00,205 ERROR connection.HConnectionManager: MARK HOST AS DOWN TRIGGERED for host 10.90.106.7(10.90.106.7):9161
2012-01-17 21:44:00,205 ERROR connection.HConnectionManager: Pool state on shutdown: <ConcurrentCassandraClientPoolByHost>:{10.90.106.7(10.90.106.7):9161}; IsActive?: true; Active: 1; Blocked: 0; Idle: 5; NumBeforeExhausted: 19
2012-01-17 21:44:00,206 INFO  connection.ConcurrentHClientPool: Shutdown triggered on <ConcurrentCassandraClientPoolByHost>:{10.90.106.7(10.90.106.7):9161}
2012-01-17 21:44:00,206 INFO  connection.ConcurrentHClientPool: Shutdown complete on <ConcurrentCassandraClientPoolByHost>:{10.90.106.7(10.90.106.7):9161}
2012-01-17 21:44:00,206 INFO  connection.CassandraHostRetryService: Host detected as down was added to retry queue: 10.90.106.7(10.90.106.7):9161
2012-01-17 21:44:00,207 WARN  connection.HConnectionManager: Could not fullfill request on this host CassandraClient<10.90.106.7:9161-1>
2012-01-17 21:44:00,207 WARN  connection.HConnectionManager: Exception:
me.prettyprint.hector.api.exceptions.HectorTransportException: org.apache.thrift.transport.TTransportException
    at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:39)
    at me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate$2.execute(ThriftColumnFamilyTemplate.java:120)
    at me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate$2.execute(ThriftColumnFamilyTemplate.java:110)
    at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:99)
    at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:243)
    at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
    at me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate.multigetSliceInternal(ThriftColumnFamilyTemplate.java:110)
    at me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate.doExecuteMultigetSlice(ThriftColumnFamilyTemplate.java:51)
    at me.prettyprint.cassandra.service.template.ColumnFamilyTemplate.queryColumns(ColumnFamilyTemplate.java:117)
    ...
Caused by: org.apache.thrift.transport.TTransportException
    at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
    at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
    at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
    at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
    at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
    at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
    at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
    at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
    at org.apache.cassandra.thrift.Cassandra$Client.recv_multiget_slice(Cassandra.java:656)
    at org.apache.cassandra.thrift.Cassandra$Client.multiget_slice(Cassandra.java:638)
    at me.prettyprint.cassandra.service.template.ThriftColumnFamilyTemplate$2.execute(ThriftColumnFamilyTemplate.java:116)
    ... 15 more
2012-01-17 21:44:00,207 INFO  connection.HConnectionManager: Client CassandraClient<10.90.106.7:9161-1> released to inactive or dead pool. Closing.
2012-01-17 21:44:00,207 WARN  connection.CassandraHostRetryService: Downed 10.90.106.7(10.90.106.7):9161 host still appears to be down: Unable to open transport to 10.90.106.7(10.90.106.7):9161 , java.net.ConnectException: Connection refused
2012-01-17 21:44:00,626 INFO  connection.CassandraHostRetryService: Removing host 10.90.106.7(10.90.106.7):9161 - It does no longer exist in the ring.



Then I brought down the B side. The log output is similar (though notice lack of "no longer exist in the ring" message):

2012-01-17 21:52:37,429 ERROR client.HThriftClient: Could not flush transport (to be expected if the pool is shutting down) in close for client: CassandraClient<10.90.106.8:9161-9>
org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
    at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
    at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
    at me.prettyprint.cassandra.connection.client.HThriftClient.close(HThriftClient.java:98)
    at me.prettyprint.cassandra.connection.client.HThriftClient.close(HThriftClient.java:26)
    at me.prettyprint.cassandra.connection.HConnectionManager.closeClient(HConnectionManager.java:308)
    at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:257)
    at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
    at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
    at me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeBatch(AbstractColumnFamilyTemplate.java:115)
    at me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeIfNotBatched(AbstractColumnFamilyTemplate.java:149)
    at me.prettyprint.cassandra.service.template.ColumnFamilyTemplate.update(ColumnFamilyTemplate.java:69)
    ...
Caused by: java.net.SocketException: Broken pipe
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
    at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
    ... 18 more
2012-01-17 21:52:37,430 ERROR connection.HConnectionManager: MARK HOST AS DOWN TRIGGERED for host 10.90.106.8(10.90.106.8):9161
2012-01-17 21:52:37,430 ERROR connection.HConnectionManager: Pool state on shutdown: <ConcurrentCassandraClientPoolByHost>:{10.90.106.8(10.90.106.8):9161}; IsActive?: true; Active: 1; Blocked: 0; Idle: 5; NumBeforeExhausted: 19
2012-01-17 21:52:37,430 INFO  connection.ConcurrentHClientPool: Shutdown triggered on <ConcurrentCassandraClientPoolByHost>:{10.90.106.8(10.90.106.8):9161}
2012-01-17 21:52:37,430 INFO  connection.ConcurrentHClientPool: Shutdown complete on <ConcurrentCassandraClientPoolByHost>:{10.90.106.8(10.90.106.8):9161}
2012-01-17 21:52:37,430 INFO  connection.CassandraHostRetryService: Host detected as down was added to retry queue: 10.90.106.8(10.90.106.8):9161
2012-01-17 21:52:37,431 WARN  connection.CassandraHostRetryService: Downed 10.90.106.8(10.90.106.8):9161 host still appears to be down: Unable to open transport to 10.90.106.8(10.90.106.8):9161 , java.net.ConnectException: Connection refused
2012-01-17 21:52:37,431 WARN  connection.HConnectionManager: Could not fullfill request on this host CassandraClient<10.90.106.8:9161-9>
2012-01-17 21:52:37,431 WARN  connection.HConnectionManager: Exception:
me.prettyprint.hector.api.exceptions.HectorTransportException: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
    at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:39)
    at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:249)
    at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
    at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
    at me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeBatch(AbstractColumnFamilyTemplate.java:115)
    at me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeIfNotBatched(AbstractColumnFamilyTemplate.java:149)
    at me.prettyprint.cassandra.service.template.ColumnFamilyTemplate.update(ColumnFamilyTemplate.java:69)
    ...
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
    at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
    at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:157)
    at org.apache.cassandra.thrift.Cassandra$Client.send_batch_mutate(Cassandra.java:1020)
    at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:1008)
    at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246)
    at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243)
    at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:99)
    at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:243)
    ... 13 more
Caused by: java.net.SocketException: Broken pipe
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
    at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
    ... 20 more
2012-01-17 21:52:37,431 INFO  connection.HConnectionManager: Client CassandraClient<10.90.106.8:9161-9> released to inactive or dead pool. Closing.
2012-01-17 21:52:37,432 INFO  connection.HConnectionManager: Client CassandraClient<10.90.106.8:9161-9> released to inactive or dead pool. Closing.


Then the following exception bubbles up to the app:

me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.
    at me.prettyprint.cassandra.connection.HConnectionManager.getClientFromLBPolicy(HConnectionManager.java:354)
    at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:234)
    at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
    at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
    at me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeBatch(AbstractColumnFamilyTemplate.java:115)
    at me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeIfNotBatched(AbstractColumnFamilyTemplate.java:149)
    at me.prettyprint.cassandra.service.template.ColumnFamilyTemplate.update(ColumnFamilyTemplate.java:69)


Hope this helps,
Mike

Nate McCall

unread,
Jan 17, 2012, 5:12:28 PM1/17/12
to hector...@googlegroups.com
Just for completeness - what is the "hosts" string you are providing to CHC?

Mike O'Neil

unread,
Jan 17, 2012, 5:16:44 PM1/17/12
to hector...@googlegroups.com

Nate McCall

unread,
Jan 17, 2012, 5:26:03 PM1/17/12
to hector...@googlegroups.com
Thanks - I'll see if I can reproduce this.

Nate McCall

unread,
Jan 17, 2012, 5:38:45 PM1/17/12
to hector...@googlegroups.com
Mike, can you try this again with CHC#setPort(9161) - I think I see
what the issue is.

Mike O'Neil

unread,
Jan 18, 2012, 10:37:14 AM1/18/12
to hector...@googlegroups.com
Nate, that worked perfectly. Thanks. Just curious, is this a "bug" with the library, or the expected thing to do if the port differs from 9160?

Thanks,
Mike

Nate McCall

unread,
Jan 18, 2012, 11:27:08 AM1/18/12
to hector...@googlegroups.com
A bug - port number did not used to be available from describe_ring
API call so we hacked it into CHC and never cleaned it up once the
thrift API changed.

Thibaut Britz

unread,
Jan 30, 2012, 4:01:07 AM1/30/12
to hector...@googlegroups.com
Could you please also fix this in 0.8.

This bug is present there as well.

Thanks,
Thibaut

Nate McCall

unread,
Jan 30, 2012, 4:02:49 PM1/30/12
to hector...@googlegroups.com
I'm mis-remembering the changes here on the cassandra side. There is
not port information returned from describe_ring - just the host
address.

This would mean I have to loosen the CassandraHost#equals contract or
add another work-around in NodeAutoDiscoverService and
CassandraHostRetryService to deal with just ip address matching.

I'm going to stick with saying that if you need this, to use
CHC#setPort when you configure your host lists.

Open to other ideas if anyone else has them.

Mike O'Neil

unread,
Jan 31, 2012, 10:42:57 AM1/31/12
to hector...@googlegroups.com
That sounds reasonable to me.

Thanks,
Mike

Víctor Hugo Oliveira Molinar

unread,
Feb 27, 2012, 7:36:07 AM2/27/12
to hector...@googlegroups.com
Well so if I want to configure many hosts, then it would be necessary to have all of them running at the same port? Once it's necessary to call  CHC#setPort(...);
Reply all
Reply to author
Forward
0 new messages