Failover on startup

17 views
Skip to first unread message

David Boxenhorn

unread,
Dec 2, 2010, 10:43:37 AM12/2/10
to hector...@googlegroups.com
Failover seems to be working fine for me (now that I've increased RF to 3...) when I take down a node while the client is running.

But, if a node is down when I start up I get an exception (not just a logged warning!):

WARN  [me.prettyprint.cassandra.service.CassandraClientPoolImpl] - <Unable to obtain client 192.168.80.14:9160 will try the next client>
me.prettyprint.hector.api.exceptions.HectorException: me.prettyprint.hector.api.exceptions.HectorTransportException: Unable to open transport to 192.168.80.14(192.168.80.14):9160 , java.net.ConnectException: Connection refused: connect
    at me.prettyprint.cassandra.service.CassandraClientFactory.create(CassandraClientFactory.java:74)
    at me.prettyprint.cassandra.service.CassandraClientFactory.makeObject(CassandraClientFactory.java:152)
    at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1148)
    at me.prettyprint.cassandra.service.CassandraClientPoolByHostImpl.borrowClient(CassandraClientPoolByHostImpl.java:93)
    at me.prettyprint.cassandra.service.CassandraClientPoolImpl.borrowClient(CassandraClientPoolImpl.java:80)
    at me.prettyprint.cassandra.service.CassandraClientPoolImpl.borrowClient(CassandraClientPoolImpl.java:216)
    at me.prettyprint.cassandra.service.CassandraClientPoolImpl.borrowClient(CassandraClientPoolImpl.java:225)

Am I doing something wrong??? Do I have to implement my own failover for borrowClient?

Nate McCall

unread,
Dec 2, 2010, 10:54:00 AM12/2/10
to hector...@googlegroups.com
0.6.x of hector still assumes all is well in the cluster on startup.
If the issue on the node is transient, it should go away after the
node comes back. Unfortunately, you will have some ugly log files in
the meantime.

David Boxenhorn

unread,
Dec 2, 2010, 11:07:02 AM12/2/10
to hector...@googlegroups.com
It's worse than ugly log files. Hector doesn't work at all until the node comes back. Why can't failover work on startup the way it works after startup?

Nate McCall

unread,
Dec 2, 2010, 11:19:43 AM12/2/10
to hector...@googlegroups.com
Honestly, I was under the impression it did. Admittedly, I have not
spent much time with the 0.6.x branch after last months refactoring,
does anybody else have some insight here?

David, can you do me a favor and create a Github issue for tracking
this? (https://github.com/rantav/hector/issues) It's pretty important
that we are able to start clean on a degraded cluster.

Utku Can Topçu

unread,
Dec 2, 2010, 6:54:35 PM12/2/10
to hector...@googlegroups.com
I've tested and have the same problem on 0.7-20 with 0.70-rc1 release of cassandra.
I get this error after calling HFactory.getOrCreateCluster, in a cluster where there's a failing node as David notes.

And it immeadiately exits since the exception is not handled.

10/12/03 00:10:45 ERROR connection.HThriftClient: Unable to open transport to host(ip):9160
org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
    at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
    at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
    at me.prettyprint.cassandra.connection.HThriftClient.open(HThriftClient.java:85)
    at me.prettyprint.cassandra.connection.ConcurrentHClientPool.<init>(ConcurrentHClientPool.java:44)
    at me.prettyprint.cassandra.connection.HConnectionManager.<init>(HConnectionManager.java:54)
    at me.prettyprint.cassandra.service.AbstractCluster.<init>(AbstractCluster.java:60)
    at me.prettyprint.cassandra.service.AbstractCluster.<init>(AbstractCluster.java:56)
    at me.prettyprint.cassandra.service.ThriftCluster.<init>(ThriftCluster.java:17)
    at me.prettyprint.hector.api.factory.HFactory.createCluster(HFactory.java:107)
    at me.prettyprint.hector.api.factory.HFactory.getOrCreateCluster(HFactory.java:99)
...
Caused by: java.net.ConnectException: Connection refused
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
    at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:432)
    at java.net.Socket.connect(Socket.java:529)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
    ... 11 more
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
    at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
    at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
    at me.prettyprint.cassandra.connection.HThriftClient.open(HThriftClient.java:85)
    ... 9 more

Regards,
Utku

Nate McCall

unread,
Dec 2, 2010, 7:18:42 PM12/2/10
to hector...@googlegroups.com
Ouch - I just checked in a fix for this on trunk. Thanks for bringing it up.

Now if a host is down on startup, it gets dumped into the retry
service immediately. I'll do some more testing on this tonight, but
Utku, if you have a chance to try this out please do and let me know
how it works.

Utku Can Topçu

unread,
Dec 3, 2010, 3:31:19 AM12/3/10
to hector...@googlegroups.com
Hi Nate,

I tried the patch right now (due to time difference, it was my bed time as you wrote the fix)

At first it worked fine. In the attached file the lines 238-254, 325-328 confirm the writes
Then, it started to fail because of another host at line: 398,

Even though there were 2 more servers to lookup, it stopped forever afterwards; printing the same exceptions over and over again.

I hope this helps,
Regards,
Utku
hector-failover-fail.txt

Nate McCall

unread,
Dec 3, 2010, 11:06:15 AM12/3/10
to hector...@googlegroups.com
Can you tell me more about the state of the cluster when you started
up? Also, how many hosts are you providing to
CassandraHostConfigurator constructor? Does everything work when all
the hosts are up?

Utku Can Topçu

unread,
Dec 3, 2010, 3:05:36 PM12/3/10
to hector...@googlegroups.com
Hi Nate,

I'l try to answer your questions as clearly as possible.

* When I started up, 3 out of 4 nodes in the cluster were up and running.
* CassandraHostConfigurator constructor takes all the four nodes as an argument. So the system is aware of all the nodes.
* So as I said it worked for a while and suddenly because of a failure in the connectivity between the client and the cassandra node (the node that it was currently had and active connection), the whole system went to an unstable state.

And finally, when all 4 nodes are up and running I don't see any problem at all.

I hope this was helpful.

Regards,
Utku

Nate McCall

unread,
Dec 3, 2010, 3:35:33 PM12/3/10
to hector...@googlegroups.com
Oh, so for your third point - this may have been function of your
consistency level and replication factor. What do you have for these
values?

Utku Can Topçu

unread,
Dec 4, 2010, 4:39:36 AM12/4/10
to hector...@googlegroups.com
Hi Nate,

The replication factor is 3 on the cluster, and I'm using CL.ONE for both reading and writing.

David Boxenhorn

unread,
Dec 5, 2010, 5:20:43 AM12/5/10
to hector...@googlegroups.com
Hi guys!

Sorry I disappeared, but very glad to see that things have progressed without me! I was away for the weekend, which for me begins Thursday night, 8 hours before central time...

I just looked at things a little more in depth. I have two lines of code that look like this:

         CassandraClient client = clientPool.borrowClient(domainPorts);
         return client.getKeyspace(name, consistency);

It seems that when a node is down on startup, clientPool.borrowClient(domainPorts) fails, but after startup it happily returns a bad url and later on

            List<SuperColumn> superColumnList = keyspace.getSuperSlice(rowKey, columnParent, slicePredicate);

does failover and (magically!) works.

So my questions is, why can't things work on startup as they work later on - clientPool.borrowClient can just return the bad url, and keyspace.getSuperSlice can do failover?

Tirthankar Barari

unread,
Sep 6, 2012, 7:05:19 PM9/6/12
to hector...@googlegroups.com
I am using the latest hector client 1.0-1 and I see the same issue. I have three
nodes and at start up time if the first one is down, the hector client fails to
try connecting to the other nodes.

However, the client successfully fails over to the other nodes, once startup was
successful.

Any fix for this yet?


Nate McCall

unread,
Sep 7, 2012, 10:58:47 AM9/7/12
to hector...@googlegroups.com
I believe there was - can you pull down the master and try this with
the snapshot? It should work.

(We'll be doing a release in the very near future).

Tirthankar Barari

unread,
Sep 20, 2012, 9:55:50 AM9/20/12
to hector...@googlegroups.com
It is working now. I called setAutoDiscoveryAtStartup(true).

Thank you!


Reply all
Reply to author
Forward
0 new messages