KairoDB - Hector retryDownedHosts not working as expected in a two node Cassandra cluster

100 views
Skip to first unread message

Venkatt Guhesan

unread,
Apr 29, 2016, 6:44:55 PM4/29/16
to KairosDB
Hello:

I am trying to test the high-availability option in KairosDB. Here is my setup:

KairosDB
# kairosdb-1.1.1-1.rpm
kairosdb-1.1.1-1.jar
hector-core-1.1-4.jar

Cassandra
# datastax-ddc-3.3.0-1.noarch.rpm
# datastax-ddc-tools-3.3.0-1.noarch.rpm
Cassandra version: 3.3.0
Thrift API version: 20.1.0
CQL supported versions: 3.4.0 (default: 3.4.0)

Setup
Node1: CASS1 : 10.48.108.9
.... Runs Cassandra
.... Runs KairoDB

Node2 : CASS2 : 10.48.108.19
.... Runs Cassandra

I have a Cassandra working as a cluster-ring with two nodes:

[root@cass1 ~]# nodetool status kairosdb
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.48.108.19  3.56 MB    256          100.0%            0c375459-5ec3-47c7-ac6d-d017ddebe2a2  rack1
UN  10.48.108.9   3.39 MB    256          100.0%            0b72f5f3-13c3-4aa5-95e1-048bbfa081b4  rack1

My KairosDB essentially has the following properties:

kairosdb.service.datastore=org.kairosdb.datastore.cassandra.CassandraModule
kairosdb.datastore.cassandra.host_list=10.48.108.19:9160,10.48.108.9:9160
kairosdb.datastore.cassandra.keyspace=kairosdb
kairosdb.datastore.cassandra.replication_factor=2
kairosdb.datastore.cassandra.read_consistency_level=ONE
kairosdb.datastore.cassandra.write_consistency_level=ONE
# Hector Configuration
kairosdb.datastore.cassandra.hector.retryDownedHosts=true
kairosdb.datastore.cassandra.hector.retryDownedHostsDelayInSeconds=30
kairosdb.datastore.cassandra.hector.retryDownedHostsQueueSize=-1

Here is the test I'm conducting:
1) All Nodes are up and operational.
2) Kill Cassandra on Node-1 - "kill -s HUP <cassandra_pid>".
3) Life is Good... I'm able to fetch data via the KairosDB REST API fine.
4) I bring back Node-1 and wait for over 30-mins. And also ensure that "nodetool status kairosdb" sees both nodes in the ring with "Up-Normal"
5) After 30-mins or so, I kill Cassandra on Node-2.
6) KairosDB is no longer able to present any data. "HectorException: All host pools marked down. Retry burden pushed out to client."

The excpected result is that in the thirty minutes of uptime on Node-1, Hector should have retried and added Node-1 back into the list.

Here are the relevant logs:
#=====
04-29|17:44:15.979 [main] INFO  [CassandraHostRetryService.java:48] - Downed Host Retry service started with queue size -1 and retry delay 600s
...
04-29|17:50:34.086 [qtp1082411691-29 - /api/v1/datapoints/query] INFO  [CassandraHostRetryService.java:68] - Host detected as down was added to retry queue: 10.48.108.9(10.48.108.9):9160
...
04-29|17:50:34.089 [qtp1082411691-29 - /api/v1/datapoints/query] INFO  [HConnectionManager.java:404] - Client CassandraClient<10.48.108.9:9160-19> released to inactive or dead pool. Closing.
04-29|17:50:34.089 [qtp1082411691-29 - /api/v1/datapoints/query] DEBUG [HThriftClient.java:126] - Closing client CassandraClient<10.48.108.9:9160-19>
04-29|17:50:34.091 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] DEBUG [HThriftClient.java:152] - Creating a new thrift connection to 10.48.108.9(10.48.108.9):9160
04-29|17:50:34.092 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] DEBUG [HThriftClient.java:183] - Unable to open transport to 10.48.108.9(10.48.108.9):9160
04-29|17:50:34.092 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] WARN  [CassandraHostRetryService.java:217] - Downed 10.48.108.9(10.48.108.9):9160 host still appears to be down: Unable to open transport to 10.48.108.9(10.48.108.9):9160 , java.net.ConnectException: Connection refused
...
04-29|17:54:15.983 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] DEBUG [CassandraHostRetryService.java:116] - Retry service fired, checking 1 downed hosts.
...
04-29|17:54:16.330 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] INFO  [CassandraHostRetryService.java:157] - Removing host 10.48.108.9(10.48.108.9):9160 - It does no longer exist in the ring.
...
04-29|17:59:44.322 [qtp1082411691-33 - /api/v1/datapoints/query] INFO  [CassandraHostRetryService.java:68] - Host detected as down was added to retry queue: 10.48.108.19(10.48.108.19):9160
04-29|17:59:44.322 [qtp1082411691-33 - /api/v1/datapoints/query] WARN  [HConnectionManager.java:302] - Could not fullfill request on this host CassandraClient<10.48.108.19:9160-7>
...
04-29|17:59:44.330 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] DEBUG [HThriftClient.java:152] - Creating a new thrift connection to 10.48.108.19(10.48.108.19):9160
04-29|17:59:44.336 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] DEBUG [HThriftClient.java:183] - Unable to open transport to 10.48.108.19(10.48.108.19):9160
04-29|17:59:44.337 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] WARN  [CassandraHostRetryService.java:217] - Downed 10.48.108.19(10.48.108.19):9160 host still appears to be down: Unable to open transport to 10.48.108.19(10.48.108.19):9160 , java.net.ConnectException: Connection refused
...
04-29|17:59:44.364 [qtp1082411691-30 - /api/v1/datapoints/query] INFO  [HConnectionManager.java:404] - Client CassandraClient<10.48.108.19:9160-12> released to inactive or dead pool. Closing.
04-29|17:59:44.364 [qtp1082411691-30 - /api/v1/datapoints/query] DEBUG [HThriftClient.java:126] - Closing client CassandraClient<10.48.108.19:9160-12>
04-29|17:59:44.365 [qtp1082411691-30 - /api/v1/datapoints/query] ERROR [MetricsResource.java:480] - Query failed.
org.kairosdb.core.exception.DatastoreException: me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.
...
#=====


Question:
1. Why did Hector not retry again and again repeatedly. Since I was emulating a node-reboot by killing the cassandra service and after about 1-min start it back up. Is this expected behavior?

2. Does anyone have a working KairosDB with a two-node Cassandra? Can you share your Hector properties?

If anyone has any idea on how I can resolve this issue, let me know.

Thanks

Venkatt


Brian Hawkins

unread,
May 26, 2016, 12:05:17 AM5/26/16
to KairosDB
I'm surprised by your results as well.  There is a bunch of hector configuration options that we have exposed at the bottom of the kairosdb.properties file.  Some of those settings may help.

The issue is hector is kinda dead and we need to move off of it.  I'm in the process of switching to CQL which has better support for this kind of thing.

Brian
Reply all
Reply to author
Forward
0 new messages