Hello:
I am trying to test the high-availability option in KairosDB. Here is my setup:
KairosDB# kairosdb-1.1.1-1.rpm
kairosdb-1.1.1-1.jar
hector-core-1.1-4.jar
Cassandra# datastax-ddc-3.3.0-1.noarch.rpm
# datastax-ddc-tools-3.3.0-1.noarch.rpm
Cassandra version: 3.3.0
Thrift API version: 20.1.0
CQL supported versions: 3.4.0 (default: 3.4.0)
SetupNode1: CASS1 : 10.48.108.9
.... Runs Cassandra
.... Runs KairoDB
Node2 : CASS2 : 10.48.108.19
.... Runs Cassandra
I have a Cassandra working as a cluster-ring with two nodes:
[root@cass1 ~]# nodetool status kairosdb
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.48.108.19 3.56 MB 256 100.0% 0c375459-5ec3-47c7-ac6d-d017ddebe2a2 rack1
UN 10.48.108.9 3.39 MB 256 100.0% 0b72f5f3-13c3-4aa5-95e1-048bbfa081b4 rack1
My KairosDB essentially has the following properties:
kairosdb.service.datastore=org.kairosdb.datastore.cassandra.CassandraModule
kairosdb.datastore.cassandra.host_list=
10.48.108.19:9160,
10.48.108.9:9160kairosdb.datastore.cassandra.keyspace=kairosdb
kairosdb.datastore.cassandra.replication_factor=2
kairosdb.datastore.cassandra.read_consistency_level=ONE
kairosdb.datastore.cassandra.write_consistency_level=ONE
# Hector Configuration
kairosdb.datastore.cassandra.hector.retryDownedHosts=true
kairosdb.datastore.cassandra.hector.retryDownedHostsDelayInSeconds=30
kairosdb.datastore.cassandra.hector.retryDownedHostsQueueSize=-1
Here is the test I'm conducting:
1) All Nodes are up and operational.
2) Kill Cassandra on Node-1 - "kill -s HUP <cassandra_pid>".
3) Life is Good... I'm able to fetch data via the KairosDB REST API fine.
4) I bring back Node-1 and wait for over 30-mins. And also ensure that "nodetool status kairosdb" sees both nodes in the ring with "Up-Normal"
5) After 30-mins or so, I kill Cassandra on Node-2.
6) KairosDB is no longer able to present any data. "HectorException: All host pools marked down. Retry burden pushed out to client."
The excpected result is that in the thirty minutes of uptime on Node-1, Hector should have retried and added Node-1 back into the list.
Here are the relevant logs:
#=====
04-29|17:44:15.979 [main] INFO [CassandraHostRetryService.java:48] - Downed Host Retry service started with queue size -1 and retry delay 600s
...
04-29|17:50:34.086 [qtp1082411691-29 - /api/v1/datapoints/query] INFO [CassandraHostRetryService.java:68] - Host detected as down was added to retry queue: 10.48.108.9(10.48.108.9):9160
...
04-29|17:50:34.089 [qtp1082411691-29 - /api/v1/datapoints/query] INFO [HConnectionManager.java:404] - Client CassandraClient<10.48.108.9:9160-19> released to inactive or dead pool. Closing.
04-29|17:50:34.089 [qtp1082411691-29 - /api/v1/datapoints/query] DEBUG [HThriftClient.java:126] - Closing client CassandraClient<10.48.108.9:9160-19>
04-29|17:50:34.091 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] DEBUG [HThriftClient.java:152] - Creating a new thrift connection to 10.48.108.9(10.48.108.9):9160
04-29|17:50:34.092 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] DEBUG [HThriftClient.java:183] - Unable to open transport to 10.48.108.9(10.48.108.9):9160
04-29|17:50:34.092 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] WARN [CassandraHostRetryService.java:217] - Downed 10.48.108.9(10.48.108.9):9160 host still appears to be down: Unable to open transport to 10.48.108.9(10.48.108.9):9160 , java.net.ConnectException: Connection refused
...
04-29|17:54:15.983 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] DEBUG [CassandraHostRetryService.java:116] - Retry service fired, checking 1 downed hosts.
...
04-29|17:54:16.330 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] INFO [CassandraHostRetryService.java:157] -
Removing host 10.48.108.9(10.48.108.9):9160 - It does no longer exist in the ring....
04-29|17:59:44.322 [qtp1082411691-33 - /api/v1/datapoints/query] INFO [CassandraHostRetryService.java:68] - Host detected as down was added to retry queue: 10.48.108.19(10.48.108.19):9160
04-29|17:59:44.322 [qtp1082411691-33 - /api/v1/datapoints/query] WARN [HConnectionManager.java:302] - Could not fullfill request on this host CassandraClient<10.48.108.19:9160-7>
...
04-29|17:59:44.330 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] DEBUG [HThriftClient.java:152] - Creating a new thrift connection to 10.48.108.19(10.48.108.19):9160
04-29|17:59:44.336 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] DEBUG [HThriftClient.java:183] - Unable to open transport to 10.48.108.19(10.48.108.19):9160
04-29|17:59:44.337 [Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1] WARN [CassandraHostRetryService.java:217] - Downed 10.48.108.19(10.48.108.19):9160 host still appears to be down: Unable to open transport to 10.48.108.19(10.48.108.19):9160 , java.net.ConnectException: Connection refused
...
04-29|17:59:44.364 [qtp1082411691-30 - /api/v1/datapoints/query] INFO [HConnectionManager.java:404] - Client CassandraClient<10.48.108.19:9160-12> released to inactive or dead pool. Closing.
04-29|17:59:44.364 [qtp1082411691-30 - /api/v1/datapoints/query] DEBUG [HThriftClient.java:126] - Closing client CassandraClient<10.48.108.19:9160-12>
04-29|17:59:44.365 [qtp1082411691-30 - /api/v1/datapoints/query] ERROR [MetricsResource.java:480] - Query failed.
org.kairosdb.core.exception.DatastoreException: me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.
...
#=====
Question:
1. Why did Hector not retry again and again repeatedly. Since I was emulating a node-reboot by killing the cassandra service and after about 1-min start it back up. Is this expected behavior?
2. Does anyone have a working KairosDB with a two-node Cassandra? Can you share your Hector properties?
If anyone has any idea on how I can resolve this issue, let me know.
Thanks
Venkatt