Frequent timeout exception

Noorul Islam K M

unread,

Feb 18, 2016, 1:47:34 PM2/18/16

to KairosDB

I am seeing regular timeout exception. When I look at the Cassandra
cluster, I see that the load is not significant. I am not sure why is it
throwing timeout exception. Any help is appreciated!

02-18|12:00:27.888 [pool-5-thread-3] WARN [HConnectionManager.java:302] - Could not fullfill request on this host CassandraClient<10.12.1.22:9160-219>
02-18|12:00:27.888 [pool-5-thread-3] WARN [HConnectionManager.java:303] - Exception:
me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException(acknowledged_by:1)
at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:260) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243) [hector-core-1.1-4.jar:na]
at org.kairosdb.datastore.cassandra.WriteBuffer$WriteDataJob.run(WriteBuffer.java:372) [kairosdb-1.1.1-2.jar:1.1.1-2.20160216153340]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_75]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
Caused by: org.apache.cassandra.thrift.TimedOutException: null
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20849) ~[cassandra-thrift-1.2.5.jar:1.2.5]
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ~[libthrift-0.7.0.jar:0.7.0]
at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964) ~[cassandra-thrift-1.2.5.jar:1.2.5]
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950) ~[cassandra-thrift-1.2.5.jar:1.2.5]
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253) ~[hector-core-1.1-4.jar:na]
... 8 common frames omitted
02-18|12:00:27.889 [pool-5-thread-3] INFO [HThriftClientFactoryImpl.java:31] - SSL enabled for client<->server communications.
02-18|12:00:28.491 [pool-2-thread-2] WARN [HConnectionManager.java:302] - Could not fullfill request on this host CassandraClient<10.12.1.135:9160-162>
02-18|12:00:28.491 [pool-2-thread-2] WARN [HConnectionManager.java:303] - Exception:
me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException(acknowledged_by:1)
at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:260) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243) [hector-core-1.1-4.jar:na]
at org.kairosdb.datastore.cassandra.WriteBuffer$WriteDataJob.run(WriteBuffer.java:372) [kairosdb-1.1.1-2.jar:1.1.1-2.20160216153340]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_75]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
Caused by: org.apache.cassandra.thrift.TimedOutException: null
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20849) ~[cassandra-thrift-1.2.5.jar:1.2.5]
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ~[libthrift-0.7.0.jar:0.7.0]

Brian Hawkins

unread,

Feb 25, 2016, 12:03:04 AM2/25/16

to KairosDB

I was thinking this was during a read but it looks like it is during a write. What is the size of buffers you are sending to C*? Have a look at this metric kairosdb.datastore.write_size and see what it was at about the time of the error. You may be trying to send to much through a single Kairos node and need to add another for load balancing.

How many Kairos nodes and C* nodes do you have? How many metrics are you sending per second?

Brian

Varun Chandra

unread,

Mar 4, 2016, 12:55:56 PM3/4/16

to KairosDB

Hi Brian,

We see write timeouts when I am writing more that 500 metrics(all different tag combinations) using a single http post, and if I divide in chucks, then it takes long to write all.

Our writes are spread over an hour:

We write some 10 Million every 15 mins and some 1 Million starting of the hour. The tag combination for a metric is not to high, in worst case it may go to 10k. We have lot of different metrics for each client which adds to 10 million writes every 15 min.

We are planning to scale 100x of current, but not sure if kairos is right fit?

We have 3 cassandra nodes with Replication factor -3 and 2 kairos nodes behind elb.

Following is the configurations we are using.

#Cassandra properties

#host list is in the form> 1.1.1.1:9160,1.1.1.2:9160

kairosdb.datastore.cassandra.host_list= <list of casandra nodes>

kairosdb.datastore.cassandra.keyspace=kairosdb

kairosdb.datastore.cassandra.replication_factor=3

kairosdb.datastore.cassandra.write_delay=500

kairosdb.datastore.cassandra.write_buffer_max_size=500000

#When reading one row read in 10k

kairosdb.datastore.cassandra.single_row_read_size=10240

#The number of rows to read when doing a multi get

kairosdb.datastore.cassandra.multi_row_size=1000

#The amount of data to read from each row when doing a multi get

kairosdb.datastore.cassandra.multi_row_read_size=1024

#Size of the row key cache size. This can be monitored by querying

#Write failed: Broken pipesize and filtering on the tag buffer = row_key_index

sjc-vchandra-mba:leaderboard vchandra$ y_index should stabilize to zero except

#when data rolls to a new row

kairosdb.datastore.cassandra.row_key_cache_size=10240

kairosdb.datastore.cassandra.string_cache_size=5000

#

03-04|17:47:14.521 [Thread-5] WARN [HConnectionManager.java:302] - Could not fullfill request on this host CassandraClient<10.1.31.101:9160-27204>

03-04|17:47:14.521 [Thread-5] WARN [HConnectionManager.java:303] - Exception:

me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException(acknowledged_by:1)

at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42) ~[hector-core-1.1-4.jar:na]

at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:260) ~[hector-core-1.1-4.jar:na]

at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113) [hector-core-1.1-4.jar:na]

at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243) [hector-core-1.1-4.jar:na]

at org.kairosdb.datastore.cassandra.WriteBuffer.run(WriteBuffer.java:237) [kairosdb-0.9.4-6.jar:0.9.4-6.20150330114205]

at java.lang.Thread.run(Thread.java:745) [na:1.7.0_91]

Caused by: org.apache.cassandra.thrift.TimedOutException: null

at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20849) ~[cassandra-thrift-1.2.5.jar:1.2.5]

at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ~[libthrift-0.7.0.jar:0.7.0]

at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964) ~[cassandra-thrift-1.2.5.jar:1.2.5]

at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950) ~[cassandra-thrift-1.2.5.jar:1.2.5]

at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246) ~[hector-core-1.1-4.jar:na]

at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243) ~[hector-core-1.1-4.jar:na]

at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104) ~[hector-core-1.1-4.jar:na]

at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253) ~[hector-core-1.1-4.jar:na]

... 4 common frames omitted

03-04|17:48:00.000 [QuartzScheduler_Worker-3] INFO [WriteBuffer.java:199] - Increasing write buffer data_points size to 37719

03-04|17:48:00.001 [QuartzScheduler_Worker-3] INFO [WriteBuffer.java:199] - Increasing write buffer row_key_index size to 17764

Varun Chandra

unread,

Mar 4, 2016, 1:02:58 PM3/4/16

to KairosDB

Full Configuration:

#===============================================================================

#Cassandra properties

#host list is in the form> 1.1.1.1:9160,1.1.1.2:9160

kairosdb.datastore.cassandra.host_list=<list of nodes>

kairosdb.datastore.cassandra.keyspace=kairosdb

kairosdb.datastore.cassandra.replication_factor=3

kairosdb.datastore.cassandra.write_delay=500

kairosdb.datastore.cassandra.write_buffer_max_size=500000

#When reading one row read in 10k

kairosdb.datastore.cassandra.single_row_read_size=10240

#The number of rows to read when doing a multi get

kairosdb.datastore.cassandra.multi_row_size=1000

#The amount of data to read from each row when doing a multi get

kairosdb.datastore.cassandra.multi_row_read_size=1024

#Size of the row key cache size. This can be monitored by querying

#kairosdb.datastore.write_size and filtering on the tag buffer = row_key_index

#Ideally the data written to the row_key_index should stabilize to zero except

#when data rolls to a new row

kairosdb.datastore.cassandra.row_key_cache_size=10240

kairosdb.datastore.cassandra.string_cache_size=5000

# Uses Quartz Cron syntax - default is to run every five minutes

kairosdb.datastore.cassandra.increase_buffer_size_schedule=0 */1 * * * ?

#Control the required consistency for cassandra operations.

#Available settings are cassandra version dependent:

#http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/dml/dml_config_consistency_c.html

kairosdb.datastore.cassandra.read_consistency_level=QUORUM

kairosdb.datastore.cassandra.write_consistency_level=QUORUM

#for cassandra authentication use the following

#kairosdb.datastore.cassandra.auth.[prop name]=[prop value]

#example:

kairosdb.datastore.cassandra.auth.username=user

kairosdb.datastore.cassandra.auth.password=password

#the time to live in seconds for datapoints. After this period the data will be

#deleted automatically. If not set the data will live forever.

#TTLs are added to columns as they're inserted so setting this will not affect

#existing data, only new data.

#kairosdb.datastore.cassandra.datapoint_ttl=31536000

#===============================================================================

# Hector configuration

kairosdb.datastore.cassandra.hector.maxActive=64

#kairosdb.datastore.cassandra.hector.maxWaitTimeWhenExhausted=-1

#kairosdb.datastore.cassandra.hector.useSocketKeepalive=false

kairosdb.datastore.cassandra.hector.useSocketKeepalive=true

#kairosdb.datastore.cassandra.hector.cassandraThriftSocketTimeout=0

kairosdb.datastore.cassandra.hector.retryDownedHosts=true

kairosdb.datastore.cassandra.hector.retryDownedHostsDelayInSeconds=10

#kairosdb.datastore.cassandra.hector.retryDownedHostsQueueSize=-1

#kairosdb.datastore.cassandra.hector.autoDiscoverHosts=false

#kairosdb.datastore.cassandra.hector.autoDiscoveryDelayInSeconds=30

#kairosdb.datastore.cassandra.hector.autoDiscoveryDataCenters=

#kairosdb.datastore.cassandra.hector.runAutoDiscoveryAtStartup=false

#kairosdb.datastore.cassandra.hector.useHostTimeoutTracker=false

#kairosdb.datastore.cassandra.hector.maxFrameSize=2147483647

#kairosdb.datastore.cassandra.hector.loadBalancingPolicy=roundRobin | leastActive | dynamic

#kairosdb.datastore.cassandra.hector.loadBalancingPolicy=dynamic

kairosdb.datastore.cassandra.hector.loadBalancingPolicy=leastActive

#kairosdb.datastore.cassandra.hector.hostTimeoutCounter=10

#kairosdb.datastore.cassandra.hector.hostTimeoutWindow=500

#kairosdb.datastore.cassandra.hector.hostTimeoutSuspensionDurationInSeconds=10

#kairosdb.datastore.cassandra.hector.hostTimeoutUnsuspendCheckDelay=10

#kairosdb.datastore.cassandra.hector.maxConnectTimeMillis=-1

#kairosdb.datastore.cassandra.hector.maxLastSuccessTimeMillis-1