Frequent timeout exception

481 views
Skip to first unread message

Noorul Islam K M

unread,
Feb 18, 2016, 1:47:34 PM2/18/16
to KairosDB

I am seeing regular timeout exception. When I look at the Cassandra
cluster, I see that the load is not significant. I am not sure why is it
throwing timeout exception. Any help is appreciated!



02-18|12:00:27.888 [pool-5-thread-3] WARN [HConnectionManager.java:302] - Could not fullfill request on this host CassandraClient<10.12.1.22:9160-219>
02-18|12:00:27.888 [pool-5-thread-3] WARN [HConnectionManager.java:303] - Exception:
me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException(acknowledged_by:1)
at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:260) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243) [hector-core-1.1-4.jar:na]
at org.kairosdb.datastore.cassandra.WriteBuffer$WriteDataJob.run(WriteBuffer.java:372) [kairosdb-1.1.1-2.jar:1.1.1-2.20160216153340]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_75]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
Caused by: org.apache.cassandra.thrift.TimedOutException: null
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20849) ~[cassandra-thrift-1.2.5.jar:1.2.5]
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ~[libthrift-0.7.0.jar:0.7.0]
at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964) ~[cassandra-thrift-1.2.5.jar:1.2.5]
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950) ~[cassandra-thrift-1.2.5.jar:1.2.5]
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253) ~[hector-core-1.1-4.jar:na]
... 8 common frames omitted
02-18|12:00:27.889 [pool-5-thread-3] INFO [HThriftClientFactoryImpl.java:31] - SSL enabled for client<->server communications.
02-18|12:00:28.491 [pool-2-thread-2] WARN [HConnectionManager.java:302] - Could not fullfill request on this host CassandraClient<10.12.1.135:9160-162>
02-18|12:00:28.491 [pool-2-thread-2] WARN [HConnectionManager.java:303] - Exception:
me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException(acknowledged_by:1)
at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:260) ~[hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243) [hector-core-1.1-4.jar:na]
at org.kairosdb.datastore.cassandra.WriteBuffer$WriteDataJob.run(WriteBuffer.java:372) [kairosdb-1.1.1-2.jar:1.1.1-2.20160216153340]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_75]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
Caused by: org.apache.cassandra.thrift.TimedOutException: null
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20849) ~[cassandra-thrift-1.2.5.jar:1.2.5]
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ~[libthrift-0.7.0.jar:0.7.0]

Brian Hawkins

unread,
Feb 25, 2016, 12:03:04 AM2/25/16
to KairosDB
I was thinking this was during a read but it looks like it is during a write.  What is the size of buffers you are sending to C*?  Have a look at this metric kairosdb.datastore.write_size and see what it was at about the time of the error.  You may be trying to send to much through a single Kairos node and need to add another for load balancing.

How many Kairos nodes and C* nodes do you have?  How many metrics are you sending per second?

Brian

Varun Chandra

unread,
Mar 4, 2016, 12:55:56 PM3/4/16
to KairosDB

Hi Brian,

We see write timeouts when I am writing more that 500 metrics(all different tag combinations) using a single http post, and if I divide in chucks, then it takes long to write all.

Our writes are spread over an hour:

We write some 10 Million every 15 mins and some 1 Million starting of the hour. The tag combination for a metric is not to high, in worst case it may go to 10k. We have lot of different metrics for each client which adds to 10 million writes every 15 min.


We are planning to scale 100x of current, but not sure if kairos is right fit?

We have 3 cassandra nodes with Replication factor -3 and 2 kairos nodes behind elb.  
Following is the configurations we are using. 

#Cassandra properties

#host list is in the form> 1.1.1.1:9160,1.1.1.2:9160

kairosdb.datastore.cassandra.host_list= <list of casandra nodes>

kairosdb.datastore.cassandra.keyspace=kairosdb

kairosdb.datastore.cassandra.replication_factor=3

kairosdb.datastore.cassandra.write_delay=500

kairosdb.datastore.cassandra.write_buffer_max_size=500000

#When reading one row read in 10k

kairosdb.datastore.cassandra.single_row_read_size=10240


#The number of rows to read when doing a multi get

kairosdb.datastore.cassandra.multi_row_size=1000

#The amount of data to read from each row when doing a multi get

kairosdb.datastore.cassandra.multi_row_read_size=1024


#Size of the row key cache size.  This can be monitored by querying

#Write failed: Broken pipesize and filtering on the tag buffer = row_key_index

sjc-vchandra-mba:leaderboard vchandra$ y_index should stabilize to zero except

#when data rolls to a new row

kairosdb.datastore.cassandra.row_key_cache_size=10240


kairosdb.datastore.cassandra.string_cache_size=5000



03-04|17:47:14.521 [Thread-5] WARN  [HConnectionManager.java:302] - Could not fullfill request on this host CassandraClient<10.1.31.101:9160-27204>

03-04|17:47:14.521 [Thread-5] WARN  [HConnectionManager.java:303] - Exception:

me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException(acknowledged_by:1)

        at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:260) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113) [hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243) [hector-core-1.1-4.jar:na]

        at org.kairosdb.datastore.cassandra.WriteBuffer.run(WriteBuffer.java:237) [kairosdb-0.9.4-6.jar:0.9.4-6.20150330114205]

        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_91]

Caused by: org.apache.cassandra.thrift.TimedOutException: null

        at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20849) ~[cassandra-thrift-1.2.5.jar:1.2.5]

        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ~[libthrift-0.7.0.jar:0.7.0]

        at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964) ~[cassandra-thrift-1.2.5.jar:1.2.5]

        at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950) ~[cassandra-thrift-1.2.5.jar:1.2.5]

        at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253) ~[hector-core-1.1-4.jar:na]

        ... 4 common frames omitted

03-04|17:48:00.000 [QuartzScheduler_Worker-3] INFO  [WriteBuffer.java:199] - Increasing write buffer data_points size to 37719

03-04|17:48:00.001 [QuartzScheduler_Worker-3] INFO  [WriteBuffer.java:199] - Increasing write buffer row_key_index size to 17764

Varun Chandra

unread,
Mar 4, 2016, 1:02:58 PM3/4/16
to KairosDB

Full Configuration:


#===============================================================================

#Cassandra properties

#host list is in the form> 1.1.1.1:9160,1.1.1.2:9160

kairosdb.datastore.cassandra.host_list=<list of nodes>

kairosdb.datastore.cassandra.keyspace=kairosdb

kairosdb.datastore.cassandra.replication_factor=3

kairosdb.datastore.cassandra.write_delay=500

kairosdb.datastore.cassandra.write_buffer_max_size=500000

#When reading one row read in 10k

kairosdb.datastore.cassandra.single_row_read_size=10240


#The number of rows to read when doing a multi get

kairosdb.datastore.cassandra.multi_row_size=1000

#The amount of data to read from each row when doing a multi get

kairosdb.datastore.cassandra.multi_row_read_size=1024


#Size of the row key cache size.  This can be monitored by querying

#kairosdb.datastore.write_size and filtering on the tag buffer = row_key_index

#Ideally the data written to the row_key_index should stabilize to zero except

#when data rolls to a new row

kairosdb.datastore.cassandra.row_key_cache_size=10240


kairosdb.datastore.cassandra.string_cache_size=5000


# Uses Quartz Cron syntax - default is to run every five minutes

kairosdb.datastore.cassandra.increase_buffer_size_schedule=0 */1 * * * ?


#Control the required consistency for cassandra operations.

#Available settings are cassandra version dependent:

#http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/dml/dml_config_consistency_c.html

kairosdb.datastore.cassandra.read_consistency_level=QUORUM

kairosdb.datastore.cassandra.write_consistency_level=QUORUM


#for cassandra authentication use the following

#kairosdb.datastore.cassandra.auth.[prop name]=[prop value]

#example:

kairosdb.datastore.cassandra.auth.username=user

kairosdb.datastore.cassandra.auth.password=password


#the time to live in seconds for datapoints. After this period the data will be

#deleted automatically. If not set the data will live forever.

#TTLs are added to columns as they're inserted so setting this will not affect

#existing data, only new data.

#kairosdb.datastore.cassandra.datapoint_ttl=31536000



#===============================================================================

# Hector configuration


kairosdb.datastore.cassandra.hector.maxActive=64

#kairosdb.datastore.cassandra.hector.maxWaitTimeWhenExhausted=-1

#kairosdb.datastore.cassandra.hector.useSocketKeepalive=false

kairosdb.datastore.cassandra.hector.useSocketKeepalive=true

#kairosdb.datastore.cassandra.hector.cassandraThriftSocketTimeout=0

kairosdb.datastore.cassandra.hector.retryDownedHosts=true

kairosdb.datastore.cassandra.hector.retryDownedHostsDelayInSeconds=10

#kairosdb.datastore.cassandra.hector.retryDownedHostsQueueSize=-1

#kairosdb.datastore.cassandra.hector.autoDiscoverHosts=false

#kairosdb.datastore.cassandra.hector.autoDiscoveryDelayInSeconds=30

#kairosdb.datastore.cassandra.hector.autoDiscoveryDataCenters=

#kairosdb.datastore.cassandra.hector.runAutoDiscoveryAtStartup=false

#kairosdb.datastore.cassandra.hector.useHostTimeoutTracker=false

#kairosdb.datastore.cassandra.hector.maxFrameSize=2147483647

#kairosdb.datastore.cassandra.hector.loadBalancingPolicy=roundRobin | leastActive | dynamic

#kairosdb.datastore.cassandra.hector.loadBalancingPolicy=dynamic

kairosdb.datastore.cassandra.hector.loadBalancingPolicy=leastActive

#kairosdb.datastore.cassandra.hector.hostTimeoutCounter=10

#kairosdb.datastore.cassandra.hector.hostTimeoutWindow=500

#kairosdb.datastore.cassandra.hector.hostTimeoutSuspensionDurationInSeconds=10

#kairosdb.datastore.cassandra.hector.hostTimeoutUnsuspendCheckDelay=10

#kairosdb.datastore.cassandra.hector.maxConnectTimeMillis=-1

#kairosdb.datastore.cassandra.hector.maxLastSuccessTimeMillis-1






03-04|17:47:14.521 [Thread-5] WARN  [HConnectionManager.java:302] - Could not fullfill request on this host CassandraClient<10.1.31.101:9160-27204>

03-04|17:47:14.521 [Thread-5] WARN  [HConnectionManager.java:303] - Exception:

me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException(acknowledged_by:1)

        at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:260) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113) [hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243) [hector-core-1.1-4.jar:na]

        at org.kairosdb.datastore.cassandra.WriteBuffer.run(WriteBuffer.java:237) [kairosdb-0.9.4-6.jar:0.9.4-6.20150330114205]

        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_91]

Caused by: org.apache.cassandra.thrift.TimedOutException: null

        at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20849) ~[cassandra-thrift-1.2.5.jar:1.2.5]

        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ~[libthrift-0.7.0.jar:0.7.0]

        at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964) ~[cassandra-thrift-1.2.5.jar:1.2.5]

        at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950) ~[cassandra-thrift-1.2.5.jar:1.2.5]

        at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253) ~[hector-core-1.1-4.jar:na]

        ... 4 common frames omitted

03-04|17:48:00.000 [QuartzScheduler_Worker-3] INFO  [WriteBuffer.java:199] - Increasing write buffer data_points size to 37719

03-04|17:48:00.001 [QuartzScheduler_Worker-3] INFO  [WriteBuffer.java:199] - Increasing write buffer row_key_index size to 17764


Loic Coulet

unread,
Mar 5, 2016, 1:05:07 AM3/5/16
to KairosDB
Hi Varun, we are often writing to KairosDB in single http post queries gzipped payloads containing about 10 millions of datapoints and  thousands of metrics.
We do this as some batch inserts, they may arrive in sequence (inserted as fast as possible), but not for too long.

But this gives Cassandra quite a lot of activity in the cluster (we have a 6 nodes cluster with spinning disks).
I remarked that in such cases a single node cassandra host was often performing better than a cluster.

Maybe by doing that every 15 minutes, all of the time, at some point Cassandra is overloaded.

How big is your Cassandra cluster? What kind of hosts? Did you monitor Cassandra for performances?

Loic

Brian Hawkins

unread,
Mar 6, 2016, 12:03:02 PM3/6/16
to KairosDB
Thanks for reposting, the lines in your first post were befuddling my brain.

With the speed you are currently talking I would go with at least a 6 node C* cluster.  While inserting watch the load avg on both Kairos and C*.  It would be interesting to see load and cpu while you insert the data, I'm guessing you are maxing out C*.

A couple of other related thoughts.  A 3 node cluster with a replication of 3 is no better at performance then just 1 node.  With full replication every node gets all the data and the added overhead of communicating with each other.  The replication does give you redundancy so it isn't all bad.  So if you increase the cluster size to 6 and leave replication at 3 you should get twice the performance as now each node will only hold half the data in the ring.  You will probably want more nodes anyway just to hold all the data.

Also I in the process of adding a queue to Kairos for various reasons.  One benefit of the queue is that it can help smooth out bursts of data like what you are sending.

Brian

swetha kasireddy

unread,
Mar 6, 2016, 10:49:08 PM3/6/16
to KairosDB
HI Brian,

I see the same timeout errors in my Kairos call to Cassandra. Any idea as to how this can be fixed? We have a 6 node cassandra cluster. What is the Order in which Kairos makes calls to Cassandra cluster ? Is it in the order of ip addresses that are specified and would it make calls to all the addressess that are specified?  Suppose I have DC1Could1 and DC1Cloud2 in one data center and  DC2Could1 and DC2Cloud2 in another data center, do I have to specify all the IPAddresses in all the kairos nodes. Kairos  nodes are also available in  DC1Could1 , DC1Cloud2, DC2Could1 and DC2Cloud2. I have six cassandra nodes in each cloud.

Thanks,
Swetha 


03-07|03:03:02.493 [pool-4-thread-14] WARN  [HConnectionManager.java:302] - Could not fullfill request on this host CassandraClient<>

03-07|03:03:02.493 [pool-4-thread-14] WARN  [HConnectionManager.java:303] - Exception:

me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException(acknowledged_by:1)

        at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:260) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113) [hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243) [hector-core-1.1-4.jar:na]

        at org.kairosdb.datastore.cassandra.WriteBuffer$WriteDataJob.run(WriteBuffer.java:372) [kairosdb-1.1.1-1.jar:1.1.1-1.20151207194217]

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_51]

        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_51]

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_51]

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_51]

        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]

Caused by: org.apache.cassandra.thrift.TimedOutException: null

        at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20849) ~[cassandra-thrift-1.2.5.jar:1.2.5]

        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ~[libthrift-0.7.0.jar:0.7.0]

        at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964) ~[cassandra-thrift-1.2.5.jar:1.2.5]

        at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950) ~[cassandra-thrift-1.2.5.jar:1.2.5]

        at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253) ~[hector-core-1.1-4.jar:na]

        ... 8 common frames omitted

Brian Hawkins

unread,
Mar 6, 2016, 11:12:03 PM3/6/16
to KairosDB
Swetha,

What does the load average look like on the C* nodes?  And how many cores do each C* node have?

Brian

swetha kasireddy

unread,
Mar 6, 2016, 11:40:20 PM3/6/16
to KairosDB
Hi Brian,

Each Cassandra node has 8 cores. I could have anywhere around 12000 metrics per second. My qa cluster seems to be fine as everything is in the same cloud and same datacenter. Whereas the prod one has 2 data centers and 2 clouds each in each data center. The cloud  that  has kairos running has a different datacenter cassandra ip addresses specified in the begining. So, it could be network issue as well. How does Kairos try to insert into Cassandra? Does it try to insert into each cassandra node as per the order of specification in the properties file? Or does it skip after inserting into one?

Thanks,
Swetha 


On Thursday, February 18, 2016 at 10:47:34 AM UTC-8, Noorul Islam Kamal Malmiyoda wrote:

Noorul Islam K M

unread,
Mar 7, 2016, 12:23:22 AM3/7/16
to swetha kasireddy, KairosDB
swetha kasireddy <swethak...@gmail.com> writes:

> Hi Brian,
>
> Each Cassandra node has 8 cores. I could have anywhere around 12000 metrics
> per second. My qa cluster seems to be fine as everything is in the same
> cloud and same datacenter. Whereas the prod one has 2 data centers and 2
> clouds each in each data center. The cloud that has kairos running has a
> different datacenter cassandra ip addresses specified in the begining. So,
> it could be network issue as well. How does Kairos try to insert into
> Cassandra? Does it try to insert into each cassandra node as per the order
> of specification in the properties file? Or does it skip after inserting
> into one?
>

I would suggest you to use a tool like iperf to measure the bandwidth between Kairos and
C* DC, so that the problem can be isolated.

Thanks and Regards
Noorul

swetha kasireddy

unread,
Mar 7, 2016, 8:21:00 PM3/7/16
to KairosDB
Hi Brian,

Any suggestions on how this can be handled?

Thanks,
Swetha


On Thursday, February 18, 2016 at 10:47:34 AM UTC-8, Noorul Islam Kamal Malmiyoda wrote:
Message has been deleted

Varun Chandra

unread,
Mar 8, 2016, 1:14:45 AM3/8/16
to KairosDB

Thanks Loic and Brian. I will try to gzip metrics before posting to Kairos and as suggest increase cassandra nodes and reduce replication. Also can you please provide you feedback on following configuration:


kairosdb.datastore.cassandra.write_delay=500               
kairosdb.datastore.cassandra.write_buffer_max_size=500000  
kairosdb.datastore.cassandra.single_row_read_size=1024     
kairosdb.datastore.cassandra.multi_row_size=1000
kairosdb.datastore.cassandra.multi_row_read_size=1024      
kairosdb.datastore.cassandra.row_key_cache_size=10240      
kairosdb.datastore.cassandra.string_cache_size=5000


I still have few more read related questions:


1. During query i set cache time to 3600, but i don't see any content in cache files created in /tmp. Also these files are deleted even though the schedule for cleanup is set longer. 

2. In my existing setup for a given metric in worst case i might have tag cardinality of around ( ~ 50k) with 5 million data sample size, I see these type of queries take around 30-50 seconds.

    So what changes can i make to reduce the read time for such queries. Do i need to increase kairosdb.datastore.cassandra.hector.maxActive=64 ?


Thanks

Varun

swetha kasireddy

unread,
Mar 8, 2016, 1:40:11 AM3/8/16
to KairosDB
Hi Brain,

I will be having at a max of 400,000 metrics per second. We have a six node cassandra cluster with 8 cores each. Any idea as to what needs to be changed to avoid frequent timeouts? Following are my Cassandra settings.

kairosdb.datastore.cassandra.keyspace=kairosdb

kairosdb.datastore.cassandra.replication_factor=3

kairosdb.datastore.cassandra.write_delay=1000

kairosdb.datastore.cassandra.write_buffer_max_size=2000000

#When reading one row read in 10k

kairosdb.datastore.cassandra.single_row_read_size=10240


#The number of rows to read when doing a multi get

kairosdb.datastore.cassandra.multi_row_size=1000

#The amount of data to read from each row when doing a multi get

kairosdb.datastore.cassandra.multi_row_read_size=1024


#Size of the row key cache size.  This can be monitored by querying

#kairosdb.datastore.write_size and filtering on the tag buffer = row_key_index

#Ideally the data written to the row_key_index should stabilize to zero except

#when data rolls to a new row

kairosdb.datastore.cassandra.row_key_cache_size=65536


kairosdb.datastore.cassandra.string_cache_size=20000


# Uses Quartz Cron syntax - default is to run every five minutes

kairosdb.datastore.cassandra.increase_buffer_size_schedule=0 */5 * * * ?


#Control the required consistency for cassandra operations.

#Available settings are cassandra version dependent:

#http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/dml/dml_config_consistency_c.html

kairosdb.datastore.cassandra.read_consistency_level=ONE

kairosdb.datastore.cassandra.write_consistency_level=LOCAL_QUORUM


On Thursday, February 18, 2016 at 10:47:34 AM UTC-8, Noorul Islam Kamal Malmiyoda wrote:

Brian Hawkins

unread,
Mar 16, 2016, 3:39:15 PM3/16/16
to KairosDB
Swetha,

I have a few thoughts.  6 nodes doesn't feel like enough for 400k/sec.  I would double the cluster size at least.
Timeouts are likely caused by too much load on C* (see above) and kairos sending too large of batches to C*.  You can reduce the batch size by increasing the number of Kairos nodes.

Also make sure kairos is only configured with the C* nodes that are in the local DC so they do not try to talk across to the other DC.

Brian

Brian Hawkins

unread,
Mar 16, 2016, 3:43:36 PM3/16/16
to KairosDB
Varun,

The query cache is only used if you make the exact same query within the timeout range from the previous query.  The query cache is quickly becoming not very useful so I wouldn't worry about it.  We have plans that will do a better job than the cache.

For query performance read https://github.com/kairosdb/kairosdb/wiki/Query-Performance to diagnose what part of the query is slow.  Then we can figure out how to speed it up.

Brian

swetha kasireddy

unread,
Mar 16, 2016, 4:08:12 PM3/16/16
to KairosDB
OK. Does Kairos look to update data into each Cassandra node that is specified in the properties file or does it check for write_consistency(which is LOCAL_QUORUM in our case)  and ignore a few nodes depending on the Write_Consistency?

Thanks,
Swetha


On Thursday, February 18, 2016 at 10:47:34 AM UTC-8, Noorul Islam Kamal Malmiyoda wrote:

Brian Hawkins

unread,
Mar 29, 2016, 9:19:28 PM3/29/16
to KairosDB
Let me first explain how sending data to C* works.  You send an update to a single node.  The node then sends the data to all nodes that hold a replica of the partition to which the data belongs.  The node that originally got the message does not return until it gets a response that matches the write consistency.

Kairos by defaults does a round robin on the nodes in the properties file.  So if you have nodes from datacenter A and B then it will send the write to the node and wait unitl it gets local quorum response from the data center in which the node is.

swetha kasireddy

unread,
Mar 30, 2016, 11:41:16 AM3/30/16
to KairosDB
Hi Brian,

So, I have a Write Consistency of ONE set now and a replica of 3. Suppose my kairos properties have nodes from data center A and B. Does it try to insert into the LOCAL_QUORUM of A and B in the same call OR the  first time it inserts into LOCAL_QUORUM of A and second time inserts into LOCAL_QUORUM of B?

Thanks!


On Thursday, February 18, 2016 at 10:47:34 AM UTC-8, Noorul Islam Kamal Malmiyoda wrote:

swetha kasireddy

unread,
Apr 5, 2016, 6:51:50 PM4/5/16
to KairosDB
Hi Brian,

I have 6 nodes in my Cassandra cluster. So, if the first 2 nodes mentioned in Kairos are down, does it keep checking the next 4 nodes until it gets the response? When I see the logs, it seems to be talking to only two nodes. The reason why I am asking you is that  in my Cassandra cluster 2 of the nodes are down and 4 nodes are up. In this case Cassandra behaves as though all the nodes are down and fails with the following error.

04-05|22:46:35.147 [pool-2-thread-10] ERROR [WriteBuffer.java:379] - Error sending data to Cassandra (data_points)

me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.

        at me.prettyprint.cassandra.connection.HConnectionManager.getClientFromLBPolicy(HConnectionManager.java:390) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:244) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113) ~[hector-core-1.1-4.jar:na]

        at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243) ~[hector-core-1.1-4.jar:na]

        at org.kairosdb.datastore.cassandra.WriteBuffer$WriteDataJob.run(WriteBuffer.java:372) ~[kairosdb-1.1.1-1.jar:1.1.1-1.20151207194217]

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_51]

        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_51]

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_51]

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_51]

        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]

04-05|22:46:35.148 [pool-2-thread-10] ERROR [WriteBuffer.java:383] - Reducing write buffer size to 0.  You need to increase your cassandra capacity or change the kairosdb.datastore.cassandra.write_buffer_max_size property.


Thanks,
Swetha


On Thursday, February 18, 2016 at 10:47:34 AM UTC-8, Noorul Islam Kamal Malmiyoda wrote:
Reply all
Reply to author
Forward
0 new messages