Read performances in opposite use cases

Arthur Michaut

unread,

Jun 10, 2021, 12:10:40 PM6/10/21

to KairosDB

Hello everyone,

We are trying to migrate our current time series data source to a more modern system. KairosDB is an obvious candidate since it is based on Cassandra (which we already use). However, we are not already sure that we want to use Kairos as we believe our use cases might be unfit for it performance-wise.

We have already developed a prototype to clarify some issues we had. The features provided by KairosDB allow to satisfy all our needs, but we have issues with read performances in half our cases.

Actually, we would like to be able to read fast in both those cases:

A. Reading all values of a small number of time series.
We found that case A is well supported natively by Kairos. Read response-time for all time steps (~10K) of a single time series is around 70ms.
B. Reading only a few values of all time series.
Case B however is not as satisfactory. Read response-time for a single value of N time series seems to grow linearly with N and is around 2s for N = 1000.

(All our tests were performed on a single Cassandra node - unfortunately I can't provide you with its specs. I joined the response time actual values but I believe that they are not as important as the ratio between them)

Provided with those results, some of us are starting to believe that KairosDB does not suit our needs. What do you think we should do to improve performances in case B?

We thought about storing data in 2 manners :

In the Kairos style for case A.
With a columns for each time series and a row for each time step for case B.

This solution would probably improve greatly the response-time in case B, but it would also obviously degrade write performances. Additionaly, I feel that doing so would alienate the solution from its initial philosophy.

Please feel free to ask for any additional information!

Brian Hawkins

unread,

Jun 15, 2021, 10:42:18 PM6/15/21

to KairosDB

It sounds like you have a grasp of how C* lays out its data but you may want to take a look at this session I did for a Scylla conference where I describe building a time series database on cassandra like systems: https://www.youtube.com/watch?v=F2ukOa1gGlo

Your results are exactly as I would expect. Also what I would expect is the response time for scenario B will improve as your C* cluster grows in size. Scenario B you are fetching data from lots of different partitions and in a larger cluster the work to get those can be done in parallel across the cluster. In kairos you can tune how many simultaneous queries are ran against the cluster to optimize your response time.

Now in scenario B do you care to know what time series a value came from or are you just looking for an aggregate of the values? I ask as there may be some tricks to writing the data so it can be all in one place to query out.

Brian

Arthur Michaut

unread,

Jun 16, 2021, 12:49:22 PM6/16/21

to KairosDB

Hi Brian,

Thank you very much for your response. We will retry with an increased cluster size as it seems that with N C* nodes and big enough systems, case B response time would be Nx faster.

Obviously, the design of KairosDB was made to match the kind of use cases you described in the video. I believe that our biggest issue is that we do not really have one single use case and that our two main use cases would lead to opposite schema designs if taken individually. This is the reason that led us to thinking about the double-storage solution.

About scenario B, we actually care about the time series origin (kind of a worst case scenario, right?). In real life, we want to retrieve all the metrics describing our system at a given instant. Then we would perform intensive calculations and store the results in other metrics. Note that the calculations will probably be a lot more long than read/write operations, so KairosDB could actually be good enough for our real-life scenario B.

Arthur

Brian Hawkins

unread,

Jun 18, 2021, 2:47:08 PM6/18/21

to KairosDB

How many different time series are you thinking you will have?

Kairos really does optimize for write throughput, most time series are write once and read never ;). The way kairos lays out the data you can almost always grow your C* cluster to improve performance without creating hotspots in the cluster. In general when dealing with big data it is a good idea to write it how you want to read it even if that means writing multiple ways, but I would be concerned with writing the data for scenario B causing hotspots in the cluster.