Improving query performance with high tag cardinality

571 views
Skip to first unread message

gab...@waylay.io

unread,
Oct 30, 2018, 12:17:52 PM10/30/18
to KairosDB
Hi,

This is a bit of a long post, so the short version is that I've got a proposal for greatly improving query performance in KairosDB when a large number of distinct tag values/combinations are involved, and am interested in getting feedback on it.

The general use case is having a large number (i.e. millions or tens of millions) of devices (e.g temperature sensors), each of which is identified by a tag, or a set of tags (e.g. sensor id, sensor_building_floor, sensor_building, sensor_building_region). The intention is to be able to request aggregated (and non-aggregated) data for individual sensors, as well as aggregated values for all sensors on a given floor, or in a given building.

As is explained in the Query Performance wiki page [1], query performance is currently very poor when so many unique tags are involved, due to the way that "phase 1" of the query works (looking up row keys for a query).

The most common workaround for this situation currently seems to be using the sensor id as the metric name, and the measurement name (e.g. "temperature") as a tag value. Although this resolves the query performance issue for an individual sensor's data, it makes it impossible to aggregate sensor values for a given floor, building, etc.

I've put together an initial implementation, available in a branch on my fork of KairosDB [2] that attempts to resolve this issue. The general idea is that the row_keys table has an entry per tag pair for a given row key, allowing much faster lookups during phase 1 of a query.

The main drawback that I can see with this approach is the related write amplification in the row_keys table. However, writes to the row_keys table are cached, meaning if the number of distinct tag combinations (mostly) fits in the cache, the write amplification shouldn't be a factor. It will be a factor if the number of distinct tags is very large, but I would consider that an acceptable trade-off for making KairosDB usable for queries for such a workload.

I've done some performance testing locally, using a single local Cassandra node, and they are outlined in this public Gist [3]. The test methodology is explained there, but the general idea is that this change makes phase 1 of a query run in constant time (i.e. the number of tags has no effect). In this specific test, it brought a query time from 5.5 seconds down to 15 milliseconds (i.e. ~350 times faster) when 1 million distinct tag combinations were used. As noted in the linked Gist, my expectation is that running this test on a more realistic workload with distributed Cassandra nodes would probably make the query performance difference quite a bit bigger.

The code as implemented is currently not backward-compatible with the existing KairosDB schema, and it should be seen as just a proof of concept for now. However, what I've seen so far in testing makes me think that this is a viable approach to greatly improving query performance. I'd be very interested in hearing what people think of this approach in general, and if there are any clear holes in my reasoning, etc, so if you've got any thoughts on this, please let me know.

- Gabriel

Brian Hawkins

unread,
Nov 1, 2018, 1:13:41 AM11/1/18
to KairosDB
Oh that is very clever.  Let me try to restate what you have done:  Lets say I insert data with two tags host and customer.  So for an insert where host=A and customer=Foo you would generate 2 hashes of each key/value pair.  You would insert the entire tag set into the row_keys table twice ones for the hash of host=A and once for the hash of customer=Foo.  Then when you do a query you use the CQL in operator to look for tag sets that match any given tag pair hashes provided with the query.

You still have to filter the results but you have drastically reduced the set of data to search through.  One thought on your implementation, you use the CQL IN operator for the query.  If I specify the hash for host=A wont the tags returned contain all values of customer so including the results from a specified customer tag just add to what I have to filter without any performance gain?  The real trick is to pick the tag with the highest cardinality to use the hash from so you reduce the search set the most.

I'll have to think on this some more.  Instead of creating full indexes on each tag you are creating hash buckets.

I'd like to make it a configurable option.

Thanks for the work on this it is great.  I'll reply back when I have more time to noodle this over.

Brian

Brian Hawkins

unread,
Nov 1, 2018, 11:03:09 AM11/1/18
to KairosDB
OK this kept me up for several hours last night.  I like the idea with the following changes:
1.  Move the tag hash into the partition key - otherwise it may put to much data in the row and there is code already in there to fetch keys from multiple partitions.
2.  Make the indexed row keys table a separate table
3.  Add configuration to allow the index to work in three modes, off, on (index everything) or only index specified metrics and tags.
4.  Add functionality to build an index for a metric after the fact.

If you want to make a pull request from your changes I'll manually pull them in with the above changes.

Thanks again, this is really well done.

Brian

Riley Zimmerman

unread,
Nov 1, 2018, 12:23:21 PM11/1/18
to KairosDB
Very clever in deed, thanks Gabriel!  I'm very eager to try it out and see how it performs on some of my test systems as soon as I get a chance to.  

Brian, are your proposals 2-4 aimed at backwards compatibility?  I know you solved that issues very well with the big CQL change.  

Brian Hawkins

unread,
Nov 1, 2018, 12:52:47 PM11/1/18
to KairosDB
Partly for backwards compatibility.  Be it that this is a great feature it has drawbacks.  If you are using Kairos for IOT I can see it being a big win.  If you are using Kairos for system telemetry where your tag cardinality isn't that big but you have a truck load of metrics it may not be worth the overhead of turning this on.  And in all cases I can see where you want to be selective in turning it on.

gab...@waylay.io

unread,
Nov 2, 2018, 4:07:09 AM11/2/18
to KairosDB
Thanks for the feedback Brian.

This sounds good -- I'll put together a pull request for this now (I'll include points 1 and 2 in it as well at least) and post it as soon as it's ready.

- Gabriel

gab...@waylay.io

unread,
Nov 2, 2018, 12:10:24 PM11/2/18
to KairosDB
Hi,

I've just put up a pull request that includes topics 1-3 (i.e. everything except for the migration tooling), it can be found here: https://github.com/kairosdb/kairosdb/pull/510

I should be able to get some work done on the migration tooling at the start of next week if that doesn't get picked up before then.

- Gabriel

gab...@waylay.io

unread,
Nov 5, 2018, 9:09:51 AM11/5/18
to KairosDB
Hi,

Some follow-up on the PR [1] to improve performance with high tag cardinality: I've just added another commit to it that resolves the potential issue that Brian pointed out around the combination of a low cardinality tag combined with a high-cardinality tag in a query (e.g. a query with tags sensor_id=123 and customer=customer1, where there are millions of sensor_ids but only a very small number of customers).

Before this new commit, such a query would have performance that was similar to before the tag-based indexing updates. With the new commit I've just added, the same query will now also exhibit the same kind of performance as if only the sensor_id tag was used for filtering (this is double-digit milliseconds when testing locally on my machine).

- Gabriel

Sebastian Spiekermann

unread,
Jan 15, 2019, 9:59:32 AM1/15/19
to KairosDB
Hi Gabriel and Brian,

any news about this Pull Request?
This looks very promising, good work! I'm eager to do some benchmarks myself if this gets merged in a beta. :)

Sebastian

Gabriel Reid

unread,
Jan 15, 2019, 11:18:31 AM1/15/19
to KairosDB, Sebastian Spiekermann
Hi Sebastian,

No news from my side on this PR -- I've done a fair bit of testing with it, and I'm quite confident in its ability to handle the cases as described in my earlier post here (and haven't run into any real problems with it yet). I'm not currently actively working on it further at the moment, but from my perspective it's pretty much ready to go.

If there's an issue with getting it merged in and you want to test it, another option is to simply build it from my branch[1]

- Gabriel


Brian Hawkins

unread,
Jan 16, 2019, 10:33:50 PM1/16/19
to KairosDB
I do want to merge it in.  I'll probably manually merge it as I want to make the feature optional.  I tagged it to be included in the next release.

Brian


On Tuesday, January 15, 2019 at 8:18:31 AM UTC-8, Gabriel Reid wrote:
Hi Sebastian,

No news from my side on this PR -- I've done a fair bit of testing with it, and I'm quite confident in its ability to handle the cases as described in my earlier post here (and haven't run into any real problems with it yet). I'm not currently actively working on it further at the moment, but from my perspective it's pretty much ready to go.

If there's an issue with getting it merged in and you want to test it, another option is to simply build it from my branch[1]

- Gabriel



Sebastian Spiekermann

unread,
Jan 17, 2019, 1:21:39 AM1/17/19
to KairosDB
Hi Brian and Gabriel,

that's fantastic news! Looking forward to test it as soon as it is merged.

Couple of things, though:
  1. Gabriel already made this optional by adding the configuration parameter kairosdb.datastore.cassandra.tag_indexed_row_key_lookup_metrics, didn't he? That's how I'm interpreting the description...
  2. This is a question for both of you regarding the commit description: "Migration [...] is not included in this commit". Could you please make sure to provide us with proper instructions what we need to do if we want to use that feature? I.e. create the new table using configurations X and Y, delete the old... that kind of stuff.
Thanks to you both!
Sebastian

Gabriel Reid

unread,
Jan 17, 2019, 4:07:09 AM1/17/19
to KairosDB
Hi,

On your two points:

1. Yes, that's correct: this is already optional via the configuration parameter as you pointed out.
2. On the subject of migration, Brian had talked about implementing functionality to re-index data. For now though, if you want to migrate existing data to the new storage format. the steps would be as follows:

  1. Export the metric (or metrics) that you want to migrate
  2. Delete the metric that is being migrated
  3. Change the tag_indexed_row_key_lookup_metrics config parameter for those metrics
  4. Re-import the metrics that are being migrated

I don't believe there are any table creation/dropping changes that need to be made in order to do the migration (the tag-indexed row key lookup table is created automatically if it doesn't exist).

- Gabriel

Sebastian Spiekermann

unread,
Jan 30, 2019, 3:08:54 AM1/30/19
to KairosDB
Hi Gabriel,

I've decided to give your branch a try!

Setup:
There's 2 KairosDB instances now: Kairos-A (release 1.2.2) and Kairos-B (from your branch). Both are connected to the same Cassandra cluster, each having it's own keyspace.
We're ingesting the same metrics into both KairosDB's. The time-to-live is configured with 6 hours.

The queries we've tried only show a slight performance boost. So my first question is: is setting the property tag_indexed_row_key_lookup_metrics: "*" sufficient? I had expected a more noticeable boost, as that metric has a high cardinality...
I'll add an example query at the end (*1).

There's a second thing I'm curious about, and that's the compaction strategy of the "data_points" table: Kairos-A uses TWCS with a compaction window of 30 minutes. Kairos-B uses STCS (it was created when I first started that KairosDB instance).
Is there any preferred compaction strategy? AFAIK the TWCS is great for time series...

Thanks,
Sebastian


(*1)

The following query is being run against both environments.
{
 
"metrics": [
   
{
     
"tags": {
       
"type": [
         
"percent_inodes"
       
],
       
"type_instance": [
         
"used"
       
]
     
},
     
"name": "collectd.df",
     
"group_by": [
       
{
         
"name": "tag",
         
"tags": [
           
"host",
           
"plugin_instance"
         
]
       
}
     
]
   
}
 
],
 
"plugins": [],
 
"cache_time": 0,
 
"start_relative": {
   
"value": "1",
   
"unit": "hours"
 
}
}

That metric has 4 tags with a high cardinality as the Grafana screenshot shows:

collectddf.png


Here are the query results:

Kairos-A
Query Time: 10,974 ms
Sample Size: 1,259,863
Data Points: 1,259,863

Kairos-B
Query Time: 10,081 ms
Sample Size: 1,259,420
Data Points: 1,259,420

Gabriel Reid

unread,
Jan 30, 2019, 8:16:59 AM1/30/19
to KairosDB, Sebastian Spiekermann
Hi Sebastian,

Thanks for giving this a try!

Setting the config setting "tag_indexed_row_key_lookup_metrics" to "*" should indeed be enough to take advantage of this functionality.

As for the query performance, I not totally sure about the distribution of your data, but an important point to note about the implementation is that it is strictly for improving query performance when you are filtering out a large proportion of the data points. I saw from your results that the queries are using about 1.2 million data points for metric "collected.df", so the question is how many data points are being filtered out by the tag filter of type="percent_inodes" and type_instance="used". 

If the majority of the entries for "collected.df" have either type="percent_inodes" or type_instance="used", then it's expected that the query performance will be similar between both versions. However, if only a very small percentage of the total entries for "collected.df" pass this tag filter, then I would expect a much bigger performance increase than what you're seeing.

As an extreme artificial example, if you had a tag value that changed for every single measurement that was ingested (e.g. a uuid), then filtering on this tag would be much faster in the new version than in the old version. However, if you're filtering on tags that don't allow a large amount of records to be filtered out, then you won't get much of an improvement (if any).

Could you let me know if this aligns with what you've got in your data (i.e. is the tag filter of type="percent_inodes" and type_instance="used" filtering out a large proportion of the "collected.df" measurements?)

As for your question about compaction strategy, I'm sorry to say that I know almost nothing about the different compaction strategies in Cassandra. Maybe someone else on the list can shed some light on this?

- Gabriel


Sebastian Spiekermann

unread,
Jan 30, 2019, 9:25:53 AM1/30/19
to KairosDB
Hi Gabriel,

thanks for looking into it.

Using these tag filters results in roughly 15% of the raw datapoints. Following your description, I'd expect a significant performance boost in this situation, but I'm not seeing that at all...

I did the queries once more for both KairosDB instances, with and without the tag filters. Please note that I had to reduce the relative start time from 1 hour to 15 minutes because the huge amount of raw data led to "insufficient memory" exceptions on the KairosDB side.
Here are the results:

Kairos-A (release 1.2.2)

query without filters
Query Time: 32,995 ms
Sample Size: 3,814,889
Data Points: 3,814,889

query
with filters
Query Time: 5,979 ms
Sample Size: 313,100
Data Points: 313,100


Kairos-B (your branch)

query without filters
Query Time: 30,597 ms
Sample Size: 3,758,180
Data Points: 3,758,180

query
with filters
Query Time: 6,376 ms
Sample Size: 306,853
Data Points: 306,853

I can't for the life of me imagine why your description and own analyses are in complete contrast to my tests... Both KairosDB's use the same Cassandra cluster, both are running on the same Linux machine, using exactly the same system resources...

Unfortunatelly I'm not allowed to share the real data here. But it should be simple enough for you to create a similar metric with the following 4 tag keys and the appropriate number of different tag values and inserting dummy values for each combination every minute:
  1. host (970 tag values)
  2. plugin_instance (5204 tag values)
  3. type (4 tag values)
  4. type_instance (3 tag values)
Maybe you could create this data and try queries yourself?

Sebastian

Gabriel Reid

unread,
Jan 30, 2019, 11:20:23 AM1/30/19
to KairosDB, Sebastian Spiekermann
Hi Sebastian,

Just to get some initial checks out of the way, could you verify the following:

1. you're running a version built from git hash 9aa6572 built from this branch and repo: https://github.com/gabrielreid/kairosdb/tree/improved_performance_high_tag_cardinality
2. an INFO-level log line similar to "Using tag-indexed row key lookup for all metrics" is being output by KairosDB shortly after startup (or during the first query or ingest)
3. the "tag_indexed_row_keys" table exists and contains records in Cassandra

Assuming all of those things are true (I assume they are), then there must be something else going on. My first guess would be that this change doesn't make enough of a difference due to the relatively low cardinality of the tags that you're filtering on. To verify this, could you try benchmarking a query that only filters on a single selective tag value (e.g. query on a single host tag, without using any other filtering parameters). 

A second thing to try would be removing the group_by operator from your benchmark query, just to see if that changes the performance characteristics at all. There's a possibility that using a group-by is affecting the row key lookup logic, although I'm also kind of doubting that that's the case.

If you could check these things and then post back here, that would be great. If none of these lead you closer to a solution then I'll look into it further.

- Gabriel

Brian Conn

unread,
Jan 30, 2019, 4:14:00 PM1/30/19
to KairosDB
Hi all,

We're planning on trying this branch out on our test cluster. We're having issues in our production cluster with high carnality queries not being as fast as we'd like (high hundreds of ms). For metrics in our production cluster we can have up to ~100,000 unique values on a tag we frequently query by. We're hoping this improves performance for those queries. We're really looking forward to having this (any any automated migration as a bonus) in the next release.

Sebastian Spiekermann

unread,
Jan 31, 2019, 2:28:21 AM1/31/19
to KairosDB
Hi Gabriel,

this is kind of embarassing: indeed, I was NOT running your improvement branch. I checked out your github repository but forgot to switch to your branch...

I'm very sorry! I'll perform my test queries again in a few hours (purged the cassandra keyspace) and get back to you asap!

Sebastian

Sebastian Spiekermann

unread,
Jan 31, 2019, 7:54:30 AM1/31/19
to KairosDB
Hi Gabriel,

good news! :) The query performance HAS IMPROVED significantly, I'm happy:

Kairos-A (1.2.2)

Query Time: 11,222 ms
Sample Size: 1,205,747
Data Points: 1,205,747

Kairos-B (your branch)

Query Time: 6,683 ms
Sample Size: 1,204,574
Data Points: 1,204,574

I did some more queries (heavy and light ones) and had no problems whatsoever. Furthermore, I can confirm that all (heavy) queries that use tag filters benefit the most; not using tag filters results in roughly the same query times with the 1.2.2 KairosDB which is great, IMHO.

I hope Brian reads this, I'm really looking forward to using this feature in production with the upcoming release of KairosDB.

Thanks a lot, Gabriel!

Sebastian

Sebastian Spiekermann

unread,
Jan 31, 2019, 8:00:15 AM1/31/19
to KairosDB
I'm adding one more real-life query performance test here, because it's just great. I've added another tag-filter (tag "host" in the formerly described metric), so it's just one tag left to groupby:

Kairos-A (1.2.2)

Query Time: 4,153 ms
Sample Size: 1,440
Data Points: 1,440

Kairos-B (your branch)

Query Time: 165 ms
Sample Size: 1,440
Data Points: 1,440

One word to Gabriel: awesome!

Sebastian

Gabriel Reid

unread,
Jan 31, 2019, 8:08:57 AM1/31/19
to KairosDB, Sebastian Spiekermann
Hi Sebastian,

Great to hear you got it up and running, and that the results are in line with what I was getting! Thanks a lot for the detailed reporting back on this.

- Gabriel

Nick Hatfield

unread,
Feb 7, 2019, 11:03:07 AM2/7/19
to KairosDB
Hi Gabriel,

Forgive my ignorance here but for some reason, im unable to query data that is older than a day. I have the original implementation running side by side with your improved branch. On the original i can query a metric all the way back to a few months. While the same metric on your branch, only goes back 24 hours. I'm sure that it is something im doing wrong, would you be able to give me some ideas? I have enabled both, write_cluster and read_cluster sections for cassandra backend. Everything connects as expected, without errors. 

Relevant conf:
               write_cluster: {
                       # name of the cluster as it shows up in client specific metrics
                       name: "write_cluster"
                       keyspace: "kairosdb"
                       replication: "{'class': 'NetworkTopologyStrategy','us-east' : '3'}"
                       cql_host_list: ["cas01", "cas02", "cas03"]

                        # Set this if this kairosdb node connects to cassandra nodes in multiple datacenters.
                       # Not setting this will select cassandra hosts using the RoundRobinPolicy, while setting this will use DCAwareRoundRobinPolicy.
                       #local_dc_name: "<local dc name>"

                        #Control the required consistency for cassandra operations.
                       #Available settings are cassandra version dependent:
                       read_consistency_level: "ONE"
                       write_consistency_level: "TWO"

                        #The number of times to retry a request to C* in case of a failure.
                       request_retry_count: 2

                        connections_per_host: {
                               local.core: 4
                               local.max: 100

                                remote.core: 4
                               remote.max: 10
                       }


                        # If using cassandra 3.0 or latter consider increasing this value
                       max_requests_per_connection: {
                               local: 128
                               remote: 128
                       }

                        max_queue_size: 500

                        #for cassandra authentication use the following
                       #auth.[prop name]=[prop value]
                       #example:
                       #auth.user_name=admin
                       #auth.password=eat_me

                        # Set this property to true to enable SSL connections to your C* cluster.
                       # Follow the instructions found here: http://docs.datastax.com/en/developer/java-driver/3.1/manual/ssl/
                       # to create a keystore and pass the values into Kairos using the -D switches
                       use_ssl: false
               }

        read_cluster
: [
                       {
                               name: "read_cluster"
                               keyspace: "kairosdb"
                               replication: "{'class': 'NetworkTopologyStrategy','us-east' : '3'}"
                               cql_host_list: ["cas01", "cas02", "cas03"]
                               #local_dc_name: "<local dc name>
                               read_consistency_level: "ONE"
                               write_consistency_level: "TWO"

                                connections_per_host: {
                                       local.core: 4
                                       local.max: 100
                                       remote.core: 4
                                       remote.max: 10
                               }

                                max_requests_per_connection: {
                                       local: 128
                                       remote: 128
                               }

                                max_queue_size: 500
                               use_ssl: false

                                # Start and end date are optional configuration parameters
                               # The start and end date set bounds on the data in this cluster
                               # queries that do not include this time range will not be sent
                               # to this cluster.
                               #start_time: "2001-07-04T12:08-0700"
                               #end_time: "2001-07-04T12:08-0700"
                       }
                       ]


        }


Thanks for all your help

Gabriel Reid

unread,
Feb 8, 2019, 3:26:55 AM2/8/19
to KairosDB, Nick Hatfield
Hi Nick,

If I'm understanding things correctly, you're running both Kairos instances (i.e. the original branch and the altered branch) pointing to the same backing Cassandra store. I'm assuming that you're also querying historical data that was in the system before you started using the altered branch. Is that correct?

In that case, I'd say that what you're seeing is expected, as the indexing works differently in my branch of Kairos, and the new indexing won't find the historical data that was ingested before you started using the altered branch. In order to query historical data, you'd need to migrate it to the new indexing method (which currently means exporting and then re-importing data).

That being said, I didn't notice any reference to a "tag_indexed_row_key_lookup_metrics" configuration parameter in the config that you posted, so it looks to me like the alternate indexing (i.e. the functionality in my altered branch) isn't actually activated in your case. 

Could you clarify the situation a bit? Specifically:

* are you using two versions of KairosDB on the same backing store?
* do you have "tag_indexed_row_key_lookup_metrics" configured somewhere in one of your configs?

- Gabriel

Message has been deleted

Nick Hatfield

unread,
Feb 8, 2019, 9:44:13 AM2/8/19
to KairosDB
Thanks for getting back to me so quickly... you are correct that this is historical data, I am using the same backing store (cassandra cluster) on both the original branch and your improved branch, I do have the `tag_indexed_row_key_lookup_metrics: "*"` set. I believe the problem I'm having, is as you stated, " In order to query historical data, you'd need to migrate it to the new indexing method (which currently means exporting and then re-importing data)." 

I'll head down this path, as it shouldn't be too difficult to accomplish, just a long running task. There's a few hundred TB of data 

Brian Hawkins

unread,
Feb 12, 2019, 11:23:53 AM2/12/19
to KairosDB
Thanks everyone for testing this branch out.  It saves me a lot of time.

Here are some thoughts as I've read through this.

Compaction strategies: data_points column family is the best candidate for TWCS.  The row key indexes could be used with TWCS but the window will need to be made large enough to encompass the entire 3 weeks or use leveled.  Use leveled for everything else.

building the index:  You don't technically have to export everything.  You definitely don't have to delete before importing it back as it will just overwrite the existing data.  All you need is one data point for every tag combination for every 3 week indexing period.  
Here is how I would try doing it - 
  1. For the metric you want to re index I would query it one week at a time and set it to group by every tag in the metric.  
  2. I'd set the limit to 1 on the results.  
  3. Then use this post to turn the query results into an insert: https://github.com/kairosdb/kairosdb/wiki/Script-to-turn-a-Query-into-an-Insert and send it back to Kairos.
I'd also do a restart on the kairos instance I was running this against to make sure the metric cache was cleared out before indexing.

Brian

Brian Hawkins

unread,
Feb 12, 2019, 2:38:12 PM2/12/19
to KairosDB
I just went through the second half of the pull request trying to get my head around what you have done.  So did you read about that technique for estimating results or come up with it on your own?  It is absolutely brilliant! I love it.

Brian 

Gabriel Reid

unread,
Feb 13, 2019, 2:24:23 AM2/13/19
to KairosDB
Thanks for the kind words Brian :-)

I don't think I specifically read about the technique that I used there, but I'm sure it was inspired by other things I've seen in the past. 

- Gabriel

Brian Hawkins

unread,
Feb 15, 2019, 9:28:49 AM2/15/19
to KairosDB
I've merged this into develop.  I've changed up the way you configure it a bit, it is now configured on a per cluster basis. 

I'm going to try and come up with an endpoint you can hit that will build the new index for existing metrics that you would like to switch over.

I just had a thought while I was typing this.  I may be able to extend the ROW_KEY_TIME_INDEX to indicate what type of indexing was done on the metric when it was inserted.  That way it will be dynamic and you can still query it after you switch.

Brian


On Wednesday, February 13, 2019 at 12:24:23 AM UTC-7, Gabriel Reid wrote:
Thanks for the kind words Brian :-)

I don't think I specifically read about the technique that I used there, but I'm sure it was inspired by other things I've seen in the past. 

- Gabriel

Brian Conn

unread,
Feb 20, 2019, 5:25:37 PM2/20/19
to KairosDB
Hi,

We've tried out the latest develop branch and see that the new index table gets created with all tags indexed for a datapoint. Is it possible to only index a single tag? We apply multiple tags to each datapoint, but always query using a single tag. We'd like the index to only apply to this tag. Is this possible or should we modify source to filter out other tags before inserting and querying data? Thanks,

Brian Hawkins

unread,
Feb 21, 2019, 8:17:57 AM2/21/19
to KairosDB
So at the moment it indexes on all tags.  I don't think it would be hard to extend the configuration to allow you to limit the indexing to certain tags.

I'd also like to change the code so the "wildcard" index is just the old row key index.  That way if you turn the indexing off on a metric queries will still work.

And I'd like to add an api that lets you build indexes.

Brian

Brian Conn

unread,
Feb 23, 2019, 4:47:32 PM2/23/19
to KairosDB
Hey Brian,

We tested out a hard-coded branch where we index only the tag we want to query on and had some great success. We're pushing more data into our test cluster over the weekend and expect to see the indexed queries stay O(1) (we were at ~50ms) and see the unindexed queries (we're pushing duplicate data in and only indexing one of the metrics) to go up as O(n) (we were at ~1000ms). Over the weekend we're pushing in 5x the data we had with our initial tests.

We're really like an option to only save a single tag in the index. I can try submitting a PR if it will be helpful. We'll also need the API to build the index before going live with this as we have a lot of data already in KDB which wouldn't be easy to push into a new cluster. Thanks again, this is looking very promising.

Brian Hawkins

unread,
Feb 27, 2019, 8:47:18 AM2/27/19
to KairosDB
This is good news, this tag cardinality issue has plagued me for a long time.

I'm about halfway done with adding the ability to index a specific tag.

I'll work on the index api next.

Brian

Brian Conn

unread,
Feb 27, 2019, 9:51:55 AM2/27/19
to KairosDB
Hey Brian,

I opened a PR yesterday which is a generalization of my hard coded branch: https://github.com/kairosdb/kairosdb/pull/527. I don't think this is 100%, but wanted to get something out there. Hopefully it hasn't duplicated that much work you have done. Thanks again and we're very excited to get these features in the next release!

Brian Hawkins

unread,
Feb 28, 2019, 8:57:20 AM2/28/19
to KairosDB
I have a question about the tag pair hash you use in the partition key.  I'm trying to identify the benefit of it over just using the tag key=value string.  It doesn't save on inserts.  It doesn't make lookups any faster, in fact it may slow them down as you could have hash collisions so you get more tags back then you wanted.  It does save a little bit on space.  I'm thinking I'll switch it unless someone thinks of a reason not to.

Brian

On Wednesday, February 13, 2019 at 12:24:23 AM UTC-7, Gabriel Reid wrote:
Thanks for the kind words Brian :-)

I don't think I specifically read about the technique that I used there, but I'm sure it was inspired by other things I've seen in the past. 

- Gabriel

Gabriel Reid

unread,
Feb 28, 2019, 9:52:35 AM2/28/19
to KairosDB, Brian Hawkins
Hi Brian,

From what I recall, the main reasoning for this approach was purely performance. As you mentioned, it does save a little bit on (storage) space. This same space savings is paid back repeatedly seeing as this is a table that has generally has a much higher read load than write load, so the general overhead of parsing tag pairs from strings instead of a 32-bit integer, sending this data over the wire, keeping it in various caches and key indexes within Cassandra, etc could possibly add up to a somewhat significant impact over time.

On the other hand, only storing the hash of course makes debugging a bit more difficult, and it's additional logic to consider, so there's also certainly a case for just storing the plaintext tag pair.

- Gabriel

Brian Hawkins

unread,
Mar 6, 2019, 9:46:08 AM3/6/19
to KairosDB
I've just committed per tag indexing.  You may have to drop your tag_indexed_row_keys table as I've changed a column from an int to a string.

Brian


On Thursday, February 28, 2019 at 7:52:35 AM UTC-7, Gabriel Reid wrote:
Hi Brian,

From what I recall, the main reasoning for this approach was purely performance. As you mentioned, it does save a little bit on (storage) space. This same space savings is paid back repeatedly seeing as this is a table that has generally has a much higher read load than write load, so the general overhead of parsing tag pairs from strings instead of a 32-bit integer, sending this data over the wire, keeping it in various caches and key indexes within Cassandra, etc could possibly add up to a somewhat significant impact over time.

On the other hand, only storing the hash of course makes debugging a bit more difficult, and it's additional logic to consider, so there's also certainly a case for just storing the plaintext tag pair.

- Gabriel

Brian Conn

unread,
Mar 6, 2019, 10:08:57 AM3/6/19
to KairosDB
Awesome, thank you Brian. Anything I can help with on backfilling old data into the index? You mentioned you wanted to create an endpoint for it and that may be the only task remaining before this is releasable?

Brian Hawkins

unread,
Mar 7, 2019, 9:32:14 AM3/7/19
to KairosDB
Yes the help would be appreciated.

I'm thinking a rest api that will index a metric for a given time period.  I'd verify that the metric in question is configured to be indexed first.

A pull request for that would be awesome.

Brian

Brian Conn

unread,
Mar 10, 2019, 12:55:33 PM3/10/19
to Brian Hawkins, KairosDB
Sounds good, I'll give it a shot this week.
--
- Brian Conn

Brian Conn

unread,
Mar 12, 2019, 10:46:51 AM3/12/19
to Brian Hawkins, KairosDB
Hey Brian,

I've started work on this (I'll be working on https://github.com/kairosdb/kairosdb/compare/develop...metricly:feature/index-backfill). A couple questions come to mind as I start this work:
- Do you see this endpoint as sync or async? As I bubble up the index statements I see that your add statements are all batched. As these are reindex statements are they worth performing synchronously as a sort of rate limiting mechanism? I think I'd prefer that as I'll be backfilling a lot of data and want to push KDB as fast as it can go without overloading. Otherwise my backfill scripts outside of KDB will need to rate limit for KDB. If performance is going to be way worse, though, then I should still batch them.
- I expect in my endpoint to run a normal query for the data (turn a metric name and time period into a Kairos query object and call all the normal methods) then feed those datapoints directly into the createIndexStatements method as if they were new datapoints. This won't rewrite the datapoints and will only insert the necessary index rows. Does this seem reasonable to you?

Thanks,
--
- Brian Conn

Brian Hawkins

unread,
Mar 12, 2019, 6:22:55 PM3/12/19
to KairosDB
So there is a CQLBatch class that you can get from the CQLBatchFactory that can be injected.  The CQLBatch has methods for adding row keys and sending them off to C*.

CassandraDatastore has a method queryMetricTags that you can model this api after.  All it does is queries the row keys for a given time range and for your case it will compute the indexed version of those and send them back to C* using the CQLBatch.

I wouldn't over engineer it for speed at first.  Most of the time will be spent getting the old row keys out of C*.  Once you get something working you can speed test it with data from the blast server.

Brian


On Tuesday, March 12, 2019 at 7:46:51 AM UTC-7, Brian Conn wrote:
Hey Brian,

I've started work on this (I'll be working on https://github.com/kairosdb/kairosdb/compare/develop...metricly:feature/index-backfill). A couple questions come to mind as I start this work:
- Do you see this endpoint as sync or async? As I bubble up the index statements I see that your add statements are all batched. As these are reindex statements are they worth performing synchronously as a sort of rate limiting mechanism? I think I'd prefer that as I'll be backfilling a lot of data and want to push KDB as fast as it can go without overloading. Otherwise my backfill scripts outside of KDB will need to rate limit for KDB. If performance is going to be way worse, though, then I should still batch them.
- I expect in my endpoint to run a normal query for the data (turn a metric name and time period into a Kairos query object and call all the normal methods) then feed those datapoints directly into the createIndexStatements method as if they were new datapoints. This won't rewrite the datapoints and will only insert the necessary index rows. Does this seem reasonable to you?

Thanks,

On Sun, Mar 10, 2019 at 12:55 PM Brian Conn <> wrote:
Sounds good, I'll give it a shot this week.

--
- Brian Conn

Brian Conn

unread,
Mar 14, 2019, 4:12:54 PM3/14/19
to Brian Hawkins, KairosDB
Ok, I've made a little progress and have more questions.

- When indexing a small number of samples this batch size gets enormous. I assuming I'm creating some duplicate statements. How could I deduplicate them to make this a more reasonable batch insert? https://github.com/kairosdb/kairosdb/compare/develop...metricly:feature/index-backfill#diff-5cc760fcd7185d42167a9ce3ff854de4R417
- Anything else I'm missing or is this the bones of it?
--
- Brian Conn

Brian Hawkins

unread,
Mar 14, 2019, 6:39:47 PM3/14/19
to KairosDB
Inline bellow:


On Thursday, March 14, 2019 at 2:12:54 PM UTC-6, Brian Conn wrote:
Ok, I've made a little progress and have more questions.

Spot on!
 
- When indexing a small number of samples this batch size gets enormous. I assuming I'm creating some duplicate statements. How could I deduplicate them to make this a more reasonable batch
I would submit the batch after x number of statements added to it or it could explode for some metrics.
 
Just call into createIndexStatements because you don't need to insert back into the row keys table.  Then I think all inserts should be unique.
I guess all you need for this is a metric name and a time range.  Lets do query parameters.
 
- Anything else I'm missing or is this the bones of it?
Keep it up.

Brian Conn

unread,
Mar 14, 2019, 7:01:26 PM3/14/19
to Brian Hawkins, KairosDB
Thanks for the quick feedback, I'm glad I'm on the right track. Sounds good on query params.

I was getting a huge number of insert statements. I started batch submitting when I hit a few thousand statements, but it seemed like far too many still. My test data only had a few unique tag values. I think it's because I call createIndexStatements on every single resulting row: https://github.com/kairosdb/kairosdb/compare/develop...metricly:feature/index-backfill#diff-5cc760fcd7185d42167a9ce3ff854de4R414.

I'm not sure I understand what you're suggesting. Don't I need to create insert statements on every matching row?

Lastly, would you prefer I open a PR and continue this conversation there? Thanks again,
--
- Brian Conn

Brian Hawkins

unread,
Mar 16, 2019, 9:46:15 AM3/16/19
to KairosDB
Yes start a PR and we can continue the conversation there with comments in the code.

Thanks


On Thursday, March 14, 2019 at 5:01:26 PM UTC-6, Brian Conn wrote:
Thanks for the quick feedback, I'm glad I'm on the right track. Sounds good on query params.

I was getting a huge number of insert statements. I started batch submitting when I hit a few thousand statements, but it seemed like far too many still. My test data only had a few unique tag values. I think it's because I call createIndexStatements on every single resulting row: https://github.com/kairosdb/kairosdb/compare/develop...metricly:feature/index-backfill#diff-5cc760fcd7185d42167a9ce3ff854de4R414.

I'm not sure I understand what you're suggesting. Don't I need to create insert statements on every matching row?

Lastly, would you prefer I open a PR and continue this conversation there? Thanks again,

Brian Conn

unread,
Mar 16, 2019, 1:03:53 PM3/16/19
to Brian Hawkins, KairosDB
Sounds good, I've cleaned up the branch and opened a PR with a few comments: https://github.com/kairosdb/kairosdb/pull/534
--
- Brian Conn

Riley Zimmerman

unread,
Jul 31, 2019, 9:31:55 AM7/31/19
to KairosDB
Hi,

I was wondering if there was any update on how the enhancement was going?  I see it was merged into develop, but there hasn't been any official releases with it.  If there is any testing or other checks that need help please let me know and I can try to any way I can.  
Reply all
Reply to author
Forward
0 new messages