Question regarding sensible primary key / sampling expression choice

186 views
Skip to first unread message

sol...@ziu.info

unread,
Oct 20, 2017, 9:02:44 AM10/20/17
to ClickHouse
Hi,

We will be using clickhouse to store data which comes with - averagely - 5k-6k requests / second (with peak cap around 30k/s). The logs come originally from nginx and will be processed by our logstash plugin, before ending in part in elasticsearch and in part in clickhouse. Initially we were experimenting with influxdb, but it essentially died under this kind of load in "high cardinality" scenario.

For now our table definition (which works quite well, we also did tests with 3 shards and 2 replicas and it also works superbly) have been looking this way:

        CREATE TABLE IF NOT EXISTS dbtest.access_log (
                daystamp                
Date,
                timestamp              
DateTime,
                node                    
String,
                bytes                  
UInt32,
                status                  
UInt16,
                path_prefix            
String,
                protocol                
String,
                referrer                
String,
                tenant_id              
UInt32,
                vhost_id                
UInt32,
                cache_status            
String,
                country                
FixedString(2),
                asn                    
UInt32,
                platform                
String,
                platform_type          
String
       
) ENGINE = MergeTree(daystamp, timestamp, 8192)

This is without any sampling key - with required day-granularity daystamp, and with second-granularity timestamp as a primary key. The questions:

  • is this primary key sensibly chosen ? Most official and not so official examples I've seen usually use 'daystamp' at least as a part of the primary key. 'timestamp' as above felt like a sensible choice, but maybe it's too fine-grained ? Would something like
    MergeTree(daystamp, (daystamp, timestamp), 8192)
    even make a sense (considering timestamp is more fine-grained daystamp) and/or be a better choice ?
  • in our expected load (5k averagely, lengthy peaks at 30k) - what about suggested '8192', does it need any adjustments ?
  • If the setup above is correct, would a sampling expression make any sense ? As far as I understand it, it's main role is to guard against memory issues - and with primary key set as in the example above, it seems completely pointless (and after few days of tests while generating logs at 20k/s clickhouse never used more than a few percents of available memory - at least not with very simple selects).
  • alternatively, would something like: 
    MergeTree(daystamp, intHash64(timestamp), (daystamp, intHash64(timestamp)), 8192)
    be more useful / better in practice ? On a daily basis, that's ~500m req/s estimated - so sampling part would likely be essential, but perhaps just a simple second-grained timestamp as a primary key is the correct choice ?
Any other remarks or suggestions appreciated =)

Ona related note - is there any chance for toDateTime() parser to understand milliseconds (even if they are meaningless) ? From what I've seen it accepts iso8601 fine, but fails whenever e,g, .000 is encountered.

Vitaliy Lyudvichenko

unread,
Oct 24, 2017, 9:35:44 AM10/24/17
to ClickHouse


пятница, 20 октября 2017 г., 16:02:44 UTC+3 пользователь sol...@ziu.info написал:
Hi,

We will be using clickhouse to store data which comes with - averagely - 5k-6k requests / second (with peak cap around 30k/s). The logs come originally from nginx and will be processed by our logstash plugin, before ending in part in elasticsearch and in part in clickhouse. Initially we were experimenting with influxdb, but it essentially died under this kind of load in "high cardinality" scenario.

For now our table definition (which works quite well, we also did tests with 3 shards and 2 replicas and it also works superbly) have been looking this way:

        CREATE TABLE IF NOT EXISTS dbtest.access_log (
                daystamp                
Date,
                timestamp              
DateTime,
                node                    
String,
                bytes                  
UInt32,
                status                  
UInt16,
                path_prefix            
String,
                protocol                
String,
                referrer                
String,
                tenant_id              
UInt32,
                vhost_id                
UInt32,
                cache_status            
String,
                country                
FixedString(2),
                asn                    
UInt32,
                platform                
String,
                platform_type          
String
       
) ENGINE = MergeTree(daystamp, timestamp, 8192)

This is without any sampling key - with required day-granularity daystamp, and with second-granularity timestamp as a primary key. The questions:

  • is this primary key sensibly chosen ? Most official and not so official examples I've seen usually use 'daystamp' at least as a part of the primary key. 'timestamp' as above felt like a sensible choice, but maybe it's too fine-grained ? Would something like
    MergeTree(daystamp, (daystamp, timestamp), 8192)
    even make a sense (considering timestamp is more fine-grained daystamp) and/or be a better choice ?
It is ok to use only timestamp as primary key.
But if you need to filter data by index using an additional column, for example tenant_id, than PK (daystamp, tenant_id) will be significantly better than (daystamp, tenant_id).
So, for queries like SELECT * FROM table WHERE tenant_id = ID a table with second PK will almost degrade to full scan.

Consequently, if you want to use the sampling effectively it is better to use (daystamp, intHash64(smth)) instead of (timestamp, intHash64(smth)).
They both will consume approximately the same amount of RAM, but the second one will use IO more intensively due to full scan.
  • in our expected load (5k averagely, lengthy peaks at 30k) - what about suggested '8192', does it need any adjustments ?
The less index_granularity (8192 in your example), the less extra rows are read (in the worst case it reads blocks with 8192 rows if only 1 is required), but the size of the index is increasing.
For example, if index_granularity=1 it will not read extra rows, but the size of the index (.idx + .mrk files) will be enormous.

So, it makes sense to decrease index_granularity if your queries read only little range of the data and you have too long strings that are too expensive to read.
But, if you have large range queries and not too grained PK you can keep big index_granularity.

  • If the setup above is correct, would a sampling expression make any sense ? As far as I understand it, it's main role is to guard against memory issues - and with primary key set as in the example above, it seems completely pointless (and after few days of tests while generating logs at 20k/s clickhouse never used more than a few percents of available memory - at least not with very simple selects).
The sampling reduces not only memory consumption but IO and CPU also.
If you don't have problems with performance, you might not need it.
But the sampling is quite easy to maintain, so you could add it for the future.

  • alternatively, would something like: 
    MergeTree(daystamp, intHash64(timestamp), (timestamp, intHash64(timestamp)), 8192)
  • be more useful / better in practice ? On a daily basis, that's ~500m req/s estimated - so sampling part would likely be essential, but perhaps just a simple second-grained timestamp as a primary key is the correct choice ?
I mentioned it above.  
Any other remarks or suggestions appreciated =)

Ona related note - is there any chance for toDateTime() parser to understand milliseconds (even if they are meaningless) ? From what I've seen it accepts iso8601 fine, but fails whenever e,g, .000 is encountered.

Maybe special function toDateTimeXXXX() will be better? 
Reply all
Reply to author
Forward
0 new messages