Question regarding sensible primary key / sampling expression choice

186 views

Skip to first unread message

sol...@ziu.info

unread,

Oct 20, 2017, 9:02:44 AM10/20/17

to ClickHouse

Hi,

We will be using clickhouse to store data which comes with - averagely - 5k-6k requests / second (with peak cap around 30k/s). The logs come originally from nginx and will be processed by our logstash plugin, before ending in part in elasticsearch and in part in clickhouse. Initially we were experimenting with influxdb, but it essentially died under this kind of load in "high cardinality" scenario.

For now our table definition (which works quite well, we also did tests with 3 shards and 2 replicas and it also works superbly) have been looking this way:

        CREATE TABLE IF NOT EXISTS dbtest.access_log (
                daystamp                Date,
                timestamp               DateTime,
                node                    String,
                bytes                   UInt32,
                status                  UInt16,
                path_prefix             String,
                protocol                String,
                referrer                String,
                tenant_id               UInt32,
                vhost_id                UInt32,
                cache_status            String,
                country                 FixedString(2),
                asn                     UInt32,
                platform                String,
                platform_type           String
        ) ENGINE = MergeTree(daystamp, timestamp, 8192)

This is without any sampling key - with required day-granularity daystamp, and with second-granularity timestamp as a primary key. The questions:

is this primary key sensibly chosen ? Most official and not so official examples I've seen usually use 'daystamp' at least as a part of the primary key. 'timestamp' as above felt like a sensible choice, but maybe it's too fine-grained ? Would something like
MergeTree(daystamp, (daystamp, timestamp), 8192)
even make a sense (considering timestamp is more fine-grained daystamp) and/or be a better choice ?
in our expected load (5k averagely, lengthy peaks at 30k) - what about suggested '8192', does it need any adjustments ?
If the setup above is correct, would a sampling expression make any sense ? As far as I understand it, it's main role is to guard against memory issues - and with primary key set as in the example above, it seems completely pointless (and after few days of tests while generating logs at 20k/s clickhouse never used more than a few percents of available memory - at least not with very simple selects).
alternatively, would something like:
MergeTree(daystamp, intHash64(timestamp), (daystamp, intHash64(timestamp)), 8192)
be more useful / better in practice ? On a daily basis, that's ~500m req/s estimated - so sampling part would likely be essential, but perhaps just a simple second-grained timestamp as a primary key is the correct choice ?

Any other remarks or suggestions appreciated =)

Ona related note - is there any chance for toDateTime() parser to understand milliseconds (even if they are meaningless) ? From what I've seen it accepts iso8601 fine, but fails whenever e,g, .000 is encountered.

Vitaliy Lyudvichenko

unread,

Oct 24, 2017, 9:35:44 AM10/24/17

to ClickHouse

пятница, 20 октября 2017 г., 16:02:44 UTC+3 пользователь sol...@ziu.info написал:

Hi,

We will be using clickhouse to store data which comes with - averagely - 5k-6k requests / second (with peak cap around 30k/s). The logs come originally from nginx and will be processed by our logstash plugin, before ending in part in elasticsearch and in part in clickhouse. Initially we were experimenting with influxdb, but it essentially died under this kind of load in "high cardinality" scenario.

For now our table definition (which works quite well, we also did tests with 3 shards and 2 replicas and it also works superbly) have been looking this way:

CREATE TABLE IF NOT EXISTS dbtest.access_log ( daystamp Date, timestamp DateTime, node String, bytes UInt32, status UInt16, path_prefix String, protocol String, referrer String, tenant_id UInt32, vhost_id UInt32, cache_status String, country FixedString(2), asn UInt32, platform String, platform_type String ) ENGINE = MergeTree(daystamp, timestamp, 8192)

This is without any sampling key - with required day-granularity daystamp, and with second-granularity timestamp as a primary key. The questions:

is this primary key sensibly chosen ? Most official and not so official examples I've seen usually use 'daystamp' at least as a part of the primary key. 'timestamp' as above felt like a sensible choice, but maybe it's too fine-grained ? Would something like
MergeTree(daystamp, (daystamp, timestamp), 8192)
even make a sense (considering timestamp is more fine-grained daystamp) and/or be a better choice ?

It is ok to use only timestamp as primary key.

But if you need to filter data by index using an additional column, for example tenant_id, than PK (daystamp, tenant_id) will be significantly better than (daystamp, tenant_id).

So, for queries like SELECT * FROM table WHERE tenant_id = ID a table with second PK will almost degrade to full scan.

Consequently, if you want to use the sampling effectively it is better to use (daystamp, intHash64(smth)) instead of (timestamp, intHash64(smth)).

They both will consume approximately the same amount of RAM, but the second one will use IO more intensively due to full scan.

in our expected load (5k averagely, lengthy peaks at 30k) - what about suggested '8192', does it need any adjustments ?

The less index_granularity (8192 in your example), the less extra rows are read (in the worst case it reads blocks with 8192 rows if only 1 is required), but the size of the index is increasing.

For example, if index_granularity=1 it will not read extra rows, but the size of the index (.idx + .mrk files) will be enormous.

So, it makes sense to decrease index_granularity if your queries read only little range of the data and you have too long strings that are too expensive to read.

But, if you have large range queries and not too grained PK you can keep big index_granularity.

If the setup above is correct, would a sampling expression make any sense ? As far as I understand it, it's main role is to guard against memory issues - and with primary key set as in the example above, it seems completely pointless (and after few days of tests while generating logs at 20k/s clickhouse never used more than a few percents of available memory - at least not with very simple selects).

The sampling reduces not only memory consumption but IO and CPU also.

If you don't have problems with performance, you might not need it.

But the sampling is quite easy to maintain, so you could add it for the future.

alternatively, would something like:
MergeTree(daystamp, intHash64(timestamp), (timestamp, intHash64(timestamp)), 8192)

be more useful / better in practice ? On a daily basis, that's ~500m req/s estimated - so sampling part would likely be essential, but perhaps just a simple second-grained timestamp as a primary key is the correct choice ?

I mentioned it above.

Any other remarks or suggestions appreciated =)

Ona related note - is there any chance for toDateTime() parser to understand milliseconds (even if they are meaningless) ? From what I've seen it accepts iso8601 fine, but fails whenever e,g, .000 is encountered.