What else you could tune for maximum performance?
1. max_block_size.
For some queries, may be better to lower this setting to be more CPU cache frendly.
But it is not always the win, because sometimes, there is large per-block overhead.
By default, it is 65536. You could lower to any value greater or equals index granularity of table. If you set lower value, it will be rounded up internally.
2. max_threads.
By default, ClickHouse automatically use number of threads equal to number of CPU cores without hyper-threading.
For example, on Intel CPUs with 2-way hyper-threading, if you have 32 logical cores, ClickHouse will use 16 threads.
But hyperthreading is not useless, it helps, for example, when your queries will use large hash tables to do aggregations.
But performance win of hyperthreading is disproportional, for example, you will get 1.5 times performance boost, when using 32 cores instead of 16.
3. compile.
This option allows to optimize inner loop of GROUP BY with runtime code-generation. It is disabled by default because performance impact is low on average. But for simple queries, it sometimes a big win (query may speed up few times). Documentation only in russian yet:
https://clickhouse.yandex/reference_ru.html#compileTo test this option, first set compile = 1, min_count_to_compile = 0. This has effect of synchronous compilation of query, with pause to do compile. If compilation will not succeed, you will get error message in client. For production usage, set min_count_to_compile to higher value.