Query performance vs. cluster size

903 views
Skip to first unread message

Iris Shainsky

unread,
Mar 7, 2017, 9:07:44 AM3/7/17
to ClickHouse
Hi, 
We are storing our raw campaign data, we tested clickhouse. I had a distributed table of 500,000,000 rows. the table was base on a local MergeTree table with an EventDate column and an Attribute column as an index. I didn't see any change in query performance when adding more nodes to the cluster(tested on 3 nodes, 5 nodes and 10 nodes cluster without replica). My queries are: count unique users where Attribute=X. when using unique I saw a slight improvement but for unique exact no change.
Any reason for this? what is the best practice? which AWS instance do you recommend for best performance.
Also saw a degradation in query performance when running 10 parallel queries, I guess that replicas can help here but still this is a very low parallelism 
Thank you 
Iris 

man...@gmail.com

unread,
Mar 27, 2017, 3:19:20 AM3/27/17
to ClickHouse
Hello. Sorry for the delay.

Scalability of query execution time depends on how many work is done on each shard in parallel, and how many work is not parallelized across shards.
Usually, what is not paralellized well is sending intermediate data over the network.

So, query scalability is depend on how many intermediate data must be sent over the network and on network speed.
And for simple GROUP BY queries, it depend of number of result rows and total size of states of aggregate functions
(states of aggregate functions need to be merged to get final result).

For example, if you use as simple aggregate function as 'sum' and your query returns just few rows, you will, sure enough, get linear scalability.
For count unique it is more difficult. uniqExact function just have all distinct values in its state, so it could be quite large.
uniq function has state up to 64K values, it does not grow further. But 64K values for each state could also be large, especially if you have large number of such states.
You could try 'uniqCombined' function. It has same or better quality that 'uniq' function, but also has more compact state.

Also you could try to use better network: 10 GBit instead of 1 GBit if it is possible; locate shards near to each other (ideally - in same switch, or at least in same region).

Sometimes you could limit amount of intermediate data for query. Consider for example, you have the following query:

SELECT key, uniq(x), count() FROM table GROUP BY key ORDER BY count() DESC LIMIT 10

It will calculate uniq for all keys, send it over the network, merge intermediate values, then select top 10.
You could rewrite it with the following:

SELECT key, uniq(x), count() FROM table
WHERE key GLOBAL IN (SELECT key, count() FROM table GROUP BY key ORDER BY count() DESC LIMIT 10)
GROUP BY key ORDER BY count() DESC LIMIT 10

This query will calculate top 10 keys inside subquery and then calculate uniq values just for top 10 keys.

Also there are some tricky ways:

SELECT key, uniq(x), count() FROM table
WHERE key IN (SELECT key FROM table WHERE rand() % 100 = 0)
GROUP BY key ORDER BY count() DESC LIMIT 10

This query calculates uniq only for keys that are likely to be frequent enough.

You could also use max_rows_to_group_by option with group_by_overflow_mode = 'any'.

Also, if you shard your data by user_id, you could avoid merging intermediate states of aggregation with 'distributed_group_by_no_merge = 1' option. This is also very tricky, see here for details: https://github.com/yandex/ClickHouse/issues/332

To check, that your query is really bounded by network speed, run dstat linux tool. If you will see high network usage and low CPU usage during query execution, then you know, that performance is network-bound.

mvav...@cloudflare.com

unread,
Mar 28, 2017, 6:56:47 PM3/28/17
to ClickHouse
That makes me curious - I've read that you're using bi-level sharding, but if I understand it right that's only for insertion and replica selection.
How do you deal with query performance with large clusters? Say if you have 500 servers, then with random sharding scheme you have to make subquery to all 500 servers, and then merge 500xN aggregation states, which takes possibly longer than with 10 servers. Do you have overlaying clusters or implement any sort of domain-specific sharding? Say you always have records with zone UserID on same servers, query planner could leverage that to only send subquery to certain servers.

tdingus

unread,
Oct 7, 2019, 2:50:20 PM10/7/19
to ClickHouse
Is there a way to know how much or what intermediate data is sent?

Thanks!
Reply all
Reply to author
Forward
0 new messages