I'm working in Catonetworks, and we are now in the process of evaluating Clickhouse as a query engine in one of our new projects.
We have preform a small POC focusing mainly on query response time for complex queries on a nice data set of about 3T (default compression, and without any optimisation on data).
The performance looked more than reasonable and we think that at least from that point of view we will be ok.
We are more concern about the concept of cluster, when usually one knows he should start using a cluster instead of increasing the machine some more.
Is it usually when amounts of data processed in a single query become too large, if so, what is this number?
or when there are too much concurrent requests?
We are trying to understand if we should start from the first place with a small cluster (2 shards - with distributed table) to make the grows in a later point easier? or start with a single shard and 2 replicas just for high-availability?
Also we think we might be missing some basic understanding in how clickhouse cluster was intended to be used. When we will reach the point when we will decide to add an extra node, it sounds like, old data should not be equally distributed between all nodes, but rather more new data need to be written to the new node. but as we see it, this will mean that if most of our queries are running over new data, we will mainly utilise the new node, is it make seance? won't it cause any performance issues?
Thanks in advanced