Hi,
i am using clickhouse deployed with version 19.16.3.6-1 and kafka with version 2.4.0.
i was looking for an explicit documentation/suggestion about the better definition for a Kafka engine instance with regards of a pointed TOPIC of a Kafka cluster, but what i read was not so helpful.
My situation is related to a complex project, where there are currently 30 clickhouse kafka table connected each of them to a different topic with different partitioning setup.
At the moment the Kafka cluster is composed by 12 nodes, but i am encreasing it to 22 nodes.
Clickhouse is currently deployed as well on 12 nodes (6 shards with 2 replica each) but like kafka i am going to increase it to 22 nodes (11 shards with 2 replica each)
My question is related to what should be the best setup for the kafka_broker_list
to limit tcp connections and to have a possible equally distributed load related to the message consuming/reading activity.
-- currently (with the 12 kafka brokers) i have this kind of definition
CREATE TABLE pippo.pluto_queue (
`aa` UInt64,
`bb` String,
...
) ENGINE = Kafka SETTINGS kafka_broker_list = 'node01:9092,node02:9092,node03:9092,node04:9092,node05:9092,node06:9092,node07:9092,node08:9092,node09:9092,node10:9092,node11:9092,node12:9092',
kafka_topic_list = 'TOPICXX', kafka_group_name = 'mygroup.pluto_queue', kafka_format = 'JSONEachRow'
The idea was that the kafka_broker_list is used to have the first contact of the kafka cluster, and then with regards of the topic topology (and considering the kafka group name specified in the engine) there will be the needed kafka consumer activated with regards of the kafka engine activated with the same group name and with regards of the topic partitioning.
Saying that, the kafka_broker_list should only be considered for HA on Kafka access, so that if first "node01:9092" won't be accessible, the engine will try to connect to the second one in the list.... since the idea is that after managing the connection and with regards of the Kafka protocol, there will be the needed connections between all active table engine in the clickhouse cluster and the current kafka topic partitioning.
Saying that the 12 nodes list could be reduced to a 2 or 3 nodes list.
Another idea about the kafka settings of the clickhouse table could be to declare access to local kafka instance process (localhost:9092) so having a distributed first kafka access with regards of the number of clickhouse/kafka processes (always considering that the topic topology and status will drive the 'consuming' task to be connected to the needed node with data to be read). But this setup could cause to stop a clickhouse table kafka engine instance to work if the corrispondent local kafka instance won't be available.
-- the alternative definition of the above kafka queue
CREATE TABLE pippo.pluto_queue (
`aa` UInt64,
`bb` String,
...
) ENGINE = Kafka SETTINGS kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'TOPICXX', kafka_group_name = 'mygroup.pluto_queue', kafka_format = 'JSONEachRow'
What do you think about it ?
Regards