kafka connect design : source task - threads - partitions

Lorenzo Sommaruga

unread,

Apr 6, 2017, 3:22:14 PM4/6/17

to Confluent Platform

Hi,

i have a design doubt for kafka connect (SourceConnector): i want to read files from many different sources - machines (file with same format).

I'm exploring the possibility to use only one target topic (same data domain and format) with 1...N partitions and use 1..N source tasks so i can parallelize the work but :

i'm a bit scared by the "problem" that in the future we can have N threads (too many threads?) running on kafka cluster (at least one for each SourceTask) :

probably if i can run the "task thread" on the source system it's better because i'll have 1 source machine = 1 thread and not (for example) 3 kafka nodes = N task threads

i'm a bit scared by the "problem" that in the future we can have too many partition so we might encounter a performance problems on the topic
i read that workers rebalance the sourcetask but if i have for example 100 max task and 100 partitions i have 100 workers also ?
finally my question is : it's better to use many connector instance so we can manage better the scalability and configuration. For example (imagine 100 source) : 5 connector instance with 20 source task each . What is the advantages of having multiple instances of connector instead of one instances and many source task?

Thanks in advance.

Ewen Cheslack-Postava

unread,

Apr 7, 2017, 1:49:48 AM4/7/17

to Confluent Platform

On Thu, Apr 6, 2017 at 12:22 PM, Lorenzo Sommaruga <lorenzo....@gmail.com> wrote:

Hi,
i have a design doubt for kafka connect (SourceConnector): i want to read files from many different sources - machines (file with same format).

I'm exploring the possibility to use only one target topic (same data domain and format) with 1...N partitions and use 1..N source tasks so i can parallelize the work but :
i'm a bit scared by the "problem" that in the future we can have N threads (too many threads?) running on kafka cluster (at least one for each SourceTask) :
probably if i can run the "task thread" on the source system it's better because i'll have 1 source machine = 1 thread and not (for example) 3 kafka nodes = N task threads

Connect does not run on the broker. Both Connect and Streams are higher-level client applications that you should run on servers other than the broker.

Connect also supports two modes, standalone and distributed. See http://docs.confluent.io/3.2.0/connect/concepts.html#workers for a detailed explanation of these two modes. In your case, standalone mode will make sense because you *must* run the connectors/tasks where the files are located.

i'm a bit scared by the "problem" that in the future we can have too many partition so we might encounter a performance problems on the topic

The number of files, tasks, and topic partitions do not need to be equal. All of these can be scaled as needed. However, for the # of partitions in the topic, you will want to plan ahead for scale since this affects which records/keys are routed to which partitions (e.g. if you use the log file name as the key, changing the number of topic partitions would mean all lines from a log file would not be in the same partition anymore).

For files & tasks, in Connect a task can handle any number of inputs (it is up to the connector) and tasks can publish to any topic partition, so the number of tasks & topic partitions are not tied to each other in any way.

i read that workers rebalance the sourcetask but if i have for example 100 max task and 100 partitions i have 100 workers also ?

You do not need 100 workers. Workers simply balance the available tasks as equally as possible. If you have only 1 worker, it will handle all 100 tasks. If you have 5 workers, each will handle 20 tasks. If you have 100, each will handle 1 task. However, note that this only applies to distributed mode. In standalone mode (which seems applicable in your case since you need each task tied to local files), you'll need to manage distribution of work, but you don't really have much choice since the process handling each file *must* be on the same node as that file.

finally my question is : it's better to use many connector instance so we can manage better the scalability and configuration. For example (imagine 100 source) : 5 connector instance with 20 source task each . What is the advantages of having multiple instances of connector instead of one instances and many source task?

For most users, fewer tasks is better if they all share the same configuration. The framework already handles load balancing the work, so the main reason for using multiple connector instances is to apply different configurations. Additionally, you might want separate connectors if you need to be able to pause/resume them independently since these operations are applied to the entire connector.

-Ewen

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/c04385c1-ce1a-496d-bddc-e8b68b7ed454%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lorenzo Sommaruga

unread,

Apr 7, 2017, 4:12:46 AM4/7/17

to Confluent Platform

Thanks, very clear.

The "last" issue is that i must copy all dependecies (most of them) of the connect-standalone script into the source machine classpath: it's not a big problem but when you want to upgrade you need to manually do it on all source machine .

I'll made some test and try to find a "light connector script / dependencie".

Gwen Shapira

unread,

Apr 10, 2017, 12:35:02 AM4/10/17

to confluent...@googlegroups.com

Agree, it is a bit of a pain. If you are into Docker, Confluent has a container on Docker hub that includes Connect and all of Confluent's connectors. You can probably use this as a base for more customizations.

We will work on improved packaging for sure, but until then - you can simply install all of Kafka on the source machine, but only run stand-alone connect. It is a bit wasteful, but it may be easier than copying lots of dependencies...

Gwen

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.

To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/ea200679-2f5f-43e3-949e-620a40802fc7%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Gwen Shapira

Product Manager | Confluent

650.450.2760 | @gwenshap

Follow us: Twitter | blog

Reply all

Reply to author

Forward

kafka connect design : source task - threads - partitions - workers

Lorenzo Sommaruga

Ewen Cheslack-Postava

Lorenzo Sommaruga

Gwen Shapira