Issue with Kafka connect workers as scalable cloud services

Dhananjay Patkar

unread,

Aug 16, 2017, 5:02:14 PM8/16/17

to Confluent Platform

I am running kafka connect in distributed mode with custom sink connector. I am using cloud services to spin up cluster worker nodes.

Whenever I add new nodes, I see my consumer connector cluster frequently go into unresponsive state, only a leader worker will continue to consume messages, while others start outputting below "INFO" [Warn] message

[2017-08-16 20:58:04,263] INFO Joined group and got assignment: Assignment{error=0, leader='connect-1-b7c0d5a3-737f-4d7a-9eb2-5381e85d2fd3', leaderUrl

='http://10.16.2.5:8083/', offset=3, connectorIds=[], taskIds=[DB_HBASE-GRP-1-0]} (org.apache.kafka.connect.runtime.distributed.DistributedHerder)

[2017-08-16 20:58:04,264] WARN Catching up to assignment's config offset. (org.apache.kafka.connect.runtime.distributed.DistributedHerder)

[2017-08-16 20:58:04,264] INFO Current config state offset 1 is behind group assignment 3, reading to end of config log (org.apache.kafka.connect.runt

ime.distributed.DistributedHerder)

[2017-08-16 20:58:04,759] INFO Finished reading to end of log and updated config snapshot, new config log offset: 1 (org.apache.kafka.connect.runtime.

distributed.DistributedHerder)

[2017-08-16 20:58:04,760] INFO Current config state offset 1 does not match group assignment 3. Forcing rebalance. (org.apache.kafka.connect.runtime.d

istributed.DistributedHerder)

[2017-08-16 20:58:04,760] INFO Rebalance started (org.apache.kafka.connect.runtime.distributed.DistributedHerder)

[2017-08-16 20:58:04,761] INFO Wasn't unable to resume work after last rebalance, can skip stopping connectors and tasks (org.apache.kafka.connect.run

time.distributed.DistributedHerder)

[2017-08-16 20:58:04,761] INFO (Re-)joining group cqr-hbase-sink-group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)

[2017-08-16 20:58:04,763] INFO Successfully joined group cqr-hbase-sink-group with generation 4 (org.apache.kafka.clients.consumer.internals.AbstractC

oordinator)

[2017-08-16 20:58:04,764] INFO Joined group and got assignment: Assignment{error=0, leader='connect-1-b7c0d5a3-737f-4d7a-9eb2-5381e85d2fd3', leaderUrl

='http://10.16.2.5:8083/', offset=3, connectorIds=[], taskIds=[DB_HBASE-GRP-1-0]} (org.apache.kafka.connect.runtime.distributed.DistributedHerder)

Is distributed mode production ready? Can I run with dynamic scaling cloud services or I need to have fixed nodes in a cluster? \

Thanks in advance!!

mayank rathi

unread,

Aug 29, 2017, 1:16:19 PM8/29/17

to confluent...@googlegroups.com

This could be due to re-balancing.

http://docs.confluent.io/current/connect/concepts.html#distributed-workers

"In distributed mode, you start many worker processes using the same group.id and they automatically coordinate to schedule execution of connectors and tasks across all available workers. If you add a worker, shut down a worker, or a worker fails unexpectedly, the rest of the workers detect this and automatically coordinate to redistribute connectors and tasks across the updated set of available workers. "

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/12f01511-d518-4df2-8c6d-d9a85c965009%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Ziliang Chen

unread,

Oct 6, 2017, 12:06:09 AM10/6/17

to Confluent Platform

May i ask if this issue is still happening ?

Dhananjay Patkar

unread,

Oct 6, 2017, 5:07:59 AM10/6/17

to Confluent Platform

To make everyone aware, this issue is primarily caused by having multiple partitions for topic used for config.storage.topic [Topic to use for storing connector and task configurations]

In my case, this config topic was getting created automatically with default number of partitions [in my case 10],

To avoid this case, we need to pre-create this topic with partition size as 1.

By doing this, I don't see recurrent re-balancing issue.

Thanks,

Dhananjay

Reply all

Reply to author

Forward