What should be ideal value of tasks.max?

Nishant Verma

unread,

Feb 17, 2017, 2:09:22 AM2/17/17

to Confluent Platform

Hi

I have a three data node + 2 name node cluster. Cluster capacity is 500G. RAM of each node is 32G.

My kafka connect is running in distributed mode on 2 nodes.

My kafka broker is a 3 node cluster.

My source is generating small JSON files which is dumped onto HDFS sink using kafka connect.

I have changed HEAP_SIZE in my hadoop-env.sh as 12G for Masters and 8G for slaves as I was getting Java Heap Size - Out of Memory error. I am yet to test it after applying this HEAP_SIZE changes.

I have kept flush.size as 3000 to ensure small size JSON files are not dumped into HDFS. Earlier with flush.size as 50, some small bytes of data were getting dumped. After increasing it to 3000, some 2MB size of data were getting dumped. We will further increase it to 25000 so that larger size data goes to HDFS. I have set below properties in connect-distributed.properties:

offset.flush.interval.ms=10000

session.timeout.ms=100000

max.poll.records=500

max.poll.interval.ms=300000

request.timeout.ms=305000

enable.auto.commit=true

auto.commit.interval.ms=4000

What should be the ideal value of tasks.max for my scenario? Currently it is 14. My source is generating 50000 small JSON records in one minute. I want to run kafka connect on these 2 nodes overnight daily for some 2 weeks.

Thanks

Nishant Verma

unread,

Feb 17, 2017, 6:42:37 AM2/17/17

to Confluent Platform

Edit1:

With flush.size as 25000:

My source is generating record with expectation of receiving Ten Million JSONs per hour. With flush.size as 25000, HADOOP_HEAPSIZE as 12000 MB for masters and HADOOP_HEAPSIZE as 8096 MB for slaves, I am not seeing any error in the logs but there is no commit happening to HDFS. There was no data in /topics/topic1 path. Although, I could see records in /topics/+tmp/topic1 but that did not get flushed out from +tmp. 25000 as flush size would have been met within minutes considering my data generation rate.

With flush.size as 3000:

I changed the config of my connector and reduced flush.size to 3000 using below command:

curl -H "Content-Type: application/json" -X PUT -d '{"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector", "format.class": "com.qubole.streamx.SourceFormat", "tasks.max": "14", "hdfs.url": "hdfs://ip-10-16-37-124:9000", "topics": "topic1,topic2", "partitioner.class": "io.confluent.connect.hdfs.partitioner.DailyPartitioner", "locale": " en.UTF-8", "flush.size": "3000", "timezone": "Asia/Calcutta" }' http://localhost:8083/connectors/run-1-hdfs-sink/config

Connector started, started reading a few data and immediately threw below Java Heap Space Out of Memory Exception error:

Exception in thread "kafka-coordinator-heartbeat-thread | connect-run-1-hdfs-sink" java.lang.OutOfMemoryError: Java heap space

at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)

at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)

at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:93)

at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)

at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:154)

at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:135)

at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:343)

at org.apache.kafka.common.network.Selector.poll(Selector.java:291)

at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:260)

at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:232)

at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:266)

at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:865)

...........READ SOME MORE DATA from TOPICS......................

[2017-02-17 16:25:16,612] ERROR Task run-1-hdfs-sink-1 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:142)

java.lang.OutOfMemoryError: Java heap space