Record CRC validation failure, please help

Cheng Chen

unread,

Dec 17, 2016, 7:04:16 AM12/17/16

to Confluent Platform

Hey Confluent folks,

we are trying to use confluent platform ver 3.1.1 and kafka ver 0.10.1.0-cp2 in our swarm cluster.

and kept seeing this error when we were trying to consume the message.

2016-12-13 02:08:13,846] ERROR Task dp threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)

org.apache.kafka.common.KafkaException: Record for partition dp at offset 234 is invalid, cause: Record is corrupt (stored crc = 1133837813, computed crc = 2330297257)
	at org.apache.kafka.clients.consumer.internals.Fetcher.parseRecord(Fetcher.java:743)
	at org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:682)
	at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:425)
	at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1045)
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:317)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:235)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143)
	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

At first we thought it might be related to https://github.com/confluentinc/kafka/commit/d2acd676c3eb0c11d0042bc3b9ae314165c68443,
but the update to "crc update function" was just a simple check, can anyone kindly shed some light into this?
Thanks in advance.
cheng

Message has been deleted

Cheng Chen

unread,

Dec 19, 2016, 4:03:27 AM12/19/16

to Confluent Platform

additional info: this happened on docker swarm with multiple nodes quite consistently.

Ewen Cheslack-Postava

unread,

Dec 20, 2016, 1:51:14 AM12/20/16

to Confluent Platform

Cheng,

I don't have any immediate suggestions, but a CRC check failure is pretty unusual. It indicates data got corrupted somehow. Do you have an easily reproducible setup for this issue?

-Ewen

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsubscribe@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/39b660f6-f40c-408b-ad8c-0be1c2e28a75%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Cheng Chen

unread,

Dec 21, 2016, 2:44:14 AM12/21/16

to Confluent Platform

Thanks Ewen for your reply, turns out once we changed consumer.fetch.max.bytes from 50mb to 5kb, the problem went away, any idea why?

also this happens in distributed mode(3 nodes).

To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.

Reply all

Reply to author

Forward