Remedies for "Batch containing x record(s) expired due to timeout while requesting metadata..."

2,592 views
Skip to first unread message

Chris Stromberger

unread,
Apr 13, 2018, 9:35:11 AM4/13/18
to Confluent Platform
Seeing this in logs from my producer in a Java application. Just wondering if this is symptomatic of producer not being configured correctly. We have retries set (to 1000) and retry.backoff.ms set (to 1000). Not sure how to defend against this type of error and avoid losing messages. Any tips appreciated.

Thanks,
Chris

Andy Coates

unread,
Apr 13, 2018, 10:01:12 AM4/13/18
to Confluent Platform
Hey Chris,

Normally, when you see batch timeouts 'while requesting metadata' it means either the producer if badly configured, (e.g. bootstrap.servers is wrong), there's a network partition or ACL stopping the producer talk to the cluster, or their is an issue with the cluster, (or just a single broker).

Is this a constant issue or intermittent?

Chris Stromberger

unread,
Apr 13, 2018, 11:49:58 AM4/13/18
to Confluent Platform
It's intermittent, but we're seeing it every 12 hrs or so. However there is some kind of maintenance going on in the cluster, so maybe that's the issue. I know the bootstrap server config is correct. The producer will be humming along for hours, then hit this for a few minutes, then will keep going. 

If this is due to cluster issues, is there a way to configure the producer to defend against losing messages in cases like this of cluster hiccups?

Damian Guy

unread,
Apr 13, 2018, 12:29:45 PM4/13/18
to confluent...@googlegroups.com
Hi,

you can set retries to Integer.MAX_VALUE
and max.block.ms to Long.MAX_VALUE
This will mean that it will effectively retry forever. Of course you probably want to have some monitoring in place so that you know when it is blocked on this.

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/c212d921-9095-4e98-8098-5831ac4cddf5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris Stromberger

unread,
Apr 13, 2018, 1:06:29 PM4/13/18
to Confluent Platform
It looks like this issue was actually posted as a kafka bug, but not sure if this up to date: https://issues.apache.org/jira/browse/KAFKA-3686

One person there mentions "We have a similar issue and even with a very very large retry, the producer decides to skip messages when there is a network issue.", so not sure retry config will fix.


On Friday, April 13, 2018 at 11:29:45 AM UTC-5, Damian Guy wrote:
Hi,

you can set retries to Integer.MAX_VALUE
and max.block.ms to Long.MAX_VALUE
This will mean that it will effectively retry forever. Of course you probably want to have some monitoring in place so that you know when it is blocked on this.

On Fri, 13 Apr 2018 at 16:49 Chris Stromberger <chris.st...@gmail.com> wrote:
It's intermittent, but we're seeing it every 12 hrs or so. However there is some kind of maintenance going on in the cluster, so maybe that's the issue. I know the bootstrap server config is correct. The producer will be humming along for hours, then hit this for a few minutes, then will keep going. 

If this is due to cluster issues, is there a way to configure the producer to defend against losing messages in cases like this of cluster hiccups?


On Friday, April 13, 2018 at 9:01:12 AM UTC-5, Andy Coates wrote:
Hey Chris,

Normally, when you see batch timeouts 'while requesting metadata' it means either the producer if badly configured, (e.g. bootstrap.servers is wrong), there's a network partition or ACL stopping the producer talk to the cluster, or their is an issue with the cluster, (or just a single broker).

Is this a constant issue or intermittent?

On Friday, April 13, 2018 at 2:35:11 PM UTC+1, Chris Stromberger wrote:
Seeing this in logs from my producer in a Java application. Just wondering if this is symptomatic of producer not being configured correctly. We have retries set (to 1000) and retry.backoff.ms set (to 1000). Not sure how to defend against this type of error and avoid losing messages. Any tips appreciated.

Thanks,
Chris

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.

Andy Coates

unread,
Apr 14, 2018, 5:48:17 AM4/14/18
to Confluent Platform
Hi Chris, 

Can you provide a full error message and stack trace place?

Thanks,

Andy

Chris Stromberger

unread,
Apr 15, 2018, 1:22:44 PM4/15/18
to Confluent Platform
Hi Andy,

Unfortunately I don't have a full stack trace in the logs. Here's what I see leading up to the error:

WARN - Sender - Got error produce response with correlation id 1414908 on topic-partition changelog-1, retrying (987 attempts left). Error: NOT_LEADER_FOR_PARTITION

Then eventually this:

org.apache.kafka.common.errors.TimeoutException: Batch containing 3 record(s) expired due to timeout while requesting metadata from brokers for changelog-1

I don't see retry "attempts left" counting down to 0 in the logs. I have retries set to 1000, I see down to ~980 attempts left at the lowest. Seems like I should see that number counting down all the way to 0, but maybe the timeout occurs before that can happen? Our producer has request timeout ms set to 30000 I believe. 

I'm going to look at enhancing the app's logging around this, but in the meantime, that's all I have to go on from the logs.

Thanks,
Chris

Andy Coates

unread,
Apr 16, 2018, 1:08:18 PM4/16/18
to confluent...@googlegroups.com
So what looks to be happening here is that there is a leadership election in the Kafka cluster for changelog-1, which means your client has the wrong metadata. When the client receives NOT_LEADER_FROM_PARTITION is will request updated metadata from the cluster to work out where this partition has moved to. However, in your instance it looks like the client fails to get a response for its metadata request within the request time out. 

The default timeout for a request is 30 seconds, which is what you're saying you're using. So if the client can't get a response in that 30 seconds then I think you'll see this timeout exception. If you want it to keep trying for longer, then you'd need to increase request.timeout.ms.

What's happening in the cluster when you're seeing these errors? Does it make sense that the client can't get a response for 30 seconds? 

Chris
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/7h2ByKkJ_Xo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/c6bf053d-c95d-4ac3-a579-6851b23b30d9%40googlegroups.com.

Chris Stromberger

unread,
Apr 16, 2018, 3:02:13 PM4/16/18
to Confluent Platform
Ok thanks for the explanation, makes sense. I'll try increasing timeout ms. I am not sure what is going on in the cluster, as it's maintained by a separate team. But I will check with them to see if any clues there.
Chris
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/7h2ByKkJ_Xo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages