Control Center will not start, continously rebalancing, lock exceptions

2,391 views
Skip to first unread message

elizabet...@stitchfix.com

unread,
Apr 4, 2017, 1:35:55 PM4/4/17
to Confluent Platform
Using confluent platform 3.2.0, kafka 0.10.2.

This morning I was all set to demo the control center to my team when to my dismay, it got itself into some kind of an unstable state.

It was running just fine for quite some time (about a week), however now it will not start up (the http service at least). I see a number of error messages, particularly in the brokers and in the control center logs.

When I bounce brokers, I see many of these messages, with different topics and partitions in each message:

[2017-04-04 17:20:04,387] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/var/lib/kafka/_confluent-controlcenter-3-1-0-1-error-topic-3/00000000000000000000.index) has non-zero size but the last offset is 0 which is no larger than the base offset 0.}. deleting /var/lib/kafka/_confluent-controlcenter-3-1-0-1-error-topic-3/00000000000000000000.timeindex, /var/lib/kafka/_confluent-controlcenter-3-1-0-1-error-topic-3/00000000000000000000.index and rebuilding index... (kafka.log.Log)
[2017-04-04 17:20:04,387] INFO Recovering unflushed segment 0 in log _confluent-controlcenter-3-1-0-1-error-topic-3. (kafka.log.Log)
[2017-04-04 17:20:04,388] INFO Completed load of log _confluent-controlcenter-3-1-0-1-error-topic-3 with 1 log segments and log end offset 0 in 2 ms (kafka.log.Log)


When I bounce the control center, it will begin to start up, but then will just continuously rebalance:

[2017-04-04 17:22:20,607] INFO tocheck=[Store{name=KSTREAM-OUTEROTHER-0000000107-store, rollup=false}, Store{name=KSTREAM-OUTERTHIS-0000000106-store, rollup=false}, Store{name=Group, rollup=true}, Store{name=MonitoringStream, rollup=true}, Store{name=TriggerActionsStore, rollup=false}, Store{name=MonitoringMessageAggregatorWindows, rollup=true}, Store{name=MonitoringTriggerStore, rollup=false}, Store{name=aggregatedTopicPartitionTableWindows, rollup=true}, Store{name=MonitoringVerifierStore, rollup=false}, Store{name=TriggerEventsStore, rollup=false}, Store{name=AlertHistoryStore, rollup=false}, Store{name=MetricsAggregateStore, rollup=false}] (io.confluent.controlcenter.streams.KafkaStreamsManager:115)
[2017-04-04 17:22:20,607] INFO streams in state=REBALANCING (io.confluent.controlcenter.streams.KafkaStreamsManager:137)

It does this for a long time, and then starts to spew out these exceptions:

streams.processor.internals.StreamThread:828)
[2017-04-04 17:29:02,732] WARN Could not create task 2_1. Will retry. (org.apache.kafka.streams.processor.internals.StreamThread:1184)
org.apache.kafka.streams.errors.LockException: task [2_1] Failed to lock the state directory: /var/lib/confluent/control-center/1/kafka-streams/_confluent-controlcenter-3-2-0-1/2_1
    at org.apache.kafka.streams.processor.internals.ProcessorStateManager.<init>(ProcessorStateManager.java:102)
    at org.apache.kafka.streams.processor.internals.AbstractTask.<init>(AbstractTask.java:73)
    at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:108)
    at org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:834)
    at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:1207)
    at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.retryWithBackoff(StreamThread.java:1180)
    at org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:937)
    at org.apache.kafka.streams.processor.internals.StreamThread.access$500(StreamThread.java:69)
    at org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:236)
    at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:255)
    at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:339)
    at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:303)
    at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286)
    at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1030)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
    at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:582)
    at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368)

Then it goes back into the rebalancing loop.

I'm at a loss as to what to try next. I tried pointing the control-center data dir config to a different directory, and the same lock exception still happens. Nothing seems to stand out in the zookeeper logs. The only other log exception of interest is this, coming from the brokers:

[2017-04-04 17:30:46,129] WARN Failed to produce metrics message (io.confluent.metrics.reporter.ConfluentMetricsReporter)
org.apache.kafka.common.errors.RecordTooLargeException: The message is 1155172 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.

That doesn't seem relevant to this particular issue though.

If anyone has any tips for debugging this, that would be hugely appreciated.

Thank you!!

xav...@confluent.io

unread,
Apr 4, 2017, 4:02:15 PM4/4/17
to Confluent Platform
Regarding the record size exception, did you override the confluent.metrics.reporter.whitelist setting by any chance?
That could lead to the record size growing larger than expected.
Even if you did not, you can set confluent.metrics.reporter.max.request.size to something like 10485760 (10MB) to avoid the issue on the broker side.

The corrupt index issues would indicate your broker didn't shut down properly, so you might want to let it recover before starting control center again.

If that still doesn't fix the problem, you can try to reset your control center by giving it a different identifier using confluent.controlcenter.id=some-unique-string
The default id is 1, so you can just increment the number to keep it simple.

Hope this helps!

elizabet...@stitchfix.com

unread,
Apr 4, 2017, 6:31:48 PM4/4/17
to Confluent Platform
Thanks for the tips Xavier!

I haven't touched the confluent.metrics.reporter.whitelist config. Setting the confluent.metrics.reporter.max.request.size to 10MB fixed the broker message size issue, so thanks for that!

The corrupt index issue seems to not happen when I use the kafka-server-stop script, as opposed to kill -9, which is what I had previously set up in my /etc/init.d/kafka script. So that issues looks like it is solves as well.

As far as the control center, resetting the control center id worked. This resolved the lock issues and the control center was able to start up cleanly, Getting the GUI back up and running. It still appears to be in a delicate state though. I get the pulsing "Initializing" or "Loading data" dialog, or sometimes the grey loading bars. The UI hangs like this for a very long time, like 20 or 30 minutes. I suppose I'll chalk that up to the command center just taking a long time to initialize? It is spewing out tons of these warnings now:

[2017-04-04 22:27:25,651] WARN Missing sequence numbers=[46, 47] from client=producer-16, topic=
XXX, partition=16, estimated timestamp=1490811907492, next timestamp=1490811907492 (io.confluent.controlcenter.streams.verify.MonitoringVerifier:331)

[2017-04-04 22:27:25,652] WARN missing monitoring data for client=producer-16 topic=sf.merch partition=16 session=49f22f37-9b40-4f0c-9e73-3ea372e61d0e window ranges=[[1490811240000?1490811960000]] ClientID=producer-16, topic=XXX, partition=16, session=49f22f37-9b40-4f0c-9e73-3ea372e61d0e, timeout=45000, windowHistorySize=10, shutdownReceived=true, maybeSessionEnded=false, sessionEnded=true, lastSequenceTime=1490811982698, expectedWindowDistance=135000, lastSequenceWithWindow={Sequence=53, Timestamp=1490811982698, minWindow=1490811825000, maxWindow=1490811960000}, Ranges=[] (io.confluent.controlcenter.streams.verify.VerifyTransformerSupplier:445)
[2017-04-04 22:27:25,652] WARN Missing sequence numbers=[1, 4] from client=producer-16, topic=XXX, partition=16, estimated timestamp=1490750343590, next timestamp=1490750343590 (io.confluent.controlcenter.streams.verify.MonitoringVerifier:331)
[2017-04-04 22:27:25,652] WARN No window info in the next non-missing sequence. Estimated window range for missing sequence=6 [1490811015000...1490811135000], lastSequenceWithWindow={Sequence=5, Timestamp=1490750343590, minWindow=1490811045000, maxWindow=1490811105000} for clientId=producer-16, topic=sf.merch, partition=16 (io.confluent.controlcenter.streams.verify.MonitoringVerifier:640)
[2017-04-04 22:27:25,652] WARN missing monitoring data for client=producer-16 topic=sf.merch partition=16 session=d5d8f144-530f-42c6-9eca-e0e82a7fa767 window ranges=[[1490811015000?1490811135000]] ClientID=producer-16, topic=XXX, partition=16, session=d5d8f144-530f-42c6-9eca-e0e82a7fa767, timeout=45000, windowHistorySize=10, shutdownReceived=true, maybeSessionEnded=false, sessionEnded=true, lastSequenceTime=1490750343590, expectedWindowDistance=60000, lastSequenceWithWindow={Sequence=5, Timestamp=1490750343590, minWindow=1490811045000, maxWindow=1490811105000}, Ranges=[] (io.confluent.controlcenter.streams.verify.VerifyTransformerSupplier:445)


Anyways, it's probably an unrelated issue. Thanks for the tips! They were very helpful.

elizabet...@stitchfix.com

unread,
Apr 10, 2017, 6:33:19 PM4/10/17
to Confluent Platform
Update: the trick to reset the control center id works partially, but I must also set confluent.controlcenter.streams.consumer.auto.offset.reset=latest. If I don't set this config, I presume the control center 'reconsumes' all the metrics data since the beginning of the topic. This would be good if it worked. This reconsuming process either doesn't work though, or takes many hours to finish.

Last night my control center ran into the
LockExceptions again and I had to reset it's id again. I had accrued several days worth of monitoring data that I was hoping to preserve, so I kept the auto offset reset config set to false. I let the control center start up and run for 3 or 4 hours, but still the UI would just be stuck on a loading screen.

So my question: is the control center actually recovering? And if so, how long does it typically take to reprocess say, a weeks worth of monitoring data?

The LockExceptions happen somewhat frequently (it seems like they happen anytime the control center doesn't shut down cleanly?), so if I have to completely reset the control center each time it happens, that will be a bit of a deal breaker. =/

Xavier Léauté

unread,
Apr 11, 2017, 2:03:49 PM4/11/17
to confluent...@googlegroups.com
Update: the trick to reset the control center id works partially, but I must also set confluent.controlcenter.streams.consumer.auto.offset.reset=latest. 
If I don't set this config, I presume the control center 'reconsumes' all the metrics data since the beginning of the topic.

That is correct, we currently have it set that way to avoid gaps in monitoring, but I agree this can be a lengthy process on startup, and we could be smarter about how we reprocess older data.

I had accrued several days worth of monitoring data that I was hoping to preserve, so I kept the auto offset reset config set to false. I let the control center start up and run for 3 or 4 hours, but still the UI would just be stuck on a loading screen.

There are some known issues in the current release that could cause it to get stuck, especially if the underlying streams app is throwing exception. We'll have a bugfix release out for those very soon.  
 
So my question: is the control center actually recovering? And if so, how long does it typically take to reprocess say, a weeks worth of monitoring data?

Depending on the number of topic partitions you are monitoring, it could take several hours to reprocess for a large cluster. It also depends on how many stream threads you are running, although increasing that number can be quite resource intensive. Do you mind sharing – privately if you prefer – the number of topic-partitions you are monitoring?

The LockExceptions happen somewhat frequently (it seems like they happen anytime the control center doesn't shut down cleanly?), so if I have to completely reset the control center each time it happens, that will be a bit of a deal breaker. =/

Typically the LockExceptions should be harmless, as they can sometimes happen when Kafka Streams initiates a rebalancing. We are making a change to the way Kafka Streams logs this error to reflect that.

The underlying problem appears to be Kafka Streams not getting out of the rebalancing loop. On startup, restoring state can take a long time, and it is possible you may be hitting some timeouts before that step completes. You might want to try setting `confluent.controlcenter.streams.max.poll.interval.ms` to something like 86400000 to see if that prevents streams from rebalancing too soon because it thinks one of its tasks is stuck.

Hope this helps,
Xavier

elizabet...@stitchfix.com

unread,
Apr 11, 2017, 7:13:59 PM4/11/17
to Confluent Platform
Hi Xavier,
Thank you so much for your replies, they have been very helpful. I don't mind sharing ballpark numbers for our setup.

We're running a kafka prototype cluster and we're ramping up the amount of data going in. At this point, it's just around 400 topics (most of those are C3 topics actually) and we have around 3600 partitions across 3 brokers. The _confluent-monitoring topic has 12 partitions and 3 replicas and contains about 1.43GB of data. This doesn't seem like too much data, so I feel like the control center should not take as long as it does to consume it (I'm not actually sure how long it takes, I've never had the patience to wait for it to finish, but it takes at least 3+ hours). I'm only running with one stream thread, so I can try bumping up the # of stream threads.

"There are some known issues in the current release that could cause it to get stuck, especially if the underlying streams app is throwing exception. " I don't recall the streams app throwing exceptions during this reconsuming stage after I reset the control center id, but perhaps it is related. The logs are frustrating barren actually, which is why I'm never sure if it's reconsuming or if it has gotten stuck.

"Typically the LockExceptions should be harmless, as they can sometimes happen when Kafka Streams initiates a rebalancing" Ah, that is good to know. I was thinking the LockExceptions cause the rebalancing but it makes sense now that the rebalancing would cause the LockExceptions. In that case, I have no idea what causes the rebalancing to happen, but I will try bumping up the config that you suggested.

I think one more thing I will try is to move the control center onto its own machine with lots and lots of resources (currently it is collocated with a kafka broker, zk and a connect node which it probably doesn't like very much). Hopefully this will help it crunch through data faster and I can give it 8 stream threads. Perhaps the increased resources will mean it won't rebalance as much since tasks won't time out and everything will be smoother.

Anyways, thanks again for the responses!
--Liz

Xavier Léauté

unread,
Apr 13, 2017, 2:23:12 PM4/13/17
to confluent...@googlegroups.com
We're running a kafka prototype cluster and we're ramping up the amount of data going in. At this point, it's just around 400 topics (most of those are C3 topics actually) and we have around 3600 partitions across 3 brokers. The _confluent-monitoring topic has 12 partitions and 3 replicas and contains about 1.43GB of data. This doesn't seem like too much data

Thank you, the number that will predominantly dictate the amount of resources required by control center is the number of topic-partitions produced to/consumed from by clients with interceptors enabled. If you are running the intercepted clients against topics in a different cluster, that number may be very different from the number of partitions in the cluster running control center. Regardless, I agree that this seems a bit long to reprocess that amount of data.

I think one more thing I will try is to move the control center onto its own machine with lots and lots of resources (currently it is collocated with a kafka broker, zk and a connect node which it probably doesn't like very much).

Agree, I wouldn't recommend co-locating all those services, control center can be quite resources intensive. I'd recommend a dedicated m4.2xl if you are running on AWS, and increase the heap size to 6GB to give it a bit more headroom.
 
Hopefully this will help it crunch through data faster and I can give it 8 stream threads. Perhaps the increased resources will mean it won't rebalance as much since tasks won't time out and everything will be smoother.

I would still change the max.poll.interval.ms regardless. We are actually changing the default in Kafka Streams 0.10.2.1 so streams apps don't run into this problem when restoring large state stores (see https://github.com/apache/kafka/pull/2770)

Xavier

Elizabeth Bennett

unread,
Apr 13, 2017, 8:56:10 PM4/13/17
to confluent...@googlegroups.com
Okay, so I did two things: I started up an additional control center on an m4.2xl and gave it 6GB heap and 8 threads. I gave it a new id, and let it consume from the earliest offset. I let it run like this for 2 days, and still, I just get a "Loading Data" screen when I load the UI. I think it must definitely be stuck...

On my original control center, I bumped up the max.poll.interval.ms config to 86400000. I bounced it and it still rebalances in the same way. I let it run for a long time (>1 hour) and it eventually recovered. But something happened and now it no longer shows consuming metrics. It looks exactly like the issue that I posted about here: https://groups.google.com/forum/#!searchin/confluent-platform/control$20center|sort:date/confluent-platform/MbjUipxwRB4/0gSkVVYvDgAJ

Basically, I just get producer metrics showing up but the control center is showing 0% for consumption. When I run the control-center-console-consumer, lots of consumer messages are pouring in, but the control center is  refusing to acknowledge them. It seems like a separate issue, but it is frustrating because the control-center will destabilize like this, and then the only way I know to fix it is to reset the control center id. The control-center doesn't seem able to reconsume from the earliest offset so I must also reset the offsets, causing it to loose all of its data. I hope this all has something to do with my deployment, because the control center is very cool when it is working well. Are there any other common causes of instability in the control center besides file limits, heap space, and RAM?

--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/DzADuOFttNs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/CA%2BrPSbYzGG7v3D%3D55kwj3Zr7-WG2OKMJkqwQPRSx8CTRrvp9mQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Xavier Léauté

unread,
Apr 17, 2017, 9:43:26 PM4/17/17
to confluent...@googlegroups.com
Okay, so I did two things: I started up an additional control center on an m4.2xl and gave it 6GB heap and 8 threads. I gave it a new id, and let it consume from the earliest offset. I let it run like this for 2 days, and still, I just get a "Loading Data" screen when I load the UI. I think it must definitely be stuck...

Just to be sure, did you make sure to also set max.poll.interval.ms on the new control center instance? If you can send the logs for the new one that would be helpful to help trace down what is happening.

On my original control center, I bumped up the max.poll.interval.ms config to 86400000. I bounced it and it still rebalances in the same way. I let it run for a long time (>1 hour) and it eventually recovered.

So it looks like it recovered but is still rebalancing, or am I reading this wrong?

But something happened and now it no longer shows consuming metrics. It looks exactly like the issue that I posted about here: https://groups.google.com/forum/#!searchin/confluent-platform/control$20center|sort:date/confluent-platform/MbjUipxwRB4/0gSkVVYvDgAJ

That does looks like a separate issue where it might be helpful to see the entire log if you don't mind.
 
 Are there any other common causes of instability in the control center besides file limits, heap space, and RAM?

I can't think of much else at this point. Are you seeing any signs of excessive time spent garbage collecting? If that's the case I would try reducing confluent.controlcenter.streams.max.poll.records to 100.

If you're seeing issues on the producer side, also set confluent.controlcenter.streams.retries to something like 10

If none of this helps I'd probably want to increase the log level to debug and dig into the logs.


Elizabeth Bennett

unread,
Apr 18, 2017, 8:30:48 PM4/18/17
to confluent...@googlegroups.com
Just to be sure, did you make sure to also set max.poll.interval.ms on the new control center instance? If you can send the logs for the new one that would be helpful to help trace down what is happening.

I set the max.poll.interval.ms config to 86400000 on both instances.

On my original control center, I bumped up the max.poll.interval.ms config to 86400000. I bounced it and it still rebalances in the same way. I let it run for a long time (>1 hour) and it eventually recovered.

So it looks like it recovered but is still rebalancing, or am I reading this wrong?

It rebalanced for about an hour and then recovered. I think each time it rebalanced a new task was starting up. That's what the logs seem to indicate. And then once all the tasks had started up it stopped rebalancing. It's just that it takes like an hour for all the tasks to start up. It wouldn't be so bad were it not for the fact that the rest server doesn't start up until the rebalancing stops, so the C3 GUI is unreachable for a really long time.

I'd be happy to provide the logs of what the control center was doing. I'll send you a separate e-mail with the attached log files.

Thanks again for your responses and for taking the time to help troubleshoot.

Cheers,
Liz


--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/DzADuOFttNs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages