Topics will not rebalance after upgrade / out of sync replicas

1,274 views

Skip to first unread message

Justin Desilets

unread,

Sep 28, 2018, 11:03:49 AM9/28/18

to Confluent Platform

We recently performed a forklift upgrade from apache kafka 0.11 to Confluent Kafka 2.11 ( 4.1.2 ). This is a 3 node cluster where each broker is also running as a zookeeper. Each server is running CentOS 7 with java-1.8.0-openjdk installed. The upgrade consisted of shutting down all brokers, then zookeepers, then removing the old kafka binaries and installing the new Confluent packages. When bringing the hosts back up, the zookeeper nodes synced up and then the brokers were brought online. When confirming the status of all of the partitions we are seeing the following behavior.

kafka-topics --zookeeper localhost:2181 --describe --topic __consumer_offsets --under-replicated-partitions
	Topic: __consumer_offsets	Partition: 1	Leader: 1	Replicas: 3,1,2	Isr: 1
	Topic: __consumer_offsets	Partition: 2	Leader: 1	Replicas: 1,2,3	Isr: 1
	Topic: __consumer_offsets	Partition: 3	Leader: 1	Replicas: 2,1,3	Isr: 1
	Topic: __consumer_offsets	Partition: 4	Leader: 1	Replicas: 3,2,1	Isr: 1
	Topic: __consumer_offsets	Partition: 5	Leader: 1	Replicas: 1,3,2	Isr: 1
	Topic: __consumer_offsets	Partition: 7	Leader: 1	Replicas: 3,1,2	Isr: 1
	Topic: __consumer_offsets	Partition: 8	Leader: 1	Replicas: 1,2,3	Isr: 1
	Topic: __consumer_offsets	Partition: 9	Leader: 1	Replicas: 2,1,3	Isr: 1
	Topic: __consumer_offsets	Partition: 10	Leader: 1	Replicas: 3,2,1	Isr: 1
	Topic: __consumer_offsets	Partition: 11	Leader: 1	Replicas: 1,3,2	Isr: 1
	Topic: __consumer_offsets	Partition: 13	Leader: 1	Replicas: 3,1,2	Isr: 1
	Topic: __consumer_offsets	Partition: 14	Leader: 1	Replicas: 1,2,3	Isr: 1
	Topic: __consumer_offsets	Partition: 15	Leader: 1	Replicas: 2,1,3	Isr: 1
	Topic: __consumer_offsets	Partition: 16	Leader: 1	Replicas: 3,2,1	Isr: 1
	Topic: __consumer_offsets	Partition: 17	Leader: 1	Replicas: 1,3,2	Isr: 1
	Topic: __consumer_offsets	Partition: 19	Leader: 1	Replicas: 3,1,2	Isr: 1
	Topic: __consumer_offsets	Partition: 20	Leader: 1	Replicas: 1,2,3	Isr: 1
	Topic: __consumer_offsets	Partition: 21	Leader: 1	Replicas: 2,1,3	Isr: 1
	Topic: __consumer_offsets	Partition: 22	Leader: 1	Replicas: 3,2,1	Isr: 1
	Topic: __consumer_offsets	Partition: 23	Leader: 1	Replicas: 1,3,2	Isr: 1
	Topic: __consumer_offsets	Partition: 25	Leader: 1	Replicas: 3,1,2	Isr: 1
	Topic: __consumer_offsets	Partition: 26	Leader: 1	Replicas: 1,2,3	Isr: 1
	Topic: __consumer_offsets	Partition: 27	Leader: 1	Replicas: 2,1,3	Isr: 1
	Topic: __consumer_offsets	Partition: 28	Leader: 1	Replicas: 3,2,1	Isr: 1
	Topic: __consumer_offsets	Partition: 29	Leader: 1	Replicas: 1,3,2	Isr: 1
	Topic: __consumer_offsets	Partition: 31	Leader: 1	Replicas: 3,1,2	Isr: 1
	Topic: __consumer_offsets	Partition: 32	Leader: 1	Replicas: 1,2,3	Isr: 1
	Topic: __consumer_offsets	Partition: 33	Leader: 1	Replicas: 2,1,3	Isr: 1
	Topic: __consumer_offsets	Partition: 34	Leader: 1	Replicas: 3,2,1	Isr: 1
	Topic: __consumer_offsets	Partition: 35	Leader: 1	Replicas: 1,3,2	Isr: 1
	Topic: __consumer_offsets	Partition: 37	Leader: 1	Replicas: 3,1,2	Isr: 1
	Topic: __consumer_offsets	Partition: 38	Leader: 1	Replicas: 1,2,3	Isr: 1
	Topic: __consumer_offsets	Partition: 39	Leader: 1	Replicas: 2,1,3	Isr: 1
	Topic: __consumer_offsets	Partition: 40	Leader: 1	Replicas: 3,2,1	Isr: 1
	Topic: __consumer_offsets	Partition: 41	Leader: 1	Replicas: 1,3,2	Isr: 1
	Topic: __consumer_offsets	Partition: 43	Leader: 1	Replicas: 3,1,2	Isr: 1
	Topic: __consumer_offsets	Partition: 44	Leader: 1	Replicas: 1,2,3	Isr: 1
	Topic: __consumer_offsets	Partition: 45	Leader: 1	Replicas: 2,1,3	Isr: 1
	Topic: __consumer_offsets	Partition: 46	Leader: 1	Replicas: 3,2,1	Isr: 1
	Topic: __consumer_offsets	Partition: 47	Leader: 1	Replicas: 1,3,2	Isr: 1
	Topic: __consumer_offsets	Partition: 49	Leader: 1	Replicas: 3,1,2	Isr: 1

Watching the state-change.log I see the following errors again and again on the Controller server:

[2018-09-28 06:57:17,852] ERROR [Controller id=1 epoch=48] Controller 1 epoch 48 failed to change state for partition __consumer_offsets-25 from OnlinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition __consumer_offsets-25 under strategy PreferredReplicaPartitionLeaderElectionStrategy
	at kafka.controller.PartitionStateMachine$$anonfun$doElectLeaderForPartitions$3.apply(PartitionStateMachine.scala:328)
	at kafka.controller.PartitionStateMachine$$anonfun$doElectLeaderForPartitions$3.apply(PartitionStateMachine.scala:326)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at kafka.controller.PartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:326)
	at kafka.controller.PartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:254)
	at kafka.controller.PartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:175)
	at kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:116)
	at kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:607)
	at kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3$$anonfun$apply$18.apply(KafkaController.scala:1003)
	at kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3$$anonfun$apply$18.apply(KafkaController.scala:996)
	at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
	at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
	at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
	at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
	at kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:996)
	at kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:983)
	at scala.collection.immutable.Map$Map3.foreach(Map.scala:161)
	at kafka.controller.KafkaController.kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance(KafkaController.scala:983)
	at kafka.controller.KafkaController$AutoPreferredReplicaLeaderElection$.process(KafkaController.scala:1017)
	at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:69)
	at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69)
	at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69)
	at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
	at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:68)
	at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

I have tried restarting one of the brokers, and it looks like the server attempts to try and bring the replicas back in sync, but I will see the above errors being logged. I have tried using both the `kafka-preferred-replica-election` script and the `kafka-reassign-partitions` script but neither have done anything to fix the under replicated partitions. If you need more information please let me know. I have been trying to dig through all the logs to look for some specific reason why the state is failing to change.

Reply all

Reply to author

Forward

0 new messages