Data loss while upgrading confluent 3.0.0 kafka cluster to confluent 3.2.2

284 views
Skip to first unread message

Yogesh Sangvikar

unread,
Sep 18, 2017, 11:33:04 AM9/18/17
to Confluent Platform
Hi Team,

Currently, we are using confluent 3.0.0 kafka cluster in our production environment. And, we are planing to upgrade the kafka cluster for confluent 3.2.2 
We are having topics with millions on records and data getting continuously published to those topics. And, also, we are using other confluent services like schema-registry, kafka connect and kafka rest to process the data.

So, we can't afford downtime upgrade for the platform.

We have tries rolling kafka upgrade as suggested on blogs in Development environment,

https://docs.confluent.io/3.2.2/upgrade.html


But, we are observing data loss on topics while doing rolling upgrade / restart of kafka servers for "inter.broker.protocol.version=0.10.2".

As per our observation, we suspect the root cause for the data loss (explained for a topic partition having 3 replicas), 
  • As the kafka broker protocol version updates from 0.10.0 to 0.10.2 in rolling fashion, the in-sync replicas having older version will not allow updated replicas (0.10.2) to be in sync unless are all updated. 
  • Also, we have explicitly disabled "unclean.leader.election.enabled" property, so only in-sync replicas will be elected as leader for the given partition.
  • While doing rolling fashion update, as mentioned above, older version leader is not allowing newer version replicas to be in sync, so the data pushed using this older version leader, will not be synced with other replicas and if this leader(older version)  goes down for an upgrade, other updated replicas will be shown in in-sync column and become leader, but they lag in offset with old version leader and shows the offset of the data till they have synced.
  • And, once the last replica comes up with updated version, will start syncing data from the current leader.  

Please let us know comments on our observation and suggest proper way for rolling kafka upgrade as we can't afford downtime.

Thanks,
Yogesh

Yogesh Sangvikar

unread,
Sep 26, 2017, 11:25:57 AM9/26/17
to Confluent Platform
Hi Team,

I got replies from kafka forum and able to understand & resolve steps for upgrade. 
Thanks a lot kafka forum. 

We have tried kafka cluster rolling upgrade by doing the version changes (CURRENT_KAFKA_VERSION -  0.10.0, CURRENT_MESSAGE_FORMAT_VERSION - 0.10.0 and upgraded respective version 0.10.2) in upgraded confluent package 3.2.2 and observed the in-sync replicas are coming up immediately & also, the preferred leaders are coming up after version bump post sync.

As per my understanding, the in-sync replicas & leader election happening quickly as the new data getting published while upgrade is getting written and synced using upgraded package libraries (0.10.2).

Also, observed some records failed to produce due to error,

kafka-rest error response -

{"offsets":[{"partition":null,"offset":null,"error_code":50003,"error":"This server is not the leader for that topic-partition."}],"key_schema_id":1542,"value_schema_id":1541}

Exception in log file -
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.


To resolve the above error, we have override properties acks=-1 (default, 1) retries=3 (default, 0) for kafka rest producer config (kafka-rest.properties) and getting some duplicate events in topic. Its better to have duplicate records rather than data loss.

  
Thanks,
Yogesh
Reply all
Reply to author
Forward
0 new messages