Where does Kafka-connect-hdfs store offsets for checkpointing

393 views
Skip to first unread message

Tariq Mohammad

unread,
May 3, 2016, 1:49:18 PM5/3/16
to Confluent Platform
Dear fellow confluent users,

Could someone help with the above query? I actually had a standlaone kafka-connect process copying data from a topic into a hive table. Because of some issue I had to stop kafka-connect, drop both the Hive table and underlying HDFS directory. But when I restarted kafka-connect it did not pick the data from the beginning. It rather started copying data from where it had left when I had stopped the process.

Many thanks!

Liquan Pei

unread,
May 3, 2016, 5:02:32 PM5/3/16
to confluent...@googlegroups.com
Hi Tariq,

The offset information is encoded in the filename of files in HDFS. When restart the HDFS connector, it will traverse the directory in HDFS to find the last committed offset start copying data from there. This is crucial in terms of exactly once delivery as we rely on the offset in HDFS to rewind the Kafka topics to the proper position.

Thanks,
Liquan

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/296822a6-d58d-476b-ad46-ceda734738ab%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Liquan Pei | Software Engineer | Confluent | +1 413.230.6855
Download Apache Kafka and Confluent Platform: www.confluent.io/download

Mohammad Tariq

unread,
May 3, 2016, 5:07:40 PM5/3/16
to confluent...@googlegroups.com
Hi Liquan,

As always thank you so much for the prompt response :-)

Well, this is what my understanding was. But even after deleting the base HDFS directory(means offset information is no longer available) kafka-connect doesn't start data copy from the very beginning as I have mentioned above. Let me try to reproduce the issue I am facing and get back to you with more details.

Thanks again. Really appreciate it!

--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/qZuOd_dSLgk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platf...@googlegroups.com.

To post to this group, send email to confluent...@googlegroups.com.

Liquan Pei

unread,
May 3, 2016, 5:24:29 PM5/3/16
to confluent...@googlegroups.com
Hi Tariq,

Sure. Definitely. I think I know the issue. Although the HDFS connector is managing the offset, we didn't disable the offset commit in Kafka. Even through the file is removed in the HDFS, during restart, we still start from the committed offset in Kafka. One work around to this issue is to use another connector name when you restart the connector. 

Thanks,
Liquan


For more options, visit https://groups.google.com/d/optout.

Mohammad Tariq

unread,
May 3, 2016, 5:40:58 PM5/3/16
to confluent...@googlegroups.com
Ah, I see. Let me try this out. Will let you know how it goes.

Thanks for the suggestion!
Reply all
Reply to author
Forward
0 new messages