Where does Kafka-connect-hdfs store offsets for checkpointing

Tariq Mohammad

unread,

May 3, 2016, 1:49:18 PM5/3/16

to Confluent Platform

Dear fellow confluent users,

Could someone help with the above query? I actually had a standlaone kafka-connect process copying data from a topic into a hive table. Because of some issue I had to stop kafka-connect, drop both the Hive table and underlying HDFS directory. But when I restarted kafka-connect it did not pick the data from the beginning. It rather started copying data from where it had left when I had stopped the process.

Many thanks!

Liquan Pei

unread,

May 3, 2016, 5:02:32 PM5/3/16

to confluent...@googlegroups.com

Hi Tariq,

The offset information is encoded in the filename of files in HDFS. When restart the HDFS connector, it will traverse the directory in HDFS to find the last committed offset start copying data from there. This is crucial in terms of exactly once delivery as we rely on the offset in HDFS to rewind the Kafka topics to the proper position.

Thanks,

Liquan

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/296822a6-d58d-476b-ad46-ceda734738ab%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Liquan Pei | Software Engineer | Confluent | +1 413.230.6855

Download Apache Kafka and Confluent Platform: www.confluent.io/download

Mohammad Tariq

unread,

May 3, 2016, 5:07:40 PM5/3/16

to confluent...@googlegroups.com

Hi Liquan,

As always thank you so much for the prompt response :-)

Well, this is what my understanding was. But even after deleting the base HDFS directory(means offset information is no longer available) kafka-connect doesn't start data copy from the very beginning as I have mentioned above. Let me try to reproduce the issue I am facing and get back to you with more details.

Thanks again. Really appreciate it!

Tariq, Mohammad

about.me/mti

--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/qZuOd_dSLgk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platf...@googlegroups.com.

To post to this group, send email to confluent...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/CAFLW6WRsc_sE8ShWwo3Yi13iXN%3D%3D0n3TqT9etkg2-JDp35-SFw%40mail.gmail.com.

Liquan Pei

unread,

May 3, 2016, 5:24:29 PM5/3/16

to confluent...@googlegroups.com

Hi Tariq,

Sure. Definitely. I think I know the issue. Although the HDFS connector is managing the offset, we didn't disable the offset commit in Kafka. Even through the file is removed in the HDFS, during restart, we still start from the committed offset in Kafka. One work around to this issue is to use another connector name when you restart the connector.

Thanks,
Liquan

To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/CAMVC6ROoUatC54ks_aCLUPZcOyCKZfo84LwuXF8qEK9H60O%2BAA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Mohammad Tariq

unread,

May 3, 2016, 5:40:58 PM5/3/16

to confluent...@googlegroups.com

Ah, I see. Let me try this out. Will let you know how it goes.

Thanks for the suggestion!

To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/CAFLW6WTLwhqnpQnQU1X7Z9ZvjZhVWBnCqpmu-jOWS2SPYndCPw%40mail.gmail.com.

Reply all

Reply to author

Forward