Hi,
May i ask how to achieve at least once delivery without introducing external offset management for Kafka Connect Sink task ?
It appears to me, the topic/partition offset is auto-managed by Kafka Connect framework, clients has no chance to manually commits the offsets.
For example, current working flow is something like below in general:
1) Kafka Connect calls SinkTask.put
2) SinkTask will write the records to sink system (asynchronously)
3) Kafka Connect call "flush(Map<TopicPartition, OffsetAndMetadata> meta)" with the offsets passed in which gives a chance for sink task to commit the offsets to external offset mgmt storage. If there is no external offset mgmt storage like a KVStore, SinkTask usually doesn't override this function
4) Kafka Connect will periodically commits the offsets to offset topic
The problem here is when SinkTask writes records to sink system, in quite some cases, it is an "async" process which means after step 2) is successful, it doesn't mean the records are in sink system which in turns mean step 4) is not safe to do commits. The key thing is we don't want to wait there for the async write to sink system completely done via a callback notification and then move forward since it will impact the throughput a lot.
If Kafka Connect can expose the offset commits "commitSync" API to SinkTask and provide an API to disable the periodical commits in Kafka Connect framework, sink task can have a way to achieve at least once delivery i guess.
The new flow will like below:
1) Kafka Connect calls SinkTask.put
2) SinkTask will write the records to sink system (asynchronously)
3) Kafka Connect call "flush" (we don't need care this)
4) Sometimes in future, a callback was invoked in SinkTask which indicates the records are in sink system completely, it is good to commit this offsets now.
5) "commitSync" was made in the callback which commits the offsets to offset topic
Thank you !