How does Debezium Postgres connector guarantee at-least-once delivery when publishing to multiple brokers?

7 views

Skip to first unread message

Yichao Zhao

unread,

Jun 24, 2024, 12:37:46 AM (9 days ago) Jun 24

to debezium

Hi folks,

I'm asking this question mainly to understand how we can apply short term fix for data loss issue in our in-house CDC equivalent service by following Debezium's best practice. I've personally operated Debezium before at my previous place and really liked it, and long term we're considering switching to Debezium too.

Here's the context of our in-house CDC service:

- Uses confluent-go-library (librdkafka-based)

- Uses idempotent producer

- For simplicity, let's say it connects to one PG primary's replication slot

- Based on the table of the change, it publishes to corresponding per-table topic (we use AWS MSK)

- There's a goroutine listening to the producer.Events() channel and ack higher LSN back to the replication slot

We have validation mechanism that surfaced the data loss behavior. They are always accompanied with broker connection issues. Our error-handling logic is written so that when there's publishing error (from Events chan), it'd restart the service (including reconnecting the replication slot etc.)

Our theory is that, since the service talks to multiple brokers, and the actual publishing is done asynchronously, if one broker is having trouble & publish gets stuck, the publishing channel to _other_ brokers are still flowing, thus we could've been committing higher LSN while some earlier change's message (that would be published to the stuck broker) are still in local queue. And if the stuck publish finally return error to Event chan, we'd error out and restart the service (with best effort flush, but can't guarantee because the broker could still be unreachable). This effectively skipped the message in the local queue, because we've committed a higher LSN to the replication slot, where we'd resume from after restart.

My understanding is that this (one DB to multiple topics) should be a fairly common use case for Debezium postgres, so I'm hoping to learn more how does it guarantee at-least-once delivery to assess if we can fix it in our system.