debezium and kakfa topic design question

n99

unread,

Feb 21, 2022, 6:40:50 AM2/21/22

to debezium

Hello

Not strictly a debezium question but one about debezium and kafka.

Apologies but I thought the easiest way to explain is to lead you down my thought process path... :)

So, for example I have a db with tables

customers

customer_emails

customer_addresses

customer_telephones

etc

Working towards a topic holding multiple customer events as described here https://www.confluent.io/blog/put-several-event-types-kafka-topic/

....I've created a debezium connector that puts all CDC events for each table into the same customer topic.

I'm wanting to use this topic for replay for any new consumers, ie allowing them to get a full load of all the customer data, before the consumer then gets new messages coming through.

Given a topic contains these messages with CRUD events for one (customerId/table name) key:

- New customer (C)

- Change customer name (U)

- New customer_address (C)

- New customer_email (C)

- Change customer_email (U)

...any new consumer will have logic in it to deal with Creates and Updates (as well as Reads and Deletes) differently.

That all sounds fine.

However now I add in topic compaction....

To avoid consumption of duplicates and deleted items, I'm setting the topic to use log compaction and tomb-stoning.

Log compaction will leave the latest message for each key in the topic:

- Change customer name (U)

- New customer_address (C)

- Change customer email (U)

In this scenario a new consumer can't treat these replay Update events as Updates but has to treat them as Creates? However it will have to treat non-compacted, new Update messages as actual Updates? (or what ever their actual CRUD event is).

So a consumer will have to treat messages in the compacted tail differently to messages in the non-compacted head.

This also sounds doable but not anything I've read on my kafka travels.

However I'm currently stuck as there does not seem to be a way for a consumer to tell where the head and tail meet on a topic. Ie how far has compaction currently got.

This leads me to think I'm missing something here?

Grateful for any advice if people have figured this out already, or can see a flaw in my thinking?

best wishes

n99

Gunnar Morling

unread,

Feb 21, 2022, 7:58:39 AM2/21/22

to debezium

> So a consumer will have to treat messages in the compacted tail differently to messages in the non-compacted head.

I don't quite understand the reasoning. In general, an update event should contain the complete full row state (there are some exceptions, e.g. when using Postgres without replica identity FULL), so you should apply the "after" part of update and insert events equally when constructing any kind of derived views downstream.

Perhaps this post (and code example) about foreign key joins with Kafka Streams could be useful:

https://debezium.io/blog/2021/03/18/understanding-non-key-joins-with-quarkus-extension-for-kafka-streams/

Hth,

--Gunnar

Chris Cranford

unread,

Feb 21, 2022, 8:01:24 AM2/21/22

to debe...@googlegroups.com, n99

Hi

Event compaction comes with a variety of things you need to consider.

First, as you've seen below, consumers and sink connectors must be capable of managing the insert/update events using UPSERT semantics. This means if you get an UPDATE for an event where you never had an INSERT, then the UPDATE must be processed as though its an INSERT. Secondly, if there is any parent/child relationships between events of a given topic, log compaction may cause issues with the consumer / sink. Lets take the following example:

    INSERT - Parent ID 1
    INSERT - Child ID 2 (depends on Parent ID 1)
    UPDATE - Parent ID 1

After log compaction you get:

    INSERT - Child ID 2 (depends on Parent ID 1)
    UPDATE - Parent ID 1

When processing Child ID 2, there won't be a Parent ID 1 in the destination yet and if there is a FK constraint on this relationship then you'll get a constraint violation.

Lastly, when using Log compaction its important to understand the payload of each event depending on the connector you're using. For example with PostgreSQL and Oracle, there is the concept of TOAST and Unavailable column values respectively. Large character or binary data is provided during the INSERT events on tables with such columns but may not be present on UPDATE or DELETE events unless those columns were changed. This means that when Kafka compacts events, you can loose values associated with such columns since these columns are not always provided in each event type.

HTH,
CC

--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/101205bc-196f-474a-b5f5-e7373e05ee65n%40googlegroups.com.

n99

unread,

Feb 21, 2022, 10:42:49 AM2/21/22

to debezium

Hi

Thanks for that both of you - very useful.

Gunnar, I guess it is as Chris has said; " consumers must be capable of managing the insert/update events using UPSERT semantics"

I guess ease of this depends on the consumer and downstream system. For instance if the final endpoint is a REST API, I guess there is no automatic UPSERT semantic. Would the consumer have to do a GET first to determine if the final call should be a POST, PUT or even PATCH?

I had hoped that knowing where the head and tail met on the topic would allow the consumer to make that choice more easily.....

Ie for a message that was originally an UPDATE:

If I know a message is in the compacted tail - then it's now an INSERT.

If I know a message is in the un-compacted head - then it's an UPDATE.

Cheers

n99

Reply all

Reply to author

Forward