Handling schema changes down stream

1,577 views
Skip to first unread message

amit...@venmo.com

unread,
Jan 7, 2019, 4:39:02 PM1/7/19
to debezium
Hi all,

I was wondering if anyone has experience with handling schema changes from MySQL (aka ALTER TABLE...) automatically while pushing the data into a target database/warehouse of sorts?
I'm wondering how one would "synchronize" that, since there's no guarantee that a client (consumer/s) would read the ALTER TABLE event before the "next" CRUD event after that, right?

Would you recommend waiting for a schema change event when there's an actual schema change on the CRUD topic? so that once a schema change is observed:
  1. load all the data up to the change
  2. wait for the DDL in the "schema topic"
  3. apply the change to the target DB
  4. keep loading to the db with the new (updated) schema
would something like that make sense?

Thanks,
Amit

Gunnar Morling

unread,
Jan 8, 2019, 7:23:01 AM1/8/19
to debezium
Hi,

What you suggest makes sense IMO, you'd have to figure out though how to decide whether you need to wait for a schema change event or not. Alternatively you could inspect the schema coming with the data change event itself (Avro or inline JSON schema) and take action based on that. That'd avoid the need for any sort of synchronization.

--Gunnar

Chris Riccomini

unread,
Jan 8, 2019, 12:15:28 PM1/8/19
to debezium
We handle this lazily in KCBQ. That is, KCBQ reads CDC messages (from DBZ) in Kafka, and attempts to insert them into the destination DB (BigQuery, in our case). If the insert fails due to invalid schema, we attempt to update the BQ schema. If the BQ schema change goes through, then we re-insert. If the change doesn't go through, we are stuck, and the pipeline is down until manual intervention is taken (this is due to an incompatible schema change, which we usually block before it occurs in the DB).

amit...@venmo.com

unread,
Jan 8, 2019, 4:29:05 PM1/8/19
to debezium
My target DW is Redshift so I was thinking of the following:
  1. Compatibility NONE - file rolls every time there's a schema change.
  2. A modified version of the S3 Sink Connector that 
    1. adds the schema version to the filename.
    2. has loader thread/s the periodically looks for new files ands sort them by table name, offset, for loading. Once a schema change was observed, look for the DDL in the "schema change" topic.
To map the MySQL DDL to Redshift DDL, I plan on using the MySqlAntlrDdlParser and map column types according to this by Amazon.

Only thing that worries me know is "rapid" schema updates, or how to avoid synchronizing the table "version" with the events version...
I could maintain a "counter" of sorts that tells me the number of schema changes the table topic "saw" and apply only that number of DDLs from the schema change topic but I'd rather avoid state and I'm not sure if the Confluent Schema guarantees a sequence.

Any suggestions?

Gunnar Morling

unread,
Jan 9, 2019, 10:00:12 AM1/9/19
to debezium
Still curious why you don't process the schema info from CDC events themselves?

Btw. there is an option to add the source column name and length as schema parameters (look for "column.propagate.source.type" in https://debezium.io/docs/connectors/mysql/). Perhaps that could be helpful to you?

--Gunnar

amit...@venmo.com

unread,
Jan 9, 2019, 12:17:59 PM1/9/19
to debezium
Sweet! missed that one. Any plans to support scale in DECIMAL types?

Considering that, I should be able to rely on the "data" topic alone.
I hope I'll have something to share here once I get it working, Thanks!

Andrew Ehrlich

unread,
Jan 9, 2019, 2:13:30 PM1/9/19
to debezium
Similar concerns here. However, I am using the Postgresql connector rather than the MySQL connector. Am I correct in understanding that ALTER statements will only come through the MySQL connector, and with Postgres they will not, meaning we need to infer the change from the difference in schema?

Gunnar Morling

unread,
Jan 9, 2019, 2:27:36 PM1/9/19
to debezium


Am Mittwoch, 9. Januar 2019 18:17:59 UTC+1 schrieb amit...@venmo.com:
Sweet! missed that one. Any plans to support scale in DECIMAL types?

Good point, was wondering about that, too, when digging out the name of the option earlier on. Can you file a JIRA issue for this? 

Considering that, I should be able to rely on the "data" topic alone.
I hope I'll have something to share here once I get it working, Thanks!

Cool; looking forward to that. Also, if you'd like to blog about your experiences on debezium.io, let us know and we can set something up.

Gunnar Morling

unread,
Jan 9, 2019, 2:28:52 PM1/9/19
to debezium
Yes, exactly. Postgres currently doesn't support events upon DDL changes.

amit...@venmo.com

unread,
Jan 9, 2019, 2:41:39 PM1/9/19
to debezium
Created https://issues.jboss.org/browse/DBZ-1073
I can take a look at it later this week.
Reply all
Reply to author
Forward
0 new messages