Hi all,
We've been using Debezium on a new project for the last few months and have had great success. It's a great project, and thanks for all the hard work.
Within the last month we deployed our new data processing system into production in a darkened state and started to see some real world results. The most notable thing so far has been the DDL history topic that Debezium uses to store the state of the database schema. In our case, we are replacing an ETL system which uses a ton of create temp table statements to execute ETL's, and it runs 100's of times a day. Within a month we've grown the history topic to over 40K events. Additionally, we have three source databases and each one shares the same ddl history topic; this was an oversight on our part, we failed to look into the importance of this topic. We also over partitioned it, not realizing it would only use one partition.
Anyhow, the issue we started to see would happen during overnight server restarts. A connector service would cycle which would cause a rebalance when it went down followed by a rebalance when it came back shortly thereafter. What started to happen was the debezium connectors would take so long reading the history topic that they wouldn't be finished starting by the time they were asked to stop. So we'd end up with a timing issue where multiple instances of the same connector would be running at overlapping times, causing a variety of issues, e.g. two shyiko slaves registering with the same slave ids, or two history consumers reading the same history topic in the same consumer group and thus not getting the full data set, and lastly, a connector failing to read to the end of the history topic but determining it had based on the 4 retries at 100ms timeout. The last one is the most confusing and we still aren't sure exactly why it happens.
We've alleviated the problem by adjusting a few configurations.
Worker configs:
task.shutdown.graceful.timeout.ms=300000
rebalance.timeout.ms=120000
We jacked up the shutdown timeout to give each connector a chance to read through the full history topic and complete a shutdown before KC would start it up again. We found we needed to then increase the rebalance timeout so that KC would not prematurely kick members out of the group while they were waiting for connectors to cycle.
Connector configs:
database.history.kafka.recovery.poll.interval.ms=500
We increased this one to prevent the premature timeouts that the history consumer was experiencing.
All of these resolve our issue for now, but I don't think it will scale because of the stupid amount of DDL statements we have to deal with. We have a few ideas going forward:
- Filtering DDL events that we don't care about
- Implement some sort of periodic snapshot/rebuild ability for the ddl topic so we can periodically prune it and start over
- Adjust the KafkaDatabaseHistory class to use offsets to determine when it's read through the entire topic, i.e. fetch the last offset value prior to reading from the beginning and don't stop until it's met the end
Does anyone have any experience with issues like this or any guidance to offer on the patches we are proposing to contribute?
Thanks,
Rich