Debezium Development questions

480 views
Skip to first unread message

Nancy Xu

unread,
Jan 26, 2022, 6:49:34 PM1/26/22
to debezium
Hi all,

I was hoping to ask this group some questions related to developing a Debezium connector:

1. It seems that most Debezium connectors take an initial snapshot before streaming changes. Is it possible to build a Debezium connector that doesn't take an initial snapshot but only streams out changes between a fixed time period, i.e. T0 - T10.

2. Are there any limitations or differences between a Debezium connector and a regular connector that implements the Kafka connect API in terms of portability / latency / delivery guarantees / scalability / fault tolerance?

3. Why do most/all the Debezium connectors seem to only allow at maximum 1 task per connector?

4. Does DebeziumIO, Debezium server, or Debezium embedded mode persist any information in Kafka? Are they as scalable and fault-tolerant as the Debezium connector in Kafka mode?

Thanks,

Nancy
Reply all
Reply to author
Forward

Chris Cranford

unread,
Jan 26, 2022, 8:32:36 PM1/26/22
to debe...@googlegroups.com, Nancy Xu
Hi Nancy, see inline


On 1/26/22 18:43, Nancy Xu wrote:
Hi all,

I was hoping to ask this group some questions related to developing a Debezium connector:

1. It seems that most Debezium connectors take an initial snapshot before streaming changes. Is it possible to build a Debezium connector that doesn't take an initial snapshot but only streams out changes between a fixed time period, i.e. T0 - T10.
Most connectors have a "schema_only" snapshot mode that allows you to skip the data part of the snapshot so that you can transition into streaming changes ASAP.  In terms of streaming changes between T0 and T10, not at the extent of being able to specify a start time and cut-off time.  You can however stream changes for a given duration from the start time and then stop the connector however.

2. It seems like a lot of Debezium connectors watch for changes in a database's binlog / transaction log. Is it possible for a debezium connector to get change events via an event stream from an API call?
Yes.  In fact, recent changes made to the MongoDB does exactly that, where we support the new MongoDB change streams mechanism for capturing changes rather than tailing the database's oplog.  SQL Server also doesn't tail a given log but instead relies on the capture tables in the database to resolve changes that have happened against tables of interest.

3. Are there any limitations or differences between a Debezium connector and a regular connector that implements the Kafka connect API in terms of portability / latency / delivery guarantees / scalability / fault tolerance?
Could you elaborate on what you mean by a "regular connector"?  Do you have a concrete example connector in mind?

4. Why do most/all the Debezium connectors seem to only allow at maximum 1 task per connector?
That is by in large driven by how changes are consumed from the database.  For example, it generally makes little sense to have more than one task reading the transaction log of the database, since doing so would require some type of de-duplication or synchronization between tasks to prevent emission of duplicate events.

5. Does DebeziumIO, Debezium server, or Debezium embedded mode persist any information in Kafka? Are they as scalable and fault-tolerant as the Debezium connector in Kafka mode?
Yes, Debezium Server and Debezium EmbeddedEngine can be configured to persist offset data in Kafka just like the connector does it it were running in the Kafka Connect runtime.  I would say in most situations, users wouldn't do this because if you're using Server or EmbeddedEngine, you're typically utilizing Debezium in a Kafka-less way.  If you have Kafka and Kafka Connect, then why not use that instead.

Furthermore, Debezium Server does include a Kafka sink, so you could spin up a Debezium Server that interacts with Kafka in a KC-less environment as well.

HTH,
Chris

Nancy Xu

unread,
Jan 26, 2022, 10:20:16 PM1/26/22
to Chris Cranford, debe...@googlegroups.com
Hi Chris,

Thanks for your responses! I just have a couple more questions, sorry I am relatively new to both Kafka and Debezium.

It is interesting that you mentioned I can't specify a cut-off time for the connector. I looked online at the Kafka Connect API -> it seems that the only way to stop a Kafka connector is to utilize the REST API? It doesn't seem possible to specify an end time for the a Kafka connector to stop running by? And I'm assuming this goes for Debezium connectors as well?

And by regular Kafka connector, I guess I'm thinking of the Kafka Connect API. The Confluent documentation mentions that Connect workers only require a set of Kafka brokers to run, and can operate on bare-metal hardware, containers, VMs, and managed environments such as Kubernetes. Does this assumption hold true for Debezium connectors?

Thanks,

Nancy


Chris Cranford

unread,
Jan 26, 2022, 10:41:43 PM1/26/22
to Nancy Xu, debe...@googlegroups.com
Hi Nancy,

With regards to the cut-off question, that's correct.  A connector starts and runs until the task is terminated either by a graceful stop via the REST API in a KC like environment, stopping the Quarkus application in the case of Debezium Server, or a fault that is not retriable in the case of an exception being thrown.

With regards to the latter question, yes the same holds true.  Debezium can be run in any of those environments without an issue.

Chris

Nancy Xu

unread,
Jan 27, 2022, 12:32:26 AM1/27/22
to Chris Cranford, debe...@googlegroups.com
Thanks so much for answering all my questions, Chris!

Gunnar Morling

unread,
Jan 27, 2022, 6:40:55 AM1/27/22
to debezium
Hey Nancy,

Lots of interesting questions :) Out of curiosity, are you planning to build your own connector? Or something based on Debezium? In case of the latter, it could be an option perhaps to do it under the Debezium umbrella?

To add on two of the questions in addition to Chris' replies; there's work happening right now to bring multi-task support to the Debezium connector framework, and the SQL Server connector in particular. As this is reading from CDC tables, it benefits from parallelisation as enabled by multiple tasks.

Re stopping points, one could envision a connector property for that, and a connector would stop reading its source when it has reached that stopping point. But that's not something which currently is supported in Debezium.

--Gunnar

Nancy Xu

unread,
Jan 27, 2022, 12:53:28 PM1/27/22
to debe...@googlegroups.com
Hi Gunnar,

Thanks for jumping in!

Not planning to build my own connector at the moment, but exploring different option and wanted to get a deeper understanding!

Does that mean right now there isn't multi-task support in the Debezium connector framework? As in, if I did build my own connector and I wanted to have several tasks, I couldn't do it using the Debezium framework?

I also have the same question for stopping points. if I did build my own connector and I wanted to have a stopping point where the connector stopped reading its source, could I technically do it using the Debezium framework?




--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/e7f211e0-d42e-42d9-be6f-d5b5cb03238fn%40googlegroups.com.

Nancy Xu

unread,
Jan 27, 2022, 1:00:57 PM1/27/22
to debe...@googlegroups.com
Also, if the Debezium framework only supports one task, that affects scalability of the connector, right?

Chris Cranford

unread,
Jan 27, 2022, 1:43:13 PM1/27/22
to debe...@googlegroups.com, Nancy Xu
Hi Nancy,

So the multi-task concept is something that is part of Kafka Connect.  At a technical level, the connector is given the configuration and it can return multiple task configurations that Kafka Connect then distributes across multiple tasks.  We do have one source connector that does use multiple tasks and that is the MongoDB connector.  We actually start a task per replica-set.  So if you did want to write your own, the architecture is there as a part of Kafka Connect to explore if you wanted.

To your last question, yes you could.  The connector implementation itself is responsible for the streaming loop that is needed to capture and emit changes to the event dispatcher.  So if you developed your own full-fledged connector for a given source, then you could use debezium-core and have your own connector implementation that is based on it.  In fact, that's exactly how all the `debezium-connector-xxxxx` projects are designed today.  Where this might not be so easy would be a situation where you want to re-use an existing Debezium connector and extend it; in which case there may need to be some small changes in the streaming loop so that you could check your condition and gracefully stop the loop rather than relying on throwing an exception to break the loop. 

But ideally, in the interest of open-source and collaboration, having a cut-off point is merely a matter of the user supplying a configuration property prior to starting the connector.  It's certainly a great idea for an enhancement and having that built directly into all Debezium connectors just means that everyone could benefit from such a feature, including even a custom connector that is built using the common framework in `debezium-core`.

HTH,
Chris

Chris Cranford

unread,
Jan 27, 2022, 2:01:09 PM1/27/22
to debe...@googlegroups.com, Nancy Xu
Hi Nancy,

Not necessarily.

You can still choose to deploy multiple connectors (although we generally recommend against it) where you capture changes for a subset of tables from a database.  This allows for some horizontal scaling if the need arises.  The big question when doing this is whether a business use case really warrants it or if the volume of changes dictate this would ultimately be faster than a single connector.  Obviously I think that will vary per database, some can certainly support that multi connector topology easier than others.

Chris

Nancy Xu

unread,
Jan 27, 2022, 2:23:07 PM1/27/22
to Chris Cranford, debe...@googlegroups.com
Got it, thanks a lot!

Nancy Xu

unread,
Mar 9, 2022, 11:49:30 AM3/9/22
to Chris Cranford, debe...@googlegroups.com
Hi!

I was hoping to ask some more questions regarding Debezium development.

1. Would it be possible to reuse Debezium code modules without building under the Debezium umbrella (i.e. simply import from maven the debezium-core package and reuse code from there)
2. What does the process to build a connector under the Debezium umbrella look like?
3. What does the process to offer the connector as a Debezium connector on the Confluent hub look like?

Thanks!

Nancy

Nancy Xu

unread,
Mar 9, 2022, 2:12:32 PM3/9/22
to Chris Cranford, debe...@googlegroups.com
Also, some additional technical questions.

1. I see that it is possible in Debezium to push regular heartbeat records into a heartbeat topic. What is the use cases for this functionality?

2. I see that Kafka SourceTask provides two methods: commit() / commitRecord(). I assume that implementing commitRecord will allow you to run callback logic after a SourceRecord was successfully written to the topic? What functionality does implementing commit() provide?

Thanks so much!

Nancy

Nancy Xu

unread,
Mar 9, 2022, 3:20:45 PM3/9/22
to Chris Cranford, debe...@googlegroups.com
And one additional question about Debezium use case:

I see that Debezium supports writing initial and incremental database snapshots as READ events into the primary topic. What are the use cases for having these records in the topic? Is it for perhaps watermarking?

Chris Cranford

unread,
Mar 11, 2022, 8:15:06 AM3/11/22
to Nancy Xu, debe...@googlegroups.com
Hi Nancy,

Yes, you can leverage the code inside debezium-core for the common connector framework as a basis for your own custom connector.  But there are many advantages of being a community-led connector under the Debezium umbrella, such as being distributed & released alongside the existing Debezium connectors, being able to collaborate and drive innovation in the open with the Debezium team, and so much more.  The DB2 and Vitess connectors are both examples of efforts driven by two different teams to deliver Debezium-based connectors from the community, and we'd welcome others. 

As for the process, I think we need to have a discussion about what does the endeavor ultimately include.  Are you looking to improve an existing connector or develop a brand new one?  If it is a brand new connector, what's the target source and then we look at setting up a repository under the Debezium umbrella and assigning the necessary roles to people to be able to review/commit accordingly.  Gunnar or I can share more information once we know what the plan here is exactly.

As for Confluent Hub, I'll have to defer to Gunnar or Jiri on that.

Chris

Chris Cranford

unread,
Mar 11, 2022, 8:22:54 AM3/11/22
to Nancy Xu, debe...@googlegroups.com
Hi Nancy,

Heartbeat records are useful in situations where you may be connected to a source database that changes infrequent.  A source database has a transaction log position which the connector needs to track to know where to start reading from after a connector restart.  The heartbeat event guarantees that this position is periodically written to the offsets topic so that in the event the connector is restarted after a long window of no activity from the source, we don't start from the position where the last change occurred but instead at the position included in the latest heartbeat event.

In terms of commit() / commitRecord(), if you're using debezium-core you won't need to worry with those.  You can simply extend from BaseSourceTask and get the base implementation that Debezium offers.  In BaseSourceTask and the ChangeEventSourceCoordinator implementations, we use commitRecord() to track the latest offset based on the most recent event and then utilize the commit() call to sync that offset.  Not all connectors implement StreamingChangeEventSource#commitOffset.

Chris

Chris Cranford

unread,
Mar 11, 2022, 8:27:02 AM3/11/22
to Nancy Xu, debe...@googlegroups.com
Hi Nancy,

No, the main purpose is to have data consistency.  A topic may be used for a variety of reasons, such as data replication across heterogeneous databases.  In those cases, you may wish to have an exact 1:1 ratio between source and target records, which is why READ events are critical.  Not every scenario may require READ events and in that case, you can control the exclusion of those by setting the approach snapshot.mode for the connector deployment accordingly.

Chris

Nancy Xu

unread,
Mar 16, 2022, 3:20:01 PM3/16/22
to Chris Cranford, debe...@googlegroups.com
Thanks a lot for your responses, Chris!

So, the only purpose of heartbeat records is to make sure the offsets get periodically updated even if there are no output records. That way, when the connector gets restarted, we can restart from the most recent timestamp.

And to confirm, the main purpose of initial and incremental snapshots is for data replication.

You mentioned that an advantage of building under the Debezium umbrella is that any new connector would be distributed / released alongside existing Debezium connectors. What exactly does the distribution/release process for Debezium connectors look like?

Thanks a lot,

Nancy

Nancy Xu

unread,
Mar 16, 2022, 5:21:05 PM3/16/22
to Chris Cranford, debe...@googlegroups.com
Also, another quick question.

I see that it is possible to write transaction records into a specified transaction topic. What is the use cases for a transaction topic. Can users join this topic against the output topics for useful post-processing?

Also, it seems like there are two types of event records. Is the below correct?

READ events
DATA CHANGE events: create, update, delete, tombstone records.

Thanks a lot!

Nancy

Chris Cranford

unread,
Mar 17, 2022, 2:09:49 PM3/17/22
to Nancy Xu, debe...@googlegroups.com
HI Nancy -

You can get an idea of our release cadence from https://debezium.io/releases/.

In general, we strive to have a new Debezium release each quarter with preview releases on an approximate 3 week cadance within that quarter.  Using 1.9 as an example, we released Alpha1 toward the end of January, Alpha2 mid February, and Beta1 at the start of March.  We intend to do CR1 next week with 1.9.0.Final scheduled for the last week of March to wrap up this quarter.  During these releases, we include community-led connectors as artifacts under the io.debezium umbrella pushed to Maven Central (i.e. Db2 [1] or Vitess [2]). 

Hope that helps,
CC

[1]: https://search.maven.org/search?q=debezium-connector-db2
[2]: https://search.maven.org/search?q=debezium-connector-vitess

Chris Cranford

unread,
Mar 17, 2022, 2:18:38 PM3/17/22
to Nancy Xu, debe...@googlegroups.com
Hi Nancy -

Yes, users use the transaction metadata topic for a variety of reasons.  You can use Kstreams to join multiple topics and do all sorts of post-processing things, including creating enriched events that are then passed into a new topic for consumption by your pipeline.  We have a great auditing example in our examples repository [1] that demonstrates doing this from an auditing perspective where you may want to capture who changed what records.

Lastly, yes there are multiple times of events

    * Change Events
    * Schema Events
    * Transaction Metadata Events
    * Tombstone Events

The 'R' (read), 'C' (create), 'U' (update), 'D' (delete) events, and 'T' (truncate) events are all considered change events.  These represent some type of data change on the source side.  Schema events are those that are sent to the schema history topic if you're using a connector that supported schema history (all connectors do but PostgreSQL).  Then there are the transaction metadata events as you've found and finally the tombstone events which a type of change event meant to indicate the end of life of a specific key in a topic.

Hope that clarifies your questions.
CC   

Chris Cranford

unread,
Mar 17, 2022, 2:20:14 PM3/17/22
to Nancy Xu, debe...@googlegroups.com
My apologies, I forgot to include the foot-note reference to the examples repository [1]:

Chris

[1]: https://github.com/debezium/debezium-examples

Nancy Xu

unread,
Mar 17, 2022, 4:09:02 PM3/17/22
to Chris Cranford, debe...@googlegroups.com
Thank you so much for your answers to all my questions so far! They are really helping me think through the development process of a connector.

Sorry, just last question:

1.  It is important that in my connector, I am able to retrieve the low watermark during Connect task runtime (i.e. when writing the SourceRecords into Kafka during poll()). The low watermark is defined as the lowest timestamp of all records processed from the source system and stored into Kafka. I believe that the low watermark would be equivalent to the lowest offset value among all the offsets for the source partitions stored by OffsetStorageWriter in Kafka Connect.

Is there a way for me to be able to retrieve the lowest offset value among all offsets during Connect task runtime? Or, is there an established pattern of other connectors doing this? If not, I might have to figure out a way to do it myself, but I just wanted to check if there were other established patterns of doing so.

Thanks,

Nancy

Chris Cranford

unread,
Mar 18, 2022, 8:59:03 AM3/18/22
to Nancy Xu, debe...@googlegroups.com
HI Nancy -

It is my understanding that KC does not offer a good way to do what you are asking.  Generally speaking, a Task is a self-contained element that operates on a subset of data.  Each task has its own offsets that are stored in the offset topic by an assigned partition.  But unfortunately, there isn't a good way to share the offset data as you described.  The only idea that comes to mind is each Task would need to be provided with all the partitions and would need to re-query the offsets and make that determination, but there is a latency concern to be aware of as a Task may have pending events that Kafka Connect has not yet polled that have been processed and therefore the offsets in the Kafka Topic are lagging behind.

HTH,
CHris

Nancy Xu

unread,
Mar 18, 2022, 11:11:51 AM3/18/22
to Chris Cranford, debe...@googlegroups.com
Hi Chris,

Yes it definitely makes sense that offsets in the Kafka topic would cause latency in the low watermark.

I was wondering then, is it an anti-pattern to access offsets during task runtime during poll()? Or perhaps during a background thread that I kick off during the task's start method?

Thanks,

Nancy

Nancy Xu

unread,
Apr 2, 2022, 1:28:17 PM4/2/22
to Chris Cranford, debe...@googlegroups.com
Hi Chris and Gunnar,

Thanks for answering my questions up to this point.

At this point, I'm trying to finalize the pros/cons of building a Debezium connector under the Debezium umbrella, vs. building my own connector while adding 'debezium-core' as a dependency.

1. How much extra time/effort would building a Debezium connector under the Debezium umbrella add? Gunnar mentioned earlier that it would include creating a repository under the Debezium repository, and assigning people to review/commit. How long would creating the repository take? How long have design reviews in the past approximately taken, as well as code reviews?

2. What kinds of development constraints would there be if I were to build a Debezium connector under the Debezium umbrella? So far, the two I can think of are: 1) Making sure my change records are compatible with the required Debezium format for change records and 2) define the proper format for Kafka source offsets.

3. If I were to build a connector using 'debezium-core' outside the umbrella, how likely would it be that future Debezium releases introduce breaking changes into my connector?

Thanks so much,

Nancy

jiri.p...@gmail.com

unread,
Apr 4, 2022, 3:25:24 AM4/4/22
to debezium
Hi,

1. Here we spoke about no more then one day for stting up the repo. The design review can take potentially later depending on the size and it is the same with initial commit but definitely less than on week. The follow-up commits are usually faster, lest say 1-2 days. Our policy though is ideally pass the review rights on such repos to additiona non-core team persons like you later so you'd then be able to review and merge commits on your own.

2. I'd say that both limitations are present as soon as you are using Debezium core.

3. We try to avoid if possible but we cannot guarantee that unless a formal API is in place. If you'd stya out of Debezium org then I'd recommend you to set a Github CI job that will build Debezium core from main and use this as dependency. We do the same for db2, cassandra and vitess repos to make sure any incompatibility is found fast.

J.

Gunnar Morling

unread,
Apr 4, 2022, 3:28:51 AM4/4/22
to debe...@googlegroups.com
Hey Nancy,

Which database is supported by that connector you're working on (sorry
if it was mentioned in the thread somewhere and I missed it)?

Thanks,

--Gunnar

Am Mo., 4. Apr. 2022 um 09:25 Uhr schrieb jiri.p...@gmail.com
<jiri.p...@gmail.com>:
> To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/5a3f1184-547d-4b95-8d37-8ed938a65248n%40googlegroups.com.

Nancy Xu

unread,
Apr 4, 2022, 12:32:26 PM4/4/22
to debe...@googlegroups.com
Thanks a lot Jiri! 

And let me get back to you later on that, Gunnar. I am currently just trying to explore more about the Debezium community at the moment, and trying to see if it is a good fit for the connector that I am working on.

Thanks a lot,

Nancy

Gunnar Morling

unread,
Apr 4, 2022, 2:05:26 PM4/4/22
to debezium
nancy...@gmail.com schrieb am Montag, 4. April 2022 um 18:32:26 UTC+2:
Thanks a lot Jiri! 

And let me get back to you later on that, Gunnar. I am currently just trying to explore more about the Debezium community at the moment, and trying to see if it is a good fit for the connector that I am working on.

Sure thing. We may help with that decision too, as it may not make equal sense for each connector to be part of the Debezium umbrella. I.e. let's talk whenever you're ready. To give some background, there's currently three "kinds" of connectors:

* core (those in the debezium/debezium repo): maintained by the Debezium core team, part of the Debezium releases
* community-led (those under the debezium org, but in other repos): maintained by Debezium community members, e.g. from Stripe, WePay, and Instaclustr; part of the Debezium releases
* external community-led (I'm currently aware of the ScyllaDB and YugaByte connectors); maintained elsewhere typically by DB vendors, may lag behind Debezium releases

Each kind has its pros and contains. Happy to chat more if you want.

Best,

--Gunnar

Nancy Xu

unread,
Apr 4, 2022, 4:19:28 PM4/4/22
to debe...@googlegroups.com
Thank you so much for the information! Will follow up when ready!

Nancy

Nancy Xu

unread,
Apr 4, 2022, 8:02:52 PM4/4/22
to debe...@googlegroups.com
For purely design related questions:

1. Are there any other development constraints i need to worry about for reusing debezium-core? The two I mentioned above are record format and connect offsets.

2. So far, my understanding is that Kafka Connect tasks are assigned a static list of partitions. If I wanted to change the partitions assigned to a task, I'd have to request task reconfiguration.

Thanks,

Nancy

jiri.p...@gmail.com

unread,
Apr 14, 2022, 3:55:34 AM4/14/22
to debezium
1. We still don't provide a fixed API for connector framework so resuing core can mean form time to tim need to update to new version at the source code level - like a new parameter added on the method call etc. But it is pretty rare.
2. Yes, AFAIK that's true.

Reply all
Reply to author
Forward
0 new messages