Most connectors have a "schema_only" snapshot mode that allows you
to skip the data part of the snapshot so that you can transition
into streaming changes ASAP. In terms of streaming changes between
T0 and T10, not at the extent of being able to specify a start time
and cut-off time. You can however stream changes for a given
duration from the start time and then stop the connector however.
2. It seems like a lot of Debezium connectors watch for
changes in a database's binlog / transaction log. Is it possible
for a debezium connector to get change events via an event
stream from an API call?
Yes. In fact, recent changes made to the MongoDB does exactly that,
where we support the new MongoDB change streams mechanism for
capturing changes rather than tailing the database's oplog. SQL
Server also doesn't tail a given log but instead relies on the
capture tables in the database to resolve changes that have happened
against tables of interest.
3. Are there any limitations or differences between a
Debezium connector and a regular connector that implements the
Kafka connect API in terms of portability / latency / delivery
guarantees / scalability / fault tolerance?
Could you elaborate on what you mean by a "regular connector"? Do
you have a concrete example connector in mind?
4. Why do most/all the Debezium connectors seem to only allow
at maximum 1 task per connector?
That is by in large driven by how changes are consumed from the
database. For example, it generally makes little sense to have more
than one task reading the transaction log of the database, since
doing so would require some type of de-duplication or
synchronization between tasks to prevent emission of duplicate
events.
5. Does DebeziumIO, Debezium server, or Debezium embedded
mode persist any information in Kafka? Are they as scalable and
fault-tolerant as the Debezium connector in Kafka mode?
Yes, Debezium Server and Debezium EmbeddedEngine can be configured
to persist offset data in Kafka just like the connector does it it
were running in the Kafka Connect runtime. I would say in most
situations, users wouldn't do this because if you're using Server or
EmbeddedEngine, you're typically utilizing Debezium in a Kafka-less
way. If you have Kafka and Kafka Connect, then why not use that
instead.
Furthermore, Debezium Server does include a Kafka sink, so you could
spin up a Debezium Server that interacts with Kafka in a KC-less
environment as well.
HTH,
Chris