Hi Oren -
I think it's really a question of what is best for a given use
case.
An initial snapshot is an excellent tool for quickly populating
the topics with the existing data, it's extremely fast, there is
minimal overhead of traversing the JDBC result set and generating
change events and so on; however, it isn't without its own set of
requirements of being consistent, which imposes limitations around
resumability. Additionally, if your initial snapshot consists of
millions and millions of rows, this can impose a certain amount of
overhead on the database with retaining transaction logs for the
duration of the snapshot so that streaming can begin where the
consistent snapshot was taken. For MySQL and Oracle, this means
making sure your transaction log retention is large enough, for
Oracle making sure the undo retention period is large enough, for
PostgreSQL making sure that the WAL can grow quite large in the
interim and so on.
Incremental snapshots are meant to bridge a number of those
initial snapshot requirements to make the process easier and
smoother. You gain resumability, you avoid the need for adjusting
your transaction log retention policies, you avoid issues with
reconfiguring Oracle's undo retention period, and you minimize the
WAL growth on PostgreSQL to just name a few. Now incremental
snapshots are generally slightly slower than initial snapshots
with how the data is queried, but that's the cost of being able to
gain all the added benefits over the initial snapshot counterpart.
Probably the biggest limitation for incremental snapshots prior to
Debezium 2.4 was around key-less tables. If a table did not have
a primary key, a unique index that provided primary key semantics,
or did not have a column that could act as a surrogate key for the
chunk-based processing, then you couldn't use incremental
snapshots on such a table. This most often was an issue for users
who were doing CDC on databases that were outside their control,
such as ERP or similar systems where changing table structures
could impose terms or support agreement violations.
For such tables prior to Debezium 2.4, users had to use multiple
connectors serially to try and re-snapshot the key-less table with
a temporary connector before resuming the original connector.
This was very cumbersome and quite prone to human error, but
Debezium 2.4 simplifies this tremendously with a new ad-hoc
blocking snapshot feature. This new feature allows you to use the
signal mechanism that is used for incremental snapshots to trigger
the capture of the tables using the initial snapshot style
queries. The caveat is that streaming is paused temporarily while
the snapshot runs, which means that you incur the same concerns as
you do with initial snapshots or the multiple connector strategy,
but it removes the human error and complexity components for the
latter.
Our goal here is to give users the flexibility to pick the
strategy that best fits their needs. In some cases, initial or
blocking snapshots are the ideal choice and others incremental
snapshots may be the better choice.
Hope that helps.
Chris