Hi David -
Debezium's source connectors are inherently single threaded, so if
you are deploying 4 connectors, each of those are an independent
singular task. SQL Server is one of the only connectors that
supports multiple tasks; however, it's only possible if the
`database.names` configuration is specified with multiple
databases. If your SQL Server connector is capturing changes from
a single database, the most the connector will use is 1 task.
So really this comes down to a question of horizontal scaling of
the pipeline across multiple connector deployments.
A database transaction log is predominately singular, meaning
regardless of which source we speak (MySQL/MariaDB/Oracle/PG/SQL
Server/MongoDB), deployment of 2 or more connectors against the
same database results in transaction logs read by each connector
independently from the other deployments. And by virtue, this
means that you increase Disk IO and CPU Utilization by the number
of connectors in many cases.
For MySQL/MariaDB, these connectors do not have the ability to do
early pruning of changes, and so every binlog event is serialized
over the wire and consumed by the connector, even if that event is
discarded. So whether you decide to split the connector deployment
out horizontally by tenant due to their activity, all connectors
will read the same stream of changes, it's just that the connector
skips events based on your include filters. You would want to
measure the difference, but it's very likely that this would be
negligible as I indicated.
When we look at other connectors like Oracle/PG/SQL Server/Db2 and
to a lesser degree of MongoDB, you have the ability to specify
configurations that impact several key performance aspects.
For PostgreSQL, you can create the publication in filter mode,
meaning PostgreSQL only serializes events over the network to
Debezium if its for a table that is of interest. This omits the
need for Debezium to filter out unwanted events and reduces
network overhead. For Oracle, you can specify to use query filters
so that we apply extra predicates to limit the data returned by
the LogMiner query to include only the tables of interest plus
transaction markers. There is also a new experimental CTE query
feature that goes even further to reduce the payloads over the
network. For SQL Server and Db2, we only read changes from the
capture tables based on your include filter setup, which minimizes
the dataset read by the each connector and what is sent over the
network.
In all cases, you need to measure and see whether or not these
really yield any better performance versus a single deployment.
You most likely will with SQL Server/Db2 in most cases but for
others, PostgreSQL has an extra replication slot at play and
Oracle an extra LogMiner process -- which may likely lead to
higher memory consumption, Disk IO, and CPU Utilization on your
database. Then its a question of whether the performance gain is
worth that extra overhead.
-cc