Debezium connector for PostgreSQL performance question

2,354 views
Skip to first unread message

Javi Roman

unread,
Jan 12, 2022, 11:12:10 AM1/12/22
to debezium
Hello community!

I'm taking my first steps with Debezium and I'm getting a very low performance, according to my understanding.

I am using a PostgreSQL 14 instance (pgoutput default standard logical decoding) a Kafka Connect node and a Kafka Broker node.

The throughput I am getting on Kafka using bin/kafka-producer-perf-test.sh is about 20K messages/sec. However, when I use the combination PostgreSQL -> Debezium Connector -> Kafka I am getting a maximum of 1.5K messages/sec (a single table, hence a single topic).

Should this performance be as expected, or is it a very poor performance? The configurations I am using are the default ones, both in Kafka Connect, and in Debezium PostgreSQL Connector.

Any help is welcome!

Javi Roman

unread,
Jan 13, 2022, 12:56:09 PM1/13/22
to debe...@googlegroups.com
I am probably misunderstanding something. 

The maximum throughput I can get from capturing changes from a table (only one table), is the maximum throughput that one of Kafka Server and one Kafka Connect node can get (the one where the connector task for that table is running).

I can not increase the performance of capturing changes for this topic (one topic per database.table) because I can not increase the partitions of this topic, so I can not get more performance with more Kafka Servers (or Kafka Connect nodes, because is only one tasks per connector) for a single database with a single table.

I'm trying with the MongoDB connector, and I'm getting about 2.3k messages/sec in the topic for the mongodb table.

The kafka server (16 MB RAM, and 4 cores) is able of processing 20k messages/sec, probably for limitations of disk performance, however 20K of a simple producer versus the number I'm getting with Debezium connectors, probably I'm doing something wrong.

--
Javi Roman

Twitter: @javiromanrh
GitHub: github.com/javiroman
Linkedin: es.linkedin.com/in/javiroman
Big Data Blog: dataintensive.info


--
You received this message because you are subscribed to a topic in the Google Groups "debezium" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/debezium/ix3GSTxuwuw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/bfa0bf96-41c7-416c-aa8b-0d39cb9d0b96n%40googlegroups.com.

Javi Roman

unread,
Jan 13, 2022, 1:34:24 PM1/13/22
to debe...@googlegroups.com
I've just increased the PostgreSQL engine with the double of resources (16MB and 8 cores), the results are the same, probably the replication stream from PostgreSQL is not the problem.

I'm thinking that the bottleneck could be the internal Debezium topic "connect-offset" is created with 25 partitions (and I have only one Kafka server).

I am not sure, some expert help could guide me in these thoughts.
 

--
Javi Roman

Twitter: @javiromanrh
GitHub: github.com/javiroman
Linkedin: es.linkedin.com/in/javiroman
Big Data Blog: dataintensive.info

Chris Cranford

unread,
Jan 13, 2022, 4:01:28 PM1/13/22
to debe...@googlegroups.com, Javi Roman
Hi Javi -

With the MongoDB connector, did you configure it to read the oplog or use change streams?
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/CACWQdqDhu4UOMmuVND9x%3DXdz3MUhHhDi%3DCj0W7jBicBs2cojzA%40mail.gmail.com.

Javi Roman

unread,
Jan 14, 2022, 3:12:40 AM1/14/22
to Chris Cranford, debe...@googlegroups.com
I'm using in the Debezium MongoDB connector thecapture.mode option by default (the default is change_streams_update_full: captures changes via MongoDB Change Streams mechanism), so I guess the MongoDB server (one instance, with one replicaset) is using the Change Streams method.

I am trying to understand how to query this configuration (oplog vs change streams) in MongoDB, the instance was run with configurations by default.

--
Javi Roman

Twitter: @javiromanrh
GitHub: github.com/javiroman
Linkedin: es.linkedin.com/in/javiroman
Big Data Blog: dataintensive.info

Chris Cranford

unread,
Jan 14, 2022, 11:00:48 AM1/14/22
to Javi Roman, debe...@googlegroups.com
HI Javi -

I'm wondering if the slowness is a product of the capture.mode being used.  So out of all capture.modes, the "change_streams_update_full" will be the slowest across all of them.  This is because MongoDB performs an out of bands query of the associated MongoDB document before the event is given to Debezium to process.  The other capture mode option for change streams, aptly named "change_streams", should theoretically be faster; however, update events won't contain the full document view.  I'm not sure if you rely on the full document view for your CDC pipeline so if you don't, you could try changing the capture.mode to see if you get better performance.

Chris

Javi Roman

unread,
Jan 20, 2022, 1:51:45 AM1/20/22
to Chris Cranford, debe...@googlegroups.com
Many thanks for your reply Chris, I'm going to try that change with MongoDB.

Anyway right now, I'm getting a constant ratio of messages/second in the topic for the PostgreSQL table, and the topic for the MongoDB table, 1.5k for PostgreSQL and 2.k for MongoDB.

I can see that there are thousands of changes pending in PostgreSQL and MongoDB (I am using a tool that inserts tuples constantly, thousands of them), however Debezium consumes the logs in chunks, or so it seems. I am seeing in the Debezium logs the following pattern:

[2022-01-17 15:53:49,832] INFO WorkerSourceTask{id=mongodb-cdc-0} flushing 8189 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:510)
[2022-01-17 15:53:59,000] INFO WorkerSourceTask{id=one-table-0} flushing 6145 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:510)
[2022-01-17 15:54:04,198] INFO WorkerSourceTask{id=mongodb-cdc-0} flushing 8193 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:510)
[2022-01-17 15:54:11,106] INFO Checking current members of replica set at mongodb-standalone.node.keedio.cloud (io.debezium.connector.mongodb.ReplicaSetDiscovery:90)
[2022-01-17 15:54:12,794] INFO WorkerSourceTask{id=one-table-0} flushing 6145 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:510)
[2022-01-17 15:54:18,348] INFO WorkerSourceTask{id=mongodb-cdc-0} flushing 8189 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:510)
[2022-01-17 15:54:26,347] INFO WorkerSourceTask{id=one-table-0} flushing 6145 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:510)

id=mongodb-cdc-0 is the MongoDB test table
id=one-table-0 is the PostgreSQL test table

The Connect server (Debezium) is not under CPU/Memory pressure, however the changes from PostgreSQL and MongoDB the changes are picked up with that cadence, slowly, even when I stop the tool that inserts records, Debezium keeps for many minutes picking up changes, with that rate.

Could you explain to me, please, what is the meaning of the message "flushing XXXX outstanding messages for offset commit"?

Many thanks!
--
Javi Roman

Twitter: @javiromanrh
GitHub: github.com/javiroman
Linkedin: es.linkedin.com/in/javiroman
Big Data Blog: dataintensive.info

Chris Cranford

unread,
Jan 21, 2022, 11:27:29 AM1/21/22
to Javi Roman, debe...@googlegroups.com
Hi Javi -

So there are a number of pieces at work here and I'll try to explain without getting too technical and in the weeds of the details.

Both MongoDB and PostgeSQL use a concept of where changes get enqueued by the database and we poll for these changes.  The poll mechanism here works asynchronously where we exhaust what is pending and then wait until the database provides us more changes.  There isn't very much in the way of knobs in this context where you can improve performance.  It's simply a matter of the database has to prepare the change events for us to be able to react.

There is a configuration option called poll.interval.ms [1].  This controls how fast this loop will wake up when no changes are detected.  You can try to reduce this value from its default (1000ms) to (500ms) to see if you get any performance gain.  If you don't see any performance gain then its likely not an issue that we aren't polling fast enough.


The next area is the actual change event queue in Debezium.  Debezium uses a blocking queue as a way to slow down the source database if its providing change events that is faster than Kafka is consuming the changes.  There are two critical configuration options available for all connectors, max.queue.size [2] and max.batch.size [3].  The first is the maximum number of records that can be enqueued by Debezium before it will block reading from the database.  The second is the number of records read from the database per round trip.  As a rule, the queue size should always be larger than the batch size.  You can look at the JMX metrics exposed by the connector to see if the queue's capacity is reaching 0.  If it is, you can consider increasing the queue's size but understand this also directly impacts the amount of memory that connector will consume as well.

Lastly, the INFO message you see is just Kafka Connect asynchronously writing the offsets for that many messages its seen.  Offsets get committed on an interval in Kafka Connect, which is why you see this message periodically.

HTH,
Chris

[1]: https://debezium.io/documentation/reference/1.8/connectors/postgresql.html#postgresql-property-poll-interval-ms
[2]: https://debezium.io/documentation/reference/1.8/connectors/postgresql.html#postgresql-property-max-queue-size
[3]: https://debezium.io/documentation/reference/1.8/connectors/postgresql.html#postgresql-property-max-batch-size

Javi Roman

unread,
Jan 25, 2022, 1:32:01 AM1/25/22
to Chris Cranford, debe...@googlegroups.com
Thank you for the detailed explanation, Chris!

--
Javi Roman

Twitter: @javiromanrh
GitHub: github.com/javiroman
Linkedin: es.linkedin.com/in/javiroman
Big Data Blog: dataintensive.info

Reply all
Reply to author
Forward
0 new messages