Re: [debezium] Postgres 16 with debezium 2.2.1 lag increasing

Chris Cranford

unread,

Mar 20, 2024, 6:22:57 PM3/20/24

to debe...@googlegroups.com

Hi Alan -

I assume by lag you're referring to WAL growth. Generally users handle this by setting both heartbeat.interval.ms and heartbeat.action.query.

What seems a bit unexpected is that this happened only with the bump to PG16 and not while you were on PG13. This type of problem happens quite often in situations where a connector's interested tables change infrequently compared to the remainder of the database. This leads to situations where we don't see any activity and therefore have no LSN to flush back to the replication slot to advance to WAL pointer, so you observe this WAL growth.

As for updating, I would definitely say it's worth considering; if nothing else for any performance optimizations and bug fixes :)

Thanks,
Chris

On 3/20/24 16:31, Alan Perius wrote:

Hi guys, We were running debezium 2.2.1 with postgres 13 without any issue... then we updated to postgres 16. All works fine, we have 6 connector in prod. But one connector, the table list is not having almost any inserts during the day and the lag increase, in the end of the day the lag is like 100gb(our max_slot_wal_keep_size = 600gb)

Once I add a new row there, the lag reduce to some mbs and is fine for some hours.. and then it start to increase again..

Do you know what I can do? Should I update the debezium to the version 2.4+ ?

Thanks

--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/0e5de274-1677-4fd2-acfb-025550791b7fn%40googlegroups.com.

Alan Perius

unread,

Mar 21, 2024, 9:51:33 AM3/21/24

to debezium

Thanks for the response, Chris. But I do have a heartbeat.

config here

config= {
"slot.name" : "slotName",
"plugin.name": "pgoutput",
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": db.hostname,
"database.port": db.port,
"database.user": db.username,
"database.password": db.password,
"database.dbname" : database,
"database.server.name": database,
"topic.prefix": "prefix",
"schema.include.list": "public",
"table.include.list": "public.outbox",
"tombstones.on.delete": "false",
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.route.by.field": "aggregateType",
"transforms.outbox.table.field.event.key": "aggregateId",
"transforms.outbox.table.field.event.payload.id": "aggregateId",
"transforms.outbox.table.field.event.payload": "payload",
"transforms.outbox.table.field.event.type": "eventType",
"transforms.outbox.table.fields.additional.placement": "eventType:header,createdAt:header:outboxCreatedAt",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"sanitize.field.names": "false",
"publication.autocreate.mode": "filtered",
"publication.name": "outbox_publication",
"skipped.operations": "u,d",
"heartbeat.interval.ms": "60000",
"heartbeat.action.query": "CREATE TABLE IF NOT EXISTS _debezium_heartbeat (id SERIAL PRIMARY KEY, ts TIMESTAMP WITH TIME ZONE);INSERT INTO _debezium_heartbeat (id, ts) VALUES (1, NOW()) ON CONFLICT(id) DO UPDATE SET ts=EXCLUDED.ts;",
"slot.drop.on.stop": drop_on_stop,
"retriable.restart.connector.wait.ms": 60000
}

Alan Perius

unread,

Mar 21, 2024, 9:53:14 AM3/21/24

to debezium

At that DB we have 12 tables. But I'm only listening to one that receives a small number of events. Is Debezium not able to handle this scenario properly?

Alan Perius

unread,

Mar 21, 2024, 10:01:12 AM3/21/24

to debezium

Do we have a minimum amount of msg to make debezium start to read the lag?

I mean, once I run this query:

SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replicationSlotLag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as confirmedLag,
active
FROM pg_replication_slots;

replicationSlotLag = 60GB and confirmedLag = 15GB

I insert one row, confirmedLag is reduced to 1MB but replicationSlotLag remains.

Once I add 10+ rows Both are decreased to 1MB.

Chris Cranford

unread,

Mar 22, 2024, 8:03:06 AM3/22/24

to debe...@googlegroups.com

Hi Alan -

So what you're observing isn't related with Debezium, it's simply a product of how the WAL works in PostgreSQL. You will see an immediate change in the calculation of Confirmed Lag because Debezium does update the confirmed_flush_lsn periodically, as we receive events. This is what the heartbeat action query effectively does. But you have to consider the WAL segment itself is controlled directly by PostgreSQL and we do not adjust the restart_lsn. This is handled directly by PostgreSQL and the WAL segment is likely not going to shrink immediately based on a single record because of how PostgreSQL works. There is also the "archive_timeout" at the database that controls when WAL segments are rotated that could be influencing the behavior you see.

I would suggest you touch base with your DBA or the PostgreSQL community to see if there are further PostgreSQL parameters you could tweak to obtain the more reactive WAL behavior you're seeking.

Thanks,
Chris

To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/b30220c2-54a4-46fa-af23-e543d2a1f308n%40googlegroups.com.

Reply all

Reply to author

Forward