Postgres RDS disconnecting ?

Nuria Ruiz

unread,

Jul 2, 2021, 11:21:45 AM7/2/21

to debezium

Hello,

We are in the process of deploying debezium to production in aws RDS (postgress). The initial dump of data works well and we have had to adjust some settings on the database like idle_timeout for things to work properly. Now, after 3 or 4 hours of working the connection abruptly drops with the exception printed below. The issue seems to be one of postgres rather than debezium, so I am just writing to see if anyone has any ideas. The time at which the connector disconnect is about the same than the setting ' idle_in_transaction_session_timeout'.

Now, we are wondering what is the path to action here, is the replication on postgres subjected to this idle timeout? (seems like the initial dump might be but not the replication itself).

Is the expectation that the connector needs to be restarted when this happens and will restart from the serialized heartbeat? Are disconnects a (frequent) fact of life we need to deal with?

As I said, any ideas/thoughts welcome. Thanks!

------------------------------------------------------------------------------------------------------------------

Context:

This is a self managed connector that uses confluent 6.1 platform which brings (per docs) debezium 1.4.1. Looking at jira we see that several fixes regarding connectivity were added to 1.1 so we assume those we already have.

2021-07-02T01:54:05 [2021-07-02 01:54:05,537] ERROR [default-connector|task-0] Producer failure (io.debezium.pipeline.ErrorHandler:31)
2021-07-02T01:54:05 org.postgresql.util.PSQLException: ERROR: canceling statement due to user request
2021-07-02T01:54:05 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2532)
2021-07-02T01:54:05 at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1200)
2021-07-02T01:54:05 at org.postgresql.core.v3.QueryExecutorImpl.readFromCopy(QueryExecutorImpl.java:1100)
2021-07-02T01:54:05 at org.postgresql.core.v3.CopyDualImpl.readFromCopy(CopyDualImpl.java:42)
2021-07-02T01:54:05 at org.postgresql.core.v3.replication.V3PGReplicationStream.receiveNextData(V3PGReplicationStream.java:158)
2021-07-02T01:54:05 at org.postgresql.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:123)
2021-07-02T01:54:05 at org.postgresql.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:80)
2021-07-02T01:54:05 at io.debezium.connector.postgresql.connection.PostgresReplicationConnection$1.readPending(PostgresReplicationConnection.java:467)
2021-07-02T01:54:05 at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.processMessages(PostgresStreamingChangeEventSource.java:185)
2021-07-02T01:54:05 at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.execute(PostgresStreamingChangeEventSource.java:147)
2021-07-02T01:54:05 at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:140)
2021-07-02T01:54:05 at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:113)
2021-07-02T01:54:05 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
2021-07-02T01:54:05 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
2021-07-02T01:54:05 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2021-07-02T01:54:05 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2021-07-02T01:54:05 at java.base/java.lang.Thread.run(Thread.java:829)

This is the connector config:

"\"basic.auth.credentials.source\"": "\"USER_INFO\"",
"\"basic.auth.user.info\"": "\"$CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL\"",
"\"connector.class\"": "\"io.debezium.connector.postgresql.PostgresConnector\"",
"\"database.dbname\"": "\"customers\"",
"\"database.hostname\"": "\"$DATABASE_HOST\"",
"\"database.password\"": "\"$DATABASE_PASSWORD\"",
"\"database.port\"": "\"5432\"",
"\"database.user\"": "\"$DATABASE_USER\"",
"\"database.server.name\"": "\"postgres\"",
"\"database.sslmode\"": "\"disable\"",
"\"event.processing.failure.handling.mode\"": "\"warn\"",
"\"json.output.decimal.format\"": "\"NUMERIC\"",
"\"heartbeat.interval.ms\"": "\"10000\"",
"\"kafka.api.key\"": "\"$KAFKA_API_KEY\"",
"\"kafka.api.secret\"": "\"$KAFKA_API_SECRET\"",
"\"output.data.format\"": "\"JSON_SR\"",
"\"plugin.name\"": "\"pgoutput\"",
"\"poll.interval.ms\"": "\"10000\"",
"\"max.batch.size\"": "\"1000\"",
"\"schema.registry.url\"": "\"$CONNECT_SCHEMA_REGISTRY_BASIC_AUTH_USER_INFO\"",
"\"schema.registry.basic.auth.credentials.source\"": "\"USER_INFO\"",
"\"schema.registry.basic.auth.user.info\"": "\"$CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL\"",
"\"slot.drop_on_stop\"": "\"fase\"",
"\"slot.name\"": "\"${slot_name}\"",
"\"snapshot.mode\"": "\"exported\"",
"\"table.include.list\"": "\"${table_include_list}\"",
"\"tasks.max\"": "\"1\"",
"\"tombstones.on.delete\"": "\"false\"",
"\"topic.creation.default.partitions\"": "\"3\"",
"\"topic.creation.default.replication.factor\"": "\"3\"",
"\"topic.creation.enable\"": "true",
"\"transforms\"": "\"unwrap\"",
"\"transforms.unwrap.drop.tombstones\"": "\"false\"",
"\"transforms.unwrap.delete.handling.mode\"": "\"rewrite\"",
"\"transforms.unwrap.add.fields\"": "\"table,lsn\"",
"\"transforms.unwrap.type\"": "\"io.debezium.transforms.ExtractNewRecordState\""

Gunnar Morling

unread,

Jul 5, 2021, 10:03:08 AM7/5/21

to debezium

Hi,

> The time at which the connector disconnect is about the same than the setting ' idle_in_transaction_session_timeout'.

This is interesting; have you set that parameter explicitly for the connector as a pass-through option (ie. via "database.idle_in_transaction_session_timeout")? If so, what's the reasoning behind it? We certainly could have the connector automatically restart if it encounters this particular exception (see the PostgresErrorHandler class), but before going there, I'd like to better understand where this disconnect is coming from in the first place.

Best,

--Gunnar

Nuria Ruiz

unread,

Jul 5, 2021, 11:31:48 AM7/5/21

to debe...@googlegroups.com

Hello,

In our experience the initial dump of the tables we wanted to publish
took >1 hr which was the idle_in_transaction_session_timeout initial
value. It is our understanding that while the initial dump of tables
is happening, the wal is not consumed, so we think the value of this
setting (for the postgres user that initiates the replication) has to
be higher than the length of the original dump. Otherwise *it seemed*
the slot was reclaimed right after the initial dump.

Now, it could very well might have been other factors but setting:

>ALTER USER debezium SET idle_in_transaction_session_timeout TO 14400000; -- 4 hrs

seemed to have removed the disconnects that happened soon after the
initial dump.

>We certainly could have the connector automatically restart if it encounters this particular exception (see the PostgresErrorHandler class)

I think is fine having the connector work as it does now (re-starting
itself after retriable exceptions) but if others run into this issue
it might be worth documenting as a setting worth having in RDS.

Thanks,

Nuria

P.S a helpful post:
https://www.cybertec-postgresql.com/en/idle_in_transaction_session_timeout-terminating-idle-transactions-in-postgresql/

> --
> You received this message because you are subscribed to a topic in the Google Groups "debezium" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/debezium/ZY3E0t_2FLo/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to debezium+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/efe3080b-1512-4fff-ad6d-d12976da90f4n%40googlegroups.com.

Chris Cranford

unread,

Jul 6, 2021, 12:01:09 AM7/6/21

to debe...@googlegroups.com, Nuria Ruiz

Nuria -

So if I understand correctly, the replication connection is being forced closed by PostgreSQL due to this idle_in_transaction_session_timeout because we aren't doing anything with the connection while the snapshot is ongoing.

Gunnar, we create this connection up-front to secure the replication slot state before the snapshot begins. I'm curious if whether it would make sense that if we need to perform a snapshot that we close this replication connection and re-open it when streaming begins or at the very least test the connection's validity at the on-set of streaming and attempt to re-open the connection if we find the connection has ended before we enter the streaming loop. Wdyt? Do you prefer either of these or would delegating to the error-handler be the way?

CC

Nuria Ruiz

unread,

Jul 6, 2021, 1:19:53 AM7/6/21

to Chris Cranford, debe...@googlegroups.com

Chris,

>So if I understand correctly, the replication connection is being forced closed by PostgreSQL due to this idle_in_transaction_session_timeout because we aren't doing anything with the connection while the snapshot is ongoing.

Given how little control (and insight) we have when we spin things in
AWS it is hard to tell. It * seems * that the
idle_in_transaction_session_timeout needs to be bigger than the length
of the initial dump. Now, before changing any of the behaviour
debezium has now (which seems very sensical, makes total sense to
claim replication slot first thing) I would double check whether you
can repro this fact with an RDS instance. I will try to do a fast test
on my end using our staging instance.

Documenting the changes needed to 'idle_in_transaction_session_timeout
' setting (if any) might be easier than changing any of your existing
code.

Thanks,

Nuria

Chris Cranford

unread,

Jul 6, 2021, 3:11:43 AM7/6/21

to Nuria Ruiz, debe...@googlegroups.com

Hi Nuria -

I don't disagree, documenting this fact is definitely the path of least resistance by far, but I'm questioning whether that's the ideal long-term solution which we should strive for. In the meantime, keep us apprised your tests and let us know. I'm not sure whether we'll make any headway on testing against RDS in the near-term so any help you can give us is certainly most welcomed & appreciated!

CC

Gunnar Morling

unread,

Jul 8, 2021, 9:19:11 AM7/8/21

to debezium

Hey Nuria, Chris,

Chris Cranford schrieb am Dienstag, 6. Juli 2021 um 06:01:09 UTC+2:

Nuria -

So if I understand correctly, the replication connection is being forced closed by PostgreSQL due to this idle_in_transaction_session_timeout because we aren't doing anything with the connection while the snapshot is ongoing.

Gunnar, we create this connection up-front to secure the replication slot state before the snapshot begins. I'm curious if whether it would make sense that if we need to perform a snapshot that we close this replication connection and re-open it when streaming begins or at the very least test the connection's validity at the on-set of streaming and attempt to re-open the connection if we find the connection has ended before we enter the streaming loop. Wdyt? Do you prefer either of these or would delegating to the error-handler be the way?

Interesting idea, it may be worth trying. We may close the connection and thus disconnect from the replication slot though only after having started the snapshot transaction though. The reason being that we rely on exported snapshots (see https://www.postgresql.org/docs/13/logicaldecoding-explanation.html#id-1.8.14.8.5) for taking the snapshot right at the position of the slot. Worth logging a Jira issue for exploring this option for sure.

Re the error handler, I'm still a bit unsure whether I've fully understood the timing here. @Nuria, so when the exception occurs, has the initial snapshot already completed? So let's say

- idle_in_transaction_session_timeout is set to 4h

- the initial snapshot takes 5h

Will the exception occur after 4h (i.e. while the snapshot still is running), or after 5h (i.e. when the snapshot is complete and we actually try to read from the replication slot)? In case of the former, automating the restart via the error handler won't be very beneficial, as the snapshot will start from scratch. In case of the latter, adding that exception to the error handler would be useful indeed. In the meantime, a contribution for describing the recommended setting in the FAQ would be highly welcomed (https://github.com/debezium/debezium.github.io/blob/develop/documentation/faq.asciidoc).

Thanks,

--Gunnar

Nuria Ruiz

unread,

Jul 8, 2021, 10:29:57 AM7/8/21

to debe...@googlegroups.com

>Will the exception occur after 4h (i.e. while the snapshot still is running), or after 5h (i.e. when the snapshot is complete and we actually try to read from the replication slot)?

Need to do a bit more precise testing to answer this.

>In the meantime, a contribution for describing the recommended setting in the FAQ would be highly welcomed (https://github.com/debezium/debezium.github.io/blob/develop/documentation/faq.asciidoc).

Totally, will try to get to this in the next couple days.

On Thu, Jul 8, 2021 at 6:19 AM 'Gunnar Morling' via debezium

> To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/c2fe25e5-d104-4bca-9687-44275d528eb1n%40googlegroups.com.

Gunnar Morling

unread,

Jul 8, 2021, 4:42:57 PM7/8/21

to debe...@googlegroups.com

Am Do., 8. Juli 2021 um 16:29 Uhr schrieb Nuria Ruiz <nu...@outschool.com>:

>Will the exception occur after 4h (i.e. while the snapshot still is running), or after 5h (i.e. when the snapshot is complete and we actually try to read from the replication slot)?
Need to do a bit more precise testing to answer this.

>In the meantime, a contribution for describing the recommended setting in the FAQ would be highly welcomed (https://github.com/debezium/debezium.github.io/blob/develop/documentation/faq.asciidoc).

Totally, will try to get to this in the next couple days.

Excellent, thank you so much.

One other question, slightly off-topic; in the connector config you shared, you seemingly were referring to environment variables, e.g. here:

"\"database.hostname\"": "\"$DATABASE_HOST\"",

I've never seen that before; is this actually supported by Kafka Connect, or was this just a placeholder syntax you used, so as to not share the actual values?

Thanks,

--Gunnar

You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/CAKg1WCQoM0cOjhQVgwGH2DcDUOuLSF0nG%3DOdggax1kLcO%3DPXZw%40mail.gmail.com.

Nuria Ruiz

unread,

Jul 9, 2021, 7:45:33 PM7/9/21

to debe...@googlegroups.com

>One other question, slightly off-topic; in the connector config you shared, you seemingly were referring to environment variables, e.g. here:
> "\"database.hostname\"": "\"$DATABASE_HOST\"",

Right, these are filled in at runtime by terraform on our setup, so
kafka connect gets the "resolved" values. Cause, as far as I know, you
need to supply those explicitly.

> To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/CADGJaX8P27JFU_7bSsU1ChwmmH0yyi7CWUAnN%3DtUD2u0GmAmYg%40mail.gmail.com.

Reply all

Reply to author

Forward