Connection timed out (Write failed) during initial snapshot of a MySQL database

1,017 views
Skip to first unread message

Koleo

unread,
Feb 13, 2024, 12:38:47 PM2/13/24
to debezium
Hello,

I have multiple PostgreSQL and MySQL Debezium connectors running on a distributed Kafka Connect. Debezium v2.4.0.Final (Debian 11)

I have setup a connector from a MySQL database with a few big tables (biggest one has 600 millions rows)

When the connector is created, with an include table list of 7 tables, the initial snapshot runs for a few hours (about 5 or 6 hours) but always fail with a coinnection timeout error (full stack trace at the end of the post):

Snapshotting of table mydb.bigTable1 failed
Connection timed out (Write failed)

After a fail like this, Debezium starts a snapshot again from scratch (the first table)... then fail after a few hours, and start again...

=> My workaround is to setup the connector with small tables first, then update table.include.list to add the big tables and run incremental snapshots one by one (with success).

So this is clearly related to a database connection timeout.

Now I try to determine which timeout should be increased on MySQL side. Here are my settings for the most common timeouts:

mysql -e "SELECT @@connect_timeout, @@net_read_timeout, @@net_write_timeout, @@wait_timeout, @@interactive_timeout\G"
*************************** 1. row ***************************
    @@connect_timeout: 10
   @@net_read_timeout: 30
  @@net_write_timeout: 60
       @@wait_timeout: 28800
@@interactive_timeout: 28800


Trying to reproduce the issue on a test environment, it seems not related to wait_timeout or interactive_timeout because we have an explicit error in this case:

Caused by: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table mysqldb.test failed
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: The client was disconnected by the server because of inactivity. See wait_timeout and interactive_timeout for configuring this behavior.

Based on the error message, can we determine which timeout should be tuned here ? "Connection timed out (Write failed)" means net_write_timeout maybe? Or could it be some other settings?

Note that I also have a HAProxy as a endpoint for the database connection and it could also be the cause (or not)

  timeout  queue 1m
  timeout  connect 3s
  timeout  client 30m
  timeout  server 30m
  timeout  tunnel 30m
  timeout  client-fin 10s


Here is the full stack trace of the production problem:

[2024-02-09 14:57:41,046] INFO Exporting data from table 'mydb.bigTable2' (3 of 7 tables) (io.debezium.relational.RelationalSnapshotChangeEventSource)
[2024-02-09 14:57:41,047] ERROR Error during snapshot (io.debezium.relational.RelationalSnapshotChangeEventSource)
java.util.concurrent.ExecutionException: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table mydb.bigTable1 failed
    at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
    at io.debezium.relational.RelationalSnapshotChangeEventSource.createDataEvents(RelationalSnapshotChangeEventSource.java:463)
    at io.debezium.relational.RelationalSnapshotChangeEventSource.doExecute(RelationalSnapshotChangeEventSource.java:165)
    at io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:92)
    at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:253)
    at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:237)
    at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:189)
    at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:137)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table mydb.bigTable1 failed
    at io.debezium.relational.RelationalSnapshotChangeEventSource.doCreateDataEventsForTable(RelationalSnapshotChangeEventSource.java:587)
    at io.debezium.relational.RelationalSnapshotChangeEventSource.lambda$createDataEventsForTableCallable$6(RelationalSnapshotChangeEventSource.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    ... 5 more
Caused by: java.sql.SQLException: Error retrieving record: Unexpected Exception: java.net.SocketException message given: Connection timed out (Write failed)

Nested Stack Trace:


** BEGIN NESTED EXCEPTION **

java.net.SocketException
MESSAGE: Connection timed out (Write failed)

STACKTRACE:

java.net.SocketException: Connection timed out (Write failed)
    at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
    at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
    at java.base/sun.security.ssl.SSLSocketOutputRecord.flush(SSLSocketOutputRecord.java:271)
    at java.base/sun.security.ssl.OutputRecord.changeWriteCiphers(OutputRecord.java:224)
    at java.base/sun.security.ssl.KeyUpdate$KeyUpdateProducer.produce(KeyUpdate.java:323)
    at java.base/sun.security.ssl.KeyUpdate$KeyUpdateKickstartProducer.produce(KeyUpdate.java:171)
    at java.base/sun.security.ssl.SSLHandshake.kickstart(SSLHandshake.java:538)
    at java.base/sun.security.ssl.PostHandshakeContext.kickstart(PostHandshakeContext.java:68)
    at java.base/sun.security.ssl.TransportContext.kickstart(TransportContext.java:240)
    at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:449)
    at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:427)
    at java.base/sun.security.ssl.SSLSocketImpl.tryKeyUpdate(SSLSocketImpl.java:1554)
    at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1526)
    at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1481)
    at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1070)
    at java.base/java.io.FilterInputStream.read(FilterInputStream.java:133)
    at com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:64)
    at com.mysql.cj.protocol.a.SimplePacketReader.readMessageLocal(SimplePacketReader.java:137)
    at com.mysql.cj.protocol.a.SimplePacketReader.readMessage(SimplePacketReader.java:102)
    at com.mysql.cj.protocol.a.SimplePacketReader.readMessage(SimplePacketReader.java:45)
    at com.mysql.cj.protocol.a.TimeTrackingPacketReader.readMessage(TimeTrackingPacketReader.java:62)
    at com.mysql.cj.protocol.a.TimeTrackingPacketReader.readMessage(TimeTrackingPacketReader.java:41)
    at com.mysql.cj.protocol.a.MultiPacketReader.readMessage(MultiPacketReader.java:66)
    at com.mysql.cj.protocol.a.MultiPacketReader.readMessage(MultiPacketReader.java:44)
    at com.mysql.cj.protocol.a.ResultsetRowReader.read(ResultsetRowReader.java:75)
    at com.mysql.cj.protocol.a.ResultsetRowReader.read(ResultsetRowReader.java:42)
    at com.mysql.cj.protocol.a.NativeProtocol.read(NativeProtocol.java:1648)
    at com.mysql.cj.protocol.a.result.ResultsetRowsStreaming.next(ResultsetRowsStreaming.java:194)
    at com.mysql.cj.protocol.a.result.ResultsetRowsStreaming.next(ResultsetRowsStreaming.java:62)
    at com.mysql.cj.jdbc.result.ResultSetImpl.next(ResultSetImpl.java:1813)
    at io.debezium.jdbc.CancellableResultSet.next(CancellableResultSet.java:52)
    at io.debezium.relational.RelationalSnapshotChangeEventSource.doCreateDataEventsForTable(RelationalSnapshotChangeEventSource.java:571)
    at io.debezium.relational.RelationalSnapshotChangeEventSource.lambda$createDataEventsForTableCallable$6(RelationalSnapshotChangeEventSource.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)


** END NESTED EXCEPTION **

cst labs

unread,
Feb 28, 2024, 9:56:15 PM2/28/24
to debezium

I ran into some timeouts during initial snapshotting and I tried both cursor and streaming approaches. I ended up adding a very large value of timeout (there is a good reason for that due to size of the db) . You may want to try that too:

config name : database.initial.statements
value : "SET SESSION wait_timeout=172800"

Koleo

unread,
Feb 29, 2024, 5:16:31 AM2/29/24
to debezium
Thank you for your reply.

My wait_timeout was already set to quite high value on server side (initially 8 hours, increased to 16 hours (57600))

However, I tried to set it up at session level as you suggest in the connector config:

  "database.initial.statements": "SET SESSION wait_timeout=172800",

But it still fails after about 1 hour of the initial import (when the first "big table" snapshot is over and it starts the second table).

FYI on a test connector where I reproduce the problem:
- table1 has about 140 million rows
- table2 has about 100 rows
=> it always fail at the end of table1, when it starts to snapshot table2

[2024-02-29 09:45:26,549] INFO   Exported 126994845 of 137457302 records for table 'ftest.table1' after 01:15:49.762 (io.debezium.relational.RelationalSnapshotChangeEventSource)
...
[2024-02-29 09:45:35,279] INFO Exporting data from table 'ftest.table2' (2 of 3 tables) (io.debezium.relational.RelationalSnapshotChangeEventSource)
[2024-02-29 09:45:35,279] ERROR Error during snapshot (io.debezium.relational.RelationalSnapshotChangeEventSource)
java.util.concurrent.ExecutionException: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table ftest.table1 failed
...
Caused by: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table ftest.table1 failed
...

Caused by: java.sql.SQLException: Error retrieving record: Unexpected Exception: java.net.SocketException message given: Connection timed out (Write failed)

I also tried to raise some other timeout values:

  "database.initial.statements": "SET SESSION wait_timeout=172800;SET SESSION interactive_timeout=172800;SET SESSION net_write_timeout=7200;SET SESSION net_read_timeout=7200;SET SESSION lock_wait_timeout=7200;SET SESSION innodb_lock_wait_timeout=7200",

but it did not help either.
Not that I can confirm in the MySQL logs that the settings from this initial.statements are applied as expected.


Other ideas are welcome, thanks.

Chris Cranford

unread,
Feb 29, 2024, 7:46:12 AM2/29/24
to debe...@googlegroups.com
Hi Koleo -

What is your connector configuration?  Are you setting any database.* or driver.* properties too?

Chris
--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/bf627a81-8af3-418e-8317-3bc9e73cdb2cn%40googlegroups.com.

Koleo

unread,
Feb 29, 2024, 9:33:38 AM2/29/24
to debezium
No driver.* setting.
Here are the database.*, snapshot.* and table.* settings:

  "database.hostname": "${file:/path-to-db-creds/ftest:hostname}",
  "database.include.list": "ftest",

  "database.initial.statements": "SET SESSION wait_timeout=172800;SET SESSION interactive_timeout=172800;SET SESSION net_write_timeout=7200;SET SESSION net_read_timeout=7200;SET SESSION lock_wait_timeout=7200;SET SESSION innodb_lock_wait_timeout=7200",
  "database.password": "${file:/path-to-db-creds/ftest:password}",
  "database.port": "${file:/path-to-db-creds/ftest:port}",
  "database.server.id": "123456",
  "database.user": "${file:/path-to-db-creds/ftest:user}",

  "snapshot.locking.mode": "minimal_percona",
  "snapshot.mode": "when_needed",

  "table.include.list": "ftest\\.table1,ftest\\.table2",


FYI the database "ftest" runs on a Percona Server for MySQL 8.0.27 (targeting a replica member of a 3-node InnoDB Group cluster)

cst labs

unread,
Feb 29, 2024, 11:03:23 AM2/29/24
to debezium
I had run into a few timeouts when running snapshots. Besides looking at various socket and db-specific timeouts we played with -- I also asked DBAs to look at the db logs to confirm if there was a time out at the db side or some intermediate layer (such as we use a proxysql that sits in the middle). I also tried to use snapshot.fetch.size
with a value that is -1. In this mode, it uses a cursor fetch that is expensive but my rationale for trying was that streaming mode may not be keeping the connection active and letting this timeout to be triggered.  I suggest taking a look at the db logs to see what kind of error do you see.

One more thing: I don't use the primary database for snapshotting. I have a dedicated replica for snapshotting. 

Koleo

unread,
Mar 1, 2024, 11:17:46 AM3/1/24
to debezium
The initial snapshot eventually completed successfully with the following additional config on my connector:

"database.initial.statements": "SET SESSION wait_timeout=172800;SET SESSION interactive_timeout=172800;SET SESSION net_write_timeout=7200;SET SESSION net_read_timeout=7200;SET SESSION lock_wait_timeout=7200;SET SESSION innodb_lock_wait_timeout=7200",
"snapshot.locking.mode": "none",
"min.row.count.to.stream.results": 0,
"snapshot.fetch.size": 10000,

=> Most probably thanks to the "snapshot.fetch.size" option.
Just note that it created a huge innodb temp file (/path-to-mysql-datadir/#innodb_temp, 161GB) during the snapshot...

I'll check again with just snapshot.fetch.size and different values to make sure that it's the option that "fixes" the timeout issue.

Reply all
Reply to author
Forward
0 new messages