Kafka connect JDBC sourcing, timzones issue

1,760 views
Skip to first unread message

Jeremy Nguyen

unread,
Jan 19, 2018, 6:12:43 AM1/19/18
to Confluent Platform
Hi there,

I'm seeking for ideas/workarounds on this specific problem, hoping that someone ran into this already.

We're sourcing a MySQL DB with the JDBC connector (shipped with Confluent 3.2.2). The connector runs in the timestamp+incrementing mode. Only inserts are done in the DB, the rows are never updated. We noticed that some records were missing, even with the parameter timestamp.delay.interval.ms set to to a high value.

2 issues were identified:
  1. The DB is configured to use the system timezone (CET), and that cannot be changed as we have no control on it.
  2. The ts field (typed as timestamp in the DB schema) is populated with a UTC timezoned value at insert time, but the DB interprets it as a timestamp in CET. So the effective stored value of the ts field is UTC-1.
    Note: it shouldn't influence the sourcing as the connector uses the last sourced value as offset, and it starts sourcing from there for the next cycle.
With the following connector config:
{
 
"name": "my-connector",
 
"config": {
   
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
   
"timestamp.column.name": "ts",
   
"incrementing.column.name": "id",
   
"transforms.insertKey.fields": "key",
   
"connection.password": "pwd",
   
"validate.non.null": "false",
   
"tasks.max": "1",
   
"transforms.insertKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
   
"query": "SELECT fields FROM table",
   
"transforms": "insertKey",
   
"batch.max.rows": "100",
   
"timestamp.delay.interval.ms": "5000",
   
"mode": "timestamp+incrementing",
   
"topic.prefix": "my-topic",
   
"connection.user": "user",
   
"poll.interval.ms": "10000",
   
"connection.url": "jdbc:mysql://host/db?useUnicode=true&characterEncoding=UTF-8"
 
}
}

The connector's debug message prints: (dates are supposedly in UTC)
[2017-12-19 12:00:17,589] DEBUG Executing prepared statement with start time value = 2017-12-19 12:00:01.000 end time = 2017-12-19 13:00:12.000 and incrementing value = 6499228341699480218 (io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier:159)

The issue is the end time being in CET instead of UTC, due to the fact that the connector uses CURRENT_TIMESTAMP to retrieve this value and that the DB is in CET.

Our current workaround is to compensate the timezone through the parameter timestamp.delay.interval.ms.

With this we didn't notice any missing records anymore, but this solution is dependent on the Daylight saving time. So the connector's configuration needs to be changed and redeployed twice a year.

The other solutions we could think of are more heavy:
  • Write our own Kafka connector to handle this time shift.
  • Similarly, write our own Kafka producer.
  • Patch the JDBC connector to use `UTC_TIMESTAMP` instead of `CURRENT_TIMESTAMP`
But these will be exclusively for this DB.


We're looking for a more long term solution, if it exists ;-)


Thanks,
Jeremy




Robin Moffatt

unread,
Jan 19, 2018, 7:02:05 AM1/19/18
to confluent...@googlegroups.com
Have you considered a proper CDC approach, with Debezium

I'd also suggest raising these issues on the github page too https://github.com/confluentinc/kafka-connect-jdbc/issues


--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/3e7a124b-2ab4-46e6-b030-f5b448196725%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Jérémy

unread,
Jan 19, 2018, 7:47:49 AM1/19/18
to confluent...@googlegroups.com
We considered using Oracle's GoldenGate, but went first with Kafka connect as it was easier to setup.
I didn't know Debezium, looks promising.

Thanks for your suggestions!

I forgot to mention on my original post that we also tried many JDBC url parameters to manipulate the timezones, but the dates were always all shifting so that our problem persisted.

To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsubscribe@googlegroups.com.


--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/wZ5yYjaN18E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platform+unsub...@googlegroups.com.

To post to this group, send email to confluent-platform@googlegroups.com.

Philip Schmitt

unread,
Jan 20, 2018, 4:03:38 AM1/20/18
to Confluent Platform
For Oracle, we are using SYS_EXTRACT_UTC() in the query to transform the database time to the timezone the connector expects.
Should also work with MySQL: https://stackoverflow.com/questions/26037830/replace-sys-extract-utc-in-mysql

Jérémy

unread,
Jan 20, 2018, 4:33:16 AM1/20/18
to confluent...@googlegroups.com
Do you mind sharing an example? From my understanding this will only affect the fetched records, but won't affect the actual connector's query.

Thanks!

--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/wZ5yYjaN18E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.

Philip Schmitt

unread,
Jan 21, 2018, 4:17:42 AM1/21/18
to Confluent Platform
Right, the connector will still use UTC timestamps when it adds the WHERE conditions to the query.
But that's okay, when you transform the timestamp you query to UTC.

So for Oracle you have something like SELECT SYS_EXTRACT_UTC(TIMESTAMP_COLUMN) timestamp, OTHER_ATTRIBUTE FROM...


For performance reason, we added a function-based index to the table on SYS_EXTRACT_USC(TIMESTAMP_COLUMN).

Jérémy

unread,
Apr 25, 2018, 9:21:45 AM4/25/18
to confluent...@googlegroups.com
Hi there (sorry for the delay),

The issue you posted sums up quite well our situation. We have no control on the database we source hence can't modify the timezone of the queried timestamp field.
We've decided to stick with the solution of manually compensate the timezone, and adjust the connector config at each DST.

Thanks for the help and ideas!

--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/wZ5yYjaN18E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages