How does streamsets jdbc connector handle deletes and updates?

903 views
Skip to first unread message

Oleg Berfirer

unread,
Aug 25, 2016, 11:21:16 AM8/25/16
to sdc-user
So far was able to set up a number of flows using JDBC connector and a technique to pull the records based on the offset of the ID field in the table. It works, but  what happens if there is a deletion or an update that needs to propagate? Is there a way to handle those use cases using JDBC connector?


Pat Patterson

unread,
Aug 25, 2016, 1:01:32 PM8/25/16
to Oleg Berfirer, sdc-user
Hi Oleg,

In the general case, there are two options with the JDBC Origin: incremental and full. Incremental mode uses the offset technique, as you have already discovered. Full mode periodically pulls all the data specified in the query.

However... We do support 'change data capture' (CDC) for MS SQL Server, and, coming soon, Oracle. CDC uses database-specific techniques to obtain a feed of transactions. With CDC you will get 'insert', 'update' and 'delete' records.

I'm interested in your use case... What destination(s) are you planning to write the data to? Feel free to reply to me privately if need be.

Cheers,

Pat

--

Pat Patterson | Community Champion | http://about.me/patpatterson

On Thu, Aug 25, 2016 at 8:21 AM, Oleg Berfirer <oleg.b...@gmail.com> wrote:
So far was able to set up a number of flows using JDBC connector and a technique to pull the records based on the offset of the ID field in the table. It works, but  what happens if there is a deletion or an update that needs to propagate? Is there a way to handle those use cases using JDBC connector?


--
You received this message because you are subscribed to the Google Groups "sdc-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sdc-user+unsubscribe@streamsets.com.
Visit this group at https://groups.google.com/a/streamsets.com/group/sdc-user/.

Oleg Berfirer

unread,
Aug 26, 2016, 3:47:49 PM8/26/16
to sdc-user, oleg.b...@gmail.com
Hi Pat, 
I responded to your private email yesterday. Did you see my message?
In short, we are targeting MySql, MSSql, Postgres, Mongo, S3 and Kinesis as sources, and would be writing to Kafka as a primary destination.
I'd be interested to learn how to implement "Change Data Capture" option with MsSql, but also Mysql and Postgres. I saw the CDC tab on the JDBC configuration screen, but not sure what the Transaction ID Column Name would be for SQL server. 
I found this on MSDN (https://msdn.microsoft.com/en-us/library/ms186327.aspx), so will give it a try using "transaction_id".

Regards,
Oleg
To unsubscribe from this group and stop receiving emails from it, send an email to sdc-user+u...@streamsets.com.

Pat Patterson

unread,
Aug 26, 2016, 4:31:44 PM8/26/16
to Oleg Berfirer, sdc-user
Hi Oleg,

Apologies - I was out most of yesterday afternoon and this morning. Just catching up with email now.

The MS SQL Server CDC functionality is documented at https://streamsets.com/documentation/datacollector/latest/help/#Origins/JDBCConsumer.html - it looks like you're on the right lines with transaction_id. I haven't worked with this part of the product yet; unfortunately the engineer who developed it is out today, but I'll point him to this thread and he can maybe add some detail on Monday.

Cheers,

Pat

On Fri, Aug 26, 2016 at 12:47 PM, Oleg Berfirer <oleg.b...@gmail.com> wrote:
Hi Pat, 
I responded to your private email yesterday. Did you see my message?
In short, we are targeting MySql, MSSql, Postgres, Mongo, S3 and Kinesis as sources, and would be writing to Kafka as a primary destination.
I'd be interested to learn how to implement "Change Data Capture" option with MsSql, but also Mysql and Postgres. I saw the CDC tab on the JDBC configuration screen, but not sure what the Transaction ID Column Name would be for SQL server. 
I found this on MSDN (https://msdn.microsoft.com/en-us/library/ms186327.aspx), so will give it a try using "transaction_id".

Cheers,

Pat

--

Pat Patterson | Community Champion | http://about.me/patpatterson
Regards,
Oleg
 

On Thursday, August 25, 2016 at 1:01:32 PM UTC-4, Pat Patterson wrote:
Hi Oleg,

In the general case, there are two options with the JDBC Origin: incremental and full. Incremental mode uses the offset technique, as you have already discovered. Full mode periodically pulls all the data specified in the query.

However... We do support 'change data capture' (CDC) for MS SQL Server, and, coming soon, Oracle. CDC uses database-specific techniques to obtain a feed of transactions. With CDC you will get 'insert', 'update' and 'delete' records.

I'm interested in your use case... What destination(s) are you planning to write the data to? Feel free to reply to me privately if need be.

Cheers,

Pat

--

Pat Patterson | Community Champion | http://about.me/patpatterson

On Thu, Aug 25, 2016 at 8:21 AM, Oleg Berfirer <oleg.b...@gmail.com> wrote:
So far was able to set up a number of flows using JDBC connector and a technique to pull the records based on the offset of the ID field in the table. It works, but  what happens if there is a deletion or an update that needs to propagate? Is there a way to handle those use cases using JDBC connector?


--
You received this message because you are subscribed to the Google Groups "sdc-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sdc-user+u...@streamsets.com.
Visit this group at https://groups.google.com/a/streamsets.com/group/sdc-user/.

--
You received this message because you are subscribed to the Google Groups "sdc-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sdc-user+unsubscribe@streamsets.com.

Pat Patterson

unread,
Aug 26, 2016, 4:37:48 PM8/26/16
to Oleg Berfirer, sdc-user

Cheers,

Pat

--

Pat Patterson | Community Champion | http://about.me/patpatterson

Oleg Berfirer

unread,
Aug 26, 2016, 4:50:24 PM8/26/16
to Pat Patterson, sdc-user
No worries! 
Thanks, Pat. I tried to point the transaction column, but the documentation is rather sparse, so it's not clear, for example, whether I still need to define a main query on the JDBC tab. It seems like it's still necessary, in which case I probably don't understand the purpose of CDC. Wouldn't it track all CRUD events on the table based on the events from the transaction log? If so, what's the purpose of the query in the "SQL Query" window? You're right, it would be great if you could connect me with your engineer.

thank you,
Oleg

Adam Kunicki

unread,
Aug 30, 2016, 1:55:39 PM8/30/16
to Oleg Berfirer, Pat Patterson, sdc-user
Hi Oleg,

The current support for CDC is for Microsoft SQL Server's CDC functionality. Enabling CDC in SQL Server requires SQL Server Enterprise.

this functionality in SQL Server 2008 (the same process also applies to 2012). It also explains how the feature works.

The JDBC origin in StreamSets can be used to query the change tables created by the CDC feature and the JDBC Destination can then apply these changes
to a destination database (e.g. postgres). The "Change Log Format" option in the JDBC destination will understand how to handle records output by the
type of example query below and apply inserts, updates, and deletes. 
Currently there is only automatic handling of the change log format in the JDBC destination and not other destinations.

For example, if you had a table called dbo.test_table and enabled CDC for that table you will end up with one such as cdc.dbo_test_table_CT

For the query in the JDBC origin you can use something like:

SELECT * FROM cdc.dbo_test_table_CT
WHERE __$start_lsn > CAST(0x${OFFSET} AS BINARY(10))
ORDER BY __$start_lsn ASC

Keep in mind that CDC is not the same feature as SQL Server's Change Tracking (CT) feature (which is also available in Azure SQL).

Hope this helps to answer some of your questions. We're working on bringing more CDC features to the product soon and improving the usability of CDC in data collector as well.

-Adam


Adam Kunicki
StreamSets | Field Engineer
mobile: 415.890.DATA (3282) | linkedin

GanWei Shi

unread,
Jan 30, 2017, 8:10:02 AM1/30/17
to sdc-user, oleg.b...@gmail.com, p...@streamsets.com
Hi adam, what should I fill in the Offset Column in JDBC tab and Transaction ID Column in the CDC tab?

Pat Patterson

unread,
Jan 30, 2017, 11:21:24 AM1/30/17
to GanWei Shi, sdc-user, Oleg Berfirer
Hi Gan Wei,

From Adam on another thread here:

SQL Server CDC (only part of Enterprise editions) works by writing the change log to a system table that can be queried. 
Here's some more information on that feature and how to enable it on a table: https://www.simple-talk.com/sql/learn-sql-server/introduction-to-change-data-capture-(cdc)-in-sql-server-2008/

This table can then be queried using SDC's JDBC origin. In the CDC configuration tab you can also specify a transaction id column which will
bypass the max batch size to ensure that all changes in a single transaction are honored. 

In the JDBC Destination you can choose the MSSQL Server CDC format which
looks for the __$start_lsn, __$operation, etc fields to determine what kind of operation to perform.

Note: this is different from Change Tracking (CT), which is a similar feature in SQL Server

For CDC a query may look like this:

SELECT * FROM cdc.dbo_test_table_CT
WHERE __$start_lsn > CAST(0x${OFFSET} AS BINARY(10))
ORDER BY __$start_lsn ASC
Cheers,

Pat

--

Pat Patterson | Community Champion | http://about.me/patpatterson

Reply all
Reply to author
Forward
0 new messages