Debezium oracle connector when paused while taking a large data snapshot

1,122 views
Skip to first unread message

FairyTail1279

unread,
Feb 21, 2022, 4:43:29 AM2/21/22
to debezium
Hi.
debezium oracle connector when paused while taking a large data snapshot Will there still be a connection to the source database or is it a burden to the source database?

Thanks
FairyTail

Chris Cranford

unread,
Feb 21, 2022, 7:44:43 AM2/21/22
to debe...@googlegroups.com, FairyTail1279
Hi Fairy -

The short answer, yes, there will still be a connection from to the source database used by the connector. 

The longer answer, PAUSE works in Kafka Connect by simply pausing the internal polling that Kafka Connect does to the connector's event queue.  The connector's event queue is a bounded, meaning that we can only store up to max.queue.size elements before Debezim's event loop will pause until space becomes available in the queue.  This is both a way to simulate the pause in Debezium if a connector is paused, but more importantly, it's to ensure that if we're reading from the source faster than Kafka Connect can emit to the broker, then we avoid a potential OOM situation.  So pausing the connector will eventually lead to the Debezium event loop being blocked, but that will occur sometime after the connector has been PAUSED and depends entirely on the volume of changes from the source; however, at no point do we close the database connection.  Most connectors keep these connections persistent for performance reasons.

HTH,
CC
--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/b43e6d87-53f0-48ff-b651-d39e608c750an%40googlegroups.com.

FairyTail1279

unread,
Feb 21, 2022, 8:29:38 AM2/21/22
to debezium
My suspicion is if instructing kafka connect to pause debezium connector while taking snapshot will it create a load on the source database side? I didn't get it in time, so I had to pause it and then come back. resume snapshot later, but worried about The load on the source database side while the debezium connector is paused.

ในวันที่ วันจันทร์ที่ 21 กุมภาพันธ์ ค.ศ. 2022 เวลา 19 นาฬิกา 44 นาที 43 วินาที UTC+7 Chris Cranford เขียนว่า:

Chris Cranford

unread,
Feb 21, 2022, 8:54:36 AM2/21/22
to debe...@googlegroups.com, FairyTail1279
Hi Fairy

The short answer is eventually there won't be a load on the source database after you pause the connector once the event queue fills.  The Debezium connector will simply wait to put another record from the JDBC result set into the event queue and block.  Now, what you need to be careful about is doing this and leaving the connector in this state for an extended period of time.  The connector may throw an error after you unpause if the connection to the database is broken, or the SCN for which the snapshot was running is no longer in the undo retention space, etc. 

My advice is that you should use PAUSE if you need to halt the connector's operations for a very short window of time, i.e. a few minutes, otherwise you should stop and restart the connector at some point in the future.

HTH,
CC

FairyTail1279

unread,
Feb 21, 2022, 10:47:44 PM2/21/22
to debezium
Hi.
Can you give me instructions for how to do CDC with 2 TB oracle data provided that There are only 2-3 hours per day for the connector to run and data will flow into the source database all the time. The concern is that when taking snapshots in tables with a large amount of data, it may not be able to do so in time.

Thanks.
FairyTail

ในวันที่ วันจันทร์ที่ 21 กุมภาพันธ์ ค.ศ. 2022 เวลา 20 นาฬิกา 54 นาที 36 วินาที UTC+7 Chris Cranford เขียนว่า:

Chris Cranford

unread,
Feb 22, 2022, 9:43:21 AM2/22/22
to debe...@googlegroups.com, FairyTail1279
Hi Fairy

May I ask why the connector can only run for 2 to 3 hours per day?  How exactly do you vision this working?  I have some concerns about the archive log's lifetime of being available for the connector to do its work if you're restrained to only running for about 12% of the entire day.

Chris

FairyTail1279

unread,
Feb 22, 2022, 8:24:44 PM2/22/22
to debezium
Hi Chris

I think the same as you, but my team is worried that when taking snapshots that query data from tables, it will create a burden on the customer's database, which the client wouldn't allow us to do during peek time. I can't answer the team that doing How much snapshots and tables with a lot of data will reduce the performance of the customer's database?

Thanks
FairyTail
ในวันที่ วันอังคารที่ 22 กุมภาพันธ์ ค.ศ. 2022 เวลา 21 นาฬิกา 43 นาที 21 วินาที UTC+7 Chris Cranford เขียนว่า:

Chris Cranford

unread,
Feb 23, 2022, 9:46:45 AM2/23/22
to debe...@googlegroups.com, FairyTail1279
Hi Fairy

So then it would seem that your concern is mostly about the performance impact of the snapshot rather than the streaming, do I understand correctly?

If we step back, Debezium offers two types of snapshot modes:

    1. Traditional snapshot that happens before streaming begins
    2. Incremental snapshot that happens concurrently with streaming

As you know, the traditional snapshot must be completed in one execution or else it must be restarted.  I'm not sure whether 2 TB of data can be extracted in a 2-3 hour window, you'd need to likely do a test in a non-production environment to guarantee if you can meet that requirement.  What I can say is that this mode uses Oracle flashback queries.  This mode also requires that your UNDO RETENTION on the database be longer than the runtime of the snapshot or the snapshot will fail when the SCN used for the flashback query ages out of the retention period.  The default (as of Oracle 19) is 15 minutes but generally your DBA will have increased this in production environments.  Increased retention here is mostly a storage concern, since the undo tablespace that stores the undo log has to grow to storage the volume of changes for the UNDO_RETENTION period.  This mode also will require that your archive log retention period be longer than your snapshot duration as well since logs for streamed events won't be read until after the snapshot concludes.  Lastly, do note that we don't apply any locks to the tables when we generate a snapshot, there is only ever a small lock to generate table schema structures for the connector but this lock mode can be disabled if you can guarantee no schema changes during the snapshot yourself.

For incremental snapshots, we shift gears quite a bit.  In this mode, the snapshot can be resumed from where it left off.  In addition, we stream changes concurrently with the snapshot.  In this mode, we do not rely on flashback queries for a consistent snapshot and so the UNDO_RETENTION configuration does not come into play.  Additionally, since we perform the snapshot concurrently with streaming, you also have less a concern about archive log retention since we begin streaming changes much faster since we don't have to take a consistent snapshot first.  But with this mode, there is a caveat that schema changes are *NOT* permitted during the incremental snapshot.

Now, if you can only run the connector for 2 to 3 hours per day even with streaming changes, then I think you may have some additional concerns.  The connector always mines from where if left off, so you're going to have a pretty significant archive log retention period in place to support this.  Furthermore, you're likely going to always be perpetually behind in capturing change events, so you're never likely going to reach near real-time. 

Chris

FairyTail1279

unread,
Feb 23, 2022, 10:51:32 AM2/23/22
to debezium
Hi Chris.
Can you explain how incremental snapshots work? Will the event dispatch sequence include just the read event or the update, delete event, and is it sent to the kafka in the order of the events? And is it much slower than a traditional snapshot?

Thanks for your kind.
FairyTail

ในวันที่ วันพุธที่ 23 กุมภาพันธ์ ค.ศ. 2022 เวลา 21 นาฬิกา 46 นาที 45 วินาที UTC+7 Chris Cranford เขียนว่า:

Chris Cranford

unread,
Feb 23, 2022, 2:32:37 PM2/23/22
to debe...@googlegroups.com, FairyTail1279
Hi Fairy -

I would highly recommend taking a look at this section in the documentation:
https://debezium.io/documentation/reference/stable/connectors/oracle.html#oracle-incremental-snapshots

In short, you receive interwoven READ events with UPDATE and DELETE events.  If a snapshot & stream event with the same primary key occurs within the snapshot window, the stream event (INSERT/UPDATE/DELETE) has priority and will be emitted and the snapshot event is discarded.  Order will be maintained in the topic by primary key but as I mentioned, you will get READ events interwoven with INSERT/UPDATE/DELETE events unlike the traditional approach that was all READ events followed by all INSERT/UPDATE/DELETE events afterward. 

As for speed, I don't have any specific metrics handy but in general it will be slightly slower than the traditional snapshot to emit events for all records.  This is because the snapshot is being operated on in chunks rather than a complete result-set iteration of row-by-row.  The snapshot is also being handled concurrently while also capturing changes from the transaction logs where-as in the traditional approach, you won't have that extra handling happening during the capture loop.  So yes I would expect incremental snapshots to take slightly longer, but you gain the ability to restart the snapshot and capture changes concurrently; so there is some trade-off here you need to consider.

Chris

FairyTail1279

unread,
Feb 25, 2022, 6:09:04 AM2/25/22
to debezium
Hi. Chris

    Due to time constraints and huge data sizes, my team decided to create a dummy table first and use debezium to snapshot that table to get the scn and use that scn to snapshot other tables by query. "select * from ... as scn .... " each table into kafka
The previous debezium connector gradually added tables to capture events. What do you think of this method and is there a better offer?

Thanks 
Fairy

ในวันที่ วันพฤหัสบดีที่ 24 กุมภาพันธ์ ค.ศ. 2022 เวลา 2 นาฬิกา 32 นาที 37 วินาที UTC+7 Chris Cranford เขียนว่า:

Chris Cranford

unread,
Feb 25, 2022, 9:23:34 AM2/25/22
to debe...@googlegroups.com, FairyTail1279
Fairy -

I assume you supplied the SELECT statement via the "snapshot.select.overrides" configuration property on the connector.  If so, that's entirely unnecessary.  What you did is precisely what the traditional snapshot process does using Flashback Queries.

As to which is better, try both & see.  Both traditional & incremental snapshots have benefits & constraints and I've already outlined those below.  We have never had a user express a time requirement like you have nor did we design Debezium wish such a requirement in mind, so some experimentation is needed.  However, given the volume of data, the limited access time to the database, you may likely find that incremental snapshots prevail.  Since incremental snapshots can be restarted, since you stream changes concurrently with snapshot rows, you get the best of both worlds without as many of the constraints that you would given the volume size of a traditional snapshot.  But if you cannot lock down and prevent schema changes during the incremental snapshot phase, then unfortunately a traditional snapshot approach is the only viable option at this time and you'll need to see if you can snapshot 2TB of data in your access window.

Chris
Reply all
Reply to author
Forward
0 new messages