Debezium 2.4 Xstream(Oracle ORA-21560)

346 views
Skip to first unread message

Kuanysh Tastandiyev

unread,
Dec 30, 2023, 12:41:15 AM12/30/23
to debezium

We have Oracle 12.2.0 and Debezium (latest) working together through XStream. Our Kafka is not very stable, so sometimes brokers can be lost temporarily (i.e., they restart). After some of those restarts we cannot continue to capture data - we either face ORA-21560 or process starts correctly, but Oracle does not seem to register our Debezium (outbound server does not seem to have anyone "attached" to it).


So, here are our questions:

1. How do we deal with ORA-21560? Now we have to recreate Outbound (and lose data) to continue capture process; is there any other way to continue capture after not-so-gracefull restarts of connector?

2. What can be the reason behind odd behavior of Debezium? I.e., connector works but does not get new data ("no new data" in logs), Xstream doesn't seem to register it as attached.


Current best guess is that Debezium asks Xstream for wrong SCN/LCR, and that causes ORA-21560; if we are correct, can anyone tell us what can be fixed (and where) to change SCN/LCR for our connector?

Chris Cranford

unread,
Jan 2, 2024, 7:37:06 AMJan 2
to debe...@googlegroups.com
Hi,

Are you using Debezium for Oracle with the Embedded Engine or Debezium Server, or are you using the Kafka Connect runtime?

From Kafka Connect's point of view, the watermark is only flushed to Oracle doing the WorkerSourceTask's commit phase, and this is done either (a) when there is no data to commit to Kafka that is depending or (b) after data that was pending has been successfully committed. One way to confirm there isn't some out of order situation happening is to enable debugging for the following two classes:

    org.apache.kafka.connect.runtime.WorkerSourceTask
    io.debezium.connector.oracle.xstream.XstreamStreamingChangeEventSource

You should notice that during commit flush, the order of log messages are:

    Finished committing offsets in XXX ms
    Sending message to request recording of offsets to Oracle

If you see "Finished offset commitOffsets successfully in XXX ms" instead of the first message, that indicates that there were no offsets to send to Kafka.

But generally speaking, Oracle should never flush before Kafka so really there shouldn't be a situation (at least on Kafka Connect) where the watermark in Oracle is advanced beyond the offsets stored in the Kafka offset topics; hence why we need to know if this is Kafka Connect, Debezium Server, or Embedded Engine deployments.

Nevertheless, to avoid having to re-create the outbound server, you could manually adjust the offset in the Kafka offset topic to compensate with a SCN that is available; but ideally I would suggest understanding the instability and check the order of commit operations mentioned above.

As to your last question, the connector should fail in this regard. If you're not observing this, please open a Jira issue and attach debug logs illustrating the behavior observed.

Thanks,
Chris

--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/f227be80-aa38-4089-962c-6e41ee9ac2b1n%40googlegroups.com.

Vladimir Yurev

unread,
Jan 3, 2024, 6:11:01 AMJan 3
to debezium
Hi, Chris.

We are using Kafka Connect runtime (and version 2.4, as the topic says, not the latest one). Currently, our Kafka-cluster is a bit shoddy due to problems with hardware; we'll fix it in some time, but for the time being we'll have to cope with those problems on Debezium-side.

We'll try to enable debugging on said classes, thanks for the tip; as for this comment:

Nevertheless, to avoid having to re-create the outbound server, you could manually adjust the offset in the Kafka offset topic to compensate with a SCN that is available; but ideally I would suggest understanding the instability and check the order of commit operations mentioned above.

Do you mean topic that's configured in offset.storage.topic? We tried doing that, but to no avail; any tips on where can we get the SCN to populate this topic? And does it really use this topic for XStream? We've seen another one, where LCRs are stored; should we do anything about this one as well, or manual reset of SCN should help?

Another strange thing that we've observed is that creating fresh connector (i.e., same config, but other topic names) does not seem to fix the problem - we still get Ora-21560 (or problems with attaching to Outbound while the connector seems to be up and running, with no problems on Oracle-side). Can it be that some unintentional interruptions on Kafka leave Outbound server in troubled state?

Chris Cranford

unread,
Jan 3, 2024, 8:23:26 AMJan 3
to debe...@googlegroups.com
Hi Vlad,

To adjust the offset topic, you need to following the instructions in our FAQ [1].  Specifically for Oracle, you're going to locate the entry for the topic.prefix and you want to explicitly adjust the value related to the "lcr_position". This is the position detail we pass to XStream when attaching to dictate where the XStream client wants to restart receiving changes from.  Fwiw, we hope to have a better system in place soon to avoid having to manually dabble in the offset topic, but for now that's the only way.

As to your last question, anything is certainly possible in an unstable environment, but we prefer to guard against that.  If you can provide debug logs when this happens, perhaps even consider including the entire io.debezium.connector.oracle package as debug enabled, that might help identify what's going on.

Thanks,
Chris

[1]: https://debezium.io/documentation/faq/#how_to_change_the_offsets_of_the_source_database

Vladimir Yurev

unread,
Jan 3, 2024, 9:10:59 AMJan 3
to debezium
Thank you; we'll try to do that. The only standing question is which of the different SCNs/LCRs from Outbound server we should use (i.e., PROCESSED_LOW_SCN, COMMITSCN, or some other value, like CAPTURED_SCN).

If we encounter this problem again, we'll try to send debug logs, thanks.

Chris Cranford

unread,
Jan 3, 2024, 9:22:20 AMJan 3
to debe...@googlegroups.com
Hi Vlad -

I would suggest the LOW_PROCESSED_POSITION.

Thanks
Chris

Chris Cranford

unread,
Jan 3, 2024, 9:22:44 AMJan 3
to debe...@googlegroups.com
Oops, sorry small typo PROCESSED_LOW_POSITION.

Vladimir Yurev

unread,
Jan 11, 2024, 3:19:43 AMJan 11
to debezium
Hello again.

Now we face another problem. During nighttime traffic from one of our Outbounds suddenly stopped; no errors on Oracle/Debezium-side, just no more messages sent into Kafka, but connector did not stop. Connector/Outbound restarts had no effects; Debezium process attached to Outbound successfully, but no new info is sent through it. We are sure that capture on Oracle-side continues and goes w/o problems; there definitely were some changes in data on Oracle; we are also sure that this time it is not Kafka-related, because Kafka seems to have no problems in logs. Restarting connector in Debug-mode spams logs with "No records available or batch size not reached yet".

What can be the cause of such a problem? Where should we start the debug? And how does batch size interfere with Xstream? I.e., if there were very few changes on our source, could this be the cause (Debezium waits for changes to reach some threshold before requesting data from Source)? If so, is there a way to force Debezium to send this data (however small) so that it doesn't lock the logfiles?



By the way, we considered switching to Logminer due to it being more opaque and easy to maintain, but:
  1. Xstream-capture seems to provide way much faster transfer to Kafka (lag is measured in seconds VS hours in Logminer).
  2. Logminer generates lots of pressure on source system (25-35 percent of total server load); Xstream seems to be much more efficient in our case.
Both reasons are probably explained by us not capturing all of the tables on Source System (thus Debezium has to filter out a lot of unnecessary events), and source system load is generated in short periods of time (6-hour nighttime processing generates 95% of changes; during this time Logminer chockes on our load, but Xstream seems to be fine; worst-case scenario lag was measured in minutes and resolved itself quite quickly, while Logminer sometimes lagged behind for 6 to 8 hours).

Vladimir Yurev

unread,
Jan 11, 2024, 3:33:21 AMJan 11
to debezium
And another question about Xstream.

Let's say that we have 30 tables in our Outbound. All of them are listed in whitelist of connector, but initial load shows that only 25 of them shared DDL with Debezium (due to Debezium-user not having access to those tables). Can this be the cause of problems? We understand that we'll miss data on those 5 tables, but can there be anything else happening behind the scene?

Vladimir Yurev

unread,
Jan 11, 2024, 10:35:19 AMJan 11
to debezium
We debugged a bit more, and it turns out our admins didn't set up Outbound properly (only first table out of many was passed as captured table, and it had a very low amount of updates). Sorry for all the ruckus; we really didn't expect to fail with something this obvious and thus didn't check it at first.

Chris Cranford

unread,
Jan 11, 2024, 10:40:29 AMJan 11
to debe...@googlegroups.com
Hi Vlad -

So XStream works quite a bit differently in Debezium than say LogMiner.  In short, we are basically a consumer of LCR (Logical Change Records) that GoldenGate/XStream outputs as it processes the redo logs.  When we connect to the stream, an infinite loop is started that constantly polls for changes.  If Oracle has data available, that poll returns immediately. If Oracle has no data, the poll waits exactly 1 second before returning and the loop continues.  So we don't control when the data becomes available, that's entirely within the boundary of GoldenGate and XStream.

What I would suggest is enabling DEBUG or TRACE logging. Then start the connector and monitor the connector logs.  If you never see the TRACE log entry "Received LCR ..." or the DEBUG log entry "Processing DDL event ..." in the connector logs, then we simply are not getting data from Oracle.  In that case, your DBA will need to investigate GoldenGate/XStream to understand why no data is available to the outbound server client.

With LogMiner, were you using "log.mining.strategy" set to "online_catalog" or using the default?  If you used the default, this can create additional load on the system if the redo logs are not appropriately sized for that setting.  Additionally if you have a high volume situation where your redo logs are swapping at a rate higher than recommended by Oracle (which is 5-6 switches per hour), then the default will most likely not be ideal because of the overhead that recalculating the data dictionary will be on a log switch.  In such cases, I'd suggest using "online_catalog" instead as you'll likely notice a substantial performance increase.

Chris

Chris Cranford

unread,
Jan 11, 2024, 10:47:28 AMJan 11
to debe...@googlegroups.com
Hi -

That shouldn't be an issue. If we get an LCR for a table we don't yet know about, we will try every attempt to process that LCR. We do this by verifying the table name is allowed based on your include/exclude filters, we attempt to resolve the table's CREATE TABLE statement using the database's metadata function. If we get a CREATE TABLE statement, we generate a schema change event to satisfy the connector upon restart, and then we attempt to process the LCR like any other table change.  If the user does not have permission to the table, there is likely a high probability that the metadata function call will likely fail and in this case the event is discarded and a warning written to the connector logs, i.e. "Table ... isnot a relational table and will be skipped.".

Chris

Vladimir Yurev

unread,
Jan 11, 2024, 11:02:28 AMJan 11
to debezium
Thanks for clarifying everything; great advice as always.

As for Logminer, we tried it both with default settings and online_catalog. Online_catalog proved to be more efficient. We also tried your other recommendations from blog posts, and it did help, but not enough for our load. Are we right to assume that switching log.mining.query.filter.mode to "in" can help with speed but will most likely generate more load on source system, or will it help to lower it due to us not capturing most of the tables at all? We tried it once but faced a bug with spaces and linebreaks (discussed in another conversation), so we didn't really test it thoroughly.

Fakrul Islam

unread,
Jan 11, 2024, 1:57:38 PMJan 11
to debe...@googlegroups.com
Can anyone help me how to deploy a debezium sqlserver connector in kubenetese ?



Vladimir Yurev

unread,
Jan 12, 2024, 2:03:22 AMJan 12
to debezium
Hi.

I believe it will be more productive if You create separate conversation on this topic. Anyway, try following instructions from the documentation (link); Strimzi is a viable K8S-operator for this matter.

Chris Cranford

unread,
Jan 15, 2024, 5:31:03 PMJan 15
to debe...@googlegroups.com
Hi -

The query filter mode set to "in" could actually help improve performance and reduce the overhead on the database.  Every log mining step that is performed, the data that is read is populated into a table called V$LOGMNR_CONTENTS and this table is backed by the SGA buffers.  So anything that reduces the volume of data that needs to be pushed into this dynamic system table view reduces the consumption of the SGA.  This net in turn means that it keeps larger amounts of those buffers available for other processes in the database.  Additionally, this "in" mode also reduces the network bandwidth consumed when transferring the data set from Oracle to Debezium via JDBC, so it can actually help the speed of consuming the data set with smaller payloads of data going over the network wire.

Chris

Vladimir Yurev

unread,
Jan 16, 2024, 2:54:26 PMJan 16
to debezium
Hm. Thanks; we'll take that into consideration!

Vladimir Yurev

unread,
Jul 12, 2024, 7:58:35 AMJul 12
to debezium

Hi, everyone.

Facing the problem again.

Last time we couldn't fix it for good, so we just recreated Outbound on Oracle side, and it was working for quite some time. But now it's broken again.

We have the following setup:
  • Debezium 2.4.0
  • Oracle 12.2.0.1.220118
  • Debezium config:
  •       config.action.reload: restart
          database.out.server.name: #####
          snapshot.mode: schema_only
          database.oracle.version: 12
          database.hostname: #####
          database.port: #####
          database.user: #####
          database.password: #####
          database.dbname: #####
          table.include.list: #####
          message.key.columns: #####
          database.connection.adapter: xstream
          database.tablename.case.insensitive: true
          errors.retry.timeout: 0
          errors.retry.delay.max.ms: 60000
          errors.tolerance: none
          errors.log.enable: true
          errors.log.include.messages: true
          time.precision.mode: connect
          snapshot.locking.mode: none
          poll.interval.ms: 50
          query.fetch.size: 10000
When we reconnect to Xstreams, we catch ORA-21560: argument last_position is null, invalid, or out of range:
Producer failure (io.debezium.pipeline.ErrorHandler) [debezium-oracleconnector-#####]
oracle.streams.StreamsException: ORA-21560: argument last_position is null, invalid, or out of range

at oracle.streams.XStreamOut.XStreamOutAttachNative(Native Method)
at oracle.streams.XStreamOut.attachInternal(XStreamOut.java:373)
at oracle.streams.XStreamOut.attach(XStreamOut.java:343)
at io.debezium.connector.oracle.xstream.XstreamStreamingChangeEventSource.execute(XstreamStreamingChangeEventSource.java:119)
at io.debezium.connector.oracle.xstream.XstreamStreamingChangeEventSource.execute(XstreamStreamingChangeEventSource.java:45)
at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:272)
at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:197)
at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:137)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)

We've queried ALL_XSTREAM_OUTBOUND_PROGRESS for our Outbound and got following parameters:
oracle_outbound.PNG

In Debezium, we have those parameters in topic:

{
  "transaction_id": null,
  "lcr_position": "000000dd5e3be4fd0000000100000001000000dd5e3be4f6000000010000000102",
  "snapshot_scn": "950101097352"
}

As we can see, lct_position in Debezium and processed_low_position in Oracle do not match (Debezium one is, probably, lower than Oracle one, hence the error); we're the only consumer of said Outbound, so no other capture processes run on database. On Oracle side we see no issues with Outbound (it was restarted, once, since issue occured), on Debezium-side - also nothing noteworthy.

Last time we've tried setting lcr_position to processed_low_position from Oracle, but error prevailed. Should we try it again? If we should, which parameters should we change (and where)?

Any suggestions on possible solutions of the problem?

Chris Cranford

unread,
Jul 12, 2024, 11:02:31 AMJul 12
to debe...@googlegroups.com
Hi -

Are the changes from XStream very infrequent compared to the changes in the database? 

I would suggest setting up `heartbeat.action.query` and `heartbeat.interval.ms` to see if that helps rectify the problem, as it could be related to stale LCR positions in the Kafka offsets.

Thanks,
Chris

Vladimir Yurev

unread,
Jul 14, 2024, 8:21:45 PM (12 days ago) Jul 14
to debe...@googlegroups.com
Hi, and thanks for your insight.

This might help prevent some of the problems, probably, although not in this case - at least one of the tables in Outbound updates frequently.

But let's say it has already occured (it can be provoked by human error; simply restarting Outbound server moves capture positions on Oracle side a bit and breaks the capture process; it can also happen due to hardware problems on both sides). How do we recover from it? Is there a way to attach to Outbound server with Debezium again, w/o recreating Outbound?


Chris Cranford

unread,
Jul 15, 2024, 9:01:35 AM (12 days ago) Jul 15
to debe...@googlegroups.com
HI -

So if restarting the Outbound Server has such affects, there isn't much we can do there.  In this case you would follow the procedure for manually setting the offsets, assuming that you have a way to identify the LCR position from within Oracle.  This is an encoded value that is stored as a series of bytes that identify the read position on the Outbound Server we expect to start from.  If you are unable to determine the LCR position, your recovery option is limited to restarting the connector and retaking the snapshot or skipping the snapshot and trying to resolve the data loss using incremental snapshots.

In the case of restarting Debezium, this should never be an issue as long as the state on the Outbound Server is being perpetually preserved.  If Debezium is stopped for any reason (manual or failure), it should restart where it left off.  If that is not the case, please report a bug with the steps to reproduce, as that's not expected behavior.

Thanks,
Chris

Vladimir Yurev

unread,
Jul 15, 2024, 10:11:07 AM (12 days ago) Jul 15
to debe...@googlegroups.com
Hi.

 In this case you would follow the procedure for manually setting the offsets, assuming that you have a way to identify the LCR position from within Oracle
Currently we are trying to determine this position by querying ALL_XSTREAM_OUTBOUND_PROGRESS. Is this correct? And if it is, what might be wrong with manually setting this in Debezium topics? As I've mentioned before, we tried updating offset-topic (in our case, prod-debezium-#####-offsets) by publishing new message into it with new LCR and SCN from ALL_XSTREAM_OUTBOUND_PROGRESS (btw, is LCR case-sensitive?); is this wrong? Do we need to do anything else?

You received this message because you are subscribed to a topic in the Google Groups "debezium" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/debezium/y_VYJE96SWs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/0627c870-7fb3-4286-9811-71a2f0b607b5%40gmail.com.

Daniel F

unread,
Jul 18, 2024, 11:10:56 AM (9 days ago) Jul 18
to debezium
Adding to this thread, we're in a very similar place.
We upgraded from 1.9.3.Final to a 2.x (we're currently on 2.5.0.Final but have tried using most of the 2.x releases to try and see if any of them work for us) and started to see this issue popup since then.
 
Some slight differences:
Oracle 19.19 (instead of 12)
Debezium 2.5.0.Final(instead of 2.4.0)
 
Our current workarounds are:
1) Using the offset editor.  This doesnt always work but is the simplest workaround.
2) Recreating the xout server.  This is a bit more involved as we dont have direct access to our DBs but instead have to triage to an internal ticket system.
 
What can cause this for us:
1) As others mentioned, restarting the project that uses debezium.  Our workaround for this has been stop, wait some amount of time, and restart.  Sometimes it breaks the connection when reconnecting regardless.
2) Snapshot too old error.  When we do a snapshot, sometimes the size is too large and can kill the job.  Could be related to #1 where it requires restarting and in the process the offset file is messed up.
As a sidebar: When setting the SCN for the offset file(using the offset editor), should we set the "PROCESSED_LOW_SCN" value to restart the server?
Reply all
Reply to author
Forward
0 new messages