Hi,
We're using Kafka Strimzi in k8s with Debezium connectors using XStream to capture data changes. We only capture data changes, not any initial snapshots. We have multiple deployments, in some we are processing over 50 DBs with hundred of schemas each.
XstreamOut is running on the source DBs, with a configuration mostly following the Debezium documentation. We only use Debezium connectors with XStream.
The issue we observed happens mostly when the Kafka connect pods are crashing (most common trigger was OOM situations for some time - if someone is interested I can go into more details of what we observed in regards to JVM heap size vs native memory with XStream), but sometimes even a pod restart initiated by connect config changes in Strimzi would trigger this issue.
The issue is that some of the Debezium connectors are not able to restart on the new connect pod. They fail with erro: "ORA-21560 argument at position ... is null, invalid, or out of range"
We have found 2 workaround for this error situation:
1. drop the existing connector and create a new one with a different name
2. clear out the LCR offset from the offset topic and restart the connector
First item that I'm worried is that either one of these workaround results into data change gaps in out streaming. We can deal with duplicated data change events, but we have a much more expensive recovery for gaps. Is that correct? Specially, the 2nd workaround: is it safe to clear the value in the offset topic without causing data change events gaps?
Then, the other item is that we haven't automated these workarounds, so the slow reaction time makes the xsteamout capture loose the database logs ...
The reason I'm writing you is to see if there are better solutions, other than us automating one of these 2 workarounds.
But the offset topic is only updated/flushed regularly (default on a 1-minute interval).
If the process is interrupted before the connector was able to update the topic, the restart will try to reconnect to xstreamout using a lower LCR than the last one set for watermark. And that's an error situation in xstreamout as explained here:
"The client application can pass a processed low position to the outbound
server that is less than the outbound server's processed low position.
In this case, the outbound server raises an error."
Is there any configuration options we could use to fix this behavior? Or could there be a configuration option to have the connector discard the bad starting LCR"if the error ORA-21560 is raised on attaching to xstreamout and use a null value instead?
Or some other mechanism to avoid manual intervention in the offset topic?
Thank you!