Hi all -
The Oracle connector added support for capturing changes for CLOB / BLOB columns in Debezium 1.6. While this support is incubating, we've been looking at some ways to improve and mature this feature. In DBZ-4366 [1], some gaps have been identified in how the connector currently attempts to reconcile a transaction that operates on LOB columns and how we merge the synthetic LOB events with the logical DML events performed on the same row. This reconciliation step's goal is to emit a single logical change event that consists of 1 or many redo entries mined from LogMiner for a single row when LOB columns are involved.
Unfortunately, one of the shortcomings identified in DBZ-4366 is that the reconcile step makes some assumptions about how the LOB data is manipulated. In particular, it does not take into account situations where the DBMS_LOB package is used to manipulate the LOB-based columns using WRITEAPPEND, WRITE, or ERASE in certain scenarios which leads to the LOB data being inconsistently represented in the change event. We'd like to solve this problem with your help / feedback.
So far we've identified 4 ideas and we'd like to know if perhaps we've overlooked some other obvious choices.
(a) Emit LOB changes from the redo logs to a separate topic from DML events
(b) Emit LOB changes from the redo logs to the same topic as DML events but as different event types
(c) Provide an SMT that allows the re-selection of a row out of bands if a specific indicator is detected on the emitted event
(d) Implement (c) baked directly into the Oracle connector, no SMT required.
Option (a) is really the least ideal of them all. This puts the ownership of combining multiple topics and guaranteeing the chronological order of events across those topics on the consumer. While this certainly makes it easy for us, we certainly prefer a more suitable approach because we know we can do better; and we should :).
Option (b) is an improved version of (a). By using the same topic and emitting separate events (insert, update, delete, lob-delta), the chronological order of events is maintained automatically and consumption of these events become significantly easier to manage in the consumer pipeline.
In either (a) or (b), the emitted event's value might consist of something like:
{
"op": "ld",
"action": "w",
"offset": 10,
"length": 3,
"bytes": "..."
}
The op-code "ld" stands for LOB delta. The action would consist of "w" for WRITE and "e" for ERASE. The offset/length represent the position/size in the LOB column that is being changed with the bytes simply being the raw character/bytes for the delta.
Option (c) takes a slightly different approach to the reconciliation and instead relies on using some marker indicator on the incoming change event to determine whether we need to re-select the row or allow the incoming event to be passed through as is. For changes to tables that don't have LOB columns, the events would simply be passed through without modification. For changes to tables that do have LOB columns, instead of mining the LOB data from the transaction logs we would instead simply mark the emitted event to be re-selected and the SMT would issue an out of bands query on the row, building a new read event that would be emitted in its place. The benefit is that the Oracle connector wouldn't need to buffer any LOB data, keeping its memory footprint lower and would allow for users to use JDBC or the DBMS_LOB package to manipulate the LOB columns in any desired way; while emitting consistent events for those rows. Such an SMT could have cross-connector benefits by even allowing a PostgreSQL user to re-select toasted columns in the same way we would LOB columns for Oracle.
Option (d) builds on top of (c) by implementing the SMT logic directly in the connector. The idea is that there could be a "lob.handling.mode" connector option that has a value-set of {"merged", "reselect", ...}. The "merged" option would tell the connector to use its existing functionality. We would still need to address any potential limitations of this mode either by fixing them (if possible) or documenting them. The "reselect" option would the baking of the SMT option (c) idea directly into the connector. The benefit to baking this directly into the connector is that we could make some assumptions about how we mine redo entries, most importantly we could skip doing the re-mining entirely. We basically would not care about the LOB changes as they're materialized by Oracle LogMiner and we would rely on our relational model to know if a table has a LOB column and if so, trigger the re-selection of the row like we do for a snapshot.
--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/d29c4db3-8c7d-4c3a-88ef-af63b9ffbdbfn%40googlegroups.com.