event replay

Scott Ferguson

unread,

Jan 9, 2017, 9:25:28 AM1/9/17

to debezium

Is there any ability to replay sql rows from a specific point-in-time? e.g. everything newer than <date>

Thanks,

Scott

Randall Hauch

unread,

Jan 9, 2017, 9:47:39 AM1/9/17

to debezium

First of all, there's really no way to start capturing events from the database starting at <date>. The connector's can either begin capturing CDC events by starting with a consistent snapshot of the database (which could contain state modified before or after <date>), or by simply starting at the current timestamp.

On the other hand, if you're talking about consumers reading CDC events that represent changes newer than <date>, there are a couple of potential options as long as <date> is after the time at which the connector completed its snapshot.

If the connector had been grabbing individual events prior to <date>, then it might be possible with Kafka 0.10 or later to use Kafka's new feature that records for each message the timestamps it was created (captured in Debezium's case) or logged. In this case, the Kafka 0.10 consumer can say they want to start consuming events given <date>, and the broker efficiently finds that starting point. Note that this is obviously not the same thing as the timestamp used by the database to perform operations, so some filtering would have to be done by the consumer until it saw the timestamp in the CDC event message.

Alternatively, if the connector had been grabbing individual events prior to <date>, then the consumer could just start at the beginning and ignore all CDC events that were performed by the database prior to <date>. This is easier, but could be potentially time consuming to just find the first CDC event at or later than <date>. This would work with all versions of Kafka and Kafka consumer clients.

I hope this helps.

Randall

Scott Ferguson

unread,

Jan 9, 2017, 10:11:07 AM1/9/17

to debezium

I probably should've mentioned the motivation, and that this may be a specific AWS RDS, specific up-front.

AWS RDS aggressively cleans up its binlogs. You can set bin_log retention to up to 1 week. If Debezium is unavailable for longer than the retention period you begin to lose data. Having the ability to replay from a point-in-time would allow recovery of that data. Having said that, it sounds like there isn't an easy answer.

Btw, Debezium being unavailable for more than a week sounds extremely unlikely, but having a backup plan ensures it'll never happen.

Thanks again,

Scott

Randall Hauch

unread,

Jan 9, 2017, 10:15:34 AM1/9/17

to debezium

Yeah, relying upon the binlogs means that Debezium has to keep up and process all events in the binlogs before those events are purged. Once purging happens, you pretty much have to perform another snapshot, but the connector does not do this automatically. I don't have a good solution, but would be interested in any ideas people have. Pull requests are always welcome! :-)

Scott Ferguson

unread,

Jan 9, 2017, 1:58:47 PM1/9/17

to debe...@googlegroups.com

One approach:

* debezium starts up

* note the binlog position

* capture current schema

* perform a "SELECT * FROM <important table(s)>"

* push the resulting rows into the normal path of data changes

* resume normal processing from the previously noted binlog position

This could also allow replay from a point-in-time with the addition of a "WHERE date_field >= '<awesome date>'" clause.

I'd love to have time to do a PR (and also help with the AWS RDS support). It's unlikely to happen in the next month though :(

--
You received this message because you are subscribed to a topic in the Google Groups "debezium" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/debezium/kBDtV8Z24E4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to debezium+unsubscribe@googlegroups.com.
To post to this group, send email to debe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/6e65de1c-78be-48ba-b0e2-25493cf8f1c7%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Emmanuel Bernard

unread,

Jan 10, 2017, 4:19:04 AM1/10/17

to debe...@googlegroups.com

Thinking out loud on the subject at hand.

I think the filter by a specific date should be considered as a partial snapshot mechanism.

Same logic as the full snapshot as far as capturing state and then resuming change events from the binlog.

The big difference would be that you cannot ignore the state before the partial snapshot, you have to go back to the initial snapshot to do a history replay.

It also arguably makes a cross table stream merge pretty hellish as you might have lost change events on one or more tables. Likely triggering a full rebuild.

Does Debezium sends a global snapshot begin / end event, would it matter in a partial snapshot?

From a user perspective, we could declare normal tables as they are (full snapshot only), have tables with update date columns (partial snapshot allowed). Upon restart, read the log for the last binlog position and last date recorded to automate a snapshot (partial or full) if there is a gap.

It feels a bit fragile.

Emmanuel

--
You received this message because you are subscribed to the Google Groups "debezium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to debezium+unsubscribe@googlegroups.com.

To post to this group, send email to debe...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/CAKDo8poUUuOSNyEKOZGk4R_PvWLs7CC95jMREbZMW_5%3DAzsrvg%40mail.gmail.com.

Randall Hauch

unread,

Jan 11, 2017, 9:45:22 AM1/11/17

to debezium

I should clarify one thing that I forgot to mention earlier.

The connector is currently able to automatically perform a consistent snapshot when it restarts should MySQL no longer has the binlog position the connector last used (e.g., DBZ was down for longer than MySQL retained the binlog), but you do need to set the connector's snapshot.mode configuration property from the default of initial to when_needed. This snapshot is consistent and includes all data, but is perhaps not the most intelligent behavior if we knew more about the schema, such as tables that have last-modified columns or if we knew only a subset of the tables needed to be included.

Randall

Reply all

Reply to author

Forward