Debezium Server - Support for non messaging infrastructure-type sinks

155 views
Skip to first unread message

Nathan Smit

unread,
Feb 16, 2022, 4:39:35 AM2/16/22
to debezium
Hi there,

This is a bit of a big-picture question around Debezium Server.  I was wondering what the view was on expanding the Debezium Server sinks in future to include targets that wouldn't really be considered messaging infrastructure but support streaming ingestion.

Examples I was thinking of would be sinks like Google BigQuery and Amazon Redshift.  Would adding sinks like these be considered outside of the scope of what you're trying to do with Debezium Server?  I guess the closest thing to this currently would be the Redis sink, but I'd probably put that in a different category still.

Gunnar Morling

unread,
Feb 16, 2022, 5:25:13 AM2/16/22
to debezium
Hey Nathan,

That's a great question. So indeed the current focus of Debezium Server is on messaging infrastructure, like Kinesis, Pulsar, Pub/Sub, etc. The Redis sink is in the same camp btw., as this targets Redis *Streams*, which is an append-only log structure, and not Redis, the K/V store.

The reasoning for this is that pushing change events to such messaging infra provides a great degree of flexibility and optionality, as it allows for 1:n integration of one source (Debezium) with multiple sinks (e.g. BigQuery and Elasticsearch), as opposed to 1:1 integrations of one source and one sink. There's one exception I'm aware of, and that's the community-maintained sink for Apache Iceberg [1]. This definitely is more similar to what you have in mind with BigQuery or Redshift.

I wouldn't say we rule out widening the scope of Debezium Server into that direction, if there's a reasonable interest. My question would be why you'd prefer such 1:1 integration of one source and one sink? One project to consider in this space is Apache Camel [2], which comes with Debezium-based source connectors and sink connectors for a large number of systems, including BQ and Redshift. The target audience is somewhat different than that of Kafka Connect / Debezium Server though; setting up an end-to-end pipeline will typically require a small amount of coding.

Hth,

--Gunnar

Message has been deleted
Message has been deleted

Nathan Smit

unread,
Feb 17, 2022, 6:50:38 AM2/17/22
to debezium
Thanks for the response!  Currently, we're doing a pilot project with Debezium where our pipeline is based heavily on Google's CDC parent dataflow project which I'm sure you're aware of.

Our source is Oracle, though, so we are doing Oracle -  Debezium Server - Pubsub (all messages pushed to single topic) - Dataflow - Bigquery

We're also not interested in having all of the transactions, so similar to Google's project our Dataflow process takes all the CDC output and merges it into a target table so that we only have the most recent version of a particular record.

This works fine!  However, we've discussed internally that it'd be useful for near real-time analytics use-cases to have a process which is just  Oracle - Debezium Server - BigQuery (via streaming inserts).  We can then clean up the CDC output in views for end-users and not have all of the overhead of running Dataflow.

We could do something like Oracle - Debezium Oracle Connector - Google Bigquery Kafka connect plugin, but Kafka is not widely used in our organisation and so it's cool to have a kafka-less solution.

Apache Camel looks cool, but looks like no Oracle component currently.  The Apache Iceberg use-case is also interesting.  I think we'd be willing to do a similar thing for BigQuery to see if it'd gain any traction in the community.
Reply all
Reply to author
Forward
0 new messages