Resource management on Kubernetes - split snapshot into a task?

Adil Karim

unread,

Nov 17, 2023, 7:51:13 AM11/17/23

to debezium

Hi!

First off, I love Debezium, it's really the only open-source solution that has been able to ship our 800GB-ish database into Kafka and then into BigQuery, so thank you for that.

I'm running Debezium on Kafka Connect and Redpanda and all of this is run on Kubernetes. Now that the system is set up and stable and we're not running so many snapshots I've noticed that the CPU usage drops from 3-4 cores during snapshotting to 70m-80m during streaming. We maintain the 3 core requests on our nodes in order to make sure we always have enough resources to run the snapshot when we need to, but the streaming CPU usage is a fraction of the snapshot usage. The memory usage stays roughly the same, although it does spike a few GB when we're running a snapshot.

This is a tremendous waste of our CPU resources, and as we scale we're worried it's only going to get worse. Are there any existing strategies to deal with this? If not, are there any plans to split up the streaming and snapshotting processes, perhaps into separate tasks, or even separate connectors, so we can scale up and scale down as needed? It's already is possible to listen on Kafka topics and spin up K8s resources using the HPA, so we could spin up additional resources on-demand when a snapshot signal is received.

Thanks!

Adil

jiri.p...@gmail.com

unread,

Nov 20, 2023, 8:52:12 AM11/20/23

to debezium

Hi,

did you think abou burstable QoS? IMHO it would fit to this usecase. WDYT?

J.

Adil Karim

unread,

Jan 2, 2024, 1:06:06 PM1/2/24

to debe...@googlegroups.com

Hi J,

QoS is a bit of a mess in Kubernetes. You can't manually specify the QoS class, it's inferred from your resource requests/limits. If I could set it manually, what I would do is set the memory requests very low, set the limits very high and it should then evict the non-Connect pods when the snapshot kicks in and memory pressure triggers on the node, but that won't happen here. The Connect worker is likely to get kicked first.

Right now my solution is to manually set the resource requests/limits before/after snapshots - not ideal!

Adil

Adil Karim

Co-Founder

LIX

t: +44 113 868 3463

e: ad...@lix-it.com

|

w: lix-it.com

a:

98 Bramley Rd, London, N14 4HS, UK

--
You received this message because you are subscribed to a topic in the Google Groups "debezium" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/debezium/NiM5ib4rAbY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to debezium+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/debezium/9baab822-ecb0-4cc8-ba68-63ed87eeac19n%40googlegroups.com.

jiri.p...@gmail.com

unread,

Jan 3, 2024, 1:21:39 AM1/3/24

to debezium

Hi,

that's a bit unfortunate. We don't have any plans to split connectors into different tasks for streaming and snapshotting. So your best bet now is just listen to start and end snapshot signals and have a tool that would modify the limits according to the phase.

J.

To unsubscribe from this group and all its topics, send an email to debezium+u...@googlegroups.com.

Reply all

Reply to author

Forward