Hi everyone,
we have a kafka cluser on AWS EC2 machines with
- 12 brokers, each with
- 12 cores
- 93 gb ram
- a single ephemeral 6.2tb disk
we started with 10 brokers that reach total of 80% cpu util (50% us, 20% sy, 10% si) and load average reaching ~20. the cpu usage mostly arrives from the kafka broker process.
so we added 2 more brokers.
the 2 new brokers receive the same amount of data as the old 10 brokers, but they are much less utilised in terms of cpu usage (total of us, sy, si is only 25%). the cpu usage mostly arrives from the kafka broker process.
the only difference between the old and new brokers is that
- the old brokers have total of 250 leader partitions (and 500 partitions in total)
- the new brokers have 125 leaders and 250 partitions (due to replication factor of 2).
old & new brokers receive the same amount of data because the topics that were assigned to the new brokers contain almost all the messages arriving to the cluster. the other partitions that aren't assigned to the new brokers are very small from traffic perspective.
the dstat on both new & old brokers shows ~ 140 mb network recv and ~200 mb network send.
however it also shows the new brokers have 100k interrupts and 70k context switches while the old brokers have 200k interrupts and 140k context switches.
has anyone encountered such issue of new brokers receiving almost the same amount of events while their cpu usage significantly differs? do you think it's related to the difference between the number of assigned partitions among them?
Thanks in advance,
Elad