Flink parallelism causing uneven TaskManager utilization

38 views
Skip to first unread message

Arjun Sivakumar

unread,
Sep 4, 2025, 1:56:29 AMSep 4
to Nussknacker

I am deploying a scenario from Nussknacker to Flink where the source is Kafka and the sink sends requests to an HTTP endpoint. The setup has 2 Kafka broker nodes and 2 TaskManagers with one TM have 6 slots and the other have 5 slots

I ran the job with different parallelism configurations, and I noticed unexpected load distribution across TaskManagers:

  • Case 1:

Job Parallelism = 5

Observation: Only one TaskManager slot is being utilized, and the entire load runs on a single TaskManager.

  • Case 2:

Job Parallelism = 8

Observation: Two TaskManager slots are being utilized, but still the entire load is handled by a single TaskManager.

  • Case 3:

Observation: Again, only two TaskManager slots are being utilized, and the load is concentrated on a single TaskManager.

However, when I set job parallelism = 6 with 3 slots in each Task Managers, the load distributes properly across both TaskManagers.

Question:
Why is the load not evenly distributed across TaskManagers? Is this related to the number of Kafka partitions, operator chaining in Flink/Nussknacker, or some scheduling limitation? What’s the recommended configuration to ensure even distribution and better throughput in this setup?

Thanks in Advance,
Arjun S

Arjun Sivakumar

unread,
Sep 8, 2025, 1:26:58 AM (11 days ago) Sep 8
to Nussknacker
Hi team,

  Could anyone please assist with this? Your support would be greatly appreciated.

Thanks and Regards,
Arjun S  

Arkadiusz Burdach

unread,
Sep 8, 2025, 10:11:08 PM (10 days ago) Sep 8
to Arjun Sivakumar, Nussknacker, enter...@nussknacker.io
Hi,

It looks like a typical problem with even job distribution of Flink.
Try to change the "taskmanager.load-balance.mode" Flink's configuration option from the default NONE to SLOTS. In older Flink versions, this configuration option was called "cluster.evenly-spread-out-slots".


Is this related to the number of Kafka partitions, operator chaining in Flink/Nussknacker, or some scheduling limitation?
The number of partitions should be greater or equals to job parallelism.

Arek Burdach
--
You received this message because you are subscribed to the Google Groups "Nussknacker" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nussknacker...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/nussknacker/414f0455-1cac-49bf-be3b-529dbcdbde92n%40googlegroups.com.

Message has been deleted
Message has been deleted
Message has been deleted

Arjun Sivakumar

unread,
Sep 9, 2025, 4:03:13 PM (10 days ago) Sep 9
to Nussknacker
Hi team,

Thanks for your response.

I’m facing an issue with load distribution in my Flink setup.  

- I have 2 TaskManagers (TM1 and TM2).  
  - TM1: 5 slots  
  - TM2: 6 slots  
  - Total: 11 slots  
- Kafka cluster: 2 brokers  
- I have configured the property `taskmanager.load-balance.mode`.  

Observations:
- When the job parallelism is 5, the slots are divided, and during load testing, the load is distributed equally across both TaskManagers.
- When the job parallelism is 10, the slots are divided, and during load testing, the load is distributed equally across both TaskManagers.  
- When the job parallelism is 8 (4 slots allocated to each TM), the load goes **only to one TaskManager** (either TM1 or TM2).  
- I also tried changing the slot configuration to make it **10 slots total (5 per TM)** and also **8 slots total (4 per TM)**, but with parallelism 8, the same issue persists—load is directed only to one TaskManager.  

Question:
Why is the load not being balanced across both TaskManagers when job parallelism is 8?  
Is there any configuration I might be missing?  

Thanks in advance for your help!

Arjun Sivakumar

unread,
Sep 9, 2025, 4:03:20 PM (10 days ago) Sep 9
to Nussknacker
Hi team,

Thanks for your response.

I’m facing an issue with load distribution in my Flink setup.  

- I have **2 TaskManagers** (TM1 and TM2).  

  - TM1: 5 slots  
  - TM2: 6 slots  
  - Total: 11 slots  
- Kafka cluster: **2 brokers**  

- I have configured the property `taskmanager.load-balance.mode`.  

Observations:
- When the job parallelism is 5, the slots are divided, and during load testing, the load is distributed equally across both TaskManagers.
- When the job parallelism is 10, the slots are divided, and during load testing, the load is distributed equally across both TaskManagers.  
- When the job parallelism is 8 (4 slots allocated to each TM), the load goes **only to one TaskManager** (either TM1 or TM2).  
- I also tried changing the slot configuration to make it **10 slots total (5 per TM)** and also **8 slots total (4 per TM)**, but with parallelism 8, the same issue persists—load is directed only to one TaskManager.  

Question:

Why is the load not being balanced across both TaskManagers when job parallelism is 8?  
Is there any configuration I might be missing?  

Thanks in advance for your help!

On Tuesday, September 9, 2025 at 7:41:08 AM UTC+5:30 Arkadiusz Burdach wrote:
Message has been deleted

Arjun Sivakumar

unread,
Sep 10, 2025, 12:34:39 AM (9 days ago) Sep 10
to Nussknacker
Hi team,

  Could anyone please assist with this? Your support would be greatly appreciated.

Thanks and Regards,
Arjun S  

Arjun Sivakumar

unread,
Sep 11, 2025, 2:04:28 AM (8 days ago) Sep 11
to Nussknacker
Hi team,

  Any assistance on this would be greatly appreciated.

Thanks in Advance.  

Arkadiusz Burdach

unread,
Sep 11, 2025, 4:26:21 AM (8 days ago) Sep 11
to Joice Jacob, Nussknacker, enter...@nussknacker.io
Hi,

On 9.09.2025 17:12, Joice Jacob wrote:
Hi all, I’m facing an issue with load distribution in my Flink setup. - I have **2 TaskManagers** (TM1 and TM2). - TM1: 5 slots - TM2: 6 slots - Total: 11 slots - Kafka cluster: **2 brokers** - I have configured the property `taskmanager.load-balance.mode`. Observations: - When the job parallelism is 5, the slots are divided, and during load testing, the load is distributed equally across both TaskManagers. - When the job parallelism is 10, the slots are divided, and during load testing, the load is distributed equally across both TaskManagers. - When the job parallelism is 8 (4 slots allocated to each TM), the load goes **only to one TaskManager** (either TM1 or TM2). - I also tried changing the slot configuration to make it **10 slots total (5 per TM)** and also **8 slots total (4 per TM)**, but with parallelism 8, the same issue persists—load is directed only to one TaskManager. Question: Why is the load not being balanced across both TaskManagers when job parallelism is 8? Is there any configuration I might be missing?
TBH I don't understand the difference between these mentioned setups with with parallelism 10 and 8. Both setups looks similar and I don't think that the observation that you've made is related with this.
Ideas that are coming to my mind are:
1. Double-check if "taskmanager.load-balance.mode" configuration option was handled by Flink - you can check it in Flink's web-console in TaskManager > Configuration
2. Double-check the number of Kafka topic partions and check if every Kafka partition have the leader assigned and is in-sync e.g. using kafka-topics cmd tool or some web UI
3. Ensure if load is evenly distributed across all Kafka partitions. It might be not if the strategy for key selection was somehow biased
4. Double-check if zookeeper configuration for Flink is correct - if the number of nodes is odd and if every node see each other - you can check it using zk-cli and check if zk epoch is the same on each node
5. Check if there is no lag during the consumptions of partitions
6. Check if back-pressure is not observed in Flink console

Arek Burdach

Thanks in advance for your help!
On Tuesday, September 9, 2025 at 7:41:08 AM UTC+5:30 Arkadiusz Burdach wrote:

Arjun Sivakumar

unread,
Sep 11, 2025, 2:50:06 PM (8 days ago) Sep 11
to Nussknacker
Hi Team,

Thanks for your response.

I’ve been troubleshooting a load distribution issue in my Flink job and verified the following:

1. taskmanager.load-balance.mode is configured as slots (confirmed in the TaskManager configuration).
2. Kafka topic has 2 partitions, both with assigned leaders and fully in-sync.
3. Load is evenly distributed across Kafka partitions.
4. I’m using only one ZooKeeper, so cluster coordination is straightforward.
5. No lag observed in partition consumption.
6. No backpressure is reported in the Flink console.

📌 Job setup:
- Source: Kafka
- Sink: a custom HTTP enricher + a dead-end operator

The issue occurs when running with parallelism = 4. Load distribution is uneven in this case. However, when I run a simple Kafka → Kafka job (without the custom HTTP enricher) at the same parallelism, the load is equally distributed.

Has anyone experienced similar issues with uneven load distribution when introducing custom operators? Any suggestions on what I might be missing?

Thanks in advance!

Arjun Sivakumar

unread,
Sep 12, 2025, 1:35:51 AM (7 days ago) Sep 12
to Nussknacker
Hi team,

  Any assistance on this would be greatly appreciated.

Thanks in Advance.  

Arkadiusz Burdach

unread,
Sep 12, 2025, 2:47:52 AM (7 days ago) Sep 12
to Arjun Sivakumar, Nussknacker, enter...@nussknacker.io
Hi,


2. Kafka topic has 2 partitions, both with assigned leaders and fully in-sync.
A few posts above, I wrote that:

The number of partitions should be greater or equals to job parallelism.
If you have, let's say parallelism = 4, and number of partitions = 2, 2 slots will be allocated to partitions and 2 other will wait for a new partition appear. Distribution of work will be random, so both busy partitions can come to one TM or be split evenly to both.
Flink have a feature allowing to rescale such stream: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/overview/#rescaling  but ATM we don't have a publicly available Nussknacker component allowing to use this feature (contribution is welcome).

Arek
Reply all
Reply to author
Forward
0 new messages