I am submitting a scenario to a Flink cluster with 2 TaskManagers and a scenario parallelism of 2.
Under normal conditions, the scenario runs correctly across both TaskManagers.
However, the problem occurs when one TaskManager goes down.
In this situation, the scenario fails after a few restart attempts, even though one TaskManager is still running and free slots are available. Please find the attached error log.
In the scenario properties, the IO mode is set to Asynchronous.
If I switch the IO mode to Synchronous, the issue does not occur — the Flink job automatically restarts and continues running on the remaining TaskManager.
Because Synchronous mode has significantly lower performance, I would like to continue using Asynchronous IO mode.
QuestionIs there any configuration or recommended approach in Nussknacker or Flink that can ensure stable job recovery when using Asynchronous IO mode?
Specifically, how can we prevent recovery failures or state-restore errors when one TaskManager stops?
--
You received this message because you are subscribed to the Google Groups "Nussknacker" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nussknacker...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/nussknacker/c8229f71-9239-4cb6-aea9-1f18877f978fn%40googlegroups.com.