supervisors status change to unhealty_tasks

1,099 views
Skip to first unread message

shailendra kumar

unread,
Apr 8, 2021, 6:15:01 AM4/8/21
to Druid User
Hi, 

 In Supervisors , datasource status always changes to unhealthy tasks.   In Master server logs ,

******************coordinator-overload.log**********************************************
2021-04-08T09:25:21,918 WARN [KafkaSupervisor-mriprodstream-Reporting-0] org.apache.druid.indexing.kafka.supervisor.KafkaSupervisor - Lag metric: Kafka partitions [0, 1, 2, 3, 4, 5, 6, 7] do not match task partitions []
2021-04-08T09:25:51,474 WARN [KafkaSupervisor-mriprodstream] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - All tasks in group [0] failed to publish, killing all tasks for these partitions
2021-04-08T09:25:51,475 ERROR [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Ignoring null work item from pending task queue: {class=org.apache.druid.indexing.overlord.RemoteTaskRunner, taskId=index_kafka_mriprodstream_0c15d19ea8071c7_lapjmdme}
2021-04-08T09:25:51,489 WARN [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[mridruiddata.redbus.com:8091] announced a status for a task I didn't know about, adding to runningTasks: index_kafka_mriprodstream_0c15d19ea8071c7_lapjmdme
2021-04-08T09:25:56,507 WARN [KafkaSupervisor-mriprodstream-Worker-0] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Ignoring task [index_kafka_mriprodstream_0c15d19ea8071c7_hkoohpkk], as probably it is not started running yet
2021-04-08T09:35:32,872 WARN [KafkaSupervisor-filterDataStream-Worker-0] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Ignoring task [index_kafka_filterDataStream_7b88d192779a8ff_cdbjhgdi], as probably it is not started running yet

************************************************************************************


2021-04-08.png

How is it happened?? Every time it happened, i fixes it by HARD_RESET . in the action column. 
what is the permanent fixes here?


Vaibhav Vaibhav

unread,
Apr 8, 2021, 7:55:31 AM4/8/21
to druid...@googlegroups.com
Hi Shailendra,

>  what is the permanent fixes here?

We need to understand the root cause for this issue in order to get to a permanent solution.

How is it happened?? Every time it happened, i fixes it by HARD_RESET . in the action column.

Supervisor stats UNHEALTHY_TASKS  - means The last druid.supervisor.taskUnhealthinessThreshold tasks have all failed .  "druid.supervisor.taskUnhealthinessThreshold" - The number of consecutive task failures before the supervisor is considered unhealthy. Default values for this parameter is 3 . That means if the 3 consecutive task failures will lead to this state .

Why the tasks are failing? 

Looking at the log excerpts you have posted,  It seems like the Kafka tasks are falling while trying to publish the segments.


2021-04-08T09:25:51,474 WARN [KafkaSupervisor-mriprodstream] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - All tasks in group [0] failed to publish, killing all tasks for these partitions

  • Looking at the overlord log carefully could provide you more details on the exact cause for why the tasks are failing while publishing.
  • One reason could be that,  task is timing out ( timing out on - completionTimeout)- The length of time to wait before declaring a publishing task as failed and terminating it. If this is set too low, your tasks may never publish. The publishing clock for a task begins roughly after taskDuration elapses. The default value for completionTimeout is ( 30 minutes). If that the case, you may see the following messages in the overlord log -
    No task in [[]]  succeeded before the completion timeout elapsed [PT1800S]!


You can grep for the failed task_ids in the overlord log to learn more about the failure which will help you in identifying the next steps.


Thas and Regards,
Vaibhav


--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/3a562c25-5013-424e-8cb0-0a89e286a465n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages