Ingestion tasks failing instantly on a single MM

Kayla Oliva

unread,

Nov 3, 2025, 10:06:59 PM (6 days ago) Nov 3

to Druid User

Hello,

For most of the day, we have one middle manager up (it's the lower bound we set on our autoscaler solution). We've noticed an issue that's occurred twice in the past six weeks since enabling autoscaling.

When our daily batch ingestions kick off, tasks usually get started on the single MM with no issue. However, on the days this issue occurs, the ingestion tasks start and fail almost immediately—so quickly that metrics aren’t even scraped in time. As a result, when the autoscaler queries the coordinator for task breakdowns, it gets back nothing, doesn’t detect pending work, and therefore doesn’t scale up the MMs.

We don't see any suspicious cluster logs around the time this has occurred, aside from a spike in "Task is not in knownTaskIds" errors. Restarting the coordinators resolves the issue, but it’s concerning that our current observability setup doesn’t tell us why this happens in the first place.

Has anyone encountered something similar or have ideas on what might be causing it?

Ben Krug

unread,

Nov 5, 2025, 5:13:20 PM (4 days ago) Nov 5

to druid...@googlegroups.com

I haven't seen this, but I wanted to ask. You mentioned that there's not much in cluster logs. Does that include in the (failed) task logs - anything there?

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/druid-user/6b9dacb0-0a03-4639-a3b4-9a0757aeafddn%40googlegroups.com.

Kayla Oliva

unread,

Nov 8, 2025, 3:03:19 PM (yesterday) Nov 8

to Druid User

Great point. Here's the full picture. It's unclear why the task is "unknown" in the first place:

Task Log (from GCS):

2025-10-19T08:36:56,342 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Attempting to lock file[/work/tasks/slot9/query-1f033f63-814f-4e38-8534-89c8684568de/lock].

2025-10-19T08:36:56,343 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Acquired lock file[/work/tasks/slot9/query-1f033f63-814f-4e38-8534-89c8684568de/lock] in 1ms.

2025-10-19T08:36:56,345 INFO [parent-monitor-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Triggering JVM shutdown. Check overlord logs to see why the task is being shut down.

Overlord/Coordinator Logs:

2025-10-19T08:36:51.582 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: query-1f033f63-814f-4e38-8534-89c8684568de

2025-10-19T08:36:51.582 INFO [hrtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigning task [query-1f033f63-814f-4e38-8534-89c8684568de] to worker

[241.32.69.18:8088]

2025-10-19T08:36:51.594 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Task[query-1f033f63-814f-4e38-8534-89c8684568de] started.

2025-10-19T08:36:51.594 INFO [HttpRemoteTaskRunner-worker-sync-0] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Task[query-1f033f63-814f-4e38-8534-89c8684568de] started RUNNING on

worker[241.32.69.18:8088].

2025-10-19T08:36:51.595 WARN [HttpRemoteTaskRunner-worker-sync-1] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Worker[241.32.69.18:8088] reported status[RUNNING] for unknown

task[query-1f033f63-814f-4e38-8534-89c8684568de]. Ignored.

2025-10-19T08:36:51.595 WARN [HttpRemoteTaskRunner-worker-sync-1] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Killing task[query-1f033f63-814f-4e38-8534-89c8684568de] on

worker[241.32.69.18:8088].

2025-10-19T08:36:51.595 INFO [qtp844008362-171] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [query-1f033f63-814f-4e38-8534-89c8684568de] because: [shut down request via HTTP endpoint]

Reply all

Reply to author

Forward