un-killable "zombie" tranq indexing tasks

147 views
Skip to first unread message

Jason Chen

unread,
Feb 19, 2016, 7:11:31 PM2/19/16
to Druid User
Hi all,

We're using tranquility 0.5.0 / druid 0.7.3.
We recently experienced some issues with our ZK cluster, and had to restart our druid cluster as well.

After bringing everything back up, we noticed something pretty strange.  All inflight tasks were killed (as expected), and all middlemanagers reported no active tasks, except for ONE middlemanager.  One middlemanager still reported that 17 tasks were still active.  The tasks were all from the same timerange (02-19T06:00), around the time we restarted things.  I hopped on that middlemanager to verify if the tasks were running, and discovered that the none of the peons associated w/ those tasks were actually running anymore.  Both the overlord & middlemanager know about the tasks, and both try to shut them down:

On overlord startup, we see this.  It looks like the overlord knows these tasks are old, but it can't shut them down.
-------------------------------------------------------------------------------------------------------------------------------------------------
...
2016-02-19T08:58:00,024 WARN [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[metrics-druid-18:8085] announced a status for a task I didn't know about, adding to runningTasks: index_realtime_*****_2016-02-19T06:00:00.000Z_0_0
2016-02-19T08:58:00,024 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[metrics-druid-18:8085] wrote RUNNING status for task: index_realtime_*****_2016-02-19T06:00:00.000Z_0_0
2016-02-19T08:58:00,024 WARN [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[metrics-druid-18:8085] announced a status for a task I didn't know about, adding to runningTasks: index_realtime_*****_2016-02-19T06:00:00.000Z_0_0
2016-02-19T08:58:00,024 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[metrics-druid-18:8085] wrote RUNNING status for task: index_realtime_*****_2016-02-19T06:00:00.000Z_2_0
2016-02-19T08:58:00,024 WARN [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[metrics-druid-18:8085] announced a status for a task I didn't know about, adding to runningTasks: index_realtime_*****_2016-02-19T06:00:00.000Z_2_0
...
2016-02-19T09:01:25,118 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 0 tasks from storage (0 tasks added, 0 tasks removed).
2016-02-19T09:01:25,118 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to clean up 17 tasks.
2016-02-19T09:01:25,124 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: metrics-druid-18:8085, status 200 OK, response: {"task":"index_realtime_*****_2016-02-19T06:00:00.000Z_4_0"}
2016-02-19T09:01:25,124 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_*****_2016-02-19T06:00:00.000Z_4_0! Are you sure the task was running?
2016-02-19T09:01:25,128 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: metrics-druid-18:8085, status 200 OK, response: {"task":"index_realtime_*****_2016-02-19T06:00:00.000Z_2_0"}
2016-02-19T09:01:25,129 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_*****_2016-02-19T06:00:00.000Z_2_0! Are you sure the task was running?
2016-02-19T09:01:25,133 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: metrics-druid-18:8085, status 200 OK, response: {"task":"index_realtime_*****_2016-02-19T06:00:00.000Z_0_0"}
...
2016-02-19T09:02:25,118 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 0 tasks from storage (0 tasks added, 0 tasks removed).
2016-02-19T09:02:25,119 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to clean up 17 tasks.
2016-02-19T09:02:25,124 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: metrics-druid-18:8085, status 200 OK, response: {"task":"index_realtime_*****_2016-02-19T06:00:00.000Z_4_0"}
2016-02-19T09:02:25,124 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_*****_2016-02-19T06:00:00.000Z_4_0! Are you sure the task was running?
2016-02-19T09:02:25,129 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: metrics-druid-18:8085, status 200 OK, response: {"task":"index_realtime_*****_2016-02-19T06:00:00.000Z_2_0"}
2016-02-19T09:02:25,129 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_*****_2016-02-19T06:00:00.000Z_2_0! Are you sure the task was running?
2016-02-19T09:02:25,133 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: metrics-druid-18:8085, status 200 OK, response: {"task":"index_realtime_*****_2016-02-19T06:00:00.000Z_0_0"}
...
(and this keeps repeating)
-------------------------------------------------------------------------------------------------------------------------------------------------

On the troublesome middlemanager, we see this: 
-------------------------------------------------------------------------------------------------------------------------------------------------
2016-02-19T06:10:48,848 INFO [qtp647052308-24] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_realtime_*****_2016-02-19T06:00:00.000Z_1_0
2016-02-19T06:10:48,876 INFO [qtp647052308-25] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_realtime_*****_2016-02-19T06:00:00.000Z_0_0
2016-02-19T06:10:48,882 INFO [qtp647052308-26] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_realtime_*****_2016-02-19T06:00:00.000Z_2_0
2016-02-19T06:10:48,889 INFO [qtp647052308-27] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_realtime_*****_2016-02-19T06:00:00.000Z_0_0
2016-02-19T06:10:48,896 INFO [qtp647052308-28] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_realtime_*****_2016-02-19T06:00:00.000Z_1_0
2016-02-19T06:10:48,904 INFO [qtp647052308-29] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_realtime_*****_2016-02-19T06:00:00.000Z_0_0
...
2016-02-19T23:22:25,254 INFO [qtp647052308-50] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_realtime_*****_2016-02-19T06:00:00.000Z_2_0
2016-02-19T23:22:25,256 INFO [qtp647052308-54] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_realtime_*****_2016-02-19T06:00:00.000Z_0_0
2016-02-19T23:22:25,257 INFO [qtp647052308-55] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_realtime_*****_2016-02-19T06:00:00.000Z_1_0
2016-02-19T23:22:25,259 INFO [qtp647052308-42] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_realtime_*****_2016-02-19T06:00:00.000Z_3_0
2016-02-19T23:22:25,261 INFO [qtp647052308-57] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_realtime_*****_2016-02-19T06:00:00.000Z_0_0
2016-02-19T23:22:25,262 INFO [qtp647052308-63] io.druid.indexing.overlord.ForkingTaskRunner - Ignoring request to cancel unknown task: index_realtime_*****_2016-02-19T06:00:00.000Z_0_0
(keeps repeating!)
-------------------------------------------------------------------------------------------------------------------------------------------------

I've tried to shut the tasks down via the overlord API, to no avail.  I get a 200 OK, but the task remains visible on the overlord page.
-------------------------------------------------------------------------------------------------------------------------------------------------
*   Trying ***.***.***.***...
* Connected to metrics-druid-05 (***.***.***.***) port 8084 (#0)
> POST /druid/indexer/v1/task/index_realtime_****_2016-02-19T06:00:00.000Z_4_0/shutdown HTTP/1.1
> Host: metrics-druid-05:8084
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Fri, 19 Feb 2016 20:11:03 GMT
< Content-Type: application/json
< Vary: Accept-Encoding, User-Agent
< Transfer-Encoding: chunked
< Server: Jetty(9.2.z-SNAPSHOT)
<
* Connection #0 to host metrics-druid-05 left intact
{"task":"index_realtime_***_2016-02-19T06:00:00.000Z_4_0"}%
-------------------------------------------------------------------------------------------------------------------------------------------------

Since the "zombie" tasks appear as "RUNNING" in the overlord, we can check the status via the status link.

A regular task shows:
-------------------------------------------------------------------------------------------------------------------------------------------------
{
  • status
    {
    • id"index_realtime_*****_2016-02-19T23:15:00.000Z_0_0",
    • status"RUNNING",
    • duration-1
    },
  • task"index_realtime_*****_2016-02-19T23:15:00.000Z_0_0"
}
-------------------------------------------------------------------------------------------------------------------------------------------------

But the "zombie" tasks return the following.  No status, just the task id.
-------------------------------------------------------------------------------------------------------------------------------------------------
{
  • task"index_realtime_*****_2016-02-19T06:00:00.000Z_0_0"
}
-------------------------------------------------------------------------------------------------------------------------------------------------


Here's the last breath of one of those zombie tasks:
-------------------------------------------------------------------------------------------------------------------------------------------------
2016-02-19T06:03:13,265 INFO [parent-monitor-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Triggering JVM shutdown.
2016-02-19T06:03:13,267 INFO [Thread-61] io.druid.cli.CliPeon - Running shutdown hook
2016-02-19T06:03:13,285 INFO [Thread-61] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.coordination.AbstractDataSegmentAnnouncer.stop()] on object[io.druid.server.coordination.BatchDataSegmentAnnouncer@7795aa3e].
2016-02-19T06:03:13,285 INFO [Thread-61] io.druid.server.coordination.AbstractDataSegmentAnnouncer - Stopping class io.druid.server.coordination.BatchDataSegmentAnnouncer with config[io.druid.server.initialization.ZkPathsConfig@58d3f4be]
2016-02-19T06:03:13,285 INFO [Thread-61] io.druid.curator.announcement.Announcer - unannouncing [/druid/announcements/metrics-druid-18:8108]
2016-02-19T06:03:13,403 INFO [Thread-61] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.indexing.worker.executor.ExecutorLifecycle.stop() throws java.lang.Exception] on object[io.druid.indexing.worker.executor.ExecutorLifecycle@fc8948f].
2016-02-19T06:03:13,413 INFO [Thread-61] org.eclipse.jetty.server.ServerConnector - Stopped ServerConnector@10d90b94{HTTP/1.1}{0.0.0.0:8108}
2016-02-19T06:03:13,415 INFO [Thread-61] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@5b3ba30c{/,null,UNAVAILABLE}
2016-02-19T06:03:13,423 INFO [Thread-61] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.log.FileRequestLogger.stop()] on object[io.druid.server.log.FileRequestLogger@3db18a6e].
2016-02-19T06:03:13,424 INFO [Thread-61] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.indexing.overlord.ThreadPoolTaskRunner.stop()] on object[io.druid.indexing.overlord.ThreadPoolTaskRunner@6437c127].
2016-02-19T06:03:13,425 INFO [Thread-61] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@434f7f5c].
2016-02-19T06:03:13,435 INFO [Thread-61] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.announcement.Announcer.stop()] on object[io.druid.curator.announcement.Announcer@41360c06].
2016-02-19T06:03:13,435 INFO [Thread-61] io.druid.curator.announcement.Announcer - unannouncing [/druid/segments/metrics-druid-18:8108/metrics-druid-18:8108_realtime__default_tier_2016-02-19T05:59:03.704Z_9ad06efff9984b789bed9e025e7f1b130]
2016-02-19T06:03:13,447 INFO [Thread-61] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@4e60571].
2016-02-19T06:03:13,439 ERROR [task-runner-0] io.druid.indexing.common.task.RealtimeIndexTask - Exception aborted realtime processing[*****]: {class=io.druid.indexing.common.task.RealtimeIndexTask, exceptionType=class java.lang.RuntimeException, exceptionMessage=java.lang.InterruptedException}
java.lang.RuntimeException: java.lang.InterruptedException
        at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.segment.realtime.firehose.EventReceiverFirehoseFactory$EventReceiverFirehose.hasMore(EventReceiverFirehoseFactory.java:209) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.segment.realtime.firehose.TimedShutoffFirehoseFactory$TimedShutoffFirehose.hasMore(TimedShutoffFirehoseFactory.java:99) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.segment.realtime.firehose.PredicateFirehose.hasMore(PredicateFirehose.java:50) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.indexing.common.task.RealtimeIndexTask.run(RealtimeIndexTask.java:314) [druid-standalone.jar:0.8.3.2]
        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:285) [druid-standalone.jar:0.8.3.2]
        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:265) [druid-standalone.jar:0.8.3.2]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_75]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_75]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_75]
        at java.lang.Thread.run(Thread.java:745) [?:1.7.0_75]
Caused by: java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) ~[?:1.7.0_75]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095) ~[?:1.7.0_75]
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) ~[?:1.7.0_75]
        at io.druid.segment.realtime.firehose.EventReceiverFirehoseFactory$EventReceiverFirehose.hasMore(EventReceiverFirehoseFactory.java:201) ~[druid-standalone.jar:0.8.3.2]
        ... 9 more
2016-02-19T06:03:13,454 ERROR [task-runner-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[RealtimeIndexTask{id=index_realtime_*****_2016-02-19T06:00:00.000Z_5_0, type=index_realtime, dataSource=*****}]
java.lang.RuntimeException: java.lang.InterruptedException
        at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.segment.realtime.firehose.EventReceiverFirehoseFactory$EventReceiverFirehose.hasMore(EventReceiverFirehoseFactory.java:209) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.segment.realtime.firehose.TimedShutoffFirehoseFactory$TimedShutoffFirehose.hasMore(TimedShutoffFirehoseFactory.java:99) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.segment.realtime.firehose.PredicateFirehose.hasMore(PredicateFirehose.java:50) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.indexing.common.task.RealtimeIndexTask.run(RealtimeIndexTask.java:314) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:285) [druid-standalone.jar:0.8.3.2]
        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:265) [druid-standalone.jar:0.8.3.2]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_75]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_75]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_75]
        at java.lang.Thread.run(Thread.java:745) [?:1.7.0_75]
Caused by: java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) ~[?:1.7.0_75]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095) ~[?:1.7.0_75]
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) ~[?:1.7.0_75]
        at io.druid.segment.realtime.firehose.EventReceiverFirehoseFactory$EventReceiverFirehose.hasMore(EventReceiverFirehoseFactory.java:201) ~[druid-standalone.jar:0.8.3.2]
        ... 9 more
        2016-02-19T06:03:13,454 ERROR [task-runner-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[RealtimeIndexTask{id=index_realtime_*****_2016-02-19T06:00:00
.000Z_5_0, type=index_realtime, dataSource=*****}]
java.lang.RuntimeException: java.lang.InterruptedException
        at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.segment.realtime.firehose.EventReceiverFirehoseFactory$EventReceiverFirehose.hasMore(EventReceiverFirehoseFactory.java:209) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.segment.realtime.firehose.TimedShutoffFirehoseFactory$TimedShutoffFirehose.hasMore(TimedShutoffFirehoseFactory.java:99) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.segment.realtime.firehose.PredicateFirehose.hasMore(PredicateFirehose.java:50) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.indexing.common.task.RealtimeIndexTask.run(RealtimeIndexTask.java:314) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:285) [druid-standalone.jar:0.8.3.2]
        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:265) [druid-standalone.jar:0.8.3.2]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_75]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_75]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_75]
        at java.lang.Thread.run(Thread.java:745) [?:1.7.0_75]
Caused by: java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) ~[?:1.7.0_75]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095) ~[?:1.7.0_75]
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) ~[?:1.7.0_75]
        at io.druid.segment.realtime.firehose.EventReceiverFirehoseFactory$EventReceiverFirehose.hasMore(EventReceiverFirehoseFactory.java:201) ~[druid-standalone.jar:0.8.3.2]
        ... 9 more
2016-02-19T06:03:13,468 INFO [task-runner-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_realtime_*****_2016-02-19T06:00:00.000Z_5_0",
  "status" : "FAILED",
  "duration" : 258172
}
2016-02-19T06:03:13,489 INFO [Thread-61] io.druid.curator.CuratorModule - Stopping Curator
2016-02-19T06:03:13,504 INFO [Thread-61] org.apache.zookeeper.ZooKeeper - Session: 0x54c332e4dc2cf85 closed
2016-02-19T06:03:13,504 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down
2016-02-19T06:03:13,506 INFO [Thread-61] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.http.client.NettyHttpClient.stop()] on object[com.metamx.
http.client.NettyHttpClient@4465a4b2].
2016-02-19T06:03:13,611 INFO [Thread-61] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.metrics.MonitorScheduler.stop()] on object[com.metamx.met
rics.MonitorScheduler@68a56bcd].
2016-02-19T06:03:13,612 INFO [Thread-61] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.emitter.service.ServiceEmitter.close() throws java.io.IOE
xception] on object[com.metamx.emitter.service.ServiceEmitter@5b8a47e].
2016-02-19T06:03:13,613 INFO [Thread-61] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.
stop()] on object[io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner@20a0726e].
2016-02-19T06:03:13,613 INFO [Thread-61] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.initialization.OpenTsdbEmitter.close() throws java.i
o.IOException] on object[io.druid.server.initialization.OpenTsdbEmitter@36f3ec92].
2016-02-19T06:03:13,613 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.coordination.AbstractDataSegmentAnnouncer.stop()] on obje
ct[io.druid.server.coordination.BatchDataSegmentAnnouncer@7795aa3e].
2016-02-19T06:03:13,613 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.indexing.worker.executor.ExecutorLifecycle.stop() throws java.lang.Exception] on object[io.druid.indexing.worker.executor.ExecutorLifecycle@fc8948f].
2016-02-19T06:03:13,614 ERROR [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Exception when stopping method[public void io.druid.indexing.worker.executor.ExecutorLifecycle.stop() throws java.lang.Exception] on object[io.druid.indexing.worker.executor.ExecutorLifecycle@fc8948f]
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.7.0_75]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[?:1.7.0_75]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.7.0_75]
        at java.lang.reflect.Method.invoke(Method.java:606) ~[?:1.7.0_75]
        at com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:337) [druid-standalone.jar:0.8.3.2]
        at com.metamx.common.lifecycle.Lifecycle.stop(Lifecycle.java:261) [druid-standalone.jar:0.8.3.2]
        at io.druid.cli.CliPeon.run(CliPeon.java:248) [druid-standalone.jar:0.8.3.2]
        at io.druid.cli.Main.main(Main.java:99) [druid-standalone.jar:0.8.3.2]
Caused by: java.nio.channels.ClosedChannelException
        at sun.nio.ch.FileLockImpl.release(FileLockImpl.java:58) ~[?:1.7.0_75]
        at io.druid.indexing.worker.executor.ExecutorLifecycle.stop(ExecutorLifecycle.java:220) ~[druid-standalone.jar:0.8.3.2]
        ... 8 more
2016-02-19T06:03:13,614 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.log.FileRequestLogger.stop()] on object[io.druid.server.log.FileRequestLogger@3db18a6e].
2016-02-19T06:03:13,615 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.indexing.overlord.ThreadPoolTaskRunner.stop()] on object[io.druid.indexing.overlord.ThreadPoolTaskRunner@6437c127].
2016-02-19T06:03:13,615 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@434f7f5c].
2016-02-19T06:03:13,615 ERROR [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Exception when stopping method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@434f7f5c]
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.7.0_75]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[?:1.7.0_75]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.7.0_75]
        at java.lang.reflect.Method.invoke(Method.java:606) ~[?:1.7.0_75]
        at com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:337) [druid-standalone.jar:0.8.3.2]
        at com.metamx.common.lifecycle.Lifecycle.stop(Lifecycle.java:261) [druid-standalone.jar:0.8.3.2]
        at io.druid.cli.CliPeon.run(CliPeon.java:248) [druid-standalone.jar:0.8.3.2]
        at io.druid.cli.Main.main(Main.java:99) [druid-standalone.jar:0.8.3.2]
Caused by: java.lang.IllegalStateException: Already closed or has not been started
        at com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[druid-standalone.jar:0.8.3.2]
        at org.apache.curator.x.discovery.details.ServiceCacheImpl.close(ServiceCacheImpl.java:104) ~[druid-standalone.jar:0.8.3.2]
        at org.apache.curator.x.discovery.details.ServiceProviderImpl.close(ServiceProviderImpl.java:78) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.curator.discovery.ServerDiscoverySelector.stop(ServerDiscoverySelector.java:122) ~[druid-standalone.jar:0.8.3.2]
        ... 8 more
2016-02-19T06:03:13,616 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.announcement.Announcer.stop()] on object[io.druid.curator.announcement.Announcer@41360c06].
2016-02-19T06:03:13,616 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@4e60571].
2016-02-19T06:03:13,616 ERROR [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Exception when stopping method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@4e60571]
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.7.0_75]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[?:1.7.0_75]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.7.0_75]
        at java.lang.reflect.Method.invoke(Method.java:606) ~[?:1.7.0_75]
        at com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:337) [druid-standalone.jar:0.8.3.2]
        at com.metamx.common.lifecycle.Lifecycle.stop(Lifecycle.java:261) [druid-standalone.jar:0.8.3.2]
        at io.druid.cli.CliPeon.run(CliPeon.java:248) [druid-standalone.jar:0.8.3.2]
        at io.druid.cli.Main.main(Main.java:99) [druid-standalone.jar:0.8.3.2]
Caused by: java.lang.IllegalStateException: Already closed or has not been started
        at com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[druid-standalone.jar:0.8.3.2]
        at org.apache.curator.x.discovery.details.ServiceCacheImpl.close(ServiceCacheImpl.java:104) ~[druid-standalone.jar:0.8.3.2]
        at org.apache.curator.x.discovery.details.ServiceProviderImpl.close(ServiceProviderImpl.java:78) ~[druid-standalone.jar:0.8.3.2]
        at io.druid.curator.discovery.ServerDiscoverySelector.stop(ServerDiscoverySelector.java:122) ~[druid-standalone.jar:0.8.3.2]
        ... 8 more
2016-02-19T06:03:13,616 ERROR [main] org.apache.curator.x.discovery.details.ServiceDiscoveryImpl - Could not unregister instance: druid:prod:firehose:*****-00-0005-0000
java.lang.IllegalStateException: instance must be started before calling this method
        at com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[druid-standalone.jar:0.8.3.2]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:357) ~[druid-standalone.jar:0.8.3.2]
        at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.internalUnregisterService(ServiceDiscoveryImpl.java:505) ~[druid-standalone.jar:0.8.3.2]
        at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:155) [druid-standalone.jar:0.8.3.2]
        at io.druid.curator.discovery.DiscoveryModule$5.stop(DiscoveryModule.java:224) [druid-standalone.jar:0.8.3.2]
        at com.metamx.common.lifecycle.Lifecycle.stop(Lifecycle.java:261) [druid-standalone.jar:0.8.3.2]
        at io.druid.cli.CliPeon.run(CliPeon.java:248) [druid-standalone.jar:0.8.3.2]
        at io.druid.cli.Main.main(Main.java:99) [druid-standalone.jar:0.8.3.2]
2016-02-19T06:03:13,617 ERROR [main] org.apache.curator.x.discovery.details.ServiceDiscoveryImpl - Could not unregister instance: *****-00-0005-0000
java.lang.IllegalStateException: instance must be started before calling this method
        at com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[druid-standalone.jar:0.8.3.2]
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:357) ~[druid-standalone.jar:0.8.3.2]
        at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.internalUnregisterService(ServiceDiscoveryImpl.java:505) ~[druid-standalone.jar:0.8.3.2]
        at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:155) [druid-standalone.jar:0.8.3.2]
        at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:155) [druid-standalone.jar:0.8.3.2]
        at io.druid.curator.discovery.DiscoveryModule$5.stop(DiscoveryModule.java:224) [druid-standalone.jar:0.8.3.2]
        at com.metamx.common.lifecycle.Lifecycle.stop(Lifecycle.java:261) [druid-standalone.jar:0.8.3.2]
        at io.druid.cli.CliPeon.run(CliPeon.java:248) [druid-standalone.jar:0.8.3.2]
        at io.druid.cli.Main.main(Main.java:99) [druid-standalone.jar:0.8.3.2]
2016-02-19T06:03:13,617 INFO [main] io.druid.curator.CuratorModule - Stopping Curator
2016-02-19T06:03:13,617 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.http.client.NettyHttpClient.stop()] on object[com.metamx.http.client.NettyHttpClient@4465a4b2].
2016-02-19T06:03:13,617 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.metrics.MonitorScheduler.stop()] on object[com.metamx.metrics.MonitorScheduler@68a56bcd].
2016-02-19T06:03:13,618 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.emitter.service.ServiceEmitter.close() throws java.io.IOException] on object[com.metamx.emitter.service.ServiceEmitter@5b8a47e].
2016-02-19T06:03:13,618 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop()] on object[io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner@20a0726e].
2016-02-19T06:03:13,618 ERROR [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Exception when stopping method[public void io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop()] on object[io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner@20a0726e]
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.7.0_75]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[?:1.7.0_75]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.7.0_75]
at java.lang.reflect.Method.invoke(Method.java:606) ~[?:1.7.0_75]
at com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:337) [druid-standalone.jar:0.8.3.2]
at com.metamx.common.lifecycle.Lifecycle.stop(Lifecycle.java:261) [druid-standalone.jar:0.8.3.2]
at io.druid.cli.CliPeon.run(CliPeon.java:248) [druid-standalone.jar:0.8.3.2]
at io.druid.cli.Main.main(Main.java:99) [druid-standalone.jar:0.8.3.2]
Caused by: java.lang.IllegalStateException: Expected state [STARTED] found [STOPPED]
at io.druid.common.config.Log4jShutdown.stop(Log4jShutdown.java:102) ~[druid-standalone.jar:0.8.3.2]
at io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop(Log4jShutterDownerModule.java:109) ~[druid-standalone.jar:0.8.3.2]
... 8 more
2016-02-19T06:03:13,618 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.initialization.OpenTsdbEmitter.close() throws java.io.IOException] on object[io.druid.server.initialization.OpenTsdbEmitter@36f3ec92].
2016-02-19T06:03:13,618 INFO [main] io.druid.cli.CliPeon - Finished peon task
-------------------------------------------------------------------------------------------------------------------------------------------------

After our restart, tranq indexing tasks started coming through again, and data appears to be making it into druid.  However, we have these 17 tasks that just won't die.  They're taking up middlemanager capacity, and preventing real tasks from using those slots, so we're getting really close to maxing out MM capacity.  Did some metadata get corrupted somewhere?  Any idea how can we kill / delete these tasks?  

Thanks in advance!
-Jason

Gian Merlino

unread,
Feb 19, 2016, 7:40:35 PM2/19/16
to druid...@googlegroups.com
Hey Jason,

Tranquility 0.5.0 and Druid 0.7.3 are quite old and there have been a few fixes since then related to task announcing and assignment. Could you try upgrading?

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/d4364cad-9d7f-4d4e-8408-5966b2b0a228%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jason Chen

unread,
Feb 19, 2016, 7:54:36 PM2/19/16
to Druid User
Hi Gian, thanks for the reply.

Upgrading to druid 0.8.3 is on our roadmap, and we're currently testing it out in our staging environment.  Unfortunately, it will still take some time for us to upgrade our prod druid cluster.  Is there anything we can do, at least right now, to remove these tasks?

Thanks again,
-Jason
...

Gian Merlino

unread,
Feb 19, 2016, 8:08:09 PM2/19/16
to druid...@googlegroups.com
Does restarting the middleManager help?

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.

Jason Chen

unread,
Feb 19, 2016, 8:57:47 PM2/19/16
to Druid User
It doesn't appear to help.  When I shutdown that middleManager, the zombie tasks no longer appear in the overlord.  When I bring that middleManager back up, the zombie tasks show up again.  Also, right after the middleManager starts, we see the "Ignoring request to cancel unknown task" message repeating multiple times in the logs.

Thanks,
-Jason
...

Fangjin Yang

unread,
Feb 29, 2016, 12:10:35 PM2/29/16
to Druid User
The version is so old that I think it'll be difficult for us to debug anything with setting aside dedicated time. You can try wiping the task entry from metadata store and see if that helps.

If you need dedicated help, you can try http://imply.io/
...
Reply all
Reply to author
Forward
0 new messages