the tasks don't stop

jianr...@alibaba-inc.com

unread,

Jul 18, 2016, 8:56:07 PM7/18/16

to Druid User

how long should the the middleManager realtime task java process would stop? the process num over 3 then the spark tranquility realtime task will failed for No hosts are available for disco, how to solve this problem（my way is to kill the process by myself in bash）?

the spark work log

16/07/18 00:04:32 INFO LoggingEmitter: Event [{"feed":"alerts","timestamp":"2016-07-18T00:04:32.920+08:00","service":"tranquility","host":"localhost","severity":"anomaly","description":"Failed to propagate events: druid:overlord/openOrder","data":{"exceptionType":"com.twitter.finagle.NoBrokersAvailableException","exceptionStackTrace":"com.twitter.finagle.NoBrokersAvailableException: No hosts are available for disco!firehose:druid:overlord:openOrder-016-0000-0000, Dtab.base=[], Dtab.local=[]\n\tat com.twitter.finagle.NoStacktrace(Unknown Source)\n","timestamp":"2016-07-17T16:00:00.000Z","beams":"MergingPartitioningBeam(DruidBeam(interval = 2016-07-17T16:00:00.000Z/2016-07-17T17:00:00.000Z, partition = 0, tasks = [index_realtime_openOrder_2016-07-17T16:00:00.000Z_0_0/openOrder-016-0000-0000]))","eventCount":1,"exceptionMessage":"No hosts are available for disco!firehose:druid:overlord:openOrder-016-0000-0000, Dtab.base=[], Dtab.local=[]"}}]

com.twitter.finagle.NoBrokersAvailableException: No hosts are available for disco!firehose:druid:overlord:openOrder-016-0000-0000, Dtab.base=[], Dtab.local=[]

16/07/18 00:05:36 INFO LoggingEmitter: Event [{"feed":"alerts","timestamp":"2016-07-18T00:05:36.720+08:00","service":"tranquility","host":"localhost","severity":"anomaly","description":"Failed to propagate events: druid:overlord/openOrder","data":{"exceptionType":"com.twitter.finagle.NoBrokersAvailableException","exceptionStackTrace":"com.twitter.finagle.NoBrokersAvailableException: No hosts are available for disco!firehose:druid:overlord:openOrder-016-0000-0000, Dtab.base=[], Dtab.local=[]\n\tat com.twitter.finagle.NoStacktrace(Unknown Source)\n","timestamp":"2016-07-17T16:00:00.000Z","beams":"MergingPartitioningBeam(DruidBeam(interval = 2016-07-17T16:00:00.000Z/2016-07-17T17:00:00.000Z, partition = 0, tasks = [index_realtime_openOrder_2016-07-17T16:00:00.000Z_0_0/openOrder-016-0000-0000]))","eventCount":1,"exceptionMessage":"No hosts are available for disco!firehose:druid:overlord:openOrder-016-0000-0000, Dtab.base=[], Dtab.local=[]"}}]

the overlord.log

2016-07-18T05:39:54,064 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: xxx.xxx.xxx.xxx:8091, status 200 OK, response: {"task":"index_realtime_openOrder_2016-07-17T05:00:00.000Z_0_0"}

2016-07-18T05:39:54,064 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_openOrder_2016-07-17T05:00:00.000Z_0_0! Are you sure the task was running?

the middleManager/runtime.properties

druid.worker.capacity=9

the DruidBeams:

DruidBeams
  .builder((openOrderDO: OpenOrderDO) => openOrderDO.timestamp)
  .curator(curator)
  .discoveryPath(discoveryPath)
  .location(DruidLocation(DruidEnvironment(indexService), dataSource))
  .rollup(DruidRollup(SpecificDruidDimensions(dimensions), aggregators, QueryGranularity.MINUTE))
  .tuning(
    ClusteredBeamTuning(
      segmentGranularity = Granularity.HOUR,
      windowPeriod = new Period("PT10M"),
      partitions = 1,
      replicants = 1
    )
  )
  .buildBeam()

jianr...@alibaba-inc.com

unread,

Jul 19, 2016, 3:01:16 AM7/19/16

to Druid User

the coordinator.log

2016-07-19T06:50:07,875 INFO [main-EventThread] io.druid.server.coordinator.LoadQueuePeon - Server[/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083] done processing [/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083/openOrder_2016-07-19T05:00:00.000Z_2016-07-19T06:00:00.000Z_2016-07-19T13:03:59.918+08:00]

2016-07-19T06:51:07,886 INFO [main-EventThread] io.druid.server.coordinator.LoadQueuePeon - Server[/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083] done processing [/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083/openOrder_2016-07-19T05:00:00.000Z_2016-07-19T06:00:00.000Z_2016-07-19T13:03:59.918+08:00]

2016-07-19T06:52:07,896 INFO [main-EventThread] io.druid.server.coordinator.LoadQueuePeon - Server[/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083] done processing [/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083/openOrder_2016-07-19T05:00:00.000Z_2016-07-19T06:00:00.000Z_2016-07-19T13:03:59.918+08:00]

2016-07-19T06:52:07,905 INFO [main-EventThread] io.druid.server.coordinator.LoadQueuePeon - Server[/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083] done processing [/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083/openOrder_2016-07-19T05:00:00.000Z_2016-07-19T06:00:00.000Z_2016-07-19T13:03:59.918+08:00]

2016-07-19T06:53:07,907 INFO [main-EventThread] io.druid.server.coordinator.LoadQueuePeon - Server[/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083] done processing [/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083/openOrder_2016-07-19T05:00:00.000Z_2016-07-19T06:00:00.000Z_2016-07-19T13:03:59.918+08:00]

the task log

2016-07-19T06:15:00,034 INFO [task-runner-0-priority-0] io.druid.segment.realtime.plumber.RealtimePlumber - Shutting down... 2016-07-19T06:15:00,034 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.RealtimeIndexTask - Job done! 2016-07-19T06:15:00,035 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_openOrder_2016-07-19T05:00:00.000Z_0_0] status changed to [SUCCESS]. 2016-07-19T06:15:00,038 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: { "id" : "index_realtime_openOrder_2016-07-19T05:00:00.000Z_0_0", "status" : "SUCCESS", "duration" : 4256032 } 2016-07-19T06:15:00,046 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.coordination.AbstractDataSegmentAnnouncer.stop()] on object[io.druid.server.coordination.BatchDataSegmentAnnouncer@552c0b19]. 2016-07-19T06:15:00,046 INFO [main] io.druid.server.coordination.AbstractDataSegmentAnnouncer - Stopping class io.druid.server.coordination.BatchDataSegmentAnnouncer with config[io.druid.server.initialization.ZkPathsConfig@e59eda19] 2016-07-19T06:15:00,046 INFO [main] io.druid.curator.announcement.Announcer - unannouncing [/trip/druid/announcements/xxxx.xxx.xxx.xxx:8102]

在 2016年7月19日星期二 UTC+8上午8:56:07，jianr...@alibaba-inc.com写道：

jianr...@alibaba-inc.com

unread,

Jul 19, 2016, 3:45:21 AM7/19/16

to Druid User

the historical.log

2016-07-19T07:28:08,313 ERROR [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[openOrder_2016-07-19T04:00:00.000Z_2016-07-19T05:00:00.000Z_2016-07-19T13:01:06.498+08:00], segment=DataSegment{size=1209429, shardSpec=LinearShardSpec{partitionNum=0}, metrics=[count, rt, user_unique], dimensions=[open_order_id, app_version, trip_type], version='2016-07-19T13:01:06.498+08:00', loadSpec={type=hdfs, path=/druid/segments/openOrder/20160719T040000.000Z_20160719T050000.000Z/2016-07-19T13_01_06.498+08_00/0/index.zip}, interval=2016-07-19T04:00:00.000Z/2016-07-19T05:00:00.000Z, dataSource='openOrder', binaryVersion='9'}}

io.druid.segment.loading.SegmentLoadingException: Exception loading segment[openOrder_2016-07-19T04:00:00.000Z_2016-07-19T05:00:00.000Z_2016-07-19T13:01:06.498+08:00]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:309) ~[druid-server-0.9.1.jar:0.9.1]

at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:350) [druid-server-0.9.1.jar:0.9.1]

at io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:44) [druid-server-0.9.1.jar:0.9.1]

at io.druid.server.coordination.ZkCoordinator$1.childEvent(ZkCoordinator.java:152) [druid-server-0.9.1.jar:0.9.1]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:522) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:516) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-2.10.0.jar:?]

at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [guava-16.0.1.jar:?]

at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:514) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:772) [curator-recipes-2.10.0.jar:?]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_51]

at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_51]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_51]

at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_51]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_51]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_51]

at java.lang.Thread.run(Thread.java:744) [?:1.7.0_51]

Caused by: io.druid.segment.loading.SegmentLoadingException: var/druid/task/zk_druid/openOrder/2016-07-19T04:00:00.000Z_2016-07-19T05:00:00.000Z/2016-07-19T13:01:06.498+08:00/0/index.drd (No such file or directory)

at io.druid.segment.loading.MMappedQueryableIndexFactory.factorize(MMappedQueryableIndexFactory.java:52) ~[druid-server-0.9.1.jar:0.9.1]

at

在 2016年7月19日星期二 UTC+8下午3:01:16，jianr...@alibaba-inc.com写道：

Fangjin Yang

unread,

Jul 25, 2016, 9:22:08 PM7/25/16

to Druid User

Hi,

Your historical log contains the problem. How was this cluster set up? Was the cluster set up to be distributed? It seems from the error there are configuration problems across your cluster.

Reply all

Reply to author

Forward