the tasks don't stop

55 views
Skip to first unread message

jianr...@alibaba-inc.com

unread,
Jul 18, 2016, 8:56:07 PM7/18/16
to Druid User
how long should the the middleManager realtime task java process would stop? the process num over 3 then the spark tranquility  realtime task  will failed for No hosts are available for disco, how to solve this problem(my way is to kill the process by myself in bash)? 



the spark work log



16/07/18 00:04:32 INFO LoggingEmitter: Event [{"feed":"alerts","timestamp":"2016-07-18T00:04:32.920+08:00","service":"tranquility","host":"localhost","severity":"anomaly","description":"Failed to propagate events: druid:overlord/openOrder","data":{"exceptionType":"com.twitter.finagle.NoBrokersAvailableException","exceptionStackTrace":"com.twitter.finagle.NoBrokersAvailableException: No hosts are available for disco!firehose:druid:overlord:openOrder-016-0000-0000, Dtab.base=[], Dtab.local=[]\n\tat com.twitter.finagle.NoStacktrace(Unknown Source)\n","timestamp":"2016-07-17T16:00:00.000Z","beams":"MergingPartitioningBeam(DruidBeam(interval = 2016-07-17T16:00:00.000Z/2016-07-17T17:00:00.000Z, partition = 0, tasks = [index_realtime_openOrder_2016-07-17T16:00:00.000Z_0_0/openOrder-016-0000-0000]))","eventCount":1,"exceptionMessage":"No hosts are available for disco!firehose:druid:overlord:openOrder-016-0000-0000, Dtab.base=[], Dtab.local=[]"}}]

com.twitter.finagle.NoBrokersAvailableException: No hosts are available for disco!firehose:druid:overlord:openOrder-016-0000-0000, Dtab.base=[], Dtab.local=[]

16/07/18 00:05:36 INFO LoggingEmitter: Event [{"feed":"alerts","timestamp":"2016-07-18T00:05:36.720+08:00","service":"tranquility","host":"localhost","severity":"anomaly","description":"Failed to propagate events: druid:overlord/openOrder","data":{"exceptionType":"com.twitter.finagle.NoBrokersAvailableException","exceptionStackTrace":"com.twitter.finagle.NoBrokersAvailableException: No hosts are available for disco!firehose:druid:overlord:openOrder-016-0000-0000, Dtab.base=[], Dtab.local=[]\n\tat com.twitter.finagle.NoStacktrace(Unknown Source)\n","timestamp":"2016-07-17T16:00:00.000Z","beams":"MergingPartitioningBeam(DruidBeam(interval = 2016-07-17T16:00:00.000Z/2016-07-17T17:00:00.000Z, partition = 0, tasks = [index_realtime_openOrder_2016-07-17T16:00:00.000Z_0_0/openOrder-016-0000-0000]))","eventCount":1,"exceptionMessage":"No hosts are available for disco!firehose:druid:overlord:openOrder-016-0000-0000, Dtab.base=[], Dtab.local=[]"}}]




the overlord.log 


2016-07-18T05:39:54,064 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: xxx.xxx.xxx.xxx:8091, status 200 OK, response: {"task":"index_realtime_openOrder_2016-07-17T05:00:00.000Z_0_0"}

2016-07-18T05:39:54,064 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_openOrder_2016-07-17T05:00:00.000Z_0_0! Are you sure the task was running?




the middleManager/runtime.properties 


druid.worker.capacity=9

the DruidBeams:

DruidBeams
.builder((openOrderDO: OpenOrderDO) => openOrderDO.timestamp)
.curator(curator)
.discoveryPath(discoveryPath)
.location(DruidLocation(DruidEnvironment(indexService), dataSource))
.rollup(DruidRollup(SpecificDruidDimensions(dimensions), aggregators, QueryGranularity.MINUTE))
.tuning(
ClusteredBeamTuning(
segmentGranularity = Granularity.HOUR,
windowPeriod = new Period("PT10M"),
partitions = 1,
replicants = 1

)
)
.buildBeam()

jianr...@alibaba-inc.com

unread,
Jul 19, 2016, 3:01:16 AM7/19/16
to Druid User

the coordinator.log 


2016-07-19T06:50:07,875 INFO [main-EventThread] io.druid.server.coordinator.LoadQueuePeon - Server[/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083] done processing [/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083/openOrder_2016-07-19T05:00:00.000Z_2016-07-19T06:00:00.000Z_2016-07-19T13:03:59.918+08:00]

2016-07-19T06:51:07,886 INFO [main-EventThread] io.druid.server.coordinator.LoadQueuePeon - Server[/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083] done processing [/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083/openOrder_2016-07-19T05:00:00.000Z_2016-07-19T06:00:00.000Z_2016-07-19T13:03:59.918+08:00]

2016-07-19T06:52:07,896 INFO [main-EventThread] io.druid.server.coordinator.LoadQueuePeon - Server[/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083] done processing [/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083/openOrder_2016-07-19T05:00:00.000Z_2016-07-19T06:00:00.000Z_2016-07-19T13:03:59.918+08:00]

2016-07-19T06:52:07,905 INFO [main-EventThread] io.druid.server.coordinator.LoadQueuePeon - Server[/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083] done processing [/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083/openOrder_2016-07-19T05:00:00.000Z_2016-07-19T06:00:00.000Z_2016-07-19T13:03:59.918+08:00]

2016-07-19T06:53:07,907 INFO [main-EventThread] io.druid.server.coordinator.LoadQueuePeon - Server[/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083] done processing [/trip/druid/loadQueue/xxx.xxx.xxx.xxx:8083/openOrder_2016-07-19T05:00:00.000Z_2016-07-19T06:00:00.000Z_2016-07-19T13:03:59.918+08:00]



the task log


2016-07-19T06:15:00,034 INFO [task-runner-0-priority-0] io.druid.segment.realtime.plumber.RealtimePlumber - Shutting down... 2016-07-19T06:15:00,034 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.RealtimeIndexTask - Job done! 2016-07-19T06:15:00,035 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_openOrder_2016-07-19T05:00:00.000Z_0_0] status changed to [SUCCESS]. 2016-07-19T06:15:00,038 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {   "id" : "index_realtime_openOrder_2016-07-19T05:00:00.000Z_0_0",   "status" : "SUCCESS",   "duration" : 4256032 } 2016-07-19T06:15:00,046 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.coordination.AbstractDataSegmentAnnouncer.stop()] on object[io.druid.server.coordination.BatchDataSegmentAnnouncer@552c0b19]. 2016-07-19T06:15:00,046 INFO [main] io.druid.server.coordination.AbstractDataSegmentAnnouncer - Stopping class io.druid.server.coordination.BatchDataSegmentAnnouncer with config[io.druid.server.initialization.ZkPathsConfig@e59eda19] 2016-07-19T06:15:00,046 INFO [main] io.druid.curator.announcement.Announcer - unannouncing [/trip/druid/announcements/xxxx.xxx.xxx.xxx:8102]


在 2016年7月19日星期二 UTC+8上午8:56:07,jianr...@alibaba-inc.com写道:

jianr...@alibaba-inc.com

unread,
Jul 19, 2016, 3:45:21 AM7/19/16
to Druid User
the historical.log

2016-07-19T07:28:08,313 ERROR [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[openOrder_2016-07-19T04:00:00.000Z_2016-07-19T05:00:00.000Z_2016-07-19T13:01:06.498+08:00], segment=DataSegment{size=1209429, shardSpec=LinearShardSpec{partitionNum=0}, metrics=[count, rt, user_unique], dimensions=[open_order_id, app_version, trip_type], version='2016-07-19T13:01:06.498+08:00', loadSpec={type=hdfs, path=/druid/segments/openOrder/20160719T040000.000Z_20160719T050000.000Z/2016-07-19T13_01_06.498+08_00/0/index.zip}, interval=2016-07-19T04:00:00.000Z/2016-07-19T05:00:00.000Z, dataSource='openOrder', binaryVersion='9'}}

io.druid.segment.loading.SegmentLoadingException: Exception loading segment[openOrder_2016-07-19T04:00:00.000Z_2016-07-19T05:00:00.000Z_2016-07-19T13:01:06.498+08:00]

at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:309) ~[druid-server-0.9.1.jar:0.9.1]

at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:350) [druid-server-0.9.1.jar:0.9.1]

at io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:44) [druid-server-0.9.1.jar:0.9.1]

at io.druid.server.coordination.ZkCoordinator$1.childEvent(ZkCoordinator.java:152) [druid-server-0.9.1.jar:0.9.1]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:522) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:516) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-2.10.0.jar:?]

at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [guava-16.0.1.jar:?]

at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:514) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.10.0.jar:?]

at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:772) [curator-recipes-2.10.0.jar:?]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_51]

at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_51]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_51]

at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_51]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_51]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_51]

at java.lang.Thread.run(Thread.java:744) [?:1.7.0_51]

Caused by: io.druid.segment.loading.SegmentLoadingException: var/druid/task/zk_druid/openOrder/2016-07-19T04:00:00.000Z_2016-07-19T05:00:00.000Z/2016-07-19T13:01:06.498+08:00/0/index.drd (No such file or directory)

at io.druid.segment.loading.MMappedQueryableIndexFactory.factorize(MMappedQueryableIndexFactory.java:52) ~[druid-server-0.9.1.jar:0.9.1]

at




在 2016年7月19日星期二 UTC+8下午3:01:16,jianr...@alibaba-inc.com写道:

Fangjin Yang

unread,
Jul 25, 2016, 9:22:08 PM7/25/16
to Druid User
Hi,

Your historical log contains the problem. How was this cluster set up? Was the cluster set up to be distributed? It seems from the error there are configuration problems across your cluster.
Reply all
Reply to author
Forward
0 new messages