Hi All,
I have setup a druid cluster with latest RC - 0.6.160 and using HDFS (hadoop-2.4.x) as deep storage. I am running my hourly index jobs using local firehose.
The indexed segments for about 5 days were successfully uploaded to HDFS and compute nodes downloaded them without any issues.
However compute nodes seem to be stuck now as I don't see them downloading new segments from HDFS even though I see corresponding entries in coordinator logs. I don't see any relevant entries in zookeeper logs either.
I didn't configure any custom rules - default rule is active :
id dataSource version payload
_default_2014-10-31T21:48:25.308Z _default 2014-10-31T21:48:25.308Z [{"tieredReplicants":{"_default_tier":2},"type":"loadForever"}]
Here are the messages I see from logs. Not sure why compute nodes suddenly stopped downloading new segments. Can someone please help me understand the issue.
---- coordinator log requesting compute node to download segment data. I don't see any response in compute node and ZK logs.
2014-11-04 19:33:39,991 INFO [Coordinator-Exec--0] io.druid.server.coordinator.LoadQueuePeon - Asking server peon[/druid/prod/loadQueue/druid-historical4-413974.XXXX:8080] to load segment[mors_2014-10-29T10:00:00.000Z_2014-10-29T11:00:00.000Z_2014-11-04T19:18:52.733Z]
---- queued jobs which doesn't get loaded
2014-11-04 19:32:39,982 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[druid-historical4-413974.XXXX:8080, historical, _default_tier] has 20 left to load, 5 left to drop, 428,516,430 bytes queued, 234,046,722 bytes served.
2014-11-04 19:32:39,982 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[druid-historical3-413973. XXXX:8080, historical, _default_tier] has 37 left to load, 4 left to drop, 616,042,313 bytes queued, 60,321,960 bytes served.
2014-11-04 19:32:39,982 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[druid-historical2-413396. XXXX:8080, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 687,011,066 bytes served.
2014-11-04 19:32:39,982 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[druid-historical1-410557. XXXX:8080, historical, _default_tier] has 8 left to load, 3 left to drop, 174,584,976 bytes queued, 501,216,378 bytes served.
I also see the below errors :
2014-11-04 19:33:39,988 ERROR [Coordinator-Exec--0] io.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue stuck after 15+ runs!: {class=io.druid.server.coordinator.ReplicationThrottler, segments=[mors_2014-10-31T22:00:00.000Z_2014-10-31T23:00:00.000Z_2014-11-03T05:04:17.356Z ON druid-historical3-413973.XXXX:8080]}
2014-11-04 19:33:39,988 INFO [Coordinator-Exec--0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2014-11-04T19:33:39.988Z","service":"druid/prod/coordinator","host":"druid-coordinator1-410556.XXXX:8080","severity":"component-failure","description":"[_default_tier]: Replicant create queue stuck after 15+ runs!","data":{"class":"io.druid.server.coordinator.ReplicationThrottler","segments":["mors_2014-10-31T22:00:00.000Z_2014-10-31T23:00:00.000Z_2014-11-03T05:04:17.356Z ON druid-historical3-413973.XXXX:8080"]}}]
2014-11-04 19:34:40,404 ERROR [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: Balancer move segments queue has a segment stuck: {class=io.druid.server.coordinator.helper.DruidCoordinatorBalancer, segment=mors_2014-10-30T18:00:00.000Z_2014-10-30T19:00:00.000Z_2014-11-04T04:35:05.122Z, server=DruidServerMetadata{name='druid-historical2-413396.XXXX:8080', host='druid-historical2-413396.XXXX:8080', maxSize=20000000000, tier='_default_tier', type='historical', priority='0'}}
2014-11-04 19:44:17,672 ERROR [Master-PeonExec--0] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/prod/loadQueue/druid-historical4-413974.XXXX:8080], throwable caught when submitting [SegmentChangeRequestLoad{segment=DataSegment{size=30151395, shardSpec=NoneShardSpec, metrics=[rowcount], dimensions=[channel, e, incc, mpid, oid, page, ppid], version='2014-11-04T04:53:12.006Z', loadSpec={type=hdfs, path=hdfs://XXXX:8020/user/dgdeploy/druid/segments/mors/20141031T160000.000Z_20141031T170000.000Z/2014-11-04T04_53_12.006Z/0/index.zip}, interval=2014-10-31T16:00:00.000Z/2014-10-31T17:00:00.000Z, dataSource='mors', binaryVersion='9'}}].
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /druid/prod/loadQueue/druid-historical4-413974.XXXX:8080/mors_2014-10-31T16:00:00.000Z_2014-10-31T17:00:00.000Z_2014-11-04T04:53:12.006Z
at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:688)
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:672)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:668)
at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44)
at io.druid.server.coordinator.LoadQueuePeon$1.run(LoadQueuePeon.java:224)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
In zookeeper logs
10557.XXXX:8080/mors_2014-11-01T16:00:00.000Z_2014-11-01T17:00:00.000Z_2014-11-03T06:04:55.648Z
2014-11-04 12:01:23,995 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when processing sessionid:0x1497b91f74d0010 type:create cxid:0x279 zxid:0x4b3 txntype:-1 reqpath:n/a Error Path:/druid/prod/loadQueue/druid-historical4-413974.XXXX:8080/mors_2014-10-30T15:00:00.000Z_2014-10-30T16:00:00.000Z_2014-11-03T01:17:18.413Z Error:KeeperErrorCode = NodeExists for /druid/prod/loadQueue/druid-historical4-413974.XXXX:8080/mors_2014-10-30T15:00:00.000Z_2014-10-30T16:00:00.000Z_2014-11-03T01:17:18.413Z
2014-11-04 12:01:24,009 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when processing sessionid:0x1497b91f74d0010 type:create cxid:0x27a zxid:0x4b4 txntype:-1 reqpath:n/a Error Path:/druid/prod/loadQueue/druid-historical1-410557.XXXX:8080/mors_2014-11-01T09:00:00.000Z_2014-11-01T10:00:00.000Z_2014-11-03T05:40:07.589Z Error:KeeperErrorCode = NodeExists for /druid/prod/loadQueue/druid-historical1-410557.XXXX:8080/mors_2014-11-01T09:00:00.000Z_2014-11-01T10:00:00.000Z_2014-11-03T05:40:07.589Z
2014-11-04 12:01:24,022 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when processing sessionid:0x1497b91f74d0010 type:create cxid:0x27b zxid:0x4b5 txntype:-1 reqpath:n/a Error Path:/druid/prod/loadQueue/druid-historical4-413974.XXXX:8080/mors_2014-10-30T07:00:00.000Z_2014-10-30T08:00:00.000Z_2014-11-02T19:11:15.806Z Error:KeeperErrorCode = NodeExists for /druid/prod/loadQueue/druid-historical4-413974.XXXX:8080/mors_2014-10-30T07:00:00.000Z_2014-10-30T08:00:00.000Z_2014-11-02T19:11:15.806Z
2014-11-04 12:01:24,035 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when processing sessionid:0x1497b91f74d0010 type:create cxid:0x27c zxid:0x4b6 txntype:-1 reqpath:n/a Error Path:/druid/prod/loadQueue/druid-historical1-410557.XXXX:8080/mors_2014-10-30T17:00:00.000Z_2014-10-30T18:00:00.000Z_2014-11-03T01:27:38.793Z Error:KeeperErrorCode = NodeExists for /druid/prod/loadQueue/druid-historical1-410557.XXXX:8080/mors_2014-10-30T17:00:00.000Z_2014-10-30T18:00:00.000Z_2014-11-03T01:27:38.793Z