Indexing service: task not allotted to worker

1,963 views
Skip to first unread message

nikhil sharma

unread,
Oct 28, 2015, 9:05:16 PM10/28/15
to Druid User
Hi,
I'm trying to ingest some test data to understand how indexing service works.
Used this command to post a task to overlord:
curl -X 'POST' -H 'Content-Type:application/json' -d @examples/indexing/wikipedia_index_task.json localhost:8090/druid/indexer/v1/task

Overlord log shows this task is received and added to pending list. But its not reaching my worker process.
2015-10-28T22:37:57,468 INFO [qtp2016593579-93] io.druid.indexing.overlord.HeapMemoryTaskStorage - Inserting task index_wikipedia_2015-10-28T22:37:57.465Z with status: TaskStatus{id=index_wikipedia_2015-10-28T22:37:57.465Z, status=RUNNING, duration=-1}
2015-10-28T22:37:57,472 INFO [TaskQueue-Manager] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_wikipedia_2015-10-28T22:37:57.465Z]: LockTryAcquireAction{interval=2013-08-31T00:00:00.000Z/2013-09-01T00:00:00.000Z}
2015-10-28T22:37:57,475 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskLockbox - Created new TaskLockPosse: TaskLockPosse{taskLock=TaskLock{groupId=index_wikipedia_2015-10-28T22:37:57.465Z, dataSource=wikipedia, interval=2013-08-31T00:00:00.000Z/2013-09-01T00:00:00.000Z, version=2015-10-28T22:37:57.474Z}, taskIds=[]}
2015-10-28T22:37:57,476 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskLockbox - Added task[index_wikipedia_2015-10-28T22:37:57.465Z] to TaskLock[index_wikipedia_2015-10-28T22:37:57.465Z]
2015-10-28T22:37:57,476 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_wikipedia_2015-10-28T22:37:57.465Z
2015-10-28T22:37:57,476 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_wikipedia_2015-10-28T22:37:57.465Z
2015-10-28T22:37:57,479 WARN [pool-12-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - No worker selections strategy set. Using default.
2015-10-28T22:38:52,543 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 1 tasks from storage (0 tasks added, 0 tasks removed).
2015-10-28T22:39:52,543 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 1 tasks from storage (0 tasks added, 0 tasks removed).

This is middle manager config:

cat config/middleManager/runtime.properties 

#druid.host=xxxxxx3en0003.xxx.xxxxx.com

druid.port=8080

druid.service=druid/middlemanager


# Store task logs in deep storage

druid.indexer.logs.type=file

druid.indexer.logs.directory=/var/log/druid/indexerlogs


# Resources for peons

druid.indexer.runner.javaOpts=-server -Xmx3g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

druid.indexer.task.baseTaskDir=/druid01/peon/tasks


# Peon properties

druid.indexer.fork.property.druid.monitoring.monitors=["com.metamx.metrics.JvmMonitor"]

druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912

druid.indexer.fork.property.druid.processing.numThreads=2

druid.indexer.fork.property.druid.segmentCache.locations=[{"path": "/druid01/zk_druid", "maxSize": 0},{"path": "/druid02/zk_druid", "maxSize": 0},{"path": "/druid03/zk_druid", "maxSize": 0}]

druid.indexer.fork.property.druid.server.http.numThreads=50

druid.indexer.fork.property.druid.storage.archiveBaseKey=prod

druid.indexer.fork.property.druid.storage.archiveBucket=

druid.indexer.fork.property.druid.storage.baseKey=prod/v1

druid.indexer.fork.property.druid.storage.bucket=druid

druid.indexer.fork.property.druid.storage.type=local


druid.worker.capacity=9

druid.worker.ip=#{IP_ADDR}

druid.worker.version=#{WORKER_VERSION}

Overlord config:

druid.host=xxxxxx3en0003.xxx.xxxxxx.com

druid.port=8090

druid.service=overlord


# Run the overlord in local mode with a single peon to execute tasks

# This is not recommended for production.

druid.indexer.queue.startDelay=PT0M

# This setting is too small for real production workloads

druid.indexer.runner.javaOpts=-server -Xmx1g

# These settings are also too small for real production workloads

# Please see our recommended production settings in the docs (http://druid.io/docs/latest/Production-Cluster-Configuration.html)

druid.indexer.fork.property.druid.processing.numThreads=6

druid.indexer.fork.property.druid.computation.buffer.size=100000000


# Uncomment following property if you are running Middle Manager

druid.indexer.runner.type=remote


When i started the worker process, overlord logs showed this:

2015-10-28T22:31:23,620 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[xxxxxx3en0003.xxx.xxxxxx.com:8080] reportin' for duty!


Looks like worker identified itself to overlord correctly but somehow overlord is not able to hand over the task to worker.

Thanks for your help,

Nikhil.

Fangjin

unread,
Oct 28, 2015, 9:36:05 PM10/28/15
to Druid User
Hi Nikhil, are there any other tasks running in the cluster or is it only the one? If the task is stuck in pending state, it should be waiting on a lock. Do you have additional logs from your overlord or perhaps a screenshot of the overlord console?

-- FJ

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/06b86604-4c91-486b-82d4-31cce987e748%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nikhil sharma

unread,
Oct 28, 2015, 10:16:13 PM10/28/15
to Druid User
No Fangjin, there is only one task submitted to run. Infact when i submitted same task again it showed its looking for lock

2015-10-29T01:32:52,573 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 2 tasks from storage (0 tasks added, 0 tasks removed).

2015-10-29T01:32:52,574 INFO [TaskQueue-Manager] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_wikipedia_2015-10-29T01:24:08.099Z]: LockTryAcquireAction{interval=2013-08-31T00:00:00.000Z/2013-09-01T00:00:00.000Z}

2015-10-29T01:33:52,573 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 2 tasks from storage (0 tasks added, 0 tasks removed).

2015-10-29T01:33:52,574 INFO [TaskQueue-Manager] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_wikipedia_2015-10-29T01:24:08.099Z]: LockTryAcquireAction{interval=2013-08-31T00:00:00.000Z/2013-09-01T00:00:00.000Z}





Although i want to run overlord eventually in remote mode but to try out i copied middle manager config into overlord and commented druid.indexer.runner.type=remote. Task made it to peon and threw this error:

2015-10-29T01:40:07,594 INFO [task-runner-0] io.druid.segment.realtime.firehose.LocalFirehoseFactory - Found files: [/usr/lib/druid/examples/indexing/wikipedia_data.json]

2015-10-29T01:40:07,656 ERROR [task-runner-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Uncaught Throwable while running task[IndexTask{id=index_wikipedia_2015-10-29T01:40:00.155Z, type=index, dataSource=wikipedia}]

java.lang.ExceptionInInitializerError

        at io.druid.indexing.common.task.IndexTask.convertTuningConfig(IndexTask.java:122) ~[druid-indexing-service-0.8.1.jar:0.8.1]

        at io.druid.indexing.common.task.IndexTask.generateSegment(IndexTask.java:344) ~[druid-indexing-service-0.8.1.jar:0.8.1]

        at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:205) ~[druid-indexing-service-0.8.1.jar:0.8.1]

        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:235) [druid-indexing-service-0.8.1.jar:0.8.1]

        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:214) [druid-indexing-service-0.8.1.jar:0.8.1]

        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_79]

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_79]

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_79]

        at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]

Caused by: java.lang.IllegalStateException: Failed to create directory within 10000 attempts (tried 1446082807605-0 to 1446082807605-9999)

        at com.google.common.io.Files.createTempDir(Files.java:600) ~[guava-16.0.1.jar:?]

        at io.druid.segment.indexing.RealtimeTuningConfig.<clinit>(RealtimeTuningConfig.java:43) ~[druid-server-0.8.1.jar:0.8.1]

        ... 9 more

2015-10-29T01:40:07,667 ERROR [main] io.druid.cli.CliPeon - Error when starting up.  Failing.

java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError

        at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]

        at io.druid.indexing.worker.executor.ExecutorLifecycle.join(ExecutorLifecycle.java:162) ~[druid-indexing-service-0.8.1.jar:0.8.1]

        at io.druid.cli.CliPeon.run(CliPeon.java:225) [druid-services-0.8.1.jar:0.8.1]

        at io.druid.cli.Main.main(Main.java:91) [druid-services-0.8.1.jar:0.8.1]



overlord command:

java -server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -Xms10g -Xmx10g -XX:MaxNewSize=1g -XX:NewSize=1g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSCompactAtFullCollection -Xloggc:/var/log/druid/druid-overlord-gc.log.201510281838 -XX:ErrorFile=/var/log/druid/druid-overlord-hs_err_pid.log -Dlog4j.configuration=/usr/lib/druid/config/_common/log4j.properties -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/hadoop/druid/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Ddruid.extensions.localRepository=./extensions-repo -classpath /usr/lib/druid/config/_common:/usr/lib/druid/config/overlord:/usr/lib/druid/lib/* io.druid.cli.Main server overlord


thanks,
nikhil.

Himanshu

unread,
Oct 29, 2015, 1:41:06 AM10/29/15
to Druid User
Hi,

For the "Caused by: java.lang.IllegalStateException: Failed to create directory within 10000 attempts (tried 1446082807605-0 to 1446082807605-9999)"
does "/hadoop/druid/tmp" exist on your disk and writable by the user starting the druid process?

For the original problem, as FJ said, screenshot of overlord console would be useful.

-- Himanshu

Message has been deleted

Jonathan Wei

unread,
Oct 30, 2015, 5:22:06 PM10/30/15
to druid...@googlegroups.com
Hi Nikhil,

I think Himanshu is referring to this parameter in the overlord startup command:

-Djava.io.tmpdir=/hadoop/druid/tmp

The exception you see here:

Caused by: java.lang.IllegalStateException: Failed to create directory within 10000 attempts (tried 1446082807605-0 to 1446082807605-9999)

 at io.druid.segment.indexing.RealtimeTuningConfig.<clinit>(RealtimeTuningConfig.java:43) ~[druid-server-0.8.1.jar:0.8.1]

is generated from a function that tries to create the directory specified by java.io.tmpdir

- Jon


On Fri, Oct 30, 2015 at 11:18 AM, nikhil sharma <nikhi...@gmail.com> wrote:
Hi Himanshu,
I am not using hadoop in my setup currently. Are you referring to the value of peon config key 'druid.indexer.task.baseTaskDir' ?
Its configured as 

druid.indexer.task.baseTaskDir=/druid01/peon/tasks



thanks,
Nikhil.

Himanshu

unread,
Oct 30, 2015, 5:33:23 PM10/30/15
to Druid User
Thanks Jonathan Wei.

Yes, I mean for tmp dir specified in the command.

-- Himanshu

nikhil sharma

unread,
Nov 3, 2015, 7:54:08 PM11/3/15
to Druid User
Thanks Guys.
Yes i was able to change the -Djava.io.tmpdir param to point to a different directory and got it working. Thats the reason i removed my response from this thread.
So overlord is working fine in local mode now. I'm going to switch back to test it in remote mode soon.

For remote mode I'm going to use NFS mounts as a deep storage as of now. I have a folder on overlord machine which i've exported to historical, coordinator nodes in cluster.
Please let me know its ok to proceed with NFS ( i plan to switch to HDFS in some time).

nikhil sharma

unread,
Nov 4, 2015, 8:47:08 PM11/4/15
to Druid User
Hi FJ,
To make sure that no tasks are running already, i restarted overlord, middle managers, historical and coordinator.
When i started middle managers i could see this message in overlord log

2015-11-05T01:05:52,571 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[xxxxxen0003.xxx.xxxxxx.com:8080] reportin' for duty!
2015-11-05T01:05:52,571 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Cancelling Worker[xxxxxx3en0003.xxx.xxxxxx.com:8080] scheduled task cleanup
2015-11-05T01:06:21,121 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 0 tasks from storage (0 tasks added, 0 tasks removed).


2015-11-05T01:06:58,734 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[xxxxx3en0004.xxx.xxxxxx.com:8080] reportin' for duty!
2015-11-05T01:06:58,734 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Cancelling Worker[xxxxx3en0004.xxx.xxxxxx.com:8080] scheduled task cleanup

Not sure what does 'Cancelling Worker' message indicate.

Again i tried submitting the wikipedia index task to overlord vi a POST request. Got same log messages as i got few days before and did not see anything in worker logs..

2015-11-05T01:31:23,007 INFO [TaskQueue-Manager] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_wikipedia_2015-11-05T01:31:23.000Z]: LockTryAcquireAction{interval=2015-08-31T00:00:00.000Z/2015-09-01T00:00:00.000Z}

2015-11-05T01:31:23,010 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskLockbox - Created new TaskLockPosse: TaskLockPosse{taskLock=TaskLock{groupId=index_wikipedia_2015-11-05T01:31:23.000Z, dataSource=wikipedia, interval=2015-08-31T00:00:00.000Z/2015-09-01T00:00:00.000Z, version=2015-11-05T01:31:23.008Z}, taskIds=[]}

2015-11-05T01:31:23,010 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskLockbox - Added task[index_wikipedia_2015-11-05T01:31:23.000Z] to TaskLock[index_wikipedia_2015-11-05T01:31:23.000Z]

2015-11-05T01:31:23,011 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_wikipedia_2015-11-05T01:31:23.000Z

2015-11-05T01:31:23,011 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_wikipedia_2015-11-05T01:31:23.000Z

2015-11-05T01:31:23,013 WARN [pool-12-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - No worker selections strategy set. Using default.


2015-11-05T01:32:21,126 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 1 tasks from storage (0 tasks added, 0 tasks removed).

2015-11-05T01:33:21,126 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 1 tasks from storage (0 tasks added, 0 tasks removed).

2015-11-05T01:34:21,126 INFO [TaskQueue-StorageSync] io.druid.indexing.overlord.TaskQueue - Synced 1 tasks from storage (0 tasks added, 0 tasks removed).


Coordinator console is not accessible to me as of now. Is there a REST API which i can use to get list of all tasks?
I'm running overlord in remote mode on port 8090 and middle managers on 8080.

Thanks for your help,
Nikhil.

On Wednesday, October 28, 2015 at 6:36:05 PM UTC-7, Fangjin Yang wrote:

Himanshu

unread,
Nov 5, 2015, 1:15:00 PM11/5/15
to Druid User
Hi,

If you use NFS as deep storage, it will either be very difficult to use hadoop indexing or not possible depending upon how your hadoop cluster is setup. So, you would have to use Index task which had to process all input data on single node and you will be limited in terms of scaling your batch ingestion.
If you don't have batch ingestion, then it should be fine I guess.

-- Himanshu

nikhil sharma

unread,
Nov 5, 2015, 5:45:01 PM11/5/15
to Druid User
Thanks Himanshu.
Could you please give me some pointers about the 'overlord not delegating task to workers' problem too (please check my last comment on this thread)?
Do i need to have middle manager and overlord on separate machines? Right now i have one manager on same machine as overlord and another manager on separate machine.

Thanks,
Nikhil.

Himanshu Gupta

unread,
Nov 5, 2015, 11:49:31 PM11/5/15
to Druid User
Hi,

Technically, It is OK to run overlord and middle manager processes on same machine. Overlord is druid's poor task scheduling/management framework and middle manager act as the workers. You can see "http://overlord_host:overlord_port/console.html" to see what tasks are running and which middle managers are in the system (recognized by the overlord). There is REST api for everything that overlord console has (it works by making those calls), see http://druid.io/docs/0.8.1/design/indexing-service.html#overlord-node .
I would look at the overlord console to see if the all the middle managers are recognized (may be not if there was a config issue), whether task runs at all and what the task log says.

-- Himanshu

nikhil sharma

unread,
Nov 6, 2015, 6:05:28 PM11/6/15
to Druid User
Yes you are right. I enabled 'debug' mode logging for overlord process and found out that there is some configuration issue with workers.
Fixed that and got it to work fine.

Thanks for help.
Nikhil.
Reply all
Reply to author
Forward
0 new messages