Druid ingestion jobs fail after kerberoization the hadoop cluster.

55 views

Skip to first unread message

Mohanraj Anandan

unread,

Jan 24, 2017, 5:39:59 AM1/24/17

to Druid Development

Before kerberoization of the hadoop cluster, druid hadoop ingestion works fine. The jobs are failing after the hadoop cluster is kerberoized.

All druid nodes middle manager, Overlord, Historical, Broker, Coordinator are ensured it talks to KDC server and normal hadoop commands works fine from these machines.

I see the below error in the middle manager log after the ingestion is kick started. middle manager talks to HDFS, acknowledges the kerberos token to launch the job. Afterwards it goes for two attempts and fails.

Can anyone help on this issue.

[INFO ] 2017-01-24 04:39:47.589 [task-runner-0-priority-0] FileInputFormat - Total input paths to process : 50

[INFO ] 2017-01-24 04:39:47.959 [task-runner-0-priority-0] JobSubmitter - number of splits:50

[INFO ] 2017-01-24 04:39:48.076 [task-runner-0-priority-0] JobSubmitter - Submitting tokens for job: job_1484691336524_131839

[INFO ] 2017-01-24 04:39:48.076 [task-runner-0-priority-0] JobSubmitter - Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:stampy, Ident: (HDFS_DELEGATION_TOKEN token 1028122 for pp_dt_fpti_batch)

[INFO ] 2017-01-24 04:39:48.213 [task-runner-0-priority-0] YARNRunner - Job jar is not present. Not adding any jar to the list of resources.

[INFO ] 2017-01-24 04:39:50.185 [task-runner-0-priority-0] YarnClientImpl - Application submission is not finished, submitted application application_1484691336524_131839 is still in NEW

[INFO ] 2017-01-24 04:39:56.209 [task-runner-0-priority-0] YarnClientImpl - Application submission is not finished, submitted application application_1484691336524_131839 is still in NEW

[INFO ] 2017-01-24 04:39:57.817 [task-runner-0-priority-0] YarnClientImpl - Submitted application application_1484691336524_131839

[INFO ] 2017-01-24 04:39:57.854 [task-runner-0-priority-0] Job - The url to track the job: http://lvshdc5en0012.lvs.paypal.com:8088/proxy/application_1484691336524_131839/

[INFO ] 2017-01-24 04:39:57.854 [task-runner-0-priority-0] DetermineHashedPartitionsJob - Job druid_ingest-determine_partitions_hashed-Optional.of([2017-01-17T00:00:00.000Z/2017-01-17T01:00:00.000Z]) submitted, status available at: http://lvshdc5en0012.lvs.paypal.com:8088/proxy/application_1484691336524_131839/

[INFO ] 2017-01-24 04:39:57.855 [task-runner-0-priority-0] Job - Running job: job_1484691336524_131839

[INFO ] 2017-01-24 04:40:05.878 [task-runner-0-priority-0] Job - Job job_1484691336524_131839 running in uber mode : false

[INFO ] 2017-01-24 04:40:05.879 [task-runner-0-priority-0] Job - map 0% reduce 0%

[INFO ] 2017-01-24 04:40:05.890 [task-runner-0-priority-0] Job - Job job_1484691336524_131839 failed with state FAILED due to: Application application_1484691336524_131839 failed 2 times due to AM Container for appattempt_1484691336524_131839_000002 exited with exitCode: 1

For more detailed output, check application tracking page:http://lvshdc5en0012.lvs.paypal.com:8088/proxy/application_1484691336524_131839/Then, click on links to logs of each attempt.

Diagnostics: Exception from container-launch.

Container id: container_1484691336524_131839_02_000001

Exit code: 1

Stack trace: ExitCodeException exitCode=1:

at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)

at org.apache.hadoop.util.Shell.run(Shell.java:455)

at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:293)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Shell output: main : command provided 1

main : user is pp_dt_fpti_batch

main : requested yarn user is pp_dt_fpti_batch

Container exited with a non-zero exit code 1

Failing this attempt. Failing the application.

[INFO ] 2017-01-24 04:40:05.910 [task-runner-0-priority-0] Job - Counters: 0

[ERROR] 2017-01-24 04:40:05.911 [task-runner-0-priority-0] DetermineHashedPartitionsJob - Job failed: job_1484691336524_131839

[INFO ] 2017-01-24 04:40:05.912 [task-runner-0-priority-0] JobHelper - Deleting path[hdfs://stampy/sys/dt/fpti/druid/workingPath/druid_ingest/2017-01-24T043937.904Z_dd58b1f1d85f48b4a1c8f5e51e65d8ee]

[ERROR] 2017-01-24 04:40:05.935 [task-runner-0-priority-0] ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=druid_ingest-17-01-16-16-10zN9oKA, type=index_hadoop, dataSource=druid_ingest}]

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException

at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:204) ~[druid-indexing-service-0.9.2.jar:0.9.2]

at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:175) ~[druid-indexing-service-0.9.2.jar:0.9.2]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.2.jar:0.9.2]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.2.jar:0.9.2]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_73]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_73]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_73]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_73]

Caused by: java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_73]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_73]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_73]

at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_73]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2.jar:0.9.2]

... 7 more

Caused by: com.metamx.common.ISE: Job[class io.druid.indexer.DetermineHashedPartitionsJob] failed!

at io.druid.indexer.JobHelper.runJobs(JobHelper.java:369) ~[druid-indexing-hadoop-0.9.2.jar:0.9.2]

at io.druid.indexer.HadoopDruidDetermineConfigurationJob.run(HadoopDruidDetermineConfigurationJob.java:91) ~[druid-indexing-hadoop-0.9.2.jar:0.9.2]

at io.druid.indexing.common.task.HadoopIndexTask$HadoopDetermineConfigInnerProcessing.runTask(HadoopIndexTask.java:291) ~[druid-indexing-service-0.9.2.jar:0.9.2]

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_73]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_73]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_73]

at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_73]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2.jar:0.9.2]

... 7 more

[INFO ] 2017-01-24 04:40:05.944 [task-runner-0-priority-0] TaskRunnerUtils - Task [druid_ingest-17-01-16-16-10zN9oKA] status changed to [FAILED].

[INFO ] 2017-01-24 04:40:05.947 [task-runner-0-priority-0] ExecutorLifecycle - Task completed with status: {

"id" : "druid_ingest-17-01-16-16-10zN9oKA",

"status" : "FAILED",

"duration" : 22621

}

Reply all

Reply to author

Forward

0 new messages