S3-MapReduce not working

85 views
Skip to first unread message

laksh...@tokbox.com

unread,
Mar 23, 2016, 1:04:00 AM3/23/16
to gobblin-users
Hello Folks,

I got gobblin to work with the following mode.

1) Kafka ----->HDFS(Map Reduce Mode)
2) Kafka   ---->S3( Standalone Mode)

I am using Hadoop 2.6.0 and running it in EMR.. The challenge I have is when i try to run Kafka to S3 in Map Reduce Mode. I have spent few hours on it and also read the Kafka to s3 publisher guide multiple times. It has example on how to run Kafka to s3 on standalone mode, but not in mapreduce mode.

Here is my configuration file 

fs.uri=file:///
writer.fs.uri=${fs.uri}
state.store.fs.uri=s3a://testbucket/
data.publisher.fs.uri=s3a://testbucket/
fs.s3a.access.key=XXX
fs.s3a.secret.key=XXX
fs.s3a.buffer.dir=/tmp
# Writer related configuration properties
writer.destination.type=HDFS
writer.output.format=AVRO
writer.staging.dir=${env:GOBBLIN_WORK_DIR}/task-staging
writer.output.dir=${env:GOBBLIN_WORK_DIR}/task-output
# Data publisher related configuration properties
data.publisher.type=gobblin.publisher.BaseDataPublisher
data.publisher.final.dir=${env:GOBBLIN_WORK_DIR}/job-output
data.publisher.replace.final.dir=false
# Directory where job/task state files are stored
state.store.dir=${env:GOBBLIN_WORK_DIR}/state-store
# Directory where error files from the quality checkers are stored
qualitychecker.row.err.file=${env:GOBBLIN_WORK_DIR}/err
# Directory where job locks are stored
job.lock.dir=${env:GOBBLIN_WORK_DIR}/locks
# Directory where metrics log files are stored
metrics.log.dir=${env:GOBBLIN_WORK_DIR}/metrics
# Interval of task state reporting in milliseconds
task.status.reportintervalinms=5000
# MapReduce properties
mr.job.root.dir=${env:GOBBLIN_WORK_DIR}/working
I have set  GOBBLIN_WORK_DIR=/home/gobblinoutput 

I am getting the following exception. 

 Failed to launch and run job job_GobblinKafkaQuickStart_1458708541198: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://10.0.1.227:8020/home/gobblinoutput/working/GobblinKafkaQuickStart/input/job_GobblinKafkaQuickStart_1458708541198.wulist
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://10.0.1.227:8020/home/gobblinoutput/working/GobblinKafkaQuickStart/input/job_GobblinKafkaQuickStart_1458708541198.wulist
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
        at org.apache.hadoop.mapreduce.lib.input.NLineInputFormat.getSplits(NLineInputFormat.java:82)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
        at gobblin.runtime.mapreduce.MRJobLauncher.runWorkUnits(MRJobLauncher.java:200)
        at gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:285)
        at gobblin.runtime.mapreduce.CliMRJobLauncher.launchJob(CliMRJobLauncher.java:87)
        at gobblin.runtime.mapreduce.CliMRJobLauncher.run(CliMRJobLauncher.java:64)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at gobblin.runtime.mapreduce.CliMRJobLauncher.main(CliMRJobLauncher.java:110)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Appreciate any help.

Sahil Takiar

unread,
Mar 24, 2016, 4:50:28 PM3/24/16
to Lakshmanan Muthuraman, gobblin-users
You should set the "fs.uri" key to point to the NameNode URI of your MR cluster, e.g. hdfs://namenode-name:9000

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/a47d925f-ab81-4be3-b4b0-5b804cefa93a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

laksh...@tokbox.com

unread,
Mar 24, 2016, 11:03:51 PM3/24/16
to gobblin-users, laksh...@tokbox.com
Thanks Sahil. It works
Reply all
Reply to author
Forward
0 new messages