can not export data to hdfs

149 views
Skip to first unread message

何文斌

unread,
Apr 20, 2016, 12:45:25 PM4/20/16
to gobblin-users
I am trying to pull a file from hdfs and put back to hdfs, 

below is my job configuration:

job.name=GobblinDemo
job.group=demo
job.description=A Gobblin job for demo purpose

source.class=gobblin.example.hdfs.SimpleHdfsTextSource
#converter.classes=gobblin.example.simplejson.SimpleJsonConverter
extract.namespace=gobblin.example.simplejson

# source configuration properties
# comma-separated list of file URIs (supporting different schemes, e.g., file://, ftp://, sftp://, http://, etc)
#source.filebased.files.to.pull=hdfs://10.45.41.172:9000/user/hadoop/test.json
source.hadoop.file.input.paths=hdfs://10.45.41.172:9000/user/hadoop/test.json
# whether to use authentication or not (default is false)
source.conn.use.authentication=
# credential for authentication purpose (optional)
source.conn.domain=
source.conn.username=
source.conn.password=
# source data schema
source.schema={"namespace":"example.avro", "type":"record", "name":"User", "fields":[{"name":"name", "type":"string"}, {"name":"favorite_number",  "type":"int"}, {"name":"favorite_color", "type":"
string"}]}

# quality checker configuration properties
#qualitychecker.task.policies=gobblin.policies.count.RowCountPolicy,gobblin.policies.schema.SchemaCompatibilityPolicy
#qualitychecker.task.policy.types=OPTIONAL,OPTIONAL
#qualitychecker.row.policies=gobblin.policies.schema.SchemaRowCheckPolicy
#qualitychecker.row.policy.types=OPTIONAL
#qualitychecker.row.err.file=test/jobOutput


writer.fs.uri=hdfs://10.45.41.172:9000
fs.uri=hdfs://10.45.41.172:9000


data.publisher.type=gobblin.publisher.BaseDataPublisher
# Data publisher related configuration properties
# #data.publisher.type=gobblin.publisher.BaseDataPublisher
data.publisher.final.dir=/data/gobblin-yarn/job-output
data.publisher.replace.final.dir=false
#
# # Directory where job/task state files are stored
state.store.dir=/data/gobblin-yarn/state-store
state.store.fs.uri=hdfs://10.45.41.172:9000

# writer configuration properties
writer.destination.type=HDFS
writer.output.format=AVRO
writer.staging.dir=/data/gobblin-yarn/task-staging
writer.output.dir=/data/gobblin-yarn/task-output


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

but I get a problem , the stack trace as below:

on app master

2016-04-20 23:32:15 CST ERROR [JobScheduler-0] gobblin.runtime.AbstractJobLauncher  - Failed to clean leftover staging data
java.lang.IllegalArgumentException: Wrong FS: hdfs://_append, expected: hdfs://10.45.41.172:9000
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
        at gobblin.util.JobLauncherUtils.cleanTaskStagingData(JobLauncherUtils.java:212)
        at gobblin.runtime.AbstractJobLauncher.cleanLeftoverStagingData(AbstractJobLauncher.java:704)
        at gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:259)
        at gobblin.scheduler.JobScheduler.runJob(JobScheduler.java:335)
        at gobblin.yarn.GobblinHelixJobScheduler.runJob(GobblinHelixJobScheduler.java:102)
        at gobblin.yarn.GobblinHelixJobScheduler$NonScheduledJobRunner.run(GobblinHelixJobScheduler.java:148)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


on work units:

2016-04-21 00:33:06 CST ERROR [JobScheduler-0] gobblin.yarn.GobblinHelixJobScheduler$NonScheduledJobRunner  - Failed to run job GobblinDemo
gobblin.runtime.JobException: Failed to run job GobblinDemo
        at gobblin.yarn.GobblinHelixJobScheduler.runJob(GobblinHelixJobScheduler.java:104)
        at gobblin.yarn.GobblinHelixJobScheduler$NonScheduledJobRunner.run(GobblinHelixJobScheduler.java:148)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: gobblin.runtime.JobException: Failed to launch and run job GobblinDemo
        at gobblin.scheduler.JobScheduler.runJob(JobScheduler.java:341)
        at gobblin.yarn.GobblinHelixJobScheduler.runJob(GobblinHelixJobScheduler.java:102)
        ... 4 more
Caused by: java.lang.IllegalArgumentException: Wrong FS: hdfs://_append, expected: hdfs://10.45.41.172:9000
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
        at gobblin.util.JobLauncherUtils.cleanTaskStagingData(JobLauncherUtils.java:212)
        at gobblin.runtime.AbstractJobLauncher.cleanupStagingDataPerTask(AbstractJobLauncher.java:766)
        at gobblin.runtime.AbstractJobLauncher.cleanupStagingData(AbstractJobLauncher.java:743)
        at gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:318)
        at gobblin.scheduler.JobScheduler.runJob(JobScheduler.java:335)


is there any one also get this problem, or some one can help.

I  built gobblin on master branch.
 

Sahil Takiar

unread,
Apr 20, 2016, 8:23:18 PM4/20/16
to 何文斌, gobblin-users
What version of Hadoop are you running on?

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/789a3031-25b6-4bb6-895e-62f3e6c6e0da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

何文斌

unread,
Apr 20, 2016, 9:57:13 PM4/20/16
to gobblin-users, tianh...@gmail.com
2.7.2, and build gobblin against this version

在 2016年4月21日星期四 UTC+8上午8:23:18,Sahil Takiar写道:

何文斌

unread,
Apr 21, 2016, 4:10:29 AM4/21/16
to gobblin-users, tianh...@gmail.com

I think I get something, in the class JobLauncherUtils, there is a method getWriterStagingDir

the function will return a path, which consist a parent and a child,

String parent = state.getProp(
ForkOperatorUtils.getPropertyNameForBranch(ConfigurationKeys.WRITER_STAGING_DIR, numBranches, branchId));
Path child = WriterUtils.getWriterFilePath(state, numBranches, branchId);


in my case the child return "//_append", parent is "hdfs://10.45.41.172:9000//user/hadoop/gobblin-yarn/task-staging"

Path path = new Path(parent,child);


return path;

and the return path just "hdfs://_append"

and this is the problem, I am not sure if I used a wrong config or kind of bug.

thx.


在 2016年4月21日星期四 UTC+8上午8:23:18,Sahil Takiar写道:
What version of Hadoop are you running on?

Sahil Takiar

unread,
Apr 21, 2016, 3:04:39 PM4/21/16
to 何文斌, gobblin-users
This looks like a bug, but I think there may be a way around it. Can you set "writer.file.path=output" and see if that works.

--Sahil

Reply all
Reply to author
Forward
0 new messages