gobblin on dataproc

594 views
Skip to first unread message

Henrique G. Abreu

unread,
May 19, 2017, 3:02:11 PM5/19/17
to gobblin-users
I'm trying to run gobblin on Google Dataproc, but I'm facing quite a few issues. The last one that I'm stuck into is:

WARN [JobContext] Property task.data.root.dir is missing. 
WARN [KafkaSource] Previous offset for partition mytopic:0 does not exist. This partition will start from the earliest offset: 0 
WARN [KafkaSource] Previous offset for partition mytopic:1 does not exist. This partition will start from the earliest offset: 0 
WARN [TaskStateCollectorService] No output task state files found in gobblin/kafka2gcs/job_kafka2gcs_1495211213990/output/job_kafka2gcs_1495211213990 
WARN [SafeDatasetCommit] Not committing dataset  of job job_kafka2gcs_1495211213990 with commit policy COMMIT_ON_FULL_SUCCESS and state FAILED 
ERROR [JobContext] Iterator executor failure. 
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Not committing dataset  of job job_kafka2gcs_1495211213990 with commit policy COMMIT_ON_FULL_SUCCESS and state FAILED
... stacktrace continues 

I have no ideia why the "Iterator executor" failed. If someone has any insights, please help.

Here is my gobblin job:

job.name=kafka2gcs
job
.group=gkafka2gcs
job
.description=Gobblin job to read messages from Kafka and save as is on GCS
job
.lock.enabled=false

kafka
.brokers=mykafka:9092
topic
.whitelist=mytopic
bootstrap
.with.offset=earliest

source
.class=gobblin.source.extractor.extract.kafka.KafkaDeserializerSource
kafka
.deserializer.type=BYTE_ARRAY
extract
.namespace=nskafka2gcs

writer
.builder.class=gobblin.writer.SimpleDataWriterBuilder
writer
.destination.type=HDFS
mr
.job.max.mappers=2
writer
.output.format=txt
data
.publisher.type=gobblin.publisher.BaseDataPublisher
metrics
.enabled=false

fs
.uri=file:///.
writer
.fs.uri=${fs.uri}
mr
.job.root.dir=gobblin
writer
.output.dir=${mr.job.root.dir}/out
writer
.staging.dir=${mr.job.root.dir}/stg

fs
.gs.project.id=my-test-project
data
.publisher.fs.uri=gs://my-bucket
state
.store.fs.uri=${data.publisher.fs.uri}
data
.publisher.final.dir=gobblin/pub
state
.store.dir=gobblin/state

These are the commands I issue for dataproc:

gcloud dataproc clusters create myspark \
 
--image-version 1.1 \
 
--master-machine-type n1-standard-4 \
 
--master-boot-disk-size 10 \
 
--num-workers 2 \
 
--worker-machine-type n1-standard-4 \
 
--worker-boot-disk-size 10 \
 
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
 
--initialization-actions gs://my-bucket/gobblin-setup.sh

gcloud dataproc jobs submit hadoop
--cluster=myspark --class gobblin.runtime.mapreduce.CliMRJobLauncher --properties mapreduce.job.user.classpath.first=true -- -sysconfig /opt/gobblin-dist/conf/gobblin-mapreduce.properties -jobconfig gs://my-bucket/gobblin-kafka-gcs.job

To setup dataproc with the goblins jars on the path I pass this "gobblin-setup.sh" (below) to setup the cluster:

#!/bin/bash
 
sed
-i 's%hdfs://myspark-m%file:///.%g' /usr/lib/hadoop/etc/hadoop/core-site.xml
 
cd
/opt
gsutil cp gs
://my-bucket/gobblin-distribution-0.10.0.tar.gz .
tar xvf gobblin
-distribution-0.10.0.tar.gz
 
BASE
="$PWD/gobblin-dist"
HADOOP_LIB
="/usr/lib/hadoop/lib"
 
ROLE
=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  rm $HADOOP_LIB
/commons-cli-1*.jar $HADOOP_LIB/guava-1*.jar
  cp $BASE
/lib/* $HADOOP_LIB
  #*/
fix syntax highlight
else
  rm $HADOOP_LIB
/guava-1*.jar
  GOBBLIN_VERSION
="0.10.0"
  FWDIR_LIB
=$BASE/lib
 
#copied this from gobblin-dist/bin/gobblin-mapreduce.sh
  cp
\
    $FWDIR_LIB
/gobblin-api-$GOBBLIN_VERSION.jar \  
    $FWDIR_LIB
/gobblin-avro-json-$GOBBLIN_VERSION.jar \
    $FWDIR_LIB
/gobblin-codecs-$GOBBLIN_VERSION.jar \
    $FWDIR_LIB
/gobblin-core-$GOBBLIN_VERSION.jar \
    $FWDIR_LIB
/gobblin-core-base-$GOBBLIN_VERSION.jar \
    $FWDIR_LIB
/gobblin-crypto-$GOBBLIN_VERSION.jar \
    $FWDIR_LIB
/gobblin-crypto-provider-$GOBBLIN_VERSION.jar \
    $FWDIR_LIB
/gobblin-data-management-$GOBBLIN_VERSION.jar \
    $FWDIR_LIB
/gobblin-metastore-$GOBBLIN_VERSION.jar \
    $FWDIR_LIB
/gobblin-metrics-$GOBBLIN_VERSION.jar \
    $FWDIR_LIB
/gobblin-metrics-base-$GOBBLIN_VERSION.jar \
    $FWDIR_LIB
/gobblin-metadata-$GOBBLIN_VERSION.jar \
    $FWDIR_LIB
/gobblin-utility-$GOBBLIN_VERSION.jar \
    $FWDIR_LIB
/avro-1.8.1.jar \
    $FWDIR_LIB
/avro-mapred-1.8.1.jar \
    $FWDIR_LIB
/commons-lang3-3.4.jar \
    $FWDIR_LIB
/config-1.2.1.jar \
    $FWDIR_LIB
/data-2.6.0.jar \
    $FWDIR_LIB
/gson-2.6.2.jar \
    $FWDIR_LIB
/guava-15.0.jar \
    $FWDIR_LIB
/guava-retrying-2.0.0.jar \
    $FWDIR_LIB
/joda-time-2.9.3.jar \
    $FWDIR_LIB
/javassist-3.18.2-GA.jar \
    $FWDIR_LIB
/kafka_2.11-0.8.2.2.jar \
    $FWDIR_LIB
/kafka-clients-0.8.2.2.jar \
    $FWDIR_LIB
/metrics-core-2.2.0.jar \
    $FWDIR_LIB
/metrics-core-3.1.0.jar \
    $FWDIR_LIB
/metrics-graphite-3.1.0.jar \
    $FWDIR_LIB
/scala-library-2.11.8.jar \
    $FWDIR_LIB
/influxdb-java-2.1.jar \
    $FWDIR_LIB
/okhttp-2.4.0.jar \
    $FWDIR_LIB
/okio-1.4.0.jar \
    $FWDIR_LIB
/retrofit-1.9.0.jar \                                                                            
    $FWDIR_LIB
/reflections-0.9.10.jar \
    $HADOOP_LIB
fi

BTW, I'm working with gobblin 0.10.0 on dataproc 1.1 (hadoop 2.7.3).

Any ideas?

Issac Buenrostro

unread,
May 19, 2017, 3:53:48 PM5/19/17
to Henrique G. Abreu, gobblin-users
Can you provide the full log of the application?

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-users+unsubscribe@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/6305a694-c576-4c87-a6ff-c1a475c8f267%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Henrique Abreu

unread,
May 19, 2017, 5:24:50 PM5/19/17
to Issac Buenrostro, gobblin-users
Sure, this is what I got:

Job [9f744870-e283-42c3-a416-55605263199f] submitted.
Waiting for job output...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
WARN [AbstractJobLauncher] Creating a job specific SharedResourcesBroker. Objects will only be shared at the job level.
WARN [JobContext] Property task.data.root.dir is missing.
WARN [KafkaSource] Previous offset for partition of_sample:0 does not exist. This partition will start from the earliest offset: 0
WARN [KafkaSource] Previous offset for partition of_sample:1 does not exist. This partition will start from the earliest offset: 0
WARN [TaskStateCollectorService] No output task state files found in gobblin/kafka2gcs/job_kafka2gcs_1495220348066/output/job_kafka2gcs_1495220348066
WARN [SafeDatasetCommit] Not committing dataset  of job job_kafka2gcs_1495220348066 with commit policy COMMIT_ON_FULL_SUCCESS and state FAILED
ERROR [JobContext] Iterator executor failure.
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Not committing dataset  of job job_kafka2gcs_1495220348066 with commit policy COMMIT_ON_FULL_SUCCESS and state FAILED
       at java.util.concurrent.FutureTask.report(FutureTask.java:122)
       at java.util.concurrent.FutureTask.get(FutureTask.java:192)
       at gobblin.util.executors.IteratorExecutor.executeAndGetResults(IteratorExecutor.java:128)
       at gobblin.runtime.JobContext.commit(JobContext.java:432)
       at gobblin.runtime.JobContext.commit(JobContext.java:398)
       at gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:431)
       at gobblin.runtime.mapreduce.CliMRJobLauncher.launchJob(CliMRJobLauncher.java:89)
       at gobblin.runtime.mapreduce.CliMRJobLauncher.run(CliMRJobLauncher.java:66)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
       at gobblin.runtime.mapreduce.CliMRJobLauncher.main(CliMRJobLauncher.java:111)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
Caused by: java.lang.RuntimeException: Not committing dataset  of job job_kafka2gcs_1495220348066 with commit policy COMMIT_ON_FULL_SUCCESS and state FAILED
       at gobblin.runtime.SafeDatasetCommit.call(SafeDatasetCommit.java:86)
       at gobblin.runtime.SafeDatasetCommit.call(SafeDatasetCommit.java:54)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at gobblin.util.executors.MDCPropagatingRunnable.run(MDCPropagatingRunnable.java:35)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       at java.lang.Thread.run(Thread.java:745)
ERROR [AbstractJobLauncher] Failed to launch and run job job_kafka2gcs_1495220348066: java.io.IOException: Failed to commit dataset state for some dataset(s) of job job_kafka2gcs_1495220348066
java.io.IOException: Failed to commit dataset state for some dataset(s) of job job_kafka2gcs_1495220348066
       at gobblin.runtime.JobContext.commit(JobContext.java:438)
       at gobblin.runtime.JobContext.commit(JobContext.java:398)
       at gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:431)
       at gobblin.runtime.mapreduce.CliMRJobLauncher.launchJob(CliMRJobLauncher.java:89)
       at gobblin.runtime.mapreduce.CliMRJobLauncher.run(CliMRJobLauncher.java:66)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
       at gobblin.runtime.mapreduce.CliMRJobLauncher.main(CliMRJobLauncher.java:111)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
Exception in thread "main" java.lang.reflect.InvocationTargetException
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
Caused by: gobblin.runtime.JobException: Job job_kafka2gcs_1495220348066 failed
       at gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:480)
       at gobblin.runtime.mapreduce.CliMRJobLauncher.launchJob(CliMRJobLauncher.java:89)
       at gobblin.runtime.mapreduce.CliMRJobLauncher.run(CliMRJobLauncher.java:66)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
       at gobblin.runtime.mapreduce.CliMRJobLauncher.main(CliMRJobLauncher.java:111)
       ... 5 more
WARN [ServiceBasedAppLauncher] ApplicationLauncher has already stopped
ERROR: (gcloud.dataproc.jobs.submit.hadoop) Job [9f744870-e283-42c3-a416-55605263199f] entered state [ERROR] while waiting for [DONE]
.

Thanks,

Henrique G. Abreu

Issac Buenrostro

unread,
May 19, 2017, 5:28:40 PM5/19/17
to Henrique G. Abreu, gobblin-users
That is not the entire log unfortunately. Are you running in MR mode? If so, can you get the log of a failed mapper? If running in local mode, there should be other exceptions before this point.

Henrique Abreu

unread,
May 19, 2017, 6:02:14 PM5/19/17
to Issac Buenrostro, gobblin-users
I'm sorry, I don't really know how to get such log. I setup a `logdir` though and go this `gobblin-current.log`, attached. I hope this is it.

Yes, I'm trying to run in MR mode, Google Dataproc is a managed Hadoop cluster.
I "learned" how to invoke in MR mode by reading the `bin/gobblin-mapreduce.sh` and tried to pass the same parameters and settings to Dataproc.

Thanks a lot taking a look.

Henrique G. Abreu
gobblin-current.log

Issac Buenrostro

unread,
May 19, 2017, 6:07:00 PM5/19/17
to Henrique G. Abreu, gobblin-users
This is the error it throws:

2017-05-19 21:46:13 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1380 - Job job_1495230148327_0001 failed with state FAILED due to: Application application_1495230148327_0001 failed 2 times due to AM Container for appattempt_1495230148327_0001_000002 exited with  exitCode: -1000
For more detailed output, check application tracking page:http://myspark-m:8088/cluster/app/application_1495230148327_0001Then, click on links to logs of each attempt.
Diagnostics: java.io.FileNotFoundException: File file:/tmp/ffe94f06-2b51-44e2-b943-268cc58ee099/gobblin/kafka2gcs/job_kafka2gcs_1495230367575/job.state does not exist

Can you try to open http://myspark-m:8088/cluster/app/application_1495230148327_0001 and see if there are any logs in there?

Henrique Abreu

unread,
May 19, 2017, 6:57:59 PM5/19/17
to Issac Buenrostro, gobblin-users
Unfortunately there isn't. It just says:
"No logs available for container container_1495230148327_0001_01_000001"

I'm changing my settings to see if I get any different or more verbose error,  without much luck :-/

The best I got was, when commenting the `sed` line I use to edit some settings dataproc set on etc/hadoop/core-site.xml (which I don't have on my local hadoop setup, where gobblin does work fine on MR mode), I got this other error :-/

2017-05-19 22:37:26 UTC INFO  [main] gobblin.kafka.client.Kafka08ConsumerClient  143 - Fetching topic metadata from broker mykafka:9092
2017-05-19 22:37:27 UTC INFO  [main] gobblin.source.extractor.extract.kafka.KafkaSource  161 - Discovered topic of_sample
2017-05-19 22:37:27 UTC INFO  [main] gobblin.util.ExecutorsUtils  186 - Attempting to shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@7ab1127c[Shutting down, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]
2017-05-19 22:37:27 UTC WARN  [pool-8-thread-1] gobblin.source.extractor.extract.kafka.KafkaSource  342 - Previous offset for partition of_sample:0 does not exist. This partition will start from the earliest offset: 0
2017-05-19 22:37:27 UTC INFO  [pool-8-thread-1] gobblin.source.extractor.extract.kafka.KafkaSource  465 - Created workunit for partition of_sample:0: lowWatermark=0, highWatermark=0, range=0
2017-05-19 22:37:27 UTC WARN  [pool-8-thread-1] gobblin.source.extractor.extract.kafka.KafkaSource  342 - Previous offset for partition of_sample:1 does not exist. This partition will start from the earliest offset: 0
2017-05-19 22:37:27 UTC INFO  [pool-8-thread-1] gobblin.source.extractor.extract.kafka.KafkaSource  465 - Created workunit for partition of_sample:1: lowWatermark=0, highWatermark=0, range=0
2017-05-19 22:37:27 UTC INFO  [main] gobblin.util.ExecutorsUtils  205 - Successfully shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@7ab1127c[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
2017-05-19 22:37:27 UTC INFO  [main] gobblin.source.extractor.extract.kafka.KafkaSource  185 - Created workunits for 1 topics in 0 seconds
2017-05-19 22:37:27 UTC INFO  [main] gobblin.source.extractor.extract.kafka.workunit.packer.KafkaAvgRecordTimeBasedWorkUnitSizeEstimator  149 - For all topics not pulled in the previous run, estimated avg time to pull a record is 1.0 milliseconds
2017-05-19 22:37:27 UTC INFO  [main] gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker  237 - Created MultiWorkUnit for partitions [of_sample:0, of_sample:1]
2017-05-19 22:37:27 UTC INFO  [main] gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker  311 - MultiWorkUnit 0: estimated load=0.003010, partitions=[]
2017-05-19 22:37:27 UTC INFO  [main] gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker  311 - MultiWorkUnit 1: estimated load=0.003010, partitions=[[of_sample:0, of_sample:1]]
2017-05-19 22:37:27 UTC INFO  [main] gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker  297 - Min load of multiWorkUnit = 0.003010; Max load of multiWorkUnit = 0.003010; Diff = 0.000000%
2017-05-19 22:37:27 UTC INFO  [main] gobblin.util.ExecutorsUtils  186 - Attempting to shutdown ExecutorService: com.google.common.util.concurrent.MoreExecutors$ListeningDecorator@3d299393
2017-05-19 22:37:27 UTC INFO  [main] gobblin.util.ExecutorsUtils  205 - Successfully shutdown ExecutorService: com.google.common.util.concurrent.MoreExecutors$ListeningDecorator@3d299393
2017-05-19 22:37:27 UTC INFO  [main] gobblin.runtime.AbstractJobLauncher  386 - Starting job job_kafka2gcs_1495233445721
2017-05-19 22:37:27 UTC INFO  [main] gobblin.util.ExecutorsUtils  186 - Attempting to shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@5dc769f9[Shutting down, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
2017-05-19 22:37:27 UTC INFO  [main] gobblin.util.ExecutorsUtils  205 - Successfully shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@5dc769f9[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
2017-05-19 22:37:27 UTC INFO  [main] gobblin.util.ExecutorsUtils  186 - Attempting to shutdown ExecutorService: com.google.common.util.concurrent.MoreExecutors$ListeningDecorator@17fede14
2017-05-19 22:37:27 UTC INFO  [main] gobblin.util.ExecutorsUtils  205 - Successfully shutdown ExecutorService: com.google.common.util.concurrent.MoreExecutors$ListeningDecorator@17fede14
2017-05-19 22:37:27 UTC INFO  [TaskStateCollectorService STARTING] gobblin.runtime.TaskStateCollectorService  97 - Starting the TaskStateCollectorService
2017-05-19 22:37:27 UTC INFO  [main] gobblin.runtime.mapreduce.MRJobLauncher  229 - Launching Hadoop MR job Gobblin-kafka2gcs
2017-05-19 22:37:28 UTC INFO  [main] org.apache.hadoop.yarn.client.RMProxy  98 - Connecting to ResourceManager at myspark-m/10.128.0.2:8032
2017-05-19 22:37:29 UTC INFO  [main] org.apache.hadoop.mapreduce.JobSubmitter  249 - Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1495233380584_0001
2017-05-19 22:37:29 UTC INFO  [TaskStateCollectorService STOPPING] gobblin.runtime.TaskStateCollectorService  103 - Stopping the TaskStateCollectorService
2017-05-19 22:37:29 UTC WARN  [TaskStateCollectorService STOPPING] gobblin.runtime.TaskStateCollectorService  131 - No output task state files found in gobblin/kafka2gcs/job_kafka2gcs_1495233445721/output/job_kafka2gcs_1495233445721
2017-05-19 22:37:29 UTC INFO  [main] gobblin.runtime.mapreduce.MRJobLauncher  505 - Deleted working directory gobblin/kafka2gcs/job_kafka2gcs_1495233445721
2017-05-19 22:37:29 UTC ERROR [main] gobblin.runtime.AbstractJobLauncher  442 - Failed to launch and run job job_kafka2gcs_1495233445721: java.io.FileNotFoundException: File does not exist: hdfs://myspark-m/user/root/gobblin/kafka2gcs/job_kafka2gcs_1495233445721/job.state
java.io.FileNotFoundException: File does not exist: hdfs://myspark-m/user/root/gobblin/kafka2gcs/job_kafka2gcs_1495233445721/job.state
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

Henrique G. Abreu

Dennis Huo

unread,
May 19, 2017, 7:14:24 PM5/19/17
to gobblin-users, ibue...@linkedin.com, hga...@gmail.com
Whenever there are filesystem-related errors, it's often when a framework makes assumptions about the fs.defaultFS in some way such that you might've specified some input/output path in GCS and the framework stages some files based on the output location, but then later looks for it with an explicit hdfs:// path.

To help isolate whether the error is related to this, can you try re-running your job after first placing your data into the cluster's HDFS and try using pure HDFS paths for input/output in your job?

You can either use "hadoop distcp gs://yourlocation/data/ hdfs:///some/hdfs/location" if the data is large or just do a "hadoop fs -cp" if it's small.

Also, is it any different if you SSH into your cluster and launch the job purely from the command line without using the Dataproc job submission API?

Henrique G. Abreu


Henrique G. Abreu


Henrique G. Abreu

To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.

To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/6305a694-c576-4c87-a6ff-c1a475c8f267%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.

Dennis Huo

unread,
May 19, 2017, 7:27:22 PM5/19/17
to gobblin-users, ibue...@linkedin.com, hga...@gmail.com
It also looks like your job configuration references "file:///" in some places; "file:///" means Hadoop's LocalFileSystem interface, meaning the master node will think it's talking about the master node's local filesystem, while workers will think it's talking about their own local filesystems.

Generally file:/// only works when you run in a single-machine test mode; any distributed cluster where workers are actually on separate nodes will require specifying either an hdfs:/// path for all the fs.uri settings or a gs:// directory.

Henrique G. Abreu


Henrique G. Abreu


Henrique G. Abreu


 
--master-boot<span sty

Henrique G. Abreu

unread,
May 21, 2017, 12:23:50 PM5/21/17
to gobblin-users, ibue...@linkedin.com, hga...@gmail.com
Thanks a lot for the help guys.

This was it, using file:/// worked ok on my laptop because it was a single node, but failed on dataproc.
Using only hdfs (for staging) and gs (for state and final output) instead, fixed this issue.
Which led me to a few other classpath issues, but I was finally able to run it successfully after I sorted that out.

I'll send a last email tomorrow with my final working configuration just for reference if a future novice, like myself, stumbles on this thread :-)

Again, thank you very much.

Shirshanka Das

unread,
Jun 16, 2017, 6:29:20 PM6/16/17
to Henrique G. Abreu, gobblin-users, Issac Buenrostro
Hi Henrique,
   Could you post your final working configuration here for reference? 

thanks!
Shirshanka


--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-users+unsubscribe@googlegroups.com.

To post to this group, send email to gobbli...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages