Kafka to remote HDFS

196 views
Skip to first unread message

joanne

unread,
Feb 9, 2016, 7:45:59 AM2/9/16
to gobblin-users
Hi, all.
I'm trying to ingest data from kafka to remote HDFS.
Directly, is it possible to write output to remote HDFS?

I looked at standalone mode and MR mode, and both require HADOOP_HOME or HADOOP_BIN_DIR.
In standalone mode, state-store data are written in remote HDFS but the data from Kafka aren't.
(Actually, I can't understand why something can or cannot be written.)
In MR mode, I can't execute gobblin-mapreduce.sh because the script needs $HADOOP_BIN_DIR in local machine environment.
This means I only use in server installed HDFS, right..?
and the reason that only executed on server installed HDFS may be not efficient to ingest to remote HDFS. (use network traffic twice more..?)

Any advice will be appreciated..

Ziyang Liu

unread,
Feb 9, 2016, 10:47:15 AM2/9/16
to gobblin-users
Hi Joanne

In order to run MR mode you need to run it on a Hadoop cluster because it needs to launch a Hadoop MR job.
You can write to any remote HDFS by specifying writer.fs.uri.

I don't think standalone mode requires HADOOP_HOME or HADOOP_BIN_DIR. Where does it require them?

-Ziyang

joanne

unread,
Feb 9, 2016, 12:20:31 PM2/9/16
to gobblin-users
I'm surprise that you respond very quickly!! Thank you.

This is part of my gobblin-current.log on standalone mode with DEBUG option. On second thoughts, it may be not the reason... but why gobblin needs HADOOP_HOME and where do I set the config?

Anyway.. after execute with this job configuration, I can get some output on state-store. but other output is not shown. 
Any Suggestion with my gobblin job?

```
2016-02-10 02:01:52 KST DEBUG [JobScheduler-0] org.apache.hadoop.util.Shell  321 - Failed to detect a valid hadoop home directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:303)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:328)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2807)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2802)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2668)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at gobblin.runtime.JobContext.<init>(JobContext.java:104)
at gobblin.runtime.AbstractJobLauncher.<init>(AbstractJobLauncher.java:114)
at gobblin.runtime.local.LocalJobLauncher.<init>(LocalJobLauncher.java:66)
at gobblin.runtime.JobLauncherFactory.newJobLauncher(JobLauncherFactory.java:63)
at gobblin.scheduler.JobScheduler.runJob(JobScheduler.java:283)
at gobblin.scheduler.JobScheduler$NonScheduledJobRunner.run(JobScheduler.java:531)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-02-10 02:01:52 KST DEBUG [JobScheduler-0] org.apache.hadoop.util.Shell  393 - setsid is not available on this machine. So not using it.
2016-02-10 02:01:52 KST DEBUG [JobScheduler-0] org.apache.hadoop.util.Shell  397 - setsid exited with exit code 0

```

And here is job config file.
```
job.disabled=true

job.name=GobblinKafkaQuickStart
job.group=GobblinKafka
job.description=Gobblin quick start job for Kafka
job.lock.enabled=false

kafka.brokers=kafka-joanne:9092
bootstrap.with.offset=earliest

source.class=gobblin.source.extractor.extract.kafka.KafkaSimpleSource
extract.namespace=gobblin.extract.kafka

writer.builder.class=gobblin.writer.SimpleDataWriterBuilder
writer.file.path.type=tablename
writer.destination.type=HDFS
writer.output.format=txt

data.publisher.type=gobblin.publisher.BaseDataPublisher

mr.job.max.mappers=1

metrics.reporting.file.enabled=true
metrics.log.dir=/gobblin-kafka/metrics
metrics.reporting.file.suffix=txt

fs.uri=hdfs://hdfs-uri:9000
writer.fs.uri=hdfs://hdfs-uri:9000
state.store.fs.uri=hdfs://hdfs-uri:9000

state.store.dir=/gobblin-kafka/state-store
task.data.root.dir=/jobs/kafkaetl/gobblin/gobblin-kafka/task-data
data.publisher.final.dir=/gobblintest/job-output
```



Ziyang Liu

unread,
Feb 9, 2016, 1:32:51 PM2/9/16
to gobblin-users
Hi Joanne, which Hadoop version did you use to build gobblin?

Sahil Takiar

unread,
Feb 9, 2016, 10:53:33 PM2/9/16
to Ziyang Liu, gobblin-users
Can you confirm that either (1) the environment variable HADOOP_HOME is set on the cluster you are running on, or (2) hadoop.home.dir is set somewhere in your cluster config.

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/3f8906b0-594d-4577-9dfe-352129094cbc%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Message has been deleted

joanne

unread,
Feb 10, 2016, 10:39:57 AM2/10/16
to gobblin-users, zli...@asu.edu
Hi, Ziyang Liu and Sahil Takiar. 

Before my check list as you suggested, I focus on the gobblin standalone mode. Because I'd like to do this step by step :)

First, my hadoop version is 2.7.1 for test and it's installed as Single Node Setup(http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html).
and the build command line was

```
./gradlew clean build -PuseHadoop2 -PhadoopVersion=2.7.1
```

Second.. about the environment on my hadoop..
(1) HADOOP_HOME and HADOOP_BIN_HOME is set on hadoop node.

```
[hdfs-joanne]$ env | grep HADOOP
HADOOP_HOME=/home/deploy/hadoop-2.7.1
HADOOP_BIN_HOME=/home/deploy/hadoop-2.7.1/bin
```

but not my local machine that execute gobblin :)

(2) hadoop.home.dir is not set. 
On Single Node Setup Docs.. there's no instruction about hadoop.home.dir.
But.. you mean I should set hadoop.home.dir on my hadoop node when I'm trying to execute MR mode. right?

Sahil Takiar

unread,
Feb 10, 2016, 11:53:41 AM2/10/16
to joanne, gobblin-users, Ziyang Liu
Can you try using version 2.6.0? Looking through the code for FileSystem, this may be a bug in 2.7.1. If you are interested a detailed explanation of the bug is below, if not just try changing the version and see if that works.

* Line 2807 of FileSystem uses a class called org.apache.hadoop.util.StringUtils
* Line 80 of the StringUtils class creates a static field called ENV_VAR_PATTERN, in order to create this variable it uses a class called org.apache.hadoop.util.Shell
* Line 328 of the Shell class invokes the method checkHadoopHome(), which throws an IOException if HADOOP_HOME or hadoop.home.dir are not set
* Version 2.6.0 of the FileSystem class does not use StringUtils and thus does not hit this problem

--Sahil

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.

joanne

unread,
Feb 11, 2016, 12:02:34 AM2/11/16
to gobblin-users, zli...@asu.edu
wow. It works with hadoop v2.6.0.
I really appreciate that you found bugs even it's not the gobblin issue.
I didn't expect that the version is problem  I'll check the code that you point out to work with hadoop v2.7.1
Have a nice day, Sahil!

Sahil Takiar

unread,
Feb 11, 2016, 1:39:24 AM2/11/16
to joanne, gobblin-users, Ziyang Liu
No problem Joanne, glad things are working for you. Hope Gobblin is fulfilling all your data ingestion needs :)

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.

joanne

unread,
Feb 11, 2016, 1:56:39 AM2/11/16
to gobblin-users
Oh, I took a mistake that I forgot mention and say thank you.
Thanks a lot, Ziyang! 

Kotesh Banoth

unread,
Apr 28, 2016, 2:28:59 AM4/28/16
to gobblin-users
Hi when i am starting in standalone mode,  Showing that GOBBLIN_JOB_CONFIG_DIR not set,  Need help

hadoop@slave1:~/Desktop/kafkHive/gobblin/gobblin-dist$ bin/gobblin-standalone.sh start

Error: Environment variable GOBBLIN_JOB_CONFIG_DIR not set!

gobblin-standalone.sh <start | status | restart | stop> [OPTION]
Where OPTION can be:
  --workdir <job work dir>                       Gobblin's base work directory: if not set, taken from ${GOBBLIN_WORK_DIR}
  --fwdir <fwd dir>                          
    Gobblin's dist directory: if not set, taken from ${GOBBLIN_FWDIR}
  --logdir <log dir>                             Gobblin's log directory: if not set, taken from ${GOBBLIN_LOG_DIR}
  --jars <comma-separated list of job jars>      Job jar(s): if not set, /home/hadoop/Desktop/kafkHive/gobblin/gobblin-dist/lib is examined
  --conf <directory of job configuration files>  Directory of job configuration files: if not set, taken from
  --conffile <custom config file>                Custom config file: if not set, is ignored. Overwrites properties in /home/hadoop/Desktop/kafkHive/gobblin/gobblin-dist/conf/gobblin-standalone.properties
  --jvmflags <string of jvm flags>               String containing any additional JVM flags to include
  --help                                         Display this help and exit


My question to the group is, has anyone had any luck using Gobblin for writing to HDFS in a SIMPLE example?

This might be helpful for folks getting started using this really nice piece of software.

Much appreciated in advance.

Reply all
Reply to author
Forward
0 new messages