Gobblin's use of the Hadoop environment. Help please.

180 views
Skip to first unread message

Chris Neal

unread,
Jul 21, 2016, 5:27:08 PM7/21/16
to gobblin-users
Hi all.

I've been running Gobblin in Dev against Cloudera's Hadoop on 5.5.1-1.cdh5.5.1.p0.11, which is build on hadoop 2.6.0.   I've been trying to get things up and running in production for several days, and have been hitting strange issues that to me point to a basic configuration issue between Gobblin and Hadoop.  

My production system is running 5.7.0-1.cdh5.7.0.p0.45, which is also built on hadoop 2.6.0, so I did not rebuild Gobblin for this new environment.

My first strange issue was related to the HADOOP_HOME environment variable.  I pointed it to the directory with the configs, but no matter how I did it, Gobblin would use the ones supplied in /etc/hadoop.   In my job script, I have this:

export HADOOP_HOME=/app/gobblin/1.0/hadoop/

with these files:

gobblin@bdprodm10:[228]:/app/gobblin/1.0/hadoop/conf> ls -l
total 56
-rwxr-xr-x 1 gobblin gobblin 3850 Jul 15 20:24 core-site.xml
-rwxr-xr-x 1 gobblin gobblin  557 Jul 15 20:24 hadoop-env.sh
-rwxr-xr-x 1 gobblin gobblin 3086 Jul 15 20:24 hdfs-site.xml
-rwxr-xr-x 1 gobblin gobblin 1132 Jul 15 20:34 hive-env.sh
-rwxr-xr-x 1 gobblin gobblin 5501 Jul 15 20:34 hive-site.xml
-rwxr-xr-x 1 gobblin gobblin  310 Jul 15 20:34 log4j.properties
-rwxr-xr-x 1 gobblin gobblin 4581 Jul 15 20:24 mapred-site.xml
-rwxr-xr-x 1 gobblin gobblin    0 Jul 15 20:34 redaction-rules.json
-rwxr-xr-x 1 gobblin gobblin  315 Jul 15 20:24 ssl-client.xml
-rwxr-xr-x 1 gobblin gobblin 1116 Jul 15 20:24 topology.map
-rwxr-xr-x 1 gobblin gobblin 1510 Jul 15 20:24 topology.py
-rwxr-xr-x 1 gobblin gobblin 3854 Jul 15 20:24 yarn-site.xml

With this setting, every job would fail with an UnknownHostException("nameservice1").
When I found this post:  https://sskaje.me/2014/02/fix-hadoop-conf-alternatives-cdh5/, I tried changing the link in /etc/hadoop that Cloudera Manager made from:

root@bdprodm10:[422]:/etc/hadoop> ls -lrt
total 8
drwxr-xr-x 2 root root 4096 May  7 19:10 conf.cloudera.yarn
lrwxrwxrwx 1 root root   29 Jul 12 14:34 conf -> /etc/alternatives/hadoop-conf
drwxr-xr-x 2 root root 4096 Jul 15 19:00 conf.cloudera.hdfs

to:

root@bdprodm10:[423]:/etc/hadoop> ls -lrt
total 8
drwxr-xr-x 2 root root 4096 May  7 19:10 conf.cloudera.yarn
drwxr-xr-x 2 root root 4096 Jul 15 19:00 conf.cloudera.hdfs
lrwxrwxrwx 1 root root   19 Jul 20 16:50 conf -> conf.cloudera.yarn/

Then the UnknownHostException was resolved.  So, question 1 is:  Why is Gobblin not using what I supply as HADOOP_HOME for its Hadoop configuration?

The next issue again points to some sort of configuration issue.  Now that I can connect to HDFS, when I submit the map reduce Gobblin job, I'm getting NoClassDefFoundExecptions for classes that are most definitely specified.  For example, the lib dir shown has all the JAR files from the Gobblin distribution, plus our custom jars:

gobblin@bdprodm10:[233]:/app/gobblin/1.0/lib> ls
activation-1.1.1.jar                   derbysoft-logging-1.0.5.jar                             j3daudio.jar                   maven-scm-provider-svn-commons-1.4.jar
ant-1.9.1.jar                          dnsns.jar                                               j3dcore.jar                    maven-scm-provider-svnexe-1.4.jar
ant-launcher-1.9.1.jar                 dns_sd.jar                                              j3dutils.jar                   metrics-core-2.2.0.jar
antlr-runtime-3.5.2.jar                eigenbase-properties-1.1.5.jar                          jaccess.jar                    metrics-core-3.1.0.jar
aopalliance-1.0.jar                    generator-1.15.9.jar                                    jackson-annotations-2.2.2.jar  metrics-graphite-3.1.0.jar
apache-curator-2.6.0.pom               geronimo-annotation_1.0_spec-1.1.1.jar                  jackson-core-2.2.2.jar         metrics-jvm-3.1.0.jar
apacheds-i18n-2.0.0-M15.jar            geronimo-jaspic_1.0_spec-1.0.jar                        jackson-core-asl-1.9.13.jar    mina-core-1.1.7.jar
apacheds-kerberos-codec-2.0.0-M15.jar  geronimo-jpa_3.0_spec-1.0.jar                           jackson-databind-2.2.2.jar     mlibwrapper_jai.jar
apache-log4j-extras-1.2.17.jar         geronimo-jta_1.1_spec-1.1.1.jar                         jackson-jaxrs-1.9.13.jar       mockito-core-1.10.19.jar
api-asn1-api-1.0.0-M20.jar             gobblin-api.jar                                         jackson-mapper-asl-1.9.13.jar  MRJToolkit.jar
api-util-1.0.0-M20.jar                 gobblin-azkaban.jar                                     jackson-xc-1.9.13.jar          nashorn.jar
AppleScriptEngine.jar                  gobblin-compaction.jar                                  jai_codec.jar                  netty-3.2.3.Final.jar
asm-3.1.jar                            gobblin-core.jar                                        jai_core.jar                   netty-3.7.0.Final.jar
asm-commons-3.1.jar                    gobblin-data-management.jar                             janino-2.7.6.jar               objenesis-2.1.jar
asm-tree-3.1.jar                       gobblin-example.jar                                     jasper-compiler-5.5.23.jar     okhttp-2.0.0.jar
avro-1.7.7.jar                         gobblin-metastore.jar                                   jasper-runtime-5.5.23.jar      okhttp-urlconnection-2.0.0.jar
avro-ipc-1.7.7.jar                     gobblin-metrics.jar                                     jasypt-1.9.2.jar               okio-1.0.0.jar
avro-ipc-1.7.7-tests.jar               gobblin-rest-api-data-template.jar                      javax.inject-1.jar             opencsv-2.3.jar
avro-mapred-1.7.7-hadoop2.jar          gobblin-rest-api.jar                                    javax.mail-1.5.2.jar           paranamer-2.3.jar
azkaban-2.5.0.jar                      gobblin-rest-api-rest-client.jar                        java-xmlbuilder-0.4.jar        parquet-avro-1.8.1.jar
bcpg-jdk15on-1.52.jar                  gobblin-rest-client.jar                                 javax.servlet-api-3.0.1.jar    parquet-hadoop-bundle-1.6.0.jar
bcprov-jdk15on-1.52.jar                gobblin-rest-server.jar                                 jaxb-api-2.2.2.jar             parquet-hadoop-bundle-1.8.1.jar
bonecp-0.8.0.RELEASE.jar               gobblin-runtime.jar                                     jaxb-impl-2.2.3-1.jar          parseq-1.3.6.jar
c3p0-0.9.1.1.jar                       gobblin-scheduler.jar                                   jdo2-api-2.1.jar               pegasus-common-1.15.9.jar
calcite-avatica-1.2.0-incubating.jar   gobblin-test-harness.jar                                jdo-api-3.0.1.jar              pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar
calcite-core-1.2.0-incubating.jar      gobblin-utility.jar                                     jersey-client-1.9.jar          plexus-utils-1.5.6.jar
calcite-linq4j-1.2.0-incubating.jar    gobblin-yarn.jar                                        jersey-core-1.9.jar            protobuf-java-2.5.0.jar
cglib-2.2.1-v20090111.jar              groovy-all-2.1.6.jar                                    jersey-guice-1.9.jar           quartz-2.2.1.jar
cglib-nodep-2.2.jar                    gson-2.3.1.jar                                          jersey-json-1.9.jar            r2-1.15.9.jar
cldrdata.jar                           guava-15.0.jar                                          jersey-server-1.9.jar          regexp-1.3.jar
codemodel-2.2.jar                      guice-3.0.jar                                           jets3t-0.9.0.jar               restli-client-1.15.9.jar
commons-cli-1.3.1.jar                  guice-servlet-3.0.jar                                   jettison-1.1.jar               restli-common-1.15.9.jar
commons-codec-1.10.jar                 hadoop-annotations-2.6.0.jar                            jetty-6.1.26.jar               restli-docgen-1.15.9.jar
commons-collections-3.2.1.jar          hadoop-auth-2.6.0.jar                                   jetty-all-7.6.0.v20120127.jar  restli-netty-standalone-1.15.9.jar
commons-compiler-2.7.6.jar             hadoop-common-2.6.0.jar                                 jetty-util-6.1.26.jar          restli-server-1.15.9.jar
commons-compress-1.10.jar              hadoop-hdfs-2.6.0.jar                                   jfxrt.jar                      restli-tools-1.15.9.jar
commons-configuration-1.10.jar         hadoop-mapreduce-client-common-2.6.0.jar                jline-2.12.jar                 retrofit-1.7.1.jar
commons-daemon-1.0.13.jar              hadoop-mapreduce-client-core-2.6.0.jar                  joda-time-2.9.jar              scala-library-2.11.6.jar
commons-dbcp-1.4.jar                   hadoop-yarn-api-2.6.0.jar                               jopt-simple-3.2.jar            scala-parser-combinators_2.11-1.0.2.jar
commons-el-1.0.jar                     hadoop-yarn-client-2.6.0.jar                            jpam-1.1.jar                   scala-xml_2.11-1.0.2.jar
commons-email-1.4.jar                  hadoop-yarn-common-2.6.0.jar                            jsch-0.1.53.jar                servlet-api-2.5-20081211.jar
commons-httpclient-3.1.jar             hadoop-yarn-server-applicationhistoryservice-2.6.0.jar  json-20090211.jar              servlet-api-2.5.jar
commons-io-2.4.jar                     hadoop-yarn-server-common-2.6.0.jar                     jsp-api-2.1.jar                slf4j-api-1.7.12.jar
commons-lang-2.6.jar                   hadoop-yarn-server-resourcemanager-2.6.0.jar            jsr305-3.0.0.jar               slf4j-log4j12-1.7.5.jar
commons-lang3-3.4.jar                  hadoop-yarn-server-web-proxy-2.6.0.jar                  jta-1.1.jar                    snappy-0.3.jar
commons-logging-1.2.jar                hamcrest-core-1.1.jar                                   junit-3.8.1.jar                snappy-java-1.1.1.6.jar
commons-math3-3.5.jar                  helix-core-0.6.6-SNAPSHOT.jar                           kafka_2.11-0.8.2.1.jar         ST4-4.0.4.jar
commons-net-3.1.jar                    hive-ant-1.2.1.jar                                      kafka-clients-0.8.2.1.jar      stax-api-1.0.1.jar
commons-pool-1.5.4.jar                 hive-common-1.2.1.jar                                   leveldbjni-all-1.8.jar         stax-api-1.0-2.jar
commons-vfs2-2.0.jar                   hive-exec-1.2.1.jar                                     libAppleScriptEngine.jnilib    sunec.jar
config-1.2.1.jar                       hive-jdbc-1.2.1.jar                                     libfb303-0.9.2.jar             sunjce_provider.jar
curator-client-2.6.0.jar               hive-metastore-1.2.1.jar                                libJ3DAudio.jnilib             sunpkcs11.jar
curator-framework-2.6.0.jar            hive-serde-1.2.1.jar                                    libJ3D.jnilib                  tools.jar
curator-recipes-2.6.0.jar              hive-service-1.2.1.jar                                  libJ3DUtils.jnilib             transaction-api-1.1.jar
d2-1.15.9.jar                          hive-shims-0.20S-1.2.1.jar                              libjdns_sd.jnilib              vecmath.jar
data-1.15.9.jar                        hive-shims-0.23-1.2.1.jar                               libmlib_jai.jnilib             velocity-1.7.jar
datanucleus-api-jdo-3.2.6.jar          hive-shims-1.2.1.jar                                    libthrift-0.9.2.jar            xercesImpl-2.9.1.jar
datanucleus-core-4.1.2.jar             hive-shims-common-1.2.1.jar                             li-jersey-uri-1.15.9.jar       xml-apis-1.3.04.jar
datanucleus-rdbms-4.1.2.jar            hive-shims-scheduler-1.2.1.jar                          localedata.jar                 xmlenc-0.52.jar
data-transform-1.15.9.jar              htrace-core-3.0.4.jar                                   log4j-1.2.17.jar               zipfs.jar
degrader-1.15.9.jar                    httpclient-4.5.jar                                      lombok-1.16.4.jar              zkclient-0.3.jar
derby-10.11.1.1.jar                    httpcore-4.4.1.jar                                      lz4-1.2.0.jar                  zookeeper-3.4.6.jar
derbysoft-avro-1.0.0-SNAPSHOT.jar      influxdb-java-1.5.jar                                   mail-1.4.1.jar
derbysoft-gobblin-1.0.0-SNAPSHOT.jar   ivy-2.4.0.jar                                           maven-scm-api-1.4.jar
gobblin@bdprodm10:[234]:/app/gobblin/1.0/lib> 

When I run a job with this script:

#!/bin/sh

export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/app/gobblin/1.0/hadoop/
export HADOOP_BIN_DIR=/opt/cloudera/parcels/CDH/bin
export CUSTOMER=X
export DATA_TYPE=
export LOG_TYPE=ari
export DATA_DIR=/data/connectivity/gobblin/$DATA_TYPE
export TOPIC_WHITELIST=X_raw

/app/gobblin/1.0/bin/gobblin-mapreduce.sh --fs hdfs://nameservice1 --workdir hdfs://nameservice1/etl/connectivity/gobblin --jars /app/gobblin/1.0/lib/derbysoft-gobblin-1.0.0-SNAPSHOT.jar,/app/gobblin/1.0/lib/derbysoft-avro-1.0.0-SNAPSHOT.jar,/app/gobblin/1.0/lib/derbysoft-logging-1.0.5.jar,/app/gobblin/1.0/lib/parquet-hadoop-bundle-1.8.1.jar,/app/gobblin/1.0/lib/parquet-avro-1.8.1.jar --conf /app/gobblin/1.0/job_conf/X.pull

After about 30 seconds, I get Exceptions to the console about "Not all tasks running completed successfully", and within the Gobblin metrics logs, I get this:

"taskWorkingState":"FAILED","taskFailureContext":"java.lang.NoClassDefFoundError: org/apache/commons/lang3/StringUtils

StringUtils is here:

chris.neal@bdprodm10:[30]:/app/gobblin/1.0/lib> jar tvf commons-lang3-3.4.jar | grep StringUtils
  3167 Fri Apr 03 14:30:26 UTC 2015 org/apache/commons/lang3/RandomStringUtils.class
 51000 Fri Apr 03 14:30:26 UTC 2015 org/apache/commons/lang3/StringUtils.class

Ok, weird.  What I found in gobblin-mapreduce.sh shows that all these jars in this lib directory get included in the environment variable:  HADOOP_CLASSPATH
BUT, they don't seem to be found at runtime.

Just for grins, I added this specific jar to my bin script for the job as such:

/app/gobblin/1.0/bin/gobblin-mapreduce.sh --fs hdfs://nameservice1 --workdir hdfs://nameservice1/etl/connectivity/gobblin --jars /app/gobblin/1.0/lib/commons-lang3-3.4.jar,/app/gobblin/1.0/lib/derbysoft-gobblin-1.0.0-SNAPSHOT.jar,/app/gobblin/1.0/lib/derbysoft-avro-1.0.0-SNAPSHOT.jar,/app/gobblin/1.0/lib/derbysoft-logging-1.0.5.jar,/app/gobblin/1.0/lib/parquet-hadoop-bundle-1.8.1.jar,/app/gobblin/1.0/lib/parquet-avro-1.8.1.jar --conf /app/gobblin/1.0/job_conf/X.pull

This time the job ran for about 30 seconds, and the metrics logs now moved to complain about a NCDFE for:

"taskWorkingState":"FAILED","taskFailureContext":"java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition\n\tat gobblin.source.extractor.extract.kafka.KafkaWrapper$KafkaOldAPI.createFetchRequest(KafkaWrapper.java:401)

So it seems to me that something is fundamentally off with the environment somehow, but I'm not sure how.  All three of these issues, IMO, should not happen if environments and CLASSPATHs are being properly set up and passed along.

I'm hoping someone out there can point out something I might have missed?  I've been scratching my head for a couple of days now with no breakthrough.

Many thanks.
Chris

Sahil Takiar

unread,
Aug 2, 2016, 4:04:49 PM8/2/16
to Chris Neal, gobblin-users
For your first issue, its possible that the configuration is being specified by HADOOP_CONF_DIR rather than $HADOOP_HOME/conf (https://wiki.apache.org/hadoop/HowToConfigure)

For your second issue, the classpath of the driver process is different from the classpath of your map tasks. Hadoop requires you to explicitly express additional classpath entries for any map tasks spawned. Gobblin has had issues with this in the past. What version of Gobblin are you running?

--Sahil

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/3ede60bb-0572-475b-8a66-cb160210f937%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Sahil Takiar
Senior Software Engineer at LinkedIn
takiar...@gmail.com | (510) 673-0309
Reply all
Reply to author
Forward
0 new messages