Cannot set spark-yarn when making OLAP on remote machine with kerberos.

Gabriele Ran

unread,

Mar 26, 2018, 2:21:38 AM3/26/18

to JanusGraph users

Dear all,

I set up a cluster with three servers and put data inside.
Now I want to do OLAP on a portal machine with hadoop and yarn installed.

Yarn host is set to be another machine in an compute cluster.

I have run.sh like this:

#!/usr/bin/env bash

export HADOOP_CONF_DIR=/opt/yarn-conf
export YARN_CONF_DIR=/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf
export CLASSPATH=$HADOOP_CONF_DIR:$YARN_CONF_DIR
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export JAVA_OPTIONS="$JAVA_OPTIONS -Djava.library.path=$HADOOP_HOME/lib/native:$HADOOP_HOME/lib/native/Linux-amd64-64"

java -cp graph_analyzer-1.0-SNAPSHOT.jar:/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/*:/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/lib/* com.fosun.graph_analyzer.GraphAnalyzer

And I have configuration file like this:

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.cassandra.Cassandra3InputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.persistContext=true
#
# JanusGraph Cassandra InputFormat configuration
#
janusgraphmr.ioformat.conf.storage.backend=cassandrathrift
janusgraphmr.ioformat.conf.storage.hostname=fonova-app-jan01,fonova-app-jan02,fonova-app-jan03
janusgraphmr.ioformat.conf.storage.port=9160
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
janusgraphmr.ioformat.conf.storage.cassandra.frame-size-mb=128
janusgraphmr.ioformat.conf.storage.cassandra.astyanax.frame-size=128
storage.cassandra.thrift.frame-size=128
storage.cassandra.thrift.max_message_size_mb=128
storage.cassandra.frame-size-mb=128
storage.cassandra.astyanax.frame-size=128

#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=janusgraph
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647


#
# SparkGraphComputer Configuration
#
spark.master=yarn
spark.submit.deployMode=client
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.security.authentication=kerberos
spark.keytab=/data/graph/graph@FONOVA_AHZ.COM.keytab
spark.principal=graph/GRAPHCOMPUTE@FONOVA_AHZ.COM
spark.queue=root.ahz_batch.dev
spark.driver.memory=1g
spark.driverEnv.HADOOP_CONF_DIR=/etc/hadoop/conf
spark.executorEnv.HADOOP_CONF_DIR=/etc/hadoop/conf
spark.files=/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/core-site.xml,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/hadoop-env.sh,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/hbase-env.sh,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/hbase-site.xml,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/hdfs-site.xml,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/hive-site.xml,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/jaas.conf,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/log4j.properties,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/mapred-site.xml,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/read-cassandra-3.properties,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/ssl-client.xml,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/topology.map,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/topology.py,/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf/yarn-site.xml
spark.executor.cores=1
spark.executor.memory=2g
spark.num.executors=25
spark.driver.class.path=/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf:/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/lib
spark.yarn.appMasterEnv.CLASSPATH=/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/lib/*:/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf
spark.yarn.dist.jars=/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/lib/*
spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hadoop/native:/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64:/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf:/data/fosundb/ran/graph/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/lib
spark.dirver.extraClassPath=/opt/cloudera/parcels/CDH/lib/hadoop/native:/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64:/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf:/data/fosundb/ran/graph/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/lib
spark.history.kerberos.enabled=true
spark.history.kerberos.principal=graph/GRAPHCOMPUTE@FONOVA_AHZ.COM
spark.history.kerberos.keytab=/data/graph/graph@FONOVA_AHZ.COM.keytab

I put yarn-site.xml under

/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf

Now the problems are:
1. It shows that no authentication is allowed: 
11:10:44,907  INFO SecurityManager:58 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(graph); users with modify permissions: Set(graph)

2. It shows that yarn-site.xml do no help on setting yarn-host:
11:10:50,833  INFO RMProxy:56 - Connecting to ResourceManager at /0.0.0.0:8032

HadoopMarc

unread,

Mar 26, 2018, 3:18:20 AM3/26/18

to JanusGraph users

Hi Gabriele,

The line

export CLASSPATH=$HADOOP_CONF_DIR:$YARN_CONF_DIR

looks suspicious. Probably the HADOOP_CONF_DIR also contains a yarn-site.xml file, which gets priority as it is first in the CLASSPATH.

Note that hadoop 2.x includes yarn, among other services; you may be confusing things.

Cheers, Marc

Op maandag 26 maart 2018 08:21:38 UTC+2 schreef Gabriele Ran:

spark.history.kerberos.principal=graph/GRAPHC...@FONOVA_AHZ.COM spark.history.kerberos.keytab=/data/graph/graph@FONOVA_AHZ.COM.keytab

Gabriele Ran

unread,

Mar 26, 2018, 5:07:09 AM3/26/18

to JanusGraph users

Hi HadoopMarc,

Thank you for your reply! And I have to thank you for your blog that helps me a lot.

I removed the $HADOOP_CONF_DIR in classpath and make sure that the yarn-site.xml in YARN_CONF_DIR being correct.
Still got the same exception.

I tried the following code and all print null.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.yarn.conf.YarnConfiguration;
import org.apache.tinkerpop.gremlin.hadoop.structure.HadoopConfiguration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class GraphAnalyzer {
    private static final Logger LOGGER = LoggerFactory.getLogger(GraphAnalyzer.class);

    public static void main(String[] args) {
        Configuration hadoopConf = new Configuration();
        HadoopConfiguration hadoopConfiguration = new HadoopConfiguration();
        YarnConfiguration yarnConfiguration = new YarnConfiguration();
        System.out.println("------------------------------------------------------------------");
        System.out.println(hadoopConf.get("yarn.resourcemanager.address"));
        System.out.println("------------------------------------------------------------------");
        System.out.println(yarnConfiguration.get("yarn.resourcemanager.address"));
        System.out.println("------------------------------------------------------------------");
        System.out.println(hadoopConfiguration.getProperty("yarn.resourcemanager.address"));
    }
}

HadoopMarc

unread,

Mar 26, 2018, 6:40:09 AM3/26/18

to JanusGraph users

Hi Gabriele,

You probably just instantiated empty Configuration objects.

Let's get back to the start. From the machine where your GraphAnalyzer runs, can you run from the console:

$ hadoop fs -ls

$ yarn queue - status root.ahz_batch.dev

This should work first. From there on you can see which conf directories are used by the hadoop and yarn commands and add them to your project's CLASSPATH.

HTH, Marc

Op maandag 26 maart 2018 11:07:09 UTC+2 schreef Gabriele Ran:

Gabriele Ran

unread,

Mar 26, 2018, 6:55:55 AM3/26/18

to JanusGraph users

Hi HadoopMarc,

I think you are right. My app can read the so called read-cassandra-3.property file but nothing else.

I tried both these two commands and they just work fine.
$ hadoop fs -ls

$ yarn queue - status

root.ahz_batch.dev

Do you mean that I should put my configuration files on hdfs?

marc.de...@gmail.com

unread,

Mar 26, 2018, 8:06:11 AM3/26/18

to JanusGraph users

Hi Gabriele,

In your original post you had the following conf locations:

export HADOOP_CONF_DIR=/opt/yarn-conf
export YARN_CONF_DIR=/home/myuser/graph/graph_analyzer-1.0-SNAPSHOT/conf

I expect that the hadoop and conf commands (that work fine) do not use these conf directories, but rather they get their configs from some standard location like /etc/hadoop/conf or from a HADOOP_CONF_DIR set in your .bashrc. These are the conf directories that contain your actual cluster configs and should be on your project's CLASSPATH. You do not need to define any additional hadoop/yarn configs in your project.

HTH, Marc

Op maandag 26 maart 2018 12:55:55 UTC+2 schreef Gabriele Ran:

Gabriele Ran

unread,

Mar 26, 2018, 8:44:34 AM3/26/18

to JanusGraph users

Hi Marc,

When submit spark jobs on our server, there is a spark-env.sh which has following code will be called:

export HADOOP_CONF_DIR=/opt/yarn-conf
export YARN_CONF_DIR=/opt/spark-2.2.0/conf/yarn-conf
export SPARK_HOME=/opt/spark-2.2.0
export MASTER=yarn
export DEPLOY_MODE=client

Thus I add these code in my calling shell.

Gabriele Ran

unread,

Mar 30, 2018, 3:27:27 AM3/30/18

to JanusGraph users

Hi Marc,

I solved the yarn problem. It is because I use java -cp option and the class path is fixed to the path I supplied in the java -cp command line but not the upper one.

But now I failed into another problem. Our queue is authorized by kerberos. Is there any way to add kerberos principals when submit graph to queue?

On Monday, March 26, 2018 at 8:06:11 PM UTC+8, marc.de...@gmail.com wrote:

HadoopMarc

unread,

Mar 30, 2018, 6:45:37 AM3/30/18

to JanusGraph users

Hi Gabriele,

Thanks for reporting back.

You have spark.kerberos configs set, but these are not required for spark yarn and could sit in the way (so remove).

If you can do a kinit as user gabriele, hadoop will find your ticket cache and you can just run the TinkerPop spark/yarn job as user gabriele like any other spark/yarn job.

If you want to use a keytab to authenticate to yarn, make sure your application can find the keytab (specify its location in the KRB5_KTNAME env variable or in a jaas.conf file on your classpath).

Cheers, Marc

Op vrijdag 30 maart 2018 09:27:27 UTC+2 schreef Gabriele Ran:

Reply all

Reply to author

Forward