I want to run a Spark SQL query to my Hive tables through Oozie from Hue
So I create a workflow where I run this pyspark script:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("local")
sc = SparkContext(conf=sconf)
print "\n\nSpark is "
print sc.version
sqlContext = HiveContext(sc)
sqlContext.sql("show databases").show()
and with properties:
Spark Master: yarn
Mode: Client
App name: MySpark
When I run this the jon does not finsh and I get for stdout log:
...
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:82)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3399)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3418)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3643)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:231)
at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:215)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:338)
...
what am I missing? When I don't use spark sql everything is fine. Do I need to add hive-site.xml and if where should it be placed?
Do I need to pass any Spark argument?
Thanks in advance for any suggestion.