Hive-on-spark Support

133 views
Skip to first unread message

Nithin Kumar Kollu

unread,
Nov 30, 2022, 3:30:47 AM11/30/22
to Google Cloud Dataproc Discussions
Hi,

Is Hive On Spark (hive.execution.engine=spark) is supported in Dataproc ?

If it's supported, can you provide steps? 

Regards,
Nithin

Matias Coca

unread,
Nov 30, 2022, 10:17:08 AM11/30/22
to Google Cloud Dataproc Discussions
Hi Nithin, 

It is supported, Spark is installed by default in a Dataproc cluster. 

This is a simple example if you want to run a command from the command line of the main node of a Dataproc Cluster.

$> hive -e "set hive.execution.engine=spark;\
show databases;"
Regards.

Matias Coca

Nithin Kumar Kollu

unread,
Dec 2, 2022, 9:13:51 AM12/2/22
to Google Cloud Dataproc Discussions
Hi, 

It isn't working that way. 

Regards,
Nithin

Matias Coca

unread,
Dec 2, 2022, 9:18:31 AM12/2/22
to Google Cloud Dataproc Discussions
In that case, you need to provide more information about your Dataproc cluster installation, and also the error that are you receiving. The default Dataproc cluster creation includes Spark. If you connect to the main node of the dataproc cluster and you run the commands i put in my previous message, this commands will work. With the information you provided is very difficult to give you any help.
Regards

Matias

Nithin Kumar Kollu

unread,
Dec 2, 2022, 10:16:52 AM12/2/22
to Google Cloud Dataproc Discussions
I am running all of these on the master node.

When I Set the execution engine to spark it fails with the error below 

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
Hive Session ID = 7c8beab1-9ce3-4f12-aa6b-03e8c311e873
hive> set hive.execution.engine=spark
    > ;
hive> insert into test values(1,"abc") ;
Query ID = root_20221202151329_f47b306f-17f9-4434-afe1-4e7ba42ad5f4
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create Spark client for Spark session 5ede6416-72d6-4f85-9e1d-3ec25a01a68d)'
FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session 5ede6416-72d6-4f85-9e1d-3ec25a01a68d

hive> 

Hive Version : Hive 3.1.2

Spark Version : 3.1.3

-Nithin

Matias Coca

unread,
Dec 2, 2022, 12:23:18 PM12/2/22
to Google Cloud Dataproc Discussions
Hi Nithin

The `set hive.execution.engine=spark;` did not fails, the error that you received is related to the  next instruction `hive> insert into test values(1,"abc") ;` 

Try to set these before the insert instruction:

"hive:hive.exec.reducers.bytes.per.reducer": "67108864"
"hive:hive.exec.reducers.max": "1000" 
"mapred:mapreduce.job.reduces": "1000" 

If this don't work, plese go to the Dataproc UI, click on the Dataproc cluster name and after that click on "CONFIGURATION" tab. Go to the bottom and click on "EQUIVALENT REST". It will open a window and click on "COPY TO CLIPBOARD" and paste it here. This is the Dataproc cluster configuration. I need to see it in detail.

Regards


Matias Coca


Reply all
Reply to author
Forward
0 new messages