Blog: How to Schedule Spark jobs with Spark on YARN and Oozie

33 views
Skip to first unread message

Romain Rigaux

unread,
Aug 23, 2016, 5:51:59 AM8/23/16
to Hue-Users
Initially published on http://gethue.com/how-to-schedule-spark-jobs-with-spark-on-yarn-and-oozie/

How to run Spark jobs with Spark on YARN? This often requires trial and error in order to make it work.

Hue is leveraging Apache Oozie to submit the jobs. It focuses on the yarn-client mode, as Oozie is already running the spark-summit command in a MapReduce2 task in the cluster. You can read more about the Spark modes here.

Here is how to get started successfully:

PySpark

Simple script with no dependency.

oozie-pyspark-simple

Script with a dependency on another script (e.g. hello imports hello2).

oozie-pyspark-dependencies

For more complex dependencies, like Panda, have a look at this documentation.

 

Jars (Java or Scala)

Add the jars as File dependency and specify the name of the main jar:

spark-action-jar

Another solution is to put your jars in the ‘lib’ directory in the workspace (‘Folder’ icon on the top right of the editor).

oozie-spark-lib2

 

The latest Hue is improving the user experience and will provide an even simpler solution in Hue 4.

If you have any questions, feel free to comment here or on the hue-user list or @gethue!


 Hue Team


Reply all
Reply to author
Forward
0 new messages