I have both Spark 1.6 and 2.0 installed on my cluster. I see in the docs how to manually run a spark-submit job and choose 2.0 here. However, I launch my jobs using Oozie. Is there a way to specify for a given Oozie workflow spark action that I want to use the 2.0 engine vs 1.6?
I've tried removing multiple files, but there are so many (and even some duplicated in oozie sharelib and spark2 sharelib) that I'm afraid of removing them all and breaking 1.6 (thus removing ability to run any existing jobs under 1.6).
After removing all duplicate files found between the sharelib for oozie and spark2, I still could not run a Spark2 job from Oozie 4.2. Was getting ImportError for a custom python file I was trying to import from the main application py file. Seems that Oozie wasn't setting --py-files correctly (again, worked fine with Spark 1.6).
Thank you dsun! I'm working on these steps today. It seems from the instructions that once the sharelib for spark2 is setup, I can switch a given workflow to point to spark2 by specifying in job.properties:
Thanks @phil_hummel. I saw that ticket. The release notes for sparklyr 0.9.3 say it supports Spark 2.4.0 and spark_available_versions listed it so I thought it was worth a try. I did not do anything to clean up between versions, it looks like they are separate installs. Is there something you recommend? What version of Spark worked for you?
You can install Spark on an Amazon EMR cluster along with other Hadoop applications, and it can also leverage the EMR file system (EMRFS) to directly access data in Amazon S3. Hive is also integrated with Spark so that you can use a HiveContext object to run Hive scripts using Spark. A Hive context is included in the spark-shell as sqlContext.
aws-sagemaker-spark-sdk, delta, emrfs, emr-goodies, emr-ddb, emr-s3-select, hadoop-client, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-httpfs-server, hadoop-kms-server, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, hadoop-yarn-timeline-server, hudi, hudi-spark, iceberg, livy-server, nginx, r, spark-client, spark-history-server, spark-on-yarn, spark-yarn-slave
Amazon EMR release 6.8.0 comes with Apache Spark 3.3.0. This Spark release uses Apache Log4j 2 and the log4j2.properties file to configure Log4j in Spark processes. If you use Spark in the cluster or create EMR clusters with custom configuration parameters, and you want to upgrade to Amazon EMR release 6.8.0, you must migrate to the new spark-log4j2 configuration classification and key format for Apache Log4j 2. For more information, see Migrating from Apache Log4j 1.x to Log4j 2.x.
aws-sagemaker-spark-sdk, emrfs, emr-goodies, emr-ddb, emr-s3-select, hadoop-client, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-httpfs-server, hadoop-kms-server, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, hadoop-yarn-timeline-server, hudi, hudi-spark, livy-server, nginx, r, spark-client, spark-history-server, spark-on-yarn, spark-yarn-slave
If you are using CDH or MapR, copy spark-env.sh.template as a newexecutable file conf/spark-env.sh and set HADOOP_CONF_DIR to the location of your Hadoop configuration directory(typically to /etc/hadoop/conf).
Updates with spark-datasource is feasible only when the source dataframe contains Hudi's meta fields or a key field is configured.Notice that the save mode is now Append. In general, always use append mode unless you are trying to create the table for the first time.
For advanced usage of spark SQL, please refer to Spark SQL DDL and Spark SQL DML reference guides.For alter table commands, check out this. Stored procedures provide a lot of powerful capabilities using Hudi SparkSQL to assist with monitoring, managing and operating Hudi tables, please check this out.
For beginner, we would suggest you to play Spark in Zeppelin docker.In the Zeppelin docker image, we have already installedminiconda and lots of useful python and R librariesincluding IPython and IRkernel prerequisites, so %spark.pyspark would use IPython and %spark.ir is enabled.Without any extra configuration, you can run most of tutorial notes under folder Spark Tutorial directly.
First you need to download Spark, because there's no Spark binary distribution shipped with Zeppelin.e.g. Here we download Spark 3.1.2 to/mnt/disk1/spark-3.1.2,and we mount it to Zeppelin docker container and run the following command to start Zeppelin docker container.
After running the above command, you can open :8080 to play Spark in Zeppelin. We only verify the spark local mode in Zeppelin docker, other modes may not work due to network issues.-p 4040:4040 is to expose Spark web ui, so that you can access Spark web ui via :8081.
The Spark interpreter can be configured with properties provided by Zeppelin.You can also set other Spark properties which are not listed in the table. For a list of additional properties, refer to Spark Available Properties. Property Default Description SPARK_HOME Location of spark distribution spark.master local[*] Spark master uri.
e.g. spark://masterhost:7077 spark.submit.deployMode The deploy mode of Spark driver program, either "client" or "cluster", Which means to launch driver program locally ("client") or remotely ("cluster") on one of the nodes inside the cluster. spark.app.name Zeppelin The name of spark application. spark.driver.cores 1 Number of cores to use for the driver process, only in cluster mode. spark.driver.memory 1g Amount of memory to use for the driver process, i.e. where SparkContext is initialized, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. 512m, 2g). spark.executor.cores 1 The number of cores to use on each executor spark.executor.memory 1g Executor memory per worker instance.
e.g. 512m, 32g spark.executor.instances 2 The number of executors for static allocation spark.files Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed. spark.jars Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed. spark.jars.packages Comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths. The coordinates should be groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will be resolved according to the configuration in the file, otherwise artifacts will be searched for in the local maven repo, then maven central and finally any additional remote repositories given by the command-line option --repositories. PYSPARK_PYTHON python Python binary executable to use for PySpark in both driver and executors (default is python). Property spark.pyspark.python take precedence if it is set PYSPARK_DRIVER_PYTHON python Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). Property spark.pyspark.driver.python take precedence if it is set zeppelin.pyspark.useIPython false Whether use IPython when the ipython prerequisites are met in %spark.pyspark zeppelin.R.cmd R R binary executable path.
zeppelin.spark.concurrentSQL false Execute multiple SQL concurrently if set true. zeppelin.spark.concurrentSQL.max 10 Max number of SQL concurrently executed zeppelin.spark.maxResult 1000 Max number rows of Spark SQL result to display. zeppelin.spark.run.asLoginUser true Whether run spark job as the zeppelin login user, it is only applied when running spark job in hadoop yarn cluster and shiro is enabled. zeppelin.spark.printREPLOutput true Print scala REPL output zeppelin.spark.useHiveContext true Use HiveContext instead of SQLContext if it is true. Enable hive for SparkSession zeppelin.spark.enableSupportedVersionCheck true Do not change - developer only setting, not for production use zeppelin.spark.sql.interpolation false Enable ZeppelinContext variable interpolation into spark sql zeppelin.spark.uiWebUrl Overrides Spark UI default URL. Value should be a full URL (ex: http://hostName/uniquePath. In Kubernetes mode, value can be Jinja template string with 3 template variables PORT, SERVICENAME and SERVICEDOMAIN . (e.g.: http://PORT-SERVICENAME.SERVICEDOMAIN ). In yarn mode, value could be a knox url with applicationId as placeholder, (e.g.: -server:8443/gateway/yarnui/yarn/proxy/applicationId/) spark.webui.yarn.useProxy false whether use yarn proxy url as Spark weburl, e.g. :8088/proxy/application1583396598068_0004 spark.repl.target jvm-1.6 Manually specifying the Java version of Spark Interpreter Scala REPL,Available options:
scala-compile v2.10.7 to v2.11.12 supports "jvm-1.5, jvm-1.6, jvm-1.7 and jvm-1.8", and the default value is jvm-1.6.
scala-compile v2.10.1 to v2.10.6 supports "jvm-1.5, jvm-1.6, jvm-1.7", and the default value is jvm-1.6.
scala-compile v2.12.x defaults to jvm-1.8, and only supports jvm-1.8.
If you want to use multiple versions of Spark, then you need to create multiple Spark interpreters and set SPARK_HOME separately. e.g.Create a new Spark interpreter spark24 for Spark 2.4 and set its SPARK_HOME in interpreter setting page as following,
After setting SPARK_HOME, you need to set spark.master property in either interpreter setting page or inline configuartion. The value may vary depending on your Spark cluster deployment type.
df19127ead