Submitting jobs directly to YARN

Ali Anwar

unread,

Mar 7, 2018, 6:04:31 PM3/7/18

to Google Cloud Dataproc Discussions

I'm trying to run a command like so from the Cloud SSH Shell:
yarn jar YARNAPP.jar com/gpiskas/yarn/AppMaster

I am getting the error, for which I pasted a screenshot image below. Any help is appreciated.
For instance, how can my commandline client kinit to authenticate against the YARN RM?

Karthik Palaniappan

unread,

Mar 7, 2018, 6:37:55 PM3/7/18

to Google Cloud Dataproc Discussions

Hi Ali,

Did you try to kerberize your Dataproc cluster? I'm not convinced that it worked -- I would have expected that error message to be "SIMPLE authentication is not enabled. Available: [TOKEN, KERBEROS].

Also, you would need to run `kinit <user>` before running `yarn jar`.

To be honest Kerberos is unchartered territory for us too. While we might be able to provide some insight, you're better off following the documentation.

Karthik Palaniappan

unread,

Mar 7, 2018, 7:53:31 PM3/7/18

to Google Cloud Dataproc Discussions

Backing up -- can you tell us more about why you're interested in Kerberos? Feel free to send us a private email to dataproc-feedback {at} google.com if you don't want to share that publicly.

Here are some high-level pointers on Dataproc security:

1) We generally recommend storing your data outside of Dataproc clusters, e.g. in Google Cloud Storage or BigQuery. That way, you can configure access control with IAM across all of your cloud usage. If your data in is in Cloud Storage or BigQuery, the data transfer to Dataproc clusters is secured. Also, all communication with Dataproc's control plane (e.g. creation of clusters, submission of jobs) is secured. In fact, *all* communication with Google services is secured.

2) We recommend "ephemeral" clusters -- clusters used by a single workflow, user, or set of users. For batch jobs, consider checking out Dataproc workflows. You can create each Dataproc cluster with a particular service account, which any jobs on that cluster will use to authenticate with Google services. These short-lived clusters are made possible because your data lives outside of the cluster itself.

3) While communication between Hadoop daemons in the cluster is not secured by default, the cluster is essentially airgapped. Node-to-node communication happens over internal IPs on your isolated VPC network. Dataproc has guidance on how to configure firewall rules. Also note that any traffic that leaves a Google datacenter (e.g. communication between regions) is always encrypted.

4) You can also use Dataproc private IP clusters to entirely avoid having external IP addresses on the VMs.

Ali Anwar

unread,

Mar 8, 2018, 1:46:10 PM3/8/18

to Google Cloud Dataproc Discussions

Hi Karthik.

I didn't especially configure Kerberos. I just went with the default DataProc configurations, besides reducing the memory and cpu size of each node in my cluster.
I also didn't notice any kerberos-related configurations in the *-site.xml files.

For this use case, I am not super concerned about securing my cluster, but instead about running a YARN application. However, I was unable to submit jobs directly to YARN from the commandline.
Is this not supported on DataProc? If it is, can someone who has tried this out before advice me on what I should do differently?

Regards,
Ali Anwar

Karthik Palaniappan

unread,

Mar 8, 2018, 2:15:51 PM3/8/18

to Google Cloud Dataproc Discussions

Ah, yes, submitting yarn jobs through the command line is supported.

For example, I created a cluster and ran the command `yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 10` and it ran successfully.

Questions:

1) Can you tell us more about why you want to reduce memory/cpu? Are you trying to resize the VMs (unsupported), or reduce the memory each YARN node manager has?

a) Though you can't resize VMs, you can create a cluster with different machine types or add/remove nodes to the cluster)

b) If you tried to change YARN configuration in the XML files: did you change them the same way on all nodes? What properties did you set and to what values? (Note that you can create a cluster with --properties to make this easier)

2) Can you run `yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 10`? If you get the same error message, you've likely changed configuration properties incorrectly. If not, it's probably something in your YARNAPP.jar. Does that job set any properties?

3) You can send us a diagnostic tarball to dataproc-feedback (privately) and we might be able to help more. If you could send us the code for YARNAPP.jar (or some minimal repro) that would also be helpful.

Ali Anwar

unread,

Mar 8, 2018, 7:59:05 PM3/8/18

to Google Cloud Dataproc Discussions

1) I didn't modify any yarn configurations. I was simply trying to reduce the cost of my experimentation/development cluster.
2) I was able to run this MapReduce job successfully. What fails for me is running a generic Yarn application.
3) I got the yarn app from the following GitHub repo: https://github.com/gpiskas/Simple_YARN_App_Skeleton/blob/master/YARNAPP.jar

Thanks,
Ali Anwar

Patrick Clay

unread,

Mar 9, 2018, 2:37:15 PM3/9/18

to Google Cloud Dataproc Discussions

Hi Ali,

If I understand correctly you tried to run a YARN AppMaster locally instead of as a YARN container, but that YARN does not support this.

If you want to shrink the footprint of you AppMaster, you could create a small single node cluster and set the property yarn:yarn.scheduler.minimum-allocation-mb=1, because we force all containers to be at least 1GB (on sufficiently large clusters) by default.

For more on Clients and AppMasters, I refer you to Hadoop's documentation.

Hope that helps,

-Patrick

Ali Anwar

unread,

Mar 9, 2018, 2:39:54 PM3/9/18

to Google Cloud Dataproc Discussions

Hey Patrick.

Just to clarify - my goal wasn't to run the YARN AppMaster locally, but indeed as a YARN container.

Patrick Clay

unread,

Mar 9, 2018, 3:18:16 PM3/9/18

to Google Cloud Dataproc Discussions

Sorry if I misunderstood.

I said that because you ran yarn jar YARNAPP.jar com.gpiskas.yarn.AppMaster, instead of yarn jar YARNAPP.jar com.gpiskas.yarn.Client, which uses Client to submit the AppMaster to YARN.

This is what launch.sh runs and is correct. The only issue with launch.sh is it's assumptions about local logs at the end. I don't understand how you were trying to reduce costs, but there should be no difference between yarn jar, hadoop jar or submitting as a Dataproc Hadoop job. For both MapReduces and direct YARN applications.

-Patrick

Ali Anwar

unread,

Mar 17, 2018, 7:18:12 PM3/17/18

to Google Cloud Dataproc Discussions

Hey Patrick.

Thanks for clarifying the commands. With that, I was able to launch the YARN application.

Regards,
Ali Anwar

Reply all

Reply to author

Forward