Dataproc Spark and mount configurations with SSDs

AndreiK

unread,

Oct 6, 2016, 1:06:58 PM10/6/16

to Google Cloud Dataproc Discussions

I use Dataproc to create Spark clusters with two SSDs per node for saving /tmp files. I need to have these SSD as a single mountpoint with RAID0. In the past I used mkfs and mdadm to do this and then update spark.local.dir to use this mountpoint as /tmp location. Now Dataproc (bdutil) updates yarn/spark/hadoop and fstab with /mnt/1 and /mnt/2 that point to sdb/sdc and Spark dies. Want to confirm this is the recent change? How can I continue using Spark with 2 SSD per worker with RAID0 for /tmp files. Thank you

Dennis Huo

unread,

Oct 6, 2016, 1:22:50 PM10/6/16

to Google Cloud Dataproc Discussions

First, to clarify in case there's future confusion for other folks coming across this thread - while some of the same engineers are involved with both products, it's important to keep in mind that Cloud Dataproc is an entirely different product than bdutil; Dataproc is a managed service with SLAs and provides REST APIs for cluster and job management and charges an additional $0.01 per vCPU-hour, while bdutil is a "do-it-yourself" tool no different than running a bunch of "gcloud compute instances create" commands and SSH'ing into each one to install Hadoop and Spark from tarballs - bdutil is thus more flexible/customizable but less optimized and less user-friendly.

That said, neither one has changed the way local SSDs are mounted recently, though there was indeed a recent update inheriting some newer guest-environment settings which didn't change the semantics of how Dataproc sets up the separate mount points, but may have changed the way any customizations would interact with the base Debian configuration: https://github.com/GoogleCloudPlatform/compute-image-packages#instance-setup

Do you have a project id you can share with dataproc...@google.com, and do you have details about how you're running your custom RAID0 setup commands?

AndreiK

unread,

Oct 6, 2016, 9:57:40 PM10/6/16

to Google Cloud Dataproc Discussions

Here why I brought up bdutil. On a worker node: cat /etc/fstab: UUID=1234 /mnt/1 ext4 defaults,discard 0 2 # added by bdutil

I am using Dataproc though. Moving on. These are the settings present in xmls of nodes with SSD

yarn-site.xml:

<name>yarn.nodemanager.local-dirs</name>

<value>/mnt/1/hadoop/yarn/nm-local-dir</value>

Other files:

/etc/hadoop/conf.empty/hdfs-site.xml: <value>/mnt/1/hadoop/dfs/data</value>

/etc/hadoop/conf.empty/mapred-site.xml: <value>/mnt/1/hadoop/mapred/local</value>

This is an example of a yarn-site.xml setting for a worker w/o SSD that I was hoping would hold:

<name>yarn.nodemanager.local-dirs</name>

<value>/hadoop/yarn/nm-local-dir</value>

I raid0 via

mdadm --create /dev/mymnt --level=0 --raid-devices=2 /dev/sd[bc]

mkfs.ext4 -F /dev/mymnt

I need SSD for something else and not for tmp Spark files. It is possible to remount, update the xmls, etc but I hoped there is a better way

Dennis Huo

unread,

Oct 6, 2016, 10:25:09 PM10/6/16

to Google Cloud Dataproc Discussions

Ah, thanks for the clarification; there are indeed some vestiges of shared/forked code, though they've diverged somewhat at this point that they'll have different considerations when doing your own raid0 setup.

When you say you "in the past you used..." did you mean that you had commands that used to work on Dataproc and then stopped working in a recent release? Do you remember approximately when was the last time Dataproc worked the way you originally used it?

AndreiK

unread,

Oct 6, 2016, 10:40:49 PM10/6/16

to Google Cloud Dataproc Discussions

I had it working last week but then I realized I had a single SSD at that point and was able to simply mount it. It is trickier with >1 SSD since I need to RAID them. Bottom line it would be nice if Dataproc let me decide which disk to allocate for Spark (PD or SSD)

Dennis Huo

unread,

Oct 6, 2016, 11:30:44 PM10/6/16

to Google Cloud Dataproc Discussions

Thanks for clarifying. At the moment, unfortunately it will require the remount/xml-updating/daemon-restarts as you mention.

However, I agree it could be a useful feature to be able to specify disabling of the normal disk mounting logic. Note that the mounting logic also points /hadoop at one of the mounted disks and not the boot disk, so it might not be enough to change the configs to point at the /hadoop directory after remounting.

It sounds like in your case you'd prefer Dataproc just ignore the local SSDs entirely so that you can add your initialization actions to deal with the SSDs separately completely fresh, is that true?

It'd be easier to provide configuration settings for wholesale disabling the mount disks logic vs trying more fine-grained configuration settings for what Spark uses, what YARN uses, what HDFS uses, etc.

If you send your project-id to dataproc...@google.com we might be able to discuss having you test an early cut of a candidate image via our Trusted Testers images arrangement.

Dennis Huo

unread,

Oct 20, 2016, 3:15:23 PM10/20/16

to Google Cloud Dataproc Discussions

By the way, forgot to mention it here at the time, but since the topic of managing local SSDs as separate block devices for custom usage is a known use case that comes up time to time, on Oct 11th we added the ability to specify "--properties dataproc:dataproc.localssd.mount.enable=false" when creating a Dataproc cluster to make Dataproc ignore local SSDs so that you can configure them yourself as needed:

https://cloud.google.com/dataproc/docs/release-notes/service#october_11_2016

I'd love to hear if you get a chance to try it and see if it works for your case.

AndreiK

unread,

Oct 27, 2016, 6:20:10 PM10/27/16

to Google Cloud Dataproc Discussions

yes it worked great. thx!

Reply all

Reply to author

Forward