Install and manage software on a Cloud Dataproc cluster

geoffry...@hedronanalytics.com

unread,

Jan 29, 2018, 3:03:44 PM1/29/18

to Google Cloud Dataproc Discussions

All,

According to the Dataproc FAQ I should be able to, "...install and manage software on a Cloud Dataproc cluster." I have been scouring the docs and cannot see exactly how. (Yes, I could have missed it.) I need to install Zookeeper and Accumulo. Accumulo uses HDFS.

Does any one know? Can I get there from here?

I thank you.

Karthik Palaniappan

unread,

Jan 29, 2018, 3:54:43 PM1/29/18

to Google Cloud Dataproc Discussions

Hi there,

You can install any additional software using "initialization actions" -- arbitrary scripts to be run on every node when the cluster gets created. https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions.

For example:

```

gcloud dataproc clusters create <cluster-name> --initialization-actions gs://path/to/my/init-action.sh

```

We have some example init actions on github, and zookeeper is one of them: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/zookeeper. Note that if you deploy a high availability cluster, zookeeper comes pre-installed, so you wouldn't need this init action.

If you do write an Accumulo init action, consider adding it to our examples!

geoffry...@hedronanalytics.com

unread,

Jan 30, 2018, 4:36:33 PM1/30/18

to Google Cloud Dataproc Discussions

Thanks Karthik,  

I gave your suggestions a try and they worked--mostly.  Truth be told, I am not quite understanding gcloud.

I have another question: Why can't I run more than one initialization action?

I tried to create a simple cluster--single-node--installing zookeeper and accumulo.  It didn't work.  
I ran this:
gcloud dataproc clusters create haz00 \
--initialization-actions gs://zookeeper.sh,gs://accumulo.sh \
--region=us-east4 \
--zone=us-east4-a \
--single-node

And got this:
ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Google Cloud Storage object does not exist 
 'gs://accumulo/accumulo.sh'.

If I remove the offending bit, and run this:
gcloud dataproc clusters create haz00 \
--initialization-actions gs://zookeeper.sh \
--region=us-east4 \
--zone=us-east4-a \
--single-node

Notice, I am running only one script.  I get a running instance but sans Accumulo.

Karthik Palaniappan

unread,

Jan 30, 2018, 5:21:11 PM1/30/18

to Google Cloud Dataproc Discussions

Cloud storage paths should be in the form gs://<bucketname>/path/to/object -- I'm surprised Dataproc accepted gs://zookeeper.sh.

If you have not created a GCS bucket (https://cloud.google.com/storage/docs/creating-buckets#storage-create-bucket-gsutil), create one like this:

```

gsutil mb gs://<bucketname> # e.g. gs://geoffryroberts

```

Then, copy the scripts into your bucket (assuming you have accumulo.sh on your machine):

```

gsutil cp gs://dataproc-initialization-actions/zookeeper/zookeeper.sh gs://<bucketname>/zookeeper.sh

gsutil cp ./accumulo.sh gs://<bucketname>/accumulo.sh

```

And finally, create the cluster (yes, you can use multiple init actions):

```

gcloud dataproc clusters create haz00 \

--initialization-actions gs://<bucketname>/zookeeper.sh,gs://<bucketname>/accumulo.sh \
--region=us-east4 \
--zone=us-east4-a \
--single-node

```

You may also need to set --initialization-action-timeout if your accumulo init action takes more than 10 minutes.

geoffry...@hedronanalytics.com

unread,

Jan 31, 2018, 1:10:51 PM1/31/18

to Google Cloud Dataproc Discussions

Thanks again Karthik. I had no idea I needed to upload a script to a bucket. Still some on this day three of my Google Cloud experience, issues persist.

I create a bucket.

$ gsutil mb -p bold-rain-193317 -c regional -l us-east4-a gs://haz-bucket

It creates successfully.

I try to upload.

$ cd ~/dataproc-initialization-actions
$ gsutil cp gs://dataproc-initialization-actions/zookeeper/zookeeper.sh gs://haz-bucket/zookeeper.sh 
$ gsutil cp gs://dataproc-initialization-actions/accumulo/accumulo.sh gs://haz-bucket/accumulo.sh

zookeeper.sh succeeds, but accumulo.sh does not. The paths are correct.

Question: Why does zookeeper.sh succeed when I run the command from within the dataproc-initialization-actions directory?

Also, I get the following error on accumulo.sh:

CommandException: No URLs matched: gs://dataproc-initialization-actions/accumulo/accumulo.sh

If I use a relative path like this (NOTE: I removed the super directory):

$ gsutil cp gs://zookeeper/zookeeper.sh gs://haz-bucket/zookeeper.sh
$ gs://accumulo/accumulo.sh

I get for zookeeper.sh:

AccessDeniedException: 403 geoffry.roberts@hedronanalytics.com does not have storage.objects.list access to zookeeper.

But for accumulo.sh I get:

BucketNotFoundException: 404 gs://accumulo bucket does not exist.

Why? Why the two different errors?

Finally,

Is there a way to execute these scripts from the command line on my compute instance? How do I access my bucket from there?

There is no apt-get for Accumulo. Hence I must download the tarball. Where does the create command download to?

No download appears in my bucket nor in my compute instance.

This is my create command:


$ gcloud dataproc clusters create haz00 \ 
--initialization-actions gs://haz-bucket/zookeeper.sh,gs://haz-bucket/accumulo.sh \ 
--bucket=haz-bucket \ 
--region=us-east4 \ 
--zone=us-east4-a \ 
--single-node

You been a lot of help already. Thanks

Karthik Palaniappan

unread,

Jan 31, 2018, 5:37:46 PM1/31/18

to Google Cloud Dataproc Discussions

> Question: Why does zookeeper.sh succeed when I run the command from within the dataproc-initialization-actions directory?

The directory does not matter. That zookeeper command should always succeed, and that accumulo command will always fail. That command is copying scripts from gs://dataproc-initialization-actions (a Dataproc-maintained mirror of https://github.com/GoogleCloudPlatform/dataproc-initialization-actions). If you run gsutil ls gs://dataproc-initialization-actions/zookeeper/zookeeper.sh, you'll see that that script exists. Accumulo, however, is not one of our init actions, so the path you tried (gs://dataproc-initialization-actions/accumulo/accumulo.sh) does not exist.

You will instead need to write an accumulo script yourself (on your local machine), then upload it to GCS with gsutil cp ./my-accumulo-script.sh gs://haz-bucket/accumulo.sh.

> Why? Why the two different errors?

As I mentioned earlier, GCS paths are in the form gs://<bucketname>/path. The bucketnames are actually in a global namespace -- not per-project. So if I own gs://my-bucket, you will get permission denied trying to list it. It sounds like gs://zookeeper exists, but is owned by another project (so you got 403), and gs://accumulo does not exist (so 404). Most GCS paths you use will be under your buckets (e.g. gs://haz-bucket). One exception is gs://dataproc-initialization-actions, which is a publicly readable bucket.

You can take this tutorial to learn more about GCS: https://cloud.google.com/storage/docs/quickstart-gsutil

> Is there a way to execute these scripts from the command line on my compute instance? How do I access my bucket from there?

Yes! Once a cluster has been created, you can ssh into any of the compute engine host(s). If your single node cluster is called haz00, then the VM will be called haz00-m. Run `gcloud compute ssh haz00-m` to ssh into that VM.

gsutil comes pre-installed on compute engine VMs, so you can run any gsutil commands, like gsutil ls gs://haz-bucket.

You can take this Dataproc tutorial to get a little more familiar with init actions: https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook

Reply all

Reply to author

Forward