Install and manage software on a Cloud Dataproc cluster

1,205 views
Skip to first unread message

geoffry...@hedronanalytics.com

unread,
Jan 29, 2018, 3:03:44 PM1/29/18
to Google Cloud Dataproc Discussions
All,

According to the Dataproc FAQ I should be able to,  "...install and manage software on a Cloud Dataproc cluster."  I have been scouring the docs and cannot see exactly how. (Yes, I could have missed it.)  I need to install Zookeeper and Accumulo.  Accumulo uses HDFS.  

Does any one know? Can I get there from here?

I thank you.

Karthik Palaniappan

unread,
Jan 29, 2018, 3:54:43 PM1/29/18
to Google Cloud Dataproc Discussions
Hi there,

You can install any additional software using "initialization actions" -- arbitrary scripts to be run on every node when the cluster gets created. https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions.

For example:

```
gcloud dataproc clusters create <cluster-name> --initialization-actions gs://path/to/my/init-action.sh
```

We have some example init actions on github, and zookeeper is one of them: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/zookeeper. Note that if you deploy a high availability cluster, zookeeper comes pre-installed, so you wouldn't need this init action.

If you do write an Accumulo init action, consider adding it to our examples!

geoffry...@hedronanalytics.com

unread,
Jan 30, 2018, 4:36:33 PM1/30/18
to Google Cloud Dataproc Discussions
Thanks Karthik,  

I gave your suggestions a try and they worked--mostly. Truth be told, I am not quite understanding gcloud.

I have another question: Why can't I run more than one initialization action?

I tried to create a simple cluster--single-node--installing zookeeper and accumulo. It didn't work.
I ran this:
gcloud dataproc clusters create haz00 \
--initialization-actions gs://zookeeper.sh,gs://accumulo.sh \
--region=us-east4 \
--zone=us-east4-a \
--single-node

And got this:
ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Google Cloud Storage object does not exist
 
'gs://accumulo/accumulo.sh'.

If I remove the offending bit, and run this:
gcloud dataproc clusters create haz00 \
--initialization-actions gs://zookeeper.sh \
--region=us-east4 \
--zone=us-east4-a \
--single-node

Notice, I am running only one script. I get a running instance but sans Accumulo.

Karthik Palaniappan

unread,
Jan 30, 2018, 5:21:11 PM1/30/18
to Google Cloud Dataproc Discussions
Cloud storage paths should be in the form gs://<bucketname>/path/to/object -- I'm surprised Dataproc accepted gs://zookeeper.sh.

If you have not created a GCS bucket (https://cloud.google.com/storage/docs/creating-buckets#storage-create-bucket-gsutil), create one like this:

```
gsutil mb gs://<bucketname>  # e.g. gs://geoffryroberts
```

Then, copy the scripts into your bucket (assuming you have accumulo.sh on your machine):

```
gsutil cp gs://dataproc-initialization-actions/zookeeper/zookeeper.sh gs://<bucketname>/zookeeper.sh
gsutil cp ./accumulo.sh gs://<bucketname>/accumulo.sh
```

And finally, create the cluster (yes, you can use multiple init actions):

```
gcloud dataproc clusters create haz00 \
--initialization-actions gs://<bucketname>/zookeeper.sh,gs://<bucketname>/accumulo.sh \
--region=us-east4 \
--zone=us-east4-a \
--single-node
```

You may also need to set --initialization-action-timeout if your accumulo init action takes more than 10 minutes.

geoffry...@hedronanalytics.com

unread,
Jan 31, 2018, 1:10:51 PM1/31/18
to Google Cloud Dataproc Discussions

Thanks again Karthik.  I had no idea I needed to upload a script to a bucket.  Still some on this day three of my Google Cloud experience, issues persist.


I create a bucket.


$ gsutil mb -p bold-rain-193317 -c regional -l us-east4-a  gs://haz-bucket


It creates successfully.

I try to upload.


$ cd ~/dataproc-initialization-actions
$ gsutil cp gs:/
/dataproc-initialization-actions/zookeeper/zookeeper.sh gs://haz-bucket/zookeeper.sh
$ gsutil cp gs
://dataproc-initialization-actions/accumulo/accumulo.sh gs://haz-bucket/accumulo.sh


zookeeper.sh succeeds, but accumulo.sh does not.  The paths are correct.

Question: Why does zookeeper.sh succeed when I run the command from within the dataproc-initialization-actions directory?


Also, I get the following error on accumulo.sh:


CommandException: No URLs matched: gs://dataproc-initialization-actions/accumulo/accumulo.sh


If I use a relative path like this (NOTE: I removed the super directory): 


$ gsutil cp gs://zookeeper/zookeeper.sh gs://haz-bucket/zookeeper.sh
$ gs
://accumulo/accumulo.sh


I get for zookeeper.sh:


AccessDeniedException: 403 geoffry.roberts@hedronanalytics.com does not have storage.objects.list access to zookeeper.


But for accumulo.sh I get:


BucketNotFoundException: 404 gs://accumulo bucket does not exist.


Why? Why the two different errors?


Finally,  


Is there a way to execute these scripts from the command line on my compute instance? How do I access my bucket from there? 


There is no apt-get for Accumulo.  Hence I must download the tarball.  Where does the create command download to?  

No download appears in my bucket nor in my compute instance.  


This is my create command:



$ gcloud dataproc clusters create haz00
\
--initialization-actions gs://haz-bucket/zookeeper.sh,gs://haz-bucket/accumulo.sh \
--bucket=haz-bucket \
--region=us-east4 \
--zone=us-east4-a \
--single-node



You been a lot of help already. Thanks





Karthik Palaniappan

unread,
Jan 31, 2018, 5:37:46 PM1/31/18
to Google Cloud Dataproc Discussions
> Question: Why does zookeeper.sh succeed when I run the command from within the dataproc-initialization-actions directory?

The directory does not matter. That zookeeper command should always succeed, and that accumulo command will always fail. That command is copying scripts from gs://dataproc-initialization-actions (a Dataproc-maintained mirror of https://github.com/GoogleCloudPlatform/dataproc-initialization-actions). If you run gsutil ls gs://dataproc-initialization-actions/zookeeper/zookeeper.sh, you'll see that that script exists. Accumulo, however, is not one of our init actions, so the path you tried (gs://dataproc-initialization-actions/accumulo/accumulo.sh) does not exist.

You will instead need to write an accumulo script yourself (on your local machine), then upload it to GCS with gsutil cp ./my-accumulo-script.sh gs://haz-bucket/accumulo.sh.

> Why? Why the two different errors?

As I mentioned earlier, GCS paths are in the form gs://<bucketname>/path. The bucketnames are actually in a global namespace -- not per-project. So if I own gs://my-bucket, you will get permission denied trying to list it. It sounds like gs://zookeeper exists, but is owned by another project (so you got 403), and gs://accumulo does not exist (so 404). Most GCS paths you use will be under your buckets (e.g. gs://haz-bucket). One exception is gs://dataproc-initialization-actions, which is a publicly readable bucket.

You can take this tutorial to learn more about GCS: https://cloud.google.com/storage/docs/quickstart-gsutil

> Is there a way to execute these scripts from the command line on my compute instance? How do I access my bucket from there?

Yes! Once a cluster has been created, you can ssh into any of the compute engine host(s). If your single node cluster is called haz00, then the VM will be called haz00-m. Run `gcloud compute ssh haz00-m` to ssh into that VM.

gsutil comes pre-installed on compute engine VMs, so you can run any gsutil commands, like gsutil ls gs://haz-bucket.

You can take this Dataproc tutorial to get a little more familiar with init actions: https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
Reply all
Reply to author
Forward
0 new messages