cluster creation fails with no further information

Louis Bergelson

unread,

Mar 18, 2016, 3:57:51 PM3/18/16

to Google Cloud Dataproc Discussions

I'm having trouble trying to create a new cluster. I get a failure with no further information and no logs.

🐳 ~ gcloud dataproc clusters create --project broad-gatk-test --bucket broad-gatk-test-cluster test-cluster2 --zone us-central1-c
Waiting on operation [projects/broad-gatk-test/regions/global/operations/fc6bdfe0-3a93-4a0a-a105-4d3e5977a3f7].
Waiting for cluster creation operation...done.
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/broad-gatk-test/regions/global/operations/fc6bdfe0-3a93-4a0a-a105-4d3e5977a3f7] failed: Google Cloud Dataproc Agent reports failure. If logs are available, they can be found in 'gs://broad-gatk-test-cluster/google-cloud-dataproc-metainfo/462a8c1b-b0ce-4eb2-9532-830365f79dc1/test-cluster2-m'..
🐳 ~ gsutil cat gs://broad-gatk-test-cluster/google-cloud-dataproc-metainfo/462a8c1b-b0ce-4eb2-9532-830365f79dc1/test-cluster2-m
CommandException: No URLs matched: gs://broad-gatk-test-cluster/google-cloud-dataproc-metainfo/462a8c1b-b0ce-4eb2-9532-830365f79dc1/test-cluster2-m

I ran diagnose, the output is here ( gs://broad-gatk-test-cluster/google-cloud-dataproc-metainfo/462a8c1b-b0ce-4eb2-9532-830365f79dc1/tasks/0e433402-04d9-4a4e-a135-3eced8bcc5a4/diagnostic.tar) and should be readable by "viewers-24135231894".

Browsing through the diagnose results, the end of the 'node_startup/test-cluster2-w-0/dataproc-startup-script_output" has this line which seems suspicious

google-dataproc-startup: E: Could not open file /var/lib/apt/lists/http.debian.net_debian_dists_jessie_main_binary-amd64_Packages - open (2: No such file or directory)

This exact command worked earlier in the week, and I don't know what changed to break it.

Submitting a job fails:

gcloud beta dataproc jobs submit spark --project broad-gatk-test --cluster cluster-1 --jar gs://hellbender/test/staging/lb_staging/gatk-all-4.alpha-191-gcacec92-SNAPSHOT-spark_d83f0056fb986bf07efb16e4fb2298cb.jar PrintReadsSpark -I gs://broad-gatk-test-cluster/src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam -O output.bam --sparkMaster yarn-client
ERROR: (gcloud.beta.dataproc.jobs.submit.spark) Unable to submit job, cluster 'cluster-1' is not in a helthy state.

(note: slight typo in "helthy state")

Going through the console UI also creates a broken cluster.

Patrick Clay

unread,

Mar 18, 2016, 4:57:14 PM3/18/16

to Google Cloud Dataproc Discussions

Hi Louis,

You mentioned "going through the console UI also creates a broken cluster", does that mean that you had successive failures?

This looks like a rare flake to me.

You were correct as to the culprit. There was an ephemeral Apt issue while cleaning up unused packages during boot. Our agents were somewhat overzealous, in not continuing booting that worker (and causing the cluster creation to fail). This is a bug, which we will fix.

As for logs, we should clarify that message, but gs://broad-gatk-test-cluster/google-cloud-dataproc-metainfo/462a8c1b-b0ce-4eb2-9532-830365f79dc1/test-cluster2-m is a directory, equivalent to node_startup/test-cluster2-m in the diagnostic tar.

It is intended behavior to block job submission to clusters in an error state. In general you have to recreate broken clusters (after debugging the issue).

If this is a recurring problem could I see another diagnostic tarball?

Thanks,

-Patrick

Louis Bergelson

unread,

Mar 22, 2016, 1:45:49 PM3/22/16

to Google Cloud Dataproc Discussions

Hi Patrick,

Thank you for your reply. I should have figured out that the logs were a folder not a single log. Sorry about that.

This is indeed a recurring issue. I seem to be completely unable to bring up a cluster. Is it possible that there is some global setting in my project that is breaking things? IT has made some changes to the project since I last successfully created a cluster, but I'm not sure what or where to look for potentially breaking changes.

I'm unable to provide another diagnostic tar ball, because diagnose has stopped returning results. It seems to run forever (> 20minutes) without results.

The output of my current cluster creation is here (it should be public, let me know if it is inaccessible) gs://dataproc-a61734cc-254d-4bb6-a176-b89c156819be-us/google-cloud-dataproc-metainfo/bcbc382c-0009-4249-8591-a68119495b3d/

Thanks for looking into this.

Louis

Louis Bergelson

unread,

Mar 22, 2016, 1:49:03 PM3/22/16

to Google Cloud Dataproc Discussions

I should add that I've tried creating clusters on several different days now, so if it's a transient error it's somehow caused a non-transient error. (At least 5 failures, it's a bit of a pain to try a lot because I have a core limit and can only try to create 1 at a time before i have to delete the old one.)

Dennis Huo

unread,

Mar 24, 2016, 2:22:45 PM3/24/16

to Google Cloud Dataproc Discussions

If it's a consistent failure, most likely your project has some networking configured in a way that breaks some of the necessary communication from VMs. Do you have significant modifications to networking rules and/or do you have any project-level GCE startup scripts?

Louis Bergelson

unread,

Mar 25, 2016, 11:34:04 AM3/25/16

to Google Cloud Dataproc Discussions

We have some firewall rules defined, but I thought they only applied to traffic coming from outside of the google internal network? What ports need to be available for what IP addresses?

Patrick Clay

unread,

Mar 25, 2016, 4:31:48 PM3/25/16

to Google Cloud Dataproc Discussions

Firewall rules apply to all traffic. The 'default' network comes preconfigured to allow internal traffic, if you still use the default network and it has the 'default-allow-internal' firewall rule, than you should be fine as far as networking goes.

If you want to white list internal traffic on a different network, you can allow all TCP/UDP traffic from the ip range of your internal network. Look at 'allow-internal' on https://cloud.google.com/compute/docs/networking. We can't enumerate a finite list of internal ports to open, because some services like Spark have their own adhoc networking.

Hope that helps,

-Patrick

Louis Bergelson

unread,

Mar 25, 2016, 5:50:34 PM3/25/16

to Google Cloud Dataproc Discussions

We have default-allow-internal set, and I believe I'm using the default network when I create a cluster (I assume it's the default if you don't specify something else. I did notice that our default-allow-internal is set to allow 10.128.0.0/9, instead of 10.240.0.0/16 which is what seems to be indicated on the page you referenced.

I've tried changing the rule to allow 10.240.0.0/16 instead, but it still results in failure ( after a very long wait). Diagnose worked this time though, the logs are here: s://dataproc-a61734cc-254d-4bb6-a176-b89c156819be-us/google-cloud-dataproc-metainfo/ba61809b-833e-441a-bd91-459f5154bd79/tasks/941a844a-03ad-4fdb-8a91-2e8d21cd02cb/diagnostic.tar

I checked our deployment scripts that set up the firewall rules, and while we do override the default-allow-https, tcp, and icmp rules, we didn't modify the default-allow-internal rule at all, so I don't know why the range would have been changed.

I can't any obvious smoking guns in the logs. They end with an innocuous looking echo statement.

I do see

google-dataproc-startup: /************************************************************
google-dataproc-startup: SHUTDOWN_MSG: Shutting down NameNode at test-cluster4-m.c.broad-gatk-test.internal/10.128.0.5
google-dataproc-startup: ************************************************************/

in node_startup/test-cluster4-m/dataproc-startup-script_output

logs/google-dataproc-agent.0.log has this in it:

Mar 25, 2016 9:13:16 PM com.google.cloud.hadoop.services.agent.hdfs.HdfsAdminClientImpl getStorageReport
INFO: Fetching Datanode storage report
Mar 25, 2016 9:13:17 PM com.google.cloud.hadoop.services.agent.protocol.MetadataGcsClient updateAgent
INFO: New node status: detail: "Insufficient number of data nodes reporting to start cluster"
state: SETUP_FAILED

Mar 25, 2016 9:24:42 PM com.google.cloud.hadoop.services.agent.MasterRequestReceiver$NormalWorkReceiver receivedSystemTask
INFO: Received new taskId '941a844a-03ad-4fdb-8a91-2e8d21cd02cb'
Mar 25, 2016 9:24:42 PM com.google.cloud.hadoop.services.agent.task.AbstractTaskHandler$1 call
INFO: Running EXECUTE_COMMAND task...

And the cluster-ping log shows that all 2 packets that were sent failed...

Combined these make me thinks there's some hadoop initialization problem. I don't have any idea what it is though or how to fix it.

Any help you can give would be great.

Thank you,

Louis

Dennis Huo

unread,

Mar 25, 2016, 7:36:34 PM3/25/16

to Google Cloud Dataproc Discussions

If it's still running, any chance you can SSH into one or both of the workers (test-cluster4-w-0 and test-cluster4-w-1) and fetch /var/log/hadoop-hdfs/*.log and /var/log/hadoop-yarn/*.log ?

Perhaps also /var/log/google-dataproc-agent.log from the worker nodes too.

Louis Bergelson

unread,

Mar 28, 2016, 11:54:43 AM3/28/16

to Google Cloud Dataproc Discussions

I've uploaded those logs to:

gs://broad-gatk-test/debug-logs/w-0 and gs://broad-gatk-test/debug-logs/w-1

They seem to be trying to connect to the server and failing. Which makes sense since I changed the default-allow-internal rule to allow 10.240.0.0/16 instead of it's initial 10.128.0.0/9. I think it was initially configured correctly and I broke it trying to follow the example rules on the docs page. I'll add another firewall rule with my initial settings, delete my cluster, and recreate it. I expect that it will be back to the situation I was in before I made the change, which was the cluster failing to startup, and diagnose on the cluster failing to ever return.

I'll let you know what happens...

Louis Bergelson

unread,

Mar 29, 2016, 4:27:30 PM3/29/16

to Google Cloud Dataproc Discussions

So I've figure out the solution. I added the following fire wall rule and now everything works fine. Something was misconfigured, I'm not sure how it got that way, but this fixed it.

Thank you for your help.

Dennis Huo

unread,

Mar 31, 2016, 2:26:05 PM3/31/16

to Google Cloud Dataproc Discussions

Ah, thanks for the update, glad to hear it worked out!

Louis Bergelson

unread,

Mar 31, 2016, 3:35:35 PM3/31/16

to Google Cloud Dataproc Discussions

Ack. I spoke too soon! It was a transient success in a continual field of errors. I've been unable to reproduce my success... New clusters are failing with the same symptoms despite the new firewall rule that I thought had fixed things.

Diagnose continues to fail.

gs://broad-gatk-test/debug-logs/itcontinues/logs-w-0

gs://broad-gatk-test/debug-logs/itcontinues/logs-w-1

Any help would be appreciated. I

Dennis Huo

unread,

Mar 31, 2016, 4:02:38 PM3/31/16

to Google Cloud Dataproc Discussions

If the diagnose command is failing, could you fetch /var/log/google-dataproc-agent.log from the master manually?

Also, perhaps we can continue in more detail if you email dataproc...@google.com where you can more easily share details that you can't post here on a public forum.

Reply all

Reply to author

Forward