unhealthy cluster

433 views
Skip to first unread message

Amina [Admin] Jackson

unread,
Oct 4, 2023, 2:31:23 PM10/4/23
to Google Cloud Dataproc Discussions
I was submitting jobs fine on dataProc clusters until I changed VMs to get better CPU performances. Now I get this error and I am stuck. I can't stop the cluster because it has a local disk, stopping and reseting the VMs manually does resolve the issue.  I creating a new cluster is not a good option

ERROR: (gcloud.dataproc.jobs.submit.pyspark) HttpError accessing .... response: <{'x-debug-tracking-id': '9633867706168606723;o=0', 'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'content-encoding': 'gzip', 'date': 'Wed, 04 Oct 2023 18:06:34 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': 429}>, content <{
  "error": {
    "code": 429,
    "message": "No agent on master node(s) found to be active in the past 300 seconds.\nLast reported times: [gca-dev-cluster-m seconds: 1696241723\nnanos: 125792000\n]. This may indicate high memory usage in Dataproc master or an unhealthy Dataproc master node",
    "status": "RESOURCE_EXHAUSTED"
  }
}
>
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach. 

Mich Talebzadeh

unread,
Oct 4, 2023, 3:41:41 PM10/4/23
to Google Cloud Dataproc Discussions
Hi
  1. What new VM hosts have you chosen and what was before?
  2. Is master node configured memory wise the same as worker nodes?

You have three factors to consider

  • High memory usage on the Dataproc master node.
  • An unhealthy Dataproc master node.
  • Network connectivity issues.

Potential things

  • Check the Dataproc master node's memory usage. You can do this using the Google Cloud Console or the gcloud cli. if the memory is too high. It is possible that you are still under allocating memory because the new tin boxes are still under-speced.
  • Monitor your cluster's resource usage. You can use the Google Cloud Console or the gcloud cli
  • Are you using autoscaling on Dataproc nodes?

Bottom line,  look for any signs of high memory usage or other resource constraints on the master node.
HTH

Amina [Admin] Jackson

unread,
Oct 5, 2023, 11:38:12 AM10/5/23
to Google Cloud Dataproc Discussions
Hi Mich, 
Thanks!  Nope I am not autoscaling. I created a new cluster  got the same behavior with similar configurations and seems to have the same issues.  Master and works have the same machine and machine configurations with workers having n2 highcpu 16. I have one with exactly the same selections, VPC settings, zone and machine configurations that was setup sometime back and is working. I am just stuck on what might be the issue.  Unless by resources being exhausted, it applies to all dataProc clusters not just a specific cluster. 
Reply all
Reply to author
Forward
0 new messages