Dataproc spark cluster - No available nodes reported, please check Resource Manager

278 views
Skip to first unread message

karan alang

unread,
Aug 30, 2022, 1:26:43 AM8/30/22
to Google Cloud Dataproc Discussions
Hello All,

I have a GCP Dataproc spark cluster, and I'm running a Spark (Structured Streaming) program which reads from Kafka, does processing & finally puts data into multiple sinks (Kafka, Mongo).

Dataproc cluster has the 1 master, 3 worker nodes (n1-highmem-16)

I'm getting error - No available nodes reported, error shown below

```

22/08/30 01:56:48 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 01:56:51 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 01:56:54 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 01:56:57 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 01:57:00 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 01:57:03 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 01:57:07 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 

```

Here is the command used to create the Dataproc cluster :

```

 gcloud compute routers nats create dataproc-nat-spark-kafka --nat-all-subnet-ip-ranges --router=dataproc-router --auto-allocate-nat-external-ips --region=us-east1 # in versa-sml-googl gcloud beta dataproc clusters create $CNAME \ --enable-component-gateway \ --bucket $BUCKET \ --region $REGION \ --zone $ZONE \ --no-address --master-machine-type $TYPE \ --master-boot-disk-size 500 \ --master-boot-disk-type pd-ssd \ --num-workers $NUM_WORKER \ --worker-machine-type $TYPE \ --worker-boot-disk-type pd-ssd \ --worker-boot-disk-size 1000 \ --image-version $IMG_VERSION \ --scopes 'https://www.googleapis.com/auth/cloud-platform' \ --project $PROJECT \ --initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh','gs://dataproc-spark-configs/prometheus.sh' \ --metadata 'gcs-connector-version=2.0.0' \ --metadata 'bigquery-connector-version=1.2.0' \ --properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=true,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs'

```

command to launch the Structured Streaming job:

```

gcloud dataproc jobs submit pyspark main.py \ --cluster $CLUSTER \ --properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.shuffle.service.enabled=true#spark.sql.autoBroadcastJoinThreshold=150m#spark.ui.prometheus.enabled=true#spark.kubernetes.driver.annotation.prometheus.io/scrape=true#spark.kubernetes.driver.annotation.prometheus.io/path=/metrics/executors/prometheus/#spark.kubernetes.driver.annotation.prometheus.io/port=4040#spark.app.name=structuredstreaming-versa\ --jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.3.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar,gs://dataproc-spark-jars/mongo-spark-connector_2.12-3.0.2.jar,gs://dataproc-spark-jars/bson-4.0.5.jar,gs://dataproc-spark-jars/mongodb-driver-sync-4.0.5.jar,gs://dataproc-spark-jars/mongodb-driver-core-4.0.5.jar \ --files=gs://kafka-certs/versa-kafka-gke-ca.p12,gs://kafka-certs/syslog-vani-noacl.p12,gs://kafka-certs/alarm-compression-user.p12,gs://kafka-certs/appstats-user.p12,gs://kafka-certs/insights-user.p12,gs://kafka-certs/intfutil-user.p12,gs://kafka-certs/reloadpred-chkpoint-user.p12,gs://kafka-certs/reloadpred-user.p12,gs://dataproc-spark-configs/metrics.properties,gs://dataproc-spark-configs/params.cfg,gs://kafka-certs/appstat-anomaly-user.p12,gs://kafka-certs/appstat-agg-user.p12,gs://kafka-certs/alarmblock-user.p12 \ --region us-east1 \ --py-files streams.zip,utils.zip

```

status of the disk space on masternode :

```

~$ df -h 

Filesystem Size Used Avail Use% Mounted on udev 52G 0 52G 0% /dev tmpfs 11G 1.2M 11G 1% /run /dev/sda1 485G 14G 472G 3% / tmpfs 52G 0 52G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 52G 0 52G 0% /sys/fs/cgroup /dev/loop0 303M 303M 0 100% /snap/google-cloud-cli/56 /dev/loop1 47M 47M 0 100% /snap/snapd/16292 /dev/loop2 56M 56M 0 100% /snap/core18/2538 /dev/sda15 105M 4.4M 100M 5% /boot/efi tmpfs 11G 0 11G 0% /run/user/113 tmpfs 11G 0 11G 0% /run/user/114 tmpfs 11G 0 11G 0% /run/user/116 tmpfs 11G 0 11G 0% /run/user/112 tmpfs 11G 0 11G 0% /run/user/117 /dev/loop3 304M 304M 0 100% /snap/google-cloud-cli/62 tmpfs 11G 0 11G 0% /run/user/1008

```

/dev/loop0, /dev/loop1 etc are showing as 100% Use, what kind of data is stored in this ?

I'm trying to understand what is causing the issue .. and how to fix this ? any ideas on this ?

tia !

Pls note : In terms of volume being processed it is ~4.5 M messages every 10 minutes (syslog messages are received every 10 minutes), and it takes ~2.5 - 3 minutes to process the data.


here is the stackoverflow, with additional details :

https://stackoverflow.com/questions/73537370/dataproc-spark-cluster-no-available-nodes-reported-please-check-resource-mana

Reply all
Reply to author
Forward
0 new messages