I have a GCP Dataproc spark cluster, and I'm running a Spark (Structured Streaming) program which reads from Kafka, does processing & finally puts data into multiple sinks (Kafka, Mongo).
Dataproc cluster has the 1 master, 3 worker nodes (n1-highmem-16)
I'm getting error - No available nodes reported, error shown below
```
22/08/30 01:56:48 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 01:56:51 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 01:56:54 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 01:56:57 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 01:57:00 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 01:57:03 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30 01:57:07 WARN org.apache.spark.deploy.yarn.YarnAllocatorNodeHealthTracker: No available nodes reported, please check Resource Manager. 22/08/30
```
Here is the command used to create the Dataproc cluster :
```
gcloud compute routers nats create dataproc-nat-spark-kafka --nat-all-subnet-ip-ranges --router=dataproc-router --auto-allocate-nat-external-ips --region=us-east1 # in versa-sml-googl gcloud beta dataproc clusters create $CNAME \ --enable-component-gateway \ --bucket $BUCKET \ --region $REGION \ --zone $ZONE \ --no-address --master-machine-type $TYPE \ --master-boot-disk-size 500 \ --master-boot-disk-type pd-ssd \ --num-workers $NUM_WORKER \ --worker-machine-type $TYPE \ --worker-boot-disk-type pd-ssd \ --worker-boot-disk-size 1000 \ --image-version $IMG_VERSION \ --scopes 'https://www.googleapis.com/auth/cloud-platform' \ --project $PROJECT \ --initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh','gs://dataproc-spark-configs/prometheus.sh' \ --metadata 'gcs-connector-version=2.0.0' \ --metadata 'bigquery-connector-version=1.2.0' \ --properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=true,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs'
```
command to launch the Structured Streaming job:
```
gcloud dataproc jobs submit pyspark main.py \
--cluster $CLUSTER \
--properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.shuffle.service.enabled=true#spark.sql.autoBroadcastJoinThreshold=150m#spark.ui.prometheus.enabled=true#spark.kubernetes.driver.annotation.prometheus.io/scrape=true#spark.kubernetes.driver.annotation.prometheus.io/path=/metrics/executors/prometheus/#spark.kubernetes.driver.annotation.prometheus.io/port=4040#spark.app.name=structuredstreaming-versa\
--jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.3.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar,gs://dataproc-spark-jars/mongo-spark-connector_2.12-3.0.2.jar,gs://dataproc-spark-jars/bson-4.0.5.jar,gs://dataproc-spark-jars/mongodb-driver-sync-4.0.5.jar,gs://dataproc-spark-jars/mongodb-driver-core-4.0.5.jar \
--files=gs://kafka-certs/versa-kafka-gke-ca.p12,gs://kafka-certs/syslog-vani-noacl.p12,gs://kafka-certs/alarm-compression-user.p12,gs://kafka-certs/appstats-user.p12,gs://kafka-certs/insights-user.p12,gs://kafka-certs/intfutil-user.p12,gs://kafka-certs/reloadpred-chkpoint-user.p12,gs://kafka-certs/reloadpred-user.p12,gs://dataproc-spark-configs/metrics.properties,gs://dataproc-spark-configs/params.cfg,gs://kafka-certs/appstat-anomaly-user.p12,gs://kafka-certs/appstat-agg-user.p12,gs://kafka-certs/alarmblock-user.p12 \
--region us-east1 \
--py-files streams.zip,utils.zip
```
status of the disk space on masternode :
```
~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 52G 0 52G 0% /dev
tmpfs 11G 1.2M 11G 1% /run
/dev/sda1 485G 14G 472G 3% /
tmpfs 52G 0 52G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 52G 0 52G 0% /sys/fs/cgroup
/dev/loop0 303M 303M 0 100% /snap/google-cloud-cli/56
/dev/loop1 47M 47M 0 100% /snap/snapd/16292
/dev/loop2 56M 56M 0 100% /snap/core18/2538
/dev/sda15 105M 4.4M 100M 5% /boot/efi
tmpfs 11G 0 11G 0% /run/user/113
tmpfs 11G 0 11G 0% /run/user/114
tmpfs 11G 0 11G 0% /run/user/116
tmpfs 11G 0 11G 0% /run/user/112
tmpfs 11G 0 11G 0% /run/user/117
/dev/loop3 304M 304M 0 100% /snap/google-cloud-cli/62
tmpfs 11G 0 11G 0% /run/user/1008
```
/dev/loop0, /dev/loop1 etc are showing as 100% Use, what kind of data is stored in this ?
I'm trying to understand what is causing the issue .. and how to fix this ? any ideas on this ?
tia !
Pls note : In terms of volume being processed it is ~4.5 M messages every 10 minutes (syslog messages are received every 10 minutes), and it takes ~2.5 - 3 minutes to process the data.
here is the stackoverflow, with additional details :