Question 1 :
I've a structured streaming job, which is running on Dataproc .. and prometheus installation on GKE (kubernetes) Prometheus install takes metrics from Kafka, kube-state-metrics etc which installed on the same GKE cluster, but different namespeces
Is there a way to monitor Spark jobs running on Dataproc, using Prometheus on GKE ?
Should the Dataproc & GKE pods be on the same VPC ?
Any pointers on this is appreciated !
2nd Question :
Here is a link, which describe the steps to monitor Spark jobs on Dataproc using Prometheus installed on the same Dataproc cluster.
https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/prometheus/prometheus.sh
I was able to set this up on Dataproc, and i see only the following metrics.
```
spark_app_name
spark_application_<app_id>_applicationMaster_numContainersPendingAllocate
spark_application_<app_id>_applicationMaster_numExecutorsFailed
spark_application_<app_id>_applicationMaster_numExecutorsRunning
spark_application_<app_id>_applicationMaster_numLocalityAwareTasks
spark_application_<app_id>_applicationMaster_numReleasedContainers
```
Note - this uses statsd, not PrometheusServlet - which can also be used to monitor apache spark, and seems to have a lot of Master/Driver/Exector metrics
I've made following changes to prometheus.sh & re-created the Dataproc cluster :
Both these changes are in the prometheus.sh (attached) included in the initialization-action
Here is the script to create the Dataproc cluster:
```
gcloud beta dataproc clusters create $CNAME \
--enable-component-gateway \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--no-address --master-machine-type $TYPE \
--master-boot-disk-size 500 \
--master-boot-disk-type pd-ssd \
--num-workers $NUM_WORKER \
--worker-machine-type $TYPE \
--worker-boot-disk-type pd-ssd \
--worker-boot-disk-size 1000 \
--image-version $IMG_VERSION \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project $PROJECT \
--initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh','gs://dataproc-spark-configs/prometheus.sh' \
--metadata 'gcs-connector-version=2.0.0' \
--metadata 'bigquery-connector-version=1.2.0' \
--properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=true,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs'
```
Command to run the Apache spark streaming job :
gcloud dataproc jobs submit pyspark main.py \ --cluster $CLUSTER \ --properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.shuffle.service.enabled=true#spark.sql.autoBroadcastJoinThreshold=150m#spark.ui.prometheus.enabled=true#spark.kubernetes.driver.annotation.prometheus.io/scrape=true#spark.kubernetes.driver.annotation.prometheus.io/path=/metrics/executors/prometheus/#spark.kubernetes.driver.annotation.prometheus.io/port=4040#spark.app.name=structuredstreaming-versa\ --region us-east1
Following are set in the SparkConf (as part of the command above), when running the Spark (StructuredStream) job in Dataproc :
spark.ui.prometheus.enabled=true
spark.kubernetes.driver.annotation.prometheus.io/scrape=true
spark.kubernetes.driver.annotation.prometheus.io/path=/metrics/executors/prometheus/
spark.kubernetes.driver.annotation.prometheus.io/port=4040
Any inputs on this ?
What needs to be done to debug/enable this ?
tia!
P.S : attached is the modified prometheus.sh, used in creation of the Dataproc cluster