Hi guys,
I'm facing a weird problem.
We are running some hourly Spark Job on our Dataproc cluster. At the beginning, every thing is running well. But each time, after a few days (or job iterations), all the batch jobs fail with the same error : "Task was not acquired".
Each failed job has a duration of about 10 min. And when I try to access the log in the UI, it's loading indefinitely without printing anything.
Also, if I check the YARN UI, I can notice that there is no Spark application that has been submitted.
There's no error in the last succeeded task. So it's difficult for us to reproduce the problem on purpose.
The only way I have for the moment to resolve the problem is to delete the cluster and re create it, which is not a good option ...
Please see attached some screenshots of the problem.
Apart from that, we use the Dataproc Cluster for submitting Druid tasks on YARN and there is not problem with those ones. So the Dataproc cluster seems healthy.
Here's the configuration of our cluster :
- --master-machine-type n1-standard-2
- --master-boot-disk-size 50 --num-workers 2
- --worker-machine-type n1-standard-2
- --worker-boot-disk-size 150
- --num-preemptible-workers 50
- --preemptible-worker-boot-disk-size 15GB
- --image-version 1.2
Thanks for your time !
Regards,
Jean.