[Spark-Job] All Jobs fail with "Task was not acquired"

1,938 views
Skip to first unread message

Jean Mouvilliat

unread,
Nov 10, 2017, 4:26:20 AM11/10/17
to Google Cloud Dataproc Discussions
Hi guys,

I'm facing a weird problem. 

We are running some hourly Spark Job on our Dataproc cluster. At the beginning, every thing is running well. But each time, after a few days (or job iterations), all the batch jobs fail with the same error : "Task was not acquired". 
Each failed job has a duration of about 10 min. And when I try to access the log in the UI, it's loading indefinitely without printing anything. 
Also, if I check the YARN UI, I can notice that there is no Spark application that has been submitted.

There's no error in the last succeeded task. So it's difficult for us to reproduce the problem on purpose

The only way I have for the moment to resolve the problem is to delete the cluster and re create it, which is not a good option ...

Please see attached some screenshots of the problem. 

Apart from that, we use the Dataproc Cluster for submitting Druid tasks on YARN and there is not problem with those ones. So the Dataproc cluster seems healthy. 
Here's the configuration of our cluster :
  • --master-machine-type n1-standard-2 
  • --master-boot-disk-size 50 --num-workers 2 
  • --worker-machine-type n1-standard-2 
  • --worker-boot-disk-size 150 
  • --num-preemptible-workers 50  
  • --preemptible-worker-boot-disk-size 15GB 
  • --image-version 1.2
Thanks for your time !

Regards, 
Jean.
noAcquired.png
sparkJobs.png

Dan Sedov

unread,
Nov 10, 2017, 12:55:34 PM11/10/17
to Google Cloud Dataproc Discussions
Hi Jean,

Tracing through the flow of that job, this appears to be an instance of a recently discovered bug where communication between Dataproc service and Agent is interrupted. We're in the process of rolling out a fix.

Please delete and recreate your cluster with the latest image any time on or after November 11th.

In the meantime, it may help to restart the master node (gcloud compute instances reset <cluster-name>-m)

Please let us know if this does not help

Jean Mouvilliat

unread,
Nov 13, 2017, 4:07:28 AM11/13/17
to Google Cloud Dataproc Discussions
Hi Dan,

Thanks for the answer. I will do that then while waiting for a fix.

Have a good day,
Jean.
Reply all
Reply to author
Forward
0 new messages