We have a Dataproc cluster with master node and two worker nodes (custom (Machine type: 4 vCPUs, 10 GB memory).
I submitted 8 Spark jobs simultaneously trying to test how it performs under load.
The Spark configuration for these jobs is:
spark.executor.memory1024m
spark.executor.cores=40
spark.executor.instances=2
spark.default.paralle1ism=80
This cluster configuration allows to run 4 jobs in parallel. Each job uses 3 YARN containers:
Cont 1 - driver: 1 GB memory
Cont 2 - executor #1: 1.5 GB memory
Cont 3 - executor #2: 1.5 GB memory
and takes 4 GB memory. We expect that all 8 jobs will be scheduled and will start running as soon as there is a container available, and some jobs will take more time to run.
In reality, only one job finished successfully. Other jobs failed, because a worker node was restarted:
=========== Cloud Dataproc Agent Error ===========
com.google.cloud.hadoop.services.agent.AgentException: Node was restarted while executing a job. This could be user-initiated or caused by Compute Engine maintenance event. (TASK_FAILED)
at com.google.cloud.hadoop.services.agent.AgentException$Builder.build(AgentException.java:83)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.lambda$kill$0(AbstractJobHandler.java:211)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:205)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:194)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:110)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1029)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:675)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:105)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractTransformFuture.create(AbstractTransformFuture.java:39)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.Futures.transformAsync(Futures.java:459)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.kill(AbstractJobHandler.java:202)
at com.google.cloud.hadoop.services.agent.job.JobManagerImpl.recoverAndKill(JobManagerImpl.java:153)
at com.google.cloud.hadoop.services.agent.MasterRequestReceiver$NormalWorkReceiver.receivedJob(MasterRequestReceiver.java:141)
at com.google.cloud.hadoop.services.agent.MasterRequestReceiver.pollForJobsAndTasks(MasterRequestReceiver.java:105)
at com.google.cloud.hadoop.services.agent.MasterRequestReceiver.pollForWork(MasterRequestReceiver.java:77)
at com.google.cloud.hadoop.services.agent.MasterRequestReceiver.lambda$doStart$0(MasterRequestReceiver.java:67)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:630)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
======== End of Cloud Dataproc Agent Error ========I found two postings on the web describing the same problem. One suggests that
- there was "a bug in image version 1.1.34. Downgrade to image 1.1.29 and that fixes the issue."
and other
- Since Compute Engine VMs don't configure a swap partition, when you run out of RAM all daemons will crash and restart.
This postings were from more than a year ago, so I suppose the bug is already fixed. Our cluster is created on Sep 11, 2018 and the image is:
- imageVersion: 1.2.47-deb8
Any ideas?
Thanks,
Victor