Opening spark UI prevents pyspark job from finishing

260 views
Skip to first unread message

bob.des...@vente-exclusive.com

unread,
Oct 12, 2016, 8:43:11 AM10/12/16
to Google Cloud Dataproc Discussions
Hi,

I have the impression that whenever I use an ssh tunnel through google chrome to connect to the spark UI on my dataproc cluster to track the job progress (as described here), the dataproc job doesn't finish even though all tasks are done according to the spark UI. When I run the same job and don't access the spark UI, the job finishes just fine...

Has anyone experienced anything similar before?

Thanks,
Bob

Dennis Huo

unread,
Oct 12, 2016, 2:21:34 PM10/12/16
to Google Cloud Dataproc Discussions
That certainly isn't expected behavior, and we definitely successfully use the UI a lot without interfering with the running jobs during day-to-day usage.

If it only happened a couple times I'd say it's likely a coincidence, but I suppose theoretically there could be failure modes if the UI is overloaded by heavy access and somehow OOMs or otherwise taxes the service daemons in a way to cause job failure.

What kind of machine types were you using for your master and worker nodes when it seemed to happen? Does it happen consistently?

AndreiK

unread,
Oct 12, 2016, 2:33:49 PM10/12/16
to Google Cloud Dataproc Discussions
I see this behavior it all the time and have to wait for the job to finish before looking at the UI. Saw it on n1-std-16, have not tried others

Dennis Huo

unread,
Oct 12, 2016, 5:22:27 PM10/12/16
to Google Cloud Dataproc Discussions
Keep in mind that it's normal for it to take some additional time past the Spark UI's completion for the Dataproc job to finish, first just for the driver program to exit after the YARN application is done, but also then for the status to propagate out to Dataproc's side and for backend book-keeping to be completed. This might be on the order of 5-20 seconds or so.

Next time this happens, please take a screenshot if possible of the Spark UI page indicating completion time, plus the YARN page indicating YARN's application completion time, and then also a screenshot of your console.cloud.google.com page showing the Dataproc job still not completed along with its jobid.

If you don't want to share those more broadly on this list, you can send them privately to dataproc...@google.com for Dataproc engineers to take a look.

Also, you should check "gcloud dataproc jobs describe <your jobid>" to see if that reported completion, in case it's just a browser refresh issue where the Dataproc UI just didn't update with the job completion.

bob.des...@vente-exclusive.com

unread,
Oct 13, 2016, 2:39:53 AM10/13/16
to Google Cloud Dataproc Discussions
Hi,

Thanks for the replies. It's definitely not a small 5-20 seconds, the job just keeps on being in the running state. If I run the same job without looking at the spark UI, it finishes just fine. I am sure that it's not a browser refresh issue in the dataproc UI because I have run the job using the python API and the REST api's response also suggests that it keeps on being stuck in the running state.

Anyway, for the moment I just don't look at the spark UI so my jobs go through, but it is annoying to say the least.

A little more info on the cluster(s): i've had the problem with the n1-standard-4 and n1-standard-8 machine types. Worker and master machines were always the same. I use the dataproc initialization scripts for jupyter (and conda) to install some extra packages on the clusters at creation time.

Dennis Huo

unread,
Oct 13, 2016, 2:50:38 AM10/13/16
to Google Cloud Dataproc Discussions
Maybe Andrei can also confirm whether any initialization actions were used and whether it seems specific to pyspark.

In your case, does it seem to only be a pyspark issue, or are other spark job types also affected?

bob.des...@vente-exclusive.com

unread,
Oct 13, 2016, 2:51:56 AM10/13/16
to Google Cloud Dataproc Discussions
I cannot answer to that since I am only using pyspark atm and don't have experience with scala or java based spark apps.. (I'm still quite new to spark)

Constantijn Visinescu

unread,
Oct 13, 2016, 10:38:39 AM10/13/16
to Google Cloud Dataproc Discussions
Hi, just adding my 0.02 in case it helps.

We've been having the same issues since 2.0.0 in scala.

I've started adding System.exit(0) to the end of all my jobs (of course after all useful processing is finished) and that seems to have fixed the issue. Still a bit ugly but it works.
Note that this seems to be a spark issue, and not a dataproc issue. (It also happens when running the jobs locally while devving)

Reply all
Reply to author
Forward
0 new messages