Spark with Kubernetes connecting to pod id, not address

Pat Ferrel

unread,

Feb 13, 2019, 3:55:47 AM2/13/19

to Kubernetes developer/contributor discussion

We have a k8s deployment of several services including Apache Spark. All services seem to be operational. Our application connects to the Spark master to submit a job using the k8s DNS service for the cluster where the master is called spark-api so we use master=spark://spark-api:7077 and we use spark.submit.deployMode=cluster. We submit the job through the API not by the spark-submit script.

This will run the "driver" and all "executors" on the cluster and this part seems to work but there is a callback to the launching code in our app from some Spark process. For some reason it is trying to connect to harness-64d97d6d6-4r4d8, which is the pod ID, not the k8s cluster IP or DNS.

How could this pod ID be getting into the system? Spark somehow seems to think it is the address of the service that called it. Needless to say any connection to the k8s pod ID fails and so does the job.

Any idea how Spark could think the pod ID is an IP address or DNS name?

BTW if we run a small sample job with `master=local` all is well, but the same job executed with the above config tries to connect to the spurious pod ID.

Pat Ferrel

unread,

Feb 13, 2019, 8:42:22 PM2/13/19

to kuberne...@googlegroups.com

From: Pat Ferrel <p...@occamsmachete.com>
Date: February 13, 2019 at 5:22:12 PM
To: Erik Erlandson <eerl...@redhat.com>, Pat Ferrel <p...@actionml.com>, kuberne...@googlegroups.com <kuberne...@googlegroups.com>, us...@spark.apache.org <us...@spark.apache.org>
Subject: Re: Spark with Kubernetes connecting to pod ID, not address

Hmm, I’m not asking about using k8s to control Spark as a Job manager or scheduler like Yarn. We use the built-in standalone Spark Job Manager and sparl://spark-api:7077 as the master not k8s.

The problem is using k8s to manage a cluster consisting of our app, some databases, and Spark (one master, one driver, several executors). The problem is that some kind of callback from Spark is trying to use the pod ID in the callback and is failing to connect because of that. We have tried deployMode “client” and “cluster” but get the same error

The full trace is below but the important bit is:

Failed to connect to harness-64d97d6d6-6n7nh:46337

This came from the deployMode = “client: and the port is the driver port, which should be on the launching pod. For some reason it is using a pod ID instead of a real address. Doesn’t the driver run in the launching app’s process? The launching app is on the pod ID harness-64d97d6d6-6n7nh but it has the k8s DNS address of harness-api. I can see the correct address fro the launching pod with "kubectl get services"

The error is:

Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/bin/java" "-cp" "/spark/conf/:/spark/jars/*:/etc/hadoop/" "-Xmx1024M" "-Dspark.driver.port=46337" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@harness-64d97d6d6-6n7nh:46337" "--executor-id" "138" "--hostname" "10.31.31.174" "--cores" "8" "--app-id" "app-20190213210105-0000" "--worker-url" "spark://Wor...@10.31.31.174:37609"

========================================

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)

at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:63)

at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)

at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)

at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)

Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:

at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)

at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)

at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)

at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)

at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)

at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:63)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)

... 4 more

Caused by: java.io.IOException: Failed to connect to harness-64d97d6d6-6n7nh:46337

at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)

at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)

at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)

at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)

at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.net.UnknownHostException: harness-64d97d6d6-6n7nh

at java.net.InetAddress.getAllByName0(InetAddress.java:1281)

at java.net.InetAddress.getAllByName(InetAddress.java:1193)

at java.net.InetAddress.getAllByName(InetAddress.java:1127)

at java.net.InetAddress.getByName(InetAddress.java:1077)

at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)

at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)

at java.security.AccessController.doPrivileged(Native Method)

at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)

at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)

at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)

at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)

at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)

at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)

at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)

at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)

at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)

at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)

at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)

at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)

at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)

at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)

at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)

at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)

at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)

at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)

at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)

at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)

at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)

at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)

at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)

at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)

at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)

... 1 more

From: Erik Erlandson <eerl...@redhat.com>
Date: February 13, 2019 at 4:57:30 AM
To: Pat Ferrel <p...@actionml.com>
Subject: Re: Spark with Kubernetes connecting to pod id, not address

Hi Pat,

I'd suggest visiting the big data slack channel, it's a more spark oriented forum than kube-dev:

https://kubernetes.slack.com/messages/C0ELB338T/

Tentatively, I think you may want to submit in client mode (unless you are initiating your application from outside the kube cluster). When in client mode, you need to set up a headless service for the application driver pod that the executors can use to talk back to the driver.

https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode

Cheers,

Erik

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/36bb6bf8-1cac-428e-8ad7-3d639c90a86b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pat Ferrel

unread,

Feb 15, 2019, 2:19:11 PM2/15/19

to Kubernetes developer/contributor discussion

A clue has emerged. We have harness and spark as stateless sets. If we make harness stateful then we have a permanent k8s managed DNS name and we can make the host name/pod ID match this DNS name. This allows spark to resolve the pod DNS that is running the spark driver. So we now have spark.submit.deployMode=client working. I suppose cluster mode needs spark to be stateful too.

Has anyone used Spark with k8s? We are using published helm charts and containers. We'd love to compare notes on what works best.

Erik Erlandson

unread,

Feb 15, 2019, 4:00:04 PM2/15/19

to Pat Ferrel, Kubernetes developer/contributor discussion

Hi Pat,

I have not seen or heard reports of the behavior you are describing replicated anywhere else.

Many people are running spark on kubernetes. There are multiple community resources having to do with deploying spark and spark applications on kubernetes. If the spark upstream documentation for its kubernetes back-end is not working as described, there are some community options available:

1. If you believe there is a spark or documentation bug, File a JIRA: https://issues.apache.org/jira/projects/SPARK/issues/

2. Visit the k8s big data slack channel: https://kubernetes.slack.com/messages/C0ELB338T/

3. If you are interested in deploying standalone spark clusters, this spark operator project may be useful: https://github.com/radanalyticsio/spark-operator

4. google also has a spark operator: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

Cheers,

Erik

--

You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-dev.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/20f1aea5-16f6-449b-8f7b-89a80bad784b%40googlegroups.com.

Pat Ferrel

unread,

Feb 15, 2019, 4:33:27 PM2/15/19

to Erik Erlandson, Kubernetes developer/contributor discussion

Thanks Erik

I’ll try these again but the docs below are mostly for running k8s as a Spark scheduler/job manager. We do not use it like this, we use it to manage our cluster, which includes Spark but several other services also. We use the Spark standalone Job manager. To put it succinctly we use master=spark://spark-api:7077, not yarn or k8s.

k8s still deploys spark and other service pods it just has nothing to do with Spark internals. So this is a question of how k8s manages Spark as a generic service, not how it manages internals of Spark.

In fact the problem seems to be in the k8s DNS. How do we set it up so Spark Executors can connect to the “cluster” mode Spark Driver? In “cluster” mode the driver runs on one of the workers like the executors do.

We found that in “client” mode, where the Spark Driver runs in the process that launches the job, we could get successful connections if we made the Driver process/pod/service a stateful set so the hostname == pod ID == k8s DNS. This allowed a Spark Executor to connect to the “client” mode driver. Now we are moving on to “cluster” mode, which we really need. We now suspect that we need to do the same for all Spark, make all pods in Spark have k8s static DNS names.

Is anyone running in this mode? Using the Spark standalone master inside a k8s cluster?

From: Erik Erlandson <eerl...@redhat.com>
Reply: Erik Erlandson <eerl...@redhat.com>
Date: February 15, 2019 at 12:59:59 PM
To: Pat Ferrel <p...@actionml.com>
Cc: Kubernetes developer/contributor discussion <kuberne...@googlegroups.com>
Subject: Re: Spark with Kubernetes connecting to pod id, not address

Reply all

Reply to author

Forward