EOF error with Spark cluster

50 views
Skip to first unread message

bspr...@gmail.com

unread,
Apr 9, 2020, 11:15:35 AM4/9/20
to actionml-user
We have a two machine cluster:  Master and one Worker.  Both configured with 64Gb.
The Harness log shows a successful connection.

We were unable to get Harness and Spark cluster to connect until we added these to our Engine Spark configuration and modified the compose .yml file with same property values.
"spark.driver.host": "<some host>",
"spark.driver.port": "45678",

Both Master and Work Spark UI's show things executing.  But Worker stderr has the output below.  stdout has nothing.  The same events train successfully when Spark is configured for localhost.



Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/bin/java" "-cp" "///conf:/spark/jars/*" "-Xmx61440M" "-Dspark.driver.port=45678" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrain...@harness.vm:45678" "--executor-id" "0" "--hostname" "worker.spark.vm" "--cores" "4" "--app-id" "app-20200409134745-0003" "--worker-url" "spark://Wor...@worker.spark.vm:37145"
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/04/09 13:48:56 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 1...@worker.spark.vm
20/04/09 13:48:56 INFO SignalUtils: Registered signal handler for TERM
20/04/09 13:48:56 INFO SignalUtils: Registered signal handler for HUP
20/04/09 13:48:56 INFO SignalUtils: Registered signal handler for INT
20/04/09 13:48:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/04/09 13:48:57 INFO SecurityManager: Changing view acls to: root
20/04/09 13:48:57 INFO SecurityManager: Changing modify acls to: root
20/04/09 13:48:57 INFO SecurityManager: Changing view acls groups to: 
20/04/09 13:48:57 INFO SecurityManager: Changing modify acls groups to: 
20/04/09 13:48:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/04/09 13:48:57 INFO TransportClientFactory: Successfully created connection to harness.vm/10.145.94.199:45678 after 83 ms (0 ms spent in bootstraps)
20/04/09 13:48:58 INFO SecurityManager: Changing view acls to: root
20/04/09 13:48:58 INFO SecurityManager: Changing modify acls to: root
20/04/09 13:48:58 INFO SecurityManager: Changing view acls groups to: 
20/04/09 13:48:58 INFO SecurityManager: Changing modify acls groups to: 
20/04/09 13:48:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/04/09 13:48:58 INFO TransportClientFactory: Successfully created connection to harness.vm/10.145.94.199:45678 after 2 ms (0 ms spent in bootstraps)
20/04/09 13:48:58 INFO DiskBlockManager: Created local directory at /tmp/spark-ecacf44e-92b8-4596-9917-1df157adef40/executor-c95689b4-5891-4852-9cc8-27e30385f4f1/blockmgr-22da189f-793d-466c-a836-37fb7b5bbafd
20/04/09 13:48:58 INFO MemoryStore: MemoryStore started with capacity 31.8 GB
20/04/09 13:48:58 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrain...@harness.vm:45678
20/04/09 13:48:58 INFO WorkerWatcher: Connecting to worker spark://Wor...@worker.spark.vm:37145
20/04/09 13:48:58 INFO TransportClientFactory: Successfully created connection to worker.spark.vm/192.168.16.2:37145 after 4 ms (0 ms spent in bootstraps)
20/04/09 13:48:58 INFO WorkerWatcher: Successfully connected to spark://Wor...@worker.spark.vm:37145
20/04/09 13:48:58 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
20/04/09 13:48:58 INFO Executor: Starting executor ID 0 on host worker.spark.vm
20/04/09 13:48:58 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45913.
20/04/09 13:48:58 INFO NettyBlockTransferService: Server created on worker.spark.vm:45913
20/04/09 13:48:58 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/04/09 13:48:58 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, worker.spark.vm, 45913, None)
20/04/09 13:48:58 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, worker.spark.vm, 45913, None)
20/04/09 13:48:58 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, worker.spark.vm, 45913, None)
20/04/09 13:48:58 ERROR Inbox: Ignoring error
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:134)
at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:133)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:133)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:96)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Pat Ferrel

unread,
Apr 9, 2020, 1:11:55 PM4/9/20
to bspr...@gmail.com, actionml-user
Don't put the driver port and address in config. This is best left for Spark to detect.

In general don’t add Spark config unless you are sure you understand all of the system ramifications. We have not seen a reason to set the driver port and only obscure Kubernetes reasons for the address.

Spark puts the host driver on a random port that it chooses so setting it comes with side-effects. Also we are moving the driver into the Spark cluster so it runs on a Worker, not inside harness. Mucking with driver address will mess up this future change. Anyway, minimal change to Spark config is usually best. 
--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/dcf94173-4de3-440f-a046-f8094c746292%40googlegroups.com.

bspr...@gmail.com

unread,
Apr 10, 2020, 10:03:53 AM4/10/20
to actionml-user
We only put those configs in because harness wasn't opening (or listening) on the port the worker was trying to connect to.  We got connection refused error on the port.
So maybe we a different underlying issue?

Thanks.

On Thursday, April 9, 2020 at 11:15:35 AM UTC-4, bspr...@gmail.com wrote:
We have a two machine cluster:  Master and one Worker.  Both configured with 64Gb.
The Harness log shows a successful connection.

We were unable to get Harness and Spark cluster to connect until we added these to our Engine Spark configuration and modified the compose .yml file with same property values.
"spark.driver.host": "<some host>",
"spark.driver.port": "45678",

Both Master and Work Spark UI's show things executing.  But Worker stderr has the output below.  stdout has nothing.  The same events train successfully when Spark is configured for localhost.



Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/bin/java" "-cp" "///conf:/spark/jars/*" "-Xmx61440M" "-Dspark.driver.port=45678" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@harness.vm:45678" "--executor-id" "0" "--hostname" "worker.spark.vm" "--cores" "4" "--app-id" "app-20200409134745-0003" "--worker-url" "spark://Wor...@worker.spark.vm:37145"
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/04/09 13:48:56 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 1...@worker.spark.vm
20/04/09 13:48:56 INFO SignalUtils: Registered signal handler for TERM
20/04/09 13:48:56 INFO SignalUtils: Registered signal handler for HUP
20/04/09 13:48:56 INFO SignalUtils: Registered signal handler for INT
20/04/09 13:48:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/04/09 13:48:57 INFO SecurityManager: Changing view acls to: root
20/04/09 13:48:57 INFO SecurityManager: Changing modify acls to: root
20/04/09 13:48:57 INFO SecurityManager: Changing view acls groups to: 
20/04/09 13:48:57 INFO SecurityManager: Changing modify acls groups to: 
20/04/09 13:48:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/04/09 13:48:57 INFO TransportClientFactory: Successfully created connection to harness.vm/10.145.94.199:45678 after 83 ms (0 ms spent in bootstraps)
20/04/09 13:48:58 INFO SecurityManager: Changing view acls to: root
20/04/09 13:48:58 INFO SecurityManager: Changing modify acls to: root
20/04/09 13:48:58 INFO SecurityManager: Changing view acls groups to: 
20/04/09 13:48:58 INFO SecurityManager: Changing modify acls groups to: 
20/04/09 13:48:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/04/09 13:48:58 INFO TransportClientFactory: Successfully created connection to harness.vm/10.145.94.199:45678 after 2 ms (0 ms spent in bootstraps)
20/04/09 13:48:58 INFO DiskBlockManager: Created local directory at /tmp/spark-ecacf44e-92b8-4596-9917-1df157adef40/executor-c95689b4-5891-4852-9cc8-27e30385f4f1/blockmgr-22da189f-793d-466c-a836-37fb7b5bbafd
20/04/09 13:48:58 INFO MemoryStore: MemoryStore started with capacity 31.8 GB
20/04/09 13:48:58 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@harness.vm:45678

Pat Ferrel

unread,
Apr 10, 2020, 1:51:12 PM4/10/20
to bspr...@gmail.com, actionml-user
I’d read the Spark docs to understand how to create a cluster. Spark is not client/server. You must have Spark itself configured via its XML config files and the config on the driver machine and master must match.

If you set the master and driver using the same config you will have no trouble. We have several clients running in this mode.

BTW you should have exactly the same version of Spark running on the cluster as is linked into harness. We guarantee this usually by using our own containers for both driver/harness and master/workers.


From: bspr...@gmail.com <bspr...@gmail.com>
Date: April 10, 2020 at 7:03:53 AM
To: actionml-user <action...@googlegroups.com>
Subject:  Re: EOF error with Spark cluster
We only put those configs in because harness wasn't opening (or listening) on the port the worker was trying to connect to.  We got connection refused error on the port.
So maybe we a different underlying issue?

Thanks.

On Thursday, April 9, 2020 at 11:15:35 AM UTC-4, bspr...@gmail.com wrote:
We have a two machine cluster:  Master and one Worker.  Both configured with 64Gb.
The Harness log shows a successful connection.

We were unable to get Harness and Spark cluster to connect until we added these to our Engine Spark configuration and modified the compose .yml file with same property values.
"spark.driver.host": "<some host>",
"spark.driver.port": "45678",

Both Master and Work Spark UI's show things executing.  But Worker stderr has the output below.  stdout has nothing.  The same events train successfully when Spark is configured for localhost.



Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/bin/java" "-cp" "///conf:/spark/jars/*" "-Xmx61440M" "-Dspark.driver.port=45678" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@harness.vm:45678" "--executor-id" "0" "--hostname" "worker.spark.vm" "--cores" "4" "--app-id" "app-20200409134745-0003" "--worker-url" "spark://Wor...@worker.spark.vm:37145"
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/04/09 13:48:56 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 1...@worker.spark.vm
20/04/09 13:48:56 INFO SignalUtils: Registered signal handler for TERM
20/04/09 13:48:56 INFO SignalUtils: Registered signal handler for HUP
20/04/09 13:48:56 INFO SignalUtils: Registered signal handler for INT
20/04/09 13:48:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/04/09 13:48:57 INFO SecurityManager: Changing view acls to: root
20/04/09 13:48:57 INFO SecurityManager: Changing modify acls to: root
20/04/09 13:48:57 INFO SecurityManager: Changing view acls groups to: 
20/04/09 13:48:57 INFO SecurityManager: Changing modify acls groups to: 
20/04/09 13:48:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/04/09 13:48:57 INFO TransportClientFactory: Successfully created connection to harness.vm/10.145.94.199:45678 after 83 ms (0 ms spent in bootstraps)
20/04/09 13:48:58 INFO SecurityManager: Changing view acls to: root
20/04/09 13:48:58 INFO SecurityManager: Changing modify acls to: root
20/04/09 13:48:58 INFO SecurityManager: Changing view acls groups to: 
20/04/09 13:48:58 INFO SecurityManager: Changing modify acls groups to: 
20/04/09 13:48:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/04/09 13:48:58 INFO TransportClientFactory: Successfully created connection to harness.vm/10.145.94.199:45678 after 2 ms (0 ms spent in bootstraps)
20/04/09 13:48:58 INFO DiskBlockManager: Created local directory at /tmp/spark-ecacf44e-92b8-4596-9917-1df157adef40/executor-c95689b4-5891-4852-9cc8-27e30385f4f1/blockmgr-22da189f-793d-466c-a836-37fb7b5bbafd
20/04/09 13:48:58 INFO MemoryStore: MemoryStore started with capacity 31.8 GB
20/04/09 13:48:58 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@harness.vm:45678

--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.

bspr...@gmail.com

unread,
Apr 15, 2020, 11:23:28 AM4/15/20
to actionml-user
It was mismatched Spark versions.

Thanks!


On Thursday, April 9, 2020 at 11:15:35 AM UTC-4, bspr...@gmail.com wrote:
We have a two machine cluster:  Master and one Worker.  Both configured with 64Gb.
The Harness log shows a successful connection.

We were unable to get Harness and Spark cluster to connect until we added these to our Engine Spark configuration and modified the compose .yml file with same property values.
"spark.driver.host": "<some host>",
"spark.driver.port": "45678",

Both Master and Work Spark UI's show things executing.  But Worker stderr has the output below.  stdout has nothing.  The same events train successfully when Spark is configured for localhost.



Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/bin/java" "-cp" "///conf:/spark/jars/*" "-Xmx61440M" "-Dspark.driver.port=45678" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@harness.vm:45678" "--executor-id" "0" "--hostname" "worker.spark.vm" "--cores" "4" "--app-id" "app-20200409134745-0003" "--worker-url" "spark://Wor...@worker.spark.vm:37145"
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/04/09 13:48:56 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 1...@worker.spark.vm
20/04/09 13:48:56 INFO SignalUtils: Registered signal handler for TERM
20/04/09 13:48:56 INFO SignalUtils: Registered signal handler for HUP
20/04/09 13:48:56 INFO SignalUtils: Registered signal handler for INT
20/04/09 13:48:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/04/09 13:48:57 INFO SecurityManager: Changing view acls to: root
20/04/09 13:48:57 INFO SecurityManager: Changing modify acls to: root
20/04/09 13:48:57 INFO SecurityManager: Changing view acls groups to: 
20/04/09 13:48:57 INFO SecurityManager: Changing modify acls groups to: 
20/04/09 13:48:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/04/09 13:48:57 INFO TransportClientFactory: Successfully created connection to harness.vm/10.145.94.199:45678 after 83 ms (0 ms spent in bootstraps)
20/04/09 13:48:58 INFO SecurityManager: Changing view acls to: root
20/04/09 13:48:58 INFO SecurityManager: Changing modify acls to: root
20/04/09 13:48:58 INFO SecurityManager: Changing view acls groups to: 
20/04/09 13:48:58 INFO SecurityManager: Changing modify acls groups to: 
20/04/09 13:48:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/04/09 13:48:58 INFO TransportClientFactory: Successfully created connection to harness.vm/10.145.94.199:45678 after 2 ms (0 ms spent in bootstraps)
20/04/09 13:48:58 INFO DiskBlockManager: Created local directory at /tmp/spark-ecacf44e-92b8-4596-9917-1df157adef40/executor-c95689b4-5891-4852-9cc8-27e30385f4f1/blockmgr-22da189f-793d-466c-a836-37fb7b5bbafd
20/04/09 13:48:58 INFO MemoryStore: MemoryStore started with capacity 31.8 GB
20/04/09 13:48:58 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@harness.vm:45678

Pat Ferrel

unread,
Apr 15, 2020, 1:57:41 PM4/15/20
to bspr...@gmail.com, actionml-user
Also make sure the config for the “client” code is the same and the “cluster” — here I’m talking about the XML files for both sides.

BTW I strongly recommend you use our containers for Harness and Spark (as well as the other services). They will match. You can find images on docker hub.


From: bspr...@gmail.com <bspr...@gmail.com>
Date: April 15, 2020 at 8:23:28 AM
To: actionml-user <action...@googlegroups.com>
Subject:  Re: EOF error with Spark cluster

--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages