Spark Standalone Mode is leaving a java process "spark.executor.StandaloneExecutorBackend" open

857 views
Skip to first unread message

grothaus....@gmail.com

unread,
Feb 12, 2013, 5:32:48 AM2/12/13
to spark...@googlegroups.com
Hello spark-users,

I just downloaded Spark 0.6.2 and tried the Standalone mode with the provided SparkPi example. After the example task has finished, there is -- besides the master and the worker -- a third java process still running: spark.executor.StandaloneExecutorBackend. When I submit the example task another time, then a fourth java process remains running after the task is finished, and so on. The StandaloneExecutorBackend seems not to finish correctly.

This is what I did:
  • opened three Windows command prompts (I am running Win 7 64bit, Java 1.6 64bit, Scala 2.9.2, Spark 0.6.2 on a lenovo ThinkPad with an Intel Core i5-2540M CPU (4 Cores) and 8 GB RAM)
  • cd'ed into %SPARK_HOME%
  • ran these three commands (each in one command prompt window):
    • run spark.deploy.master.Master
    • run spark.deploy.worker.Worker --cores 3 --memory 1G spark://S-270:7077
    • run spark.examples.SparkPi spark://S-270:7077 8
The detailed output of the console windows is attached below. Please note the error message "java.io.IOException:Eine vorhandene Verbindung wurde vom Remotehost geschlossen" (An existing connection was closed by the remote host) in the master's output.

Is this a bug?


Kind regards

Christoph Grothaus



######################################################################################################################
# Master
##############################
########################################################################################

C:\DEV\PROGRAMME\spark\0.6.2>
run spark.deploy.master.Master
13/02/12 11:02:52 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started
13/02/12 11:02:52 INFO actor.ActorSystemImpl: RemoteServerStarted@akka://
spark@S-270:7077
13/02/12 11:02:52 INFO master.Master: Starting Spark master at spark://S-270:7077
13/02/12 11:02:52 INFO io.IoWorker: IoWorker thread 'spray-io-worker-0' started
13/02/12 11:02:53 INFO server.HttpServer: akka://spark/user/HttpServer started on /0.0.0.0:8080
13/02/12 11:03:02 INFO actor.ActorSystemImpl: RemoteClientStarted@akka://
spark@S-270:51561
13/02/12 11:03:02 INFO master.Master: Registering worker S-270:51561 with 3 cores, 1024.0 MB RAM
13/02/12 11:03:12 INFO actor.ActorSystemImpl: RemoteClientStarted@akka://spar...@10.1.112.102:51573
13/02/12 11:03:12 INFO master.Master: Registering job SparkPi
13/02/12 11:03:12 INFO master.Master: Registered job SparkPi with ID job-20130212110312-0000
13/02/12 11:03:12 INFO master.Master: Launching executor job-20130212110312-0000/0 on worker worker-20130212110302-S-270-
51561
13/02/12 11:03:14 ERROR actor.ActorSystemImpl: RemoteServerError@akka://
spark@S-270:7077] Error[java.io.IOException:Eine vorhandene Verbindung wurde vom Remotehost geschlossen
        at sun.nio.ch.SocketDispatcher.
read0(Native Method)
        at sun.nio.ch.SocketDispatcher.
read(SocketDispatcher.java:25)
        at sun.nio.ch.IOUtil.
readIntoNativeBuffer(IOUtil.java:198)
        at sun.nio.ch.IOUtil.read(IOUtil.
java:166)
        at sun.nio.ch.SocketChannelImpl.
read(SocketChannelImpl.java:245)
        at org.jboss.netty.channel.
socket.nio.NioWorker.read(NioWorker.java:63)
        at org.jboss.netty.channel.
socket.nio.AbstractNioWorker.processSelectedKeys(AbstractNioWorker.java:385)
        at org.jboss.netty.channel.
socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:256)
        at org.jboss.netty.channel.
socket.nio.NioWorker.run(NioWorker.java:35)
        at java.util.concurrent.
ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.
ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.
java:662)
]
13/02/12 11:03:14 INFO actor.ActorSystemImpl: RemoteClientShutdown@akka://spa...@10.1.112.102:51573
13/02/12 11:03:14 INFO master.Master: Removing job job-20130212110312-0000
13/02/12 11:03:14 WARN master.Master: Got status update for unknown executor job-20130212110312-0000/0



##############################
########################################################################################
# Worker
##############################
########################################################################################

C:\DEV\PROGRAMME\spark\0.6.2>
run spark.deploy.worker.Worker --cores 3 --memory 1G spark://S-270:7077
13/02/12 11:03:02 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started
13/02/12 11:03:02 INFO actor.ActorSystemImpl: RemoteServerStarted@akka://
spark@S-270:51561
13/02/12 11:03:02 INFO worker.Worker: Starting Spark worker S-270:51561 with 3 cores, 1024.0 MB RAM
13/02/12 11:03:02 INFO worker.Worker: Spark home: C:\DEV\PROGRAMME\spark\0.6.2
13/02/12 11:03:02 INFO worker.Worker: Connecting to master spark://S-270:7077
13/02/12 11:03:02 INFO actor.ActorSystemImpl: RemoteClientStarted@akka://
spark@S-270:7077
13/02/12 11:03:02 INFO io.IoWorker: IoWorker thread 'spray-io-worker-0' started
13/02/12 11:03:02 INFO server.HttpServer: akka://spark/user/HttpServer started on /0.0.0.0:8081
13/02/12 11:03:02 INFO worker.Worker: Successfully registered with master
13/02/12 11:03:12 INFO worker.Worker: Asked to launch executor job-20130212110312-0000/0 for SparkPi
13/02/12 11:03:14 INFO worker.Worker: Asked to kill executor job-20130212110312-0000/0
13/02/12 11:03:14 INFO worker.ExecutorRunner: Killing process!
13/02/12 11:03:14 INFO worker.ExecutorRunner: Runner thread for executor job-20130212110312-0000/0 interrupted
13/02/12 11:03:14 INFO worker.Worker: Executor job-20130212110312-0000/0 finished with state KILLED



##############################
########################################################################################
# Driver program
##############################
########################################################################################

C:\DEV\PROGRAMME\spark\0.6.2>
run spark.examples.SparkPi spark://S-270:7077 8
13/02/12 11:03:11 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started
13/02/12 11:03:11 INFO actor.ActorSystemImpl: RemoteServerStarted@akka://spar...@10.1.112.102:51573
13/02/12 11:03:11 INFO storage.BlockManagerMaster: Registered BlockManagerMaster Actor
13/02/12 11:03:11 INFO storage.MemoryStore: MemoryStore started with capacity 323.9 MB.
13/02/12 11:03:11 INFO storage.DiskStore: Created local directory at J:\TEMP\spark-local-
20130212110311-5415
13/02/12 11:03:11 INFO network.ConnectionManager: Bound socket to port 51578 with id = ConnectionManagerId(S-270,
51578)
13/02/12 11:03:11 INFO storage.BlockManagerMaster: Trying to register BlockManager
13/02/12 11:03:11 INFO storage.
BlockManagerMasterActor: Got Register Msg from master node, don't register it
13/02/12 11:03:11 INFO storage.BlockManagerMaster: BlockManager registered successfully @ syncRegisterBlockManager
13/02/12 11:03:11 INFO storage.BlockManagerMaster: Done registering BlockManager
13/02/12 11:03:11 INFO server.Server: jetty-7.5.3.v20111011
13/02/12 11:03:11 INFO server.AbstractConnector: Started SocketC...@0.0.0.0:51579 STARTING
13/02/12 11:03:11 INFO broadcast.HttpBroadcast: Broadcast server started at http://10.1.112.102:51579
13/02/12 11:03:11 INFO spark.CacheTracker: Registered CacheTrackerActor actor
13/02/12 11:03:11 INFO spark.MapOutputTracker: Registered MapOutputTrackerActor actor
13/02/12 11:03:11 INFO spark.HttpFileServer: HTTP File server directory is J:\TEMP\spark-bbdcbec3-d987-
469d-ace2-1aab895ea729
13/02/12 11:03:11 INFO server.Server: jetty-7.5.3.v20111011
13/02/12 11:03:11 INFO server.AbstractConnector: Started SocketC...@0.0.0.0:51580 STARTING
13/02/12 11:03:11 INFO client.Client$ClientActor: Connecting to master spark://S-270:7077
13/02/12 11:03:12 INFO spark.SparkContext: Starting job: reduce at SparkPi.scala:20
13/02/12 11:03:12 INFO scheduler.DAGScheduler: Registering RDD 1 (map at SparkPi.scala:16)
13/02/12 11:03:12 INFO spark.CacheTracker: Registering RDD ID 1 with cache
13/02/12 11:03:12 INFO spark.CacheTrackerActor: Registering RDD 1 with 8 partitions
13/02/12 11:03:12 INFO scheduler.DAGScheduler: Registering parent RDD 1 (map at SparkPi.scala:16)
13/02/12 11:03:12 INFO scheduler.DAGScheduler: Registering parent RDD 0 (parallelize at SparkPi.scala:16)
13/02/12 11:03:12 INFO spark.CacheTracker: Registering RDD ID 0 with cache
13/02/12 11:03:12 INFO spark.CacheTrackerActor: Registering RDD 0 with 8 partitions
13/02/12 11:03:12 INFO spark.CacheTrackerActor: Asked for current cache locations
13/02/12 11:03:12 INFO scheduler.DAGScheduler: Got job 0 (reduce at SparkPi.scala:20) with 8 output partitions
13/02/12 11:03:12 INFO scheduler.DAGScheduler: Final stage: Stage 0 (map at SparkPi.scala:16)
13/02/12 11:03:12 INFO scheduler.DAGScheduler: Parents of final stage: List()
13/02/12 11:03:12 INFO scheduler.DAGScheduler: Missing parents: List()
13/02/12 11:03:12 INFO scheduler.DAGScheduler: Submitting Stage 0 (map at SparkPi.scala:16), which has no missing parents
13/02/12 11:03:12 INFO scheduler.DAGScheduler: Submitting 8 missing tasks from Stage 0
13/02/12 11:03:12 INFO actor.ActorSystemImpl: RemoteClientStarted@akka://
spark@S-270:7077
13/02/12 11:03:12 INFO cluster.ClusterScheduler: Adding task set 0.0 with 8 tasks
13/02/12 11:03:12 INFO cluster.
SparkDeploySchedulerBackend: Connected to Spark cluster with job ID job-20130212110312-0000
13/02/12 11:03:12 INFO client.Client$ClientActor: Executor added: job-20130212110312-0000/0 on worker-20130212110302-S-270-
51561 (S-270) with 3 cores
13/02/12 11:03:12 INFO cluster.
SparkDeploySchedulerBackend: Granted executor ID job-20130212110312-0000/0 on host S-270 with 3 cores, 512.0 MB RAM
13/02/12 11:03:12 INFO client.Client$ClientActor: Executor updated: job-20130212110312-0000/0 is now RUNNING
13/02/12 11:03:13 INFO actor.ActorSystemImpl: RemoteClientStarted@akka://
spark@S-270:51588
13/02/12 11:03:13 INFO cluster.
SparkDeploySchedulerBackend: Registered slave: Actor[akka://spark@S-270:51588/user/Executor] with ID worker-20130212110302-S-270-51561
13/02/12 11:03:13 INFO cluster.TaskSetManager: Starting task 0.0:0 as TID 0 on slave worker-20130212110302-S-270-
51561: S-270 (preferred)
13/02/12 11:03:13 INFO cluster.TaskSetManager: Serialized task 0.0:0 as 1748 bytes in 16 ms
13/02/12 11:03:13 INFO cluster.TaskSetManager: Starting task 0.0:1 as TID 1 on slave worker-20130212110302-S-270-
51561: S-270 (preferred)
13/02/12 11:03:13 INFO cluster.TaskSetManager: Serialized task 0.0:1 as 1748 bytes in 0 ms
13/02/12 11:03:13 INFO cluster.TaskSetManager: Starting task 0.0:2 as TID 2 on slave worker-20130212110302-S-270-
51561: S-270 (preferred)
13/02/12 11:03:13 INFO cluster.TaskSetManager: Serialized task 0.0:2 as 1748 bytes in 0 ms
13/02/12 11:03:13 INFO actor.ActorSystemImpl: RemoteClientStarted@akka://
spark@S-270:51598
13/02/12 11:03:13 INFO storage.
BlockManagerMasterActor: Got Register Msg from master node, don't register it
13/02/12 11:03:14 INFO cluster.TaskSetManager: Finished TID 2 in 421 ms (progress: 1/8)
13/02/12 11:03:14 INFO cluster.TaskSetManager: Starting task 0.0:3 as TID 3 on slave worker-20130212110302-S-270-
51561: S-270 (preferred)
13/02/12 11:03:14 INFO scheduler.DAGScheduler: Completed ResultTask(0, 2)
13/02/12 11:03:14 INFO cluster.TaskSetManager: Serialized task 0.0:3 as 1748 bytes in 0 ms
13/02/12 11:03:14 INFO cluster.TaskSetManager: Finished TID 0 in 468 ms (progress: 2/8)
13/02/12 11:03:14 INFO scheduler.DAGScheduler: Completed ResultTask(0, 0)
13/02/12 11:03:14 INFO cluster.TaskSetManager: Starting task 0.0:4 as TID 4 on slave worker-20130212110302-S-270-
51561: S-270 (preferred)
13/02/12 11:03:14 INFO cluster.TaskSetManager: Serialized task 0.0:4 as 1748 bytes in 0 ms
13/02/12 11:03:14 INFO cluster.TaskSetManager: Finished TID 1 in 436 ms (progress: 3/8)
13/02/12 11:03:14 INFO cluster.TaskSetManager: Starting task 0.0:5 as TID 5 on slave worker-20130212110302-S-270-
51561: S-270 (preferred)
13/02/12 11:03:14 INFO scheduler.DAGScheduler: Completed ResultTask(0, 1)
13/02/12 11:03:14 INFO cluster.TaskSetManager: Serialized task 0.0:5 as 1748 bytes in 16 ms
13/02/12 11:03:14 INFO cluster.TaskSetManager: Finished TID 3 in 32 ms (progress: 4/8)
13/02/12 11:03:14 INFO scheduler.DAGScheduler: Completed ResultTask(0, 3)
13/02/12 11:03:14 INFO cluster.TaskSetManager: Starting task 0.0:6 as TID 6 on slave worker-20130212110302-S-270-
51561: S-270 (preferred)
13/02/12 11:03:14 INFO cluster.TaskSetManager: Serialized task 0.0:6 as 1748 bytes in 0 ms
13/02/12 11:03:14 INFO cluster.TaskSetManager: Finished TID 4 in 63 ms (progress: 5/8)
13/02/12 11:03:14 INFO scheduler.DAGScheduler: Completed ResultTask(0, 4)
13/02/12 11:03:14 INFO cluster.TaskSetManager: Starting task 0.0:7 as TID 7 on slave worker-20130212110302-S-270-
51561: S-270 (preferred)
13/02/12 11:03:14 INFO cluster.TaskSetManager: Serialized task 0.0:7 as 1748 bytes in 0 ms
13/02/12 11:03:14 INFO cluster.TaskSetManager: Finished TID 5 in 63 ms (progress: 6/8)
13/02/12 11:03:14 INFO scheduler.DAGScheduler: Completed ResultTask(0, 5)
13/02/12 11:03:14 INFO cluster.TaskSetManager: Finished TID 6 in 46 ms (progress: 7/8)
13/02/12 11:03:14 INFO scheduler.DAGScheduler: Completed ResultTask(0, 6)
13/02/12 11:03:14 INFO cluster.TaskSetManager: Finished TID 7 in 15 ms (progress: 8/8)
13/02/12 11:03:14 INFO scheduler.DAGScheduler: Completed ResultTask(0, 7)
13/02/12 11:03:14 INFO spark.SparkContext: Job finished: reduce at SparkPi.scala:20, took 2.186205 s
Pi is roughly 3.142615

C:\DEV\PROGRAMME\spark\0.6.2>



##############################
########################################################################################
# JVisualVM Overview Data for the StandaloneExecutorBackend java process that remains running
##############################
########################################################################################

PID: 8740
Host: localhost
Main class: spark.executor.
StandaloneExecutorBackend
Arguments: akka://sp...@10.1.112.102:51573/user/StandaloneScheduler worker-20130212110302-S-270-
51561 S-270 3

JVM: Java HotSpot(TM) 64-Bit Server VM (20.7-b02, mixed mode)
Java: version 1.6.0_32, vendor Sun Microsystems Inc.
Java Home: J:\jdks\jdk1.6.0_32\jre
JVM Flags: <none>

grothaus....@gmail.com

unread,
Feb 12, 2013, 11:14:39 AM2/12/13
to spark...@googlegroups.com
After some investigation, I think that the call process.destroy() in spark.deploy.worker.ExecutorRunner.kill() (line 60) does not kill the process (at least in my environment, Win 7). Maybe it would be better to send an appropriate termination message to the actor that is started in spark.executor.StandaloneExecutorBackend.run() (line 67), so that the started process finishes properly?

Kind regards

Christoph Grothaus

grothaus....@gmail.com

unread,
Feb 13, 2013, 3:27:13 AM2/13/13
to spark...@googlegroups.com
I found the cause: under Max OS X and Unix, there is a single shell script "run" to start Spart master, worker, and executor. Under Windows, there is a cascade: "run.cmd" calls "run2.cmd" which calls java. So when the ExecutorRunner (which runs in the worker process) wants to kill the executor process via process.destroy(), it actually only kills the process of "run.cmd", and the process of "run2.cmd" (=> java running the executor) stays alive.

I would like to report this as an issue, but the Github issues page that is linked to on http://spark-project.org/docs/latest/contributing-to-spark.html does not exist (404 error by Github).

Kind regards

Christoph

Matei Zaharia

unread,
Feb 13, 2013, 3:28:56 AM2/13/13
to spark...@googlegroups.com
Ah, got it. Our issue tracker is actually at https://spark-project.atlassian.net. Please report it there and we'll deal with it. Basically we want to make run.cmd "exec" run2.cmd instead of just spawning it as a subprocess... Not sure how to do that on Windows but we'll figure it out.

Matei


--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

grothaus....@gmail.com

unread,
Feb 13, 2013, 5:12:51 AM2/13/13
to spark...@googlegroups.com
Created issue SPARK-698.

Regards, Christoph
Reply all
Reply to author
Forward
0 new messages