NOTE: Usually Mesos master runs on port 5050 and slaves on port 5051. The port 5050 was already used by some other app on my machine, so master runs on port 5051 and slave on port 5052.
Through Marathon, I then run Spark's MesosClusterDispatcher in a Docker container using this json :
{
"id": "spark-cluster-dispatcher",
"instances": 1,
"container": {
"docker": {
"image": "eurobd/spark-cluster-dispatcher",
"network": "HOST"
},
"type": "DOCKER"
},
"args": ["--master", "mesos://<mesos_ip>:5051"]
}
And here is the content of ther Dockerfile :
FROM java:8-jdk
## DEPENDENCIES ##
RUN echo "deb http://repos.mesosphere.io/ubuntu/ trusty main" > /etc/apt/sources.list.d/mesosphere.list
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
RUN apt-get update
RUN apt-get install --assume-yes mesos
## SPARK ##
#ADD http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz /tmp/spark.tgz
#RUN mkdir -p /opt/spark && tar xzf /tmp/spark.tgz -C /opt/spark --strip=1 && rm -f /tmp/spark.tgz
COPY spark-*.tgz /tmp/spark.tgz
RUN mkdir -p /opt/spark && tar xzf /tmp/spark.tgz -C /opt/spark --strip=1 && rm -f /tmp/spark.tgz
EXPOSE 7077
WORKDIR /opt/spark
ENTRYPOINT ["/opt/spark/bin/spark-class", "org.apache.spark.deploy.mesos.MesosClusterDispatcher"]
It appears to be running perfectly on Marathon and Mesos. However,
when I try to submit a job, it keeps failing. Here is the command I use
to start the job : ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master mesos://MASTER_IP:7077 --deploy-mode cluster --supervise --driver-memory 8G http://MASTER_IP:1337/spark-examples-1.6.1-hadoop2.6.0.jar
Am I doing something wrong ? Isn't Mesos supposed to allocate memory for executors itself ?
Here are the various logs that I read, but didn't help me :
Executor stderr :
I0511 12:37:09.953196 3133 logging.cpp:188] INFO level logging started!
I0511 12:37:09.953541 3133 fetcher.cpp:424] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/6aa06342-0200-4f28-9e34-ba6b070f1071-S0\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"http:\/\/MASTER_IP:1337\/spark-examples-1.6.1-hadoop2.6.0.jar"}}],"sandbox_directory":"\/tmp\/mesos\/slaves\/6aa06342-0200-4f28-9e34-ba6b070f1071-S0\/frameworks\/6aa06342-0200-4f28-9e34-ba6b070f1071-0002\/executors\/driver-20160511123705-0002\/runs\/d882577f-46ef-45e1-8968-1350820410a5","user":"root"}
I0511 12:37:09.955236 3133 fetcher.cpp:379] Fetching URI 'http://MASTER_IP:1337/spark-examples-1.6.1-hadoop2.6.0.jar'
I0511 12:37:09.955263 3133 fetcher.cpp:250] Fetching directly into the sandbox directory
I0511 12:37:09.955289 3133 fetcher.cpp:187] Fetching URI 'http://MASTER_IP:1337/spark-examples-1.6.1-hadoop2.6.0.jar'
I0511 12:37:09.955312 3133 fetcher.cpp:134] Downloading resource from 'http://MASTER_IP:1337/spark-examples-1.6.1-hadoop2.6.0.jar' to '/tmp/mesos/slaves/6aa06342-0200-4f28-9e34-ba6b070f1071-S0/frameworks/6aa06342-0200-4f28-9e34-ba6b070f1071-0002/executors/driver-20160511123705-0002/runs/d882577f-46ef-45e1-8968-1350820410a5/spark-examples-1.6.1-hadoop2.6.0.jar'
W0511 12:37:10.099078 3133 fetcher.cpp:272] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: http://MASTER_IP:1337/spark-examples-1.6.1-hadoop2.6.0.jar
I0511 12:37:10.099280 3133 fetcher.cpp:456] Fetched 'http://MASTER_IP:1337/spark-examples-1.6.1-hadoop2.6.0.jar' to '/tmp/mesos/slaves/6aa06342-0200-4f28-9e34-ba6b070f1071-S0/frameworks/6aa06342-0200-4f28-9e34-ba6b070f1071-0002/executors/driver-20160511123705-0002/runs/d882577f-46ef-45e1-8968-1350820410a5/spark-examples-1.6.1-hadoop2.6.0.jar'
I0511 12:37:10.258157 3136 logging.cpp:188] INFO level logging started!
I0511 12:37:10.260020 3136 exec.cpp:143] Version: 0.28.0
I0511 12:37:10.261939 3143 exec.cpp:472] Slave exited ... shutting down
Executor stdout... Shutting down
(nothing more)
MesosCluterUI driver "last failed status":
task_id { value: "driver-20160511123705-0002" } state: TASK_FAILED message: "Executor terminated" slave_id { value: "6aa06342-0200-4f28-9e34-ba6b070f1071-S0" } timestamp: 1.462970414349151E9 executor_id { value: "driver-20160511123705-0002" } source: SOURCE_SLAVE reason: REASON_EXECUTOR_TERMINATED 11: "\034\360p\031\332)D\037\217u$\315\300\032\236\033" 13: ""
Slave logs (only a part): http://pastie.org/private/k62ilh2zqgoscoxtifpydg. This is not really readable but I have a recurring error (and can't figure out why it's here or whether it could cause errors or not) worth noting :
Failed to get resource statistics for executor 'spark-cluster-dispatcher.a44015d3-1773-11e6-8f0f-0242ac110003' of framework 6aa06342-0200-4f28-9e34-ba6b070f1071-0000: Failed to collect cgroup stats: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/19760/cgroup: Failed to open file '/proc/19760/cgroup': No such file or directory
I can probably give more information, if it isn't enough. Thanks for your help !