Spark job fails on Mesos cluster via mesos dispatcher

240 views
Skip to first unread message

Ellande Barthélémy

unread,
May 17, 2016, 5:36:06 AM5/17/16
to meso-user

I am running - currently on a single machine - one Mesos master, one slave, one Zookeeper and Marathon. All are running inside a Docker container. They all seem to be communicating with each other correctly.


NOTE: Usually Mesos master runs on port 5050 and slaves on port 5051. The port 5050 was already used by some other app on my machine, so master runs on port 5051 and slave on port 5052.

Through Marathon, I then run Spark's MesosClusterDispatcher in a Docker container using this json :

    {
      "id": "spark-cluster-dispatcher",
      "instances": 1,
      "container": {
        "docker": {
          "image": "eurobd/spark-cluster-dispatcher",
          "network": "HOST"
        },
        "type": "DOCKER"
      },
      "args": ["--master", "mesos://<mesos_ip>:5051"]
    }
 

And here is the content of ther Dockerfile :

FROM java:8-jdk

## DEPENDENCIES ##
RUN echo "deb http://repos.mesosphere.io/ubuntu/ trusty main" > /etc/apt/sources.list.d/mesosphere.list
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
RUN apt-get update
RUN apt-get install --assume-yes mesos

## SPARK ##
#ADD http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz /tmp/spark.tgz
#RUN mkdir -p /opt/spark && tar xzf /tmp/spark.tgz -C /opt/spark --strip=1 && rm -f /tmp/spark.tgz

COPY spark-*.tgz /tmp/spark.tgz
RUN mkdir -p /opt/spark && tar xzf /tmp/spark.tgz -C /opt/spark --strip=1 && rm -f /tmp/spark.tgz

EXPOSE 7077
WORKDIR /opt/spark
ENTRYPOINT ["/opt/spark/bin/spark-class", "org.apache.spark.deploy.mesos.MesosClusterDispatcher"]

It appears to be running perfectly on Marathon and Mesos. However, when I try to submit a job, it keeps failing. Here is the command I use to start the job : ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master mesos://MASTER_IP:7077 --deploy-mode cluster --supervise --driver-memory 8G http://MASTER_IP:1337/spark-examples-1.6.1-hadoop2.6.0.jar


As we can see on the following capture, it looks like Mesos is not allocating more than 32MB of memory (the minimum), even if I ask for more. The memory on the left menu grows than the task fails.




Am I doing something wrong ? Isn't Mesos supposed to allocate memory for executors itself ?


Here are the various logs that I read, but didn't help me :

Executor stderr :

I0511 12:37:09.953196  3133 logging.cpp:188] INFO level logging started!
I0511 12:37:09.953541  3133 fetcher.cpp:424] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/6aa06342-0200-4f28-9e34-ba6b070f1071-S0\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"http:\/\/MASTER_IP:1337\/spark-examples-1.6.1-hadoop2.6.0.jar"}}],"sandbox_directory":"\/tmp\/mesos\/slaves\/6aa06342-0200-4f28-9e34-ba6b070f1071-S0\/frameworks\/6aa06342-0200-4f28-9e34-ba6b070f1071-0002\/executors\/driver-20160511123705-0002\/runs\/d882577f-46ef-45e1-8968-1350820410a5","user":"root"}
I0511 12:37:09.955236  3133 fetcher.cpp:379] Fetching URI 'http://MASTER_IP:1337/spark-examples-1.6.1-hadoop2.6.0.jar'
I0511 12:37:09.955263  3133 fetcher.cpp:250] Fetching directly into the sandbox directory
I0511 12:37:09.955289  3133 fetcher.cpp:187] Fetching URI 'http://MASTER_IP:1337/spark-examples-1.6.1-hadoop2.6.0.jar'
I0511 12:37:09.955312  3133 fetcher.cpp:134] Downloading resource from 'http://MASTER_IP:1337/spark-examples-1.6.1-hadoop2.6.0.jar' to '/tmp/mesos/slaves/6aa06342-0200-4f28-9e34-ba6b070f1071-S0/frameworks/6aa06342-0200-4f28-9e34-ba6b070f1071-0002/executors/driver-20160511123705-0002/runs/d882577f-46ef-45e1-8968-1350820410a5/spark-examples-1.6.1-hadoop2.6.0.jar'
W0511 12:37:10.099078  3133 fetcher.cpp:272] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: http://MASTER_IP:1337/spark-examples-1.6.1-hadoop2.6.0.jar
I0511 12:37:10.099280  3133 fetcher.cpp:456] Fetched 'http://MASTER_IP:1337/spark-examples-1.6.1-hadoop2.6.0.jar' to '/tmp/mesos/slaves/6aa06342-0200-4f28-9e34-ba6b070f1071-S0/frameworks/6aa06342-0200-4f28-9e34-ba6b070f1071-0002/executors/driver-20160511123705-0002/runs/d882577f-46ef-45e1-8968-1350820410a5/spark-examples-1.6.1-hadoop2.6.0.jar'
I0511 12:37:10.258157  3136 logging.cpp:188] INFO level logging started!
I0511 12:37:10.260020  3136 exec.cpp:143] Version: 0.28.0
I0511 12:37:10.261939  3143 exec.cpp:472] Slave exited ... shutting down

Executor stdout... Shutting down (nothing more)

MesosCluterUI driver "last failed status":

task_id { value: "driver-20160511123705-0002" } state: TASK_FAILED message: "Executor terminated" slave_id { value: "6aa06342-0200-4f28-9e34-ba6b070f1071-S0" } timestamp: 1.462970414349151E9 executor_id { value: "driver-20160511123705-0002" } source: SOURCE_SLAVE reason: REASON_EXECUTOR_TERMINATED 11: "\034\360p\031\332)D\037\217u$\315\300\032\236\033" 13: ""

Slave logs (only a part): http://pastie.org/private/k62ilh2zqgoscoxtifpydg. This is not really readable but I have a recurring error (and can't figure out why it's here or whether it could cause errors or not) worth noting :

Failed to get resource statistics for executor 'spark-cluster-dispatcher.a44015d3-1773-11e6-8f0f-0242ac110003' of framework 6aa06342-0200-4f28-9e34-ba6b070f1071-0000: Failed to collect cgroup stats: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/19760/cgroup: Failed to open file '/proc/19760/cgroup': No such file or directory

I can probably give more information, if it isn't enough. Thanks for your help !


Reply all
Reply to author
Forward
0 new messages