Spark Connection Refused Exception

3,761 views
Skip to first unread message

Karthik Thiyagarajan

unread,
Nov 18, 2011, 3:36:26 PM11/18/11
to Spark Users

Hi,

We were trying to run our code on our small 3 node test cluster, but
are having issues (see stacktrace below).
The hosts have no firewall configured and are on the same subnet. I
also verified that name resolution work properly among the nodes
binging to eth0 and not localhost.
We're running spark 0.3 with mesos pre-protobuf.

Matei's response in an email conversation was :

"This could mean one of two things -- either one machine isn't
reporting its hostname correctly (so others can't connect to it), or
the executor on it actually crashed. Look at the master too to see
whether it has any "task lost" messages indicating that one of the
nodes crashed, or look at the executor logs to see whether multiple
ones ran on one of the nodes.
By the way, it would be better if you post this on the user group,
because it will make it to more of the Spark team's mailboxes and it
will undoubtedly help other users in the future. We only switched to
Google Groups from a Berkeley mailing list a week ago, so that's why
there aren't other posts."

Extending that conversation :

To make debugging easier, I'm now running a mesos cluster with a
master and one slave (on the same machine).
I can still reproduce the problem. (The exception with the stacktrace
appearing on the stderr logs for the task on the slave).

From the executor logs, it looks like the master loses all the tasks
and retries them.
(11/11/18 12:14:18 INFO spark.SimpleJob: Lost TID 196 (task 33:1)
11/11/18 12:14:18 INFO spark.SimpleJob: Loss was due to fetch failure
from http://10.10.5.152:58854)

Looking at the logs from the mesos slave, it goes into this endless
cycle of starting the executor, getting assigned tasks and then
disconnecting / killing
the framework. The master keeps trying to launch the task on the slave
without ever giving up.

I don't know if it's related (maybe it could give some clues on the
state of the spark/mesos code I'm using).
I tried running all the examples and have issues specifically with the
BroadcastTest and MultiBroadcastTest)
When the master and slave are on the same machine, the tests run fine.
But when they are on different machines, I get a "Broadcast variable
not found exception" on the remote slave.

I'd be happy to dive deeper into the details of what's going on.
I'd appreciate it if some one could point me in the right direction.

Thanks,
Karthik


Stacktrace (Logs from stderr on the slave)

SLF4J: The following loggers will not work becasue they were created
SLF4J: during the default configuration phase of the underlying
logging system.
SLF4J: See also http://www.slf4j.org/codes.html#substituteLogger
SLF4J: spark.Executor
SLF4J: spark.Executor
11/11/17 23:19:21 INFO spark.Executor: Running task ID 270
11/11/17 23:19:21 INFO spark.MapOutputTracker: Updating generation to
168 and clearing cache
11/11/17 23:19:21 INFO spark.SimpleShuffleFetcher: Fetching outputs
for shuffle 0, reduce 0
11/11/17 23:19:21 INFO spark.SimpleShuffleFetcher: Fetching outputs
for shuffle 0, reduce 1
11/11/17 23:19:21 INFO spark.SimpleShuffleFetcher: Fetching outputs
for shuffle 0, reduce 2
11/11/17 23:19:21 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/17 23:19:21 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/17 23:19:21 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/17 23:19:21 INFO spark.MapOutputTracker: Doing the fetch;
tracker actor = 'MapOutputTracker@Node(phd10.quantifind.com,50501)
11/11/17 23:19:21 ERROR spark.SimpleShuffleFetcher: Fetch failed
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at java.net.Socket.connect(Socket.java:478)
at sun.net.NetworkClient.doConnect(NetworkClient.java:163)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
at sun.net.www.http.HttpClient.New(HttpClient.java:306)
at sun.net.www.http.HttpClient.New(HttpClient.java:323)
at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:
975)
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:
916)
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:
841)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:
1177)
at java.net.URL.openStream(URL.java:1010)
at spark.SimpleShuffleFetcher$$anonfun$fetch$5$$anonfun$apply$1.apply
$mcVI$sp(SimpleShuffleFetcher.scala:25)
at spark.SimpleShuffleFetcher$$anonfun$fetch$5$$anonfun$apply
$1.apply(SimpleShuffleFetcher.scala:20)
at spark.SimpleShuffleFetcher$$anonfun$fetch$5$$anonfun$apply
$1.apply(SimpleShuffleFetcher.scala:20)
at scala.collection.mutable.ResizableArray
$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:44)
at spark.SimpleShuffleFetcher$$anonfun$fetch
$5.apply(SimpleShuffleFetcher.scala:20)
at spark.SimpleShuffleFetcher$$anonfun$fetch
$5.apply(SimpleShuffleFetcher.scala:19)
at scala.collection.mutable.ResizableArray
$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:44)
at spark.SimpleShuffleFetcher.fetch(SimpleShuffleFetcher.scala:19)
at spark.ShuffledRDD.compute(ShuffledRDD.scala:38)
at spark.RDD.iterator(RDD.scala:79)
at spark.FilteredRDD.compute(RDD.scala:218)
at spark.RDD.iterator(RDD.scala:79)
at spark.MappedRDD.compute(RDD.scala:202)
at spark.RDD.iterator(RDD.scala:79)
at spark.MappedRDD.compute(RDD.scala:202)
at spark.RDD.iterator(RDD.scala:79)
at spark.FlatMappedRDD.compute(RDD.scala:210)
at spark.RDD.iterator(RDD.scala:79)
at spark.FilteredRDD.compute(RDD.scala:218)
at spark.RDD.iterator(RDD.scala:79)
at spark.ResultTask.run(ResultTask.scala:10)
at spark.Executor$TaskRunner.run(Executor.scala:66)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
11/11/17 23:19:21 ERROR spark.SimpleShuffleFetcher: Fetch failed
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at java.net.Socket.connect(Socket.java:478)
at sun.net.NetworkClient.doConnect(NetworkClient.java:163)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
at sun.net.www.http.HttpClient.New(HttpClient.java:306)
at sun.net.www.http.HttpClient.New(HttpClient.java:323)
at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:
975)
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:
916)
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:
841)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:
1177)
at java.net.URL.openStream(URL.java:1010)
at spark.SimpleShuffleFetcher$$anonfun$fetch$5$$anonfun$apply$1.apply
$mcVI$sp(SimpleShuffleFetcher.scala:25)
at spark.SimpleShuffleFetcher$$anonfun$fetch$5$$anonfun$apply
$1.apply(SimpleShuffleFetcher.scala:20)
at spark.SimpleShuffleFetcher$$anonfun$fetch$5$$anonfun$apply
$1.apply(SimpleShuffleFetcher.scala:20)
at scala.collection.mutable.ResizableArray
$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:44)
at spark.SimpleShuffleFetcher$$anonfun$fetch
$5.apply(SimpleShuffleFetcher.scala:20)
at spark.SimpleShuffleFetcher$$anonfun$fetch
$5.apply(SimpleShuffleFetcher.scala:19)
at scala.collection.mutable.ResizableArray
$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:44)
at spark.SimpleShuffleFetcher.fetch(SimpleShuffleFetcher.scala:19)
at spark.ShuffledRDD.compute(ShuffledRDD.scala:38)
at spark.RDD.iterator(RDD.scala:79)
at spark.FilteredRDD.compute(RDD.scala:218)
at spark.RDD.iterator(RDD.scala:79)
at spark.MappedRDD.compute(RDD.scala:202)
at spark.RDD.iterator(RDD.scala:79)
at spark.MappedRDD.compute(RDD.scala:202)
at spark.RDD.iterator(RDD.scala:79)
at spark.FlatMappedRDD.compute(RDD.scala:210)
at spark.RDD.iterator(RDD.scala:79)
at spark.FilteredRDD.compute(RDD.scala:218)
at spark.RDD.iterator(RDD.scala:79)
at spark.ResultTask.run(ResultTask.scala:10)
at spark.Executor$TaskRunner.run(Executor.scala:66)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
11/11/17 23:19:21 ERROR spark.SimpleShuffleFetcher: Fetch failed
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at java.net.Socket.connect(Socket.java:478)
at sun.net.NetworkClient.doConnect(NetworkClient.java:163)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
at sun.net.www.http.HttpClient.New(HttpClient.java:306)
at sun.net.www.http.HttpClient.New(HttpClient.java:323)
at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:
975)
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:
916)
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:
841)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:
1177)
at java.net.URL.openStream(URL.java:1010)
at spark.SimpleShuffleFetcher$$anonfun$fetch$5$$anonfun$apply$1.apply
$mcVI$sp(SimpleShuffleFetcher.scala:25)
at spark.SimpleShuffleFetcher$$anonfun$fetch$5$$anonfun$apply
$1.apply(SimpleShuffleFetcher.scala:20)
at spark.SimpleShuffleFetcher$$anonfun$fetch$5$$anonfun$apply
$1.apply(SimpleShuffleFetcher.scala:20)
at scala.collection.mutable.ResizableArray
$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:44)
at spark.SimpleShuffleFetcher$$anonfun$fetch
$5.apply(SimpleShuffleFetcher.scala:20)
at spark.SimpleShuffleFetcher$$anonfun$fetch
$5.apply(SimpleShuffleFetcher.scala:19)
at scala.collection.mutable.ResizableArray
$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:44)
at spark.SimpleShuffleFetcher.fetch(SimpleShuffleFetcher.scala:19)
at spark.ShuffledRDD.compute(ShuffledRDD.scala:38)
at spark.RDD.iterator(RDD.scala:79)
at spark.FilteredRDD.compute(RDD.scala:218)
at spark.RDD.iterator(RDD.scala:79)
at spark.MappedRDD.compute(RDD.scala:202)
at spark.RDD.iterator(RDD.scala:79)
at spark.MappedRDD.compute(RDD.scala:202)
at spark.RDD.iterator(RDD.scala:79)
at spark.FlatMappedRDD.compute(RDD.scala:210)
at spark.RDD.iterator(RDD.scala:79)
at spark.FilteredRDD.compute(RDD.scala:218)
at spark.RDD.iterator(RDD.scala:79)
at spark.ResultTask.run(ResultTask.scala:10)
at spark.Executor$TaskRunner.run(Executor.scala:66)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Matei Zaharia

unread,
Nov 18, 2011, 4:41:12 PM11/18/11
to spark...@googlegroups.com
Hey Karthik,

This really looks like the machine might be reporting the wrong hostname for itself, or that hostname might be resolving to the wrong IP. Can you check that 10.10.5.152 is actually the IP of the machine? Also, does the machine happen to have multiple network interfaces with different IP addresses?

It's unfortunate that you get that SLF4J message at the top, because that means we might be missing some log messages. That's a bug that was fixed in the master branch of Spark. You can fix it in 0.3 by adding the following line in core/src/main/scala/spark/Executor.scala (right after the var env: SparkEnv = ...):

initLogging()

Matei

Erich Nachbar

unread,
Nov 18, 2011, 5:34:24 PM11/18/11
to Spark Users
comments inline (working with Karthik on this issue)

> This really looks like the machine might be reporting the wrong hostname for itself, or that hostname might be resolving to the wrong IP. Can you check that 10.10.5.152 is actually the IP of the machine? Also, does the machine happen to have multiple network interfaces with different IP addresses?

Machine has only one host interface (with ipv6 disabled):
eth0 Link encap:Ethernet HWaddr 00:25:90:2D:0D:5E
inet addr:10.10.5.152 Bcast:10.10.5.255 Mask:255.255.255.0

Also 'hostname' shows phd10 as the output and it resolves to said IP.
It also looks like the processes are listening on the correct
interface:

[root@phd10 ~]# netstat -an | grep 45668
tcp 0 0 10.10.5.152:45668
0.0.0.0:* LISTEN
tcp 0 0 10.10.5.152:45668
10.10.5.152:56553 ESTABLISHED
tcp 0 0 10.10.5.152:56553
10.10.5.152:45668 ESTABLISHED

[root@phd10 ~]# netstat -an | grep 5050
tcp 0 0 10.10.5.152:5050
0.0.0.0:* LISTEN
tcp 0 0 10.10.5.152:58072
10.10.5.152:5050 ESTABLISHED
tcp 0 0 10.10.5.152:5050
10.10.5.152:58072 ESTABLISHED
tcp 0 0 10.10.5.152:5050
10.10.5.153:49849 ESTABLISHED

I also checked that the firewall is down:
[root@phd10 ~]# /etc/init.d/iptables status
Firewall is stopped.

We are next trying to see if we can apply your suggested loggin fix.

Karthik Thiyagarajan

unread,
Nov 18, 2011, 6:01:21 PM11/18/11
to Spark Users

I added the initLogging line to Executor.scala (in spark 0.3) to fix
the logging problem.

The updated stack trace looks like this. (It logs other running task
Id's on the slave)

11/11/18 14:53:18 INFO spark.Executor: Running task ID 46
11/11/18 14:53:18 INFO spark.Executor: Running task ID 50
11/11/18 14:53:18 INFO spark.Executor: Running task ID 48
11/11/18 14:53:18 INFO spark.Executor: Running task ID 52
11/11/18 14:53:18 INFO spark.MapOutputTracker: Updating generation to
24 and clearing cache
11/11/18 14:53:18 INFO spark.SimpleShuffleFetcher: Fetching outputs


for shuffle 0, reduce 2

11/11/18 14:53:18 INFO spark.SimpleShuffleFetcher: Fetching outputs
for shuffle 0, reduce 7
11/11/18 14:53:18 INFO spark.SimpleShuffleFetcher: Fetching outputs
for shuffle 0, reduce 3
11/11/18 14:53:18 INFO spark.SimpleShuffleFetcher: Fetching outputs
for shuffle 0, reduce 4
11/11/18 14:53:18 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 14:53:18 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 14:53:18 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 14:53:18 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 14:53:18 INFO spark.MapOutputTracker: Doing the fetch;
tracker actor = 'MapOutputTracker@Node(phd10.quantifind.com,50501)
11/11/18 14:53:19 ERROR spark.SimpleShuffleFetcher: Fetch failed


java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)

.....


As opposed to (before the logging fix) :

SLF4J: The following loggers will not work becasue they were created
SLF4J: during the default configuration phase of the underlying
logging system.
SLF4J: See also http://www.slf4j.org/codes.html#substituteLogger
SLF4J: spark.Executor
SLF4J: spark.Executor

SLF4J: spark.Executor
11/11/18 14:47:50 INFO spark.Executor: Running task ID 11
11/11/18 14:47:50 INFO spark.SimpleShuffleFetcher: Fetching outputs


for shuffle 0, reduce 1

11/11/18 14:47:50 INFO spark.SimpleShuffleFetcher: Fetching outputs


for shuffle 0, reduce 0

11/11/18 14:47:50 INFO spark.SimpleShuffleFetcher: Fetching outputs
for shuffle 0, reduce 6
11/11/18 14:47:50 INFO spark.SimpleShuffleFetcher: Fetching outputs
for shuffle 0, reduce 5
11/11/18 14:47:50 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 14:47:50 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 14:47:50 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 14:47:50 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 14:47:50 INFO spark.MapOutputTracker: Doing the fetch;
tracker actor = 'MapOutputTracker@Node(phd10.quantifind.com,50501)
11/11/18 14:47:50 ERROR spark.SimpleShuffleFetcher: Fetch failed


java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)

...


I also see some instances of a MalformedURLException (perhaps related)

11/11/18 14:52:49 ERROR spark.SimpleShuffleFetcher: Fetch failed
java.net.MalformedURLException: no protocol: null/shuffle/0/1/2
at java.net.URL.<init>(URL.java:567)
at java.net.URL.<init>(URL.java:464)
at java.net.URL.<init>(URL.java:413)

at spark.Executor$TaskRunner.run(Executor.scala:67)


at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


-- Karthik

Matei Zaharia

unread,
Nov 18, 2011, 6:20:01 PM11/18/11
to spark...@googlegroups.com
That null stuff is really weird. One possibility is that port 50501, which the master listens on to talk to slaves, is taken. Unfortunately, because we're using Scala actors for communication, there's no way to get an error if this is the case (the library doesn't tell you if you try to bind to a port that's taken!). Try passing the following JVM option to your master to set a different port (choose a number lower than 30K so it's outside the ephemeral port range):

-Dspark.master.port=1234

It might also help to try a really short program which should give the same kind of error, because it also does a shuffle. Try something like this:

import spark.SparkContext
import spark.SparkContext._

object TestProg {
def main(args: Array[String]) {
val sc = new SparkContext("your-master", "test")
sc.parallelize(1 to 10).groupBy(_ % 2).collect().foreach(println)
}
}

For this smaller program, can you attach the full logs?

Matei

Matei Zaharia

unread,
Nov 18, 2011, 6:24:23 PM11/18/11
to spark...@googlegroups.com
By the way, this "port being taken" thing would also happen, rather reliably, if you're running two driver programs running on the same machine. I have an open issue to switch to a different actor library that will let us bind to a free port and see which port that was: https://github.com/mesos/spark/issues/90.

Matei

Erich Nachbar

unread,
Nov 18, 2011, 6:43:38 PM11/18/11
to Spark Users
Ok, so I verified that the port was not taken beforehand and mesos
bound to it:

COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 1301 erich 62u IPv4 60535337 TCP *:50501 (LISTEN)
java 1301 erich 63u IPv4 60535339 TCP phd10.quantifind.com:
50501->phd10.quantifind.com:60824 (CLOSE_WAIT)
....

We are running the short example next.

> > SLF4J: See alsohttp://www.slf4j.org/codes.html#substituteLogger

Karthik Thiyagarajan

unread,
Nov 18, 2011, 7:15:00 PM11/18/11
to Spark Users
So I ran the test program with the shuffle. It ran successfully.

Logs from the slave node :

11/11/18 15:52:43 INFO spark.Executor: Running task ID 1
11/11/18 15:52:43 INFO spark.Executor: Running task ID 6
11/11/18 15:52:43 INFO spark.Executor: Running task ID 7
11/11/18 15:52:43 INFO spark.Executor: Running task ID 5
11/11/18 15:52:43 INFO spark.Executor: Running task ID 2
11/11/18 15:52:43 INFO spark.Executor: Running task ID 0
11/11/18 15:52:43 INFO spark.Executor: Running task ID 3
11/11/18 15:52:43 INFO spark.Executor: Running task ID 4
11/11/18 15:52:43 INFO spark.LocalFileShuffle: Shuffle dir: /tmp/spark-
local-3fe2df23-4ce5-4da6-bcc5-99c1463ace06/shuffle
11/11/18 15:52:44 INFO util.log: jetty-7.4.2.v20110526
11/11/18 15:52:44 INFO util.log: Started
SelectChann...@0.0.0.0:41648 STARTING
11/11/18 15:52:44 INFO spark.LocalFileShuffle: Local URI: http://10.10.5.152:41648
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 4
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 2
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 5
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 7
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 1
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 3
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 6
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 0
11/11/18 15:52:44 INFO spark.Executor: Running task ID 8
11/11/18 15:52:44 INFO spark.Executor: Running task ID 9
11/11/18 15:52:44 INFO spark.Executor: Running task ID 10
11/11/18 15:52:44 INFO spark.Executor: Running task ID 11
11/11/18 15:52:44 INFO spark.Executor: Running task ID 12
11/11/18 15:52:44 INFO spark.Executor: Running task ID 13
11/11/18 15:52:44 INFO spark.SimpleShuffleFetcher: Fetching outputs


for shuffle 0, reduce 0

11/11/18 15:52:44 INFO spark.SimpleShuffleFetcher: Fetching outputs


for shuffle 0, reduce 3

11/11/18 15:52:44 INFO spark.SimpleShuffleFetcher: Fetching outputs


for shuffle 0, reduce 2

11/11/18 15:52:44 INFO spark.SimpleShuffleFetcher: Fetching outputs


for shuffle 0, reduce 5

11/11/18 15:52:44 INFO spark.SimpleShuffleFetcher: Fetching outputs


for shuffle 0, reduce 4

11/11/18 15:52:44 INFO spark.SimpleShuffleFetcher: Fetching outputs


for shuffle 0, reduce 1

11/11/18 15:52:44 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 15:52:44 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 15:52:44 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 15:52:44 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 15:52:44 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 15:52:44 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 15:52:44 INFO spark.MapOutputTracker: Doing the fetch;
tracker actor = 'MapOutputTracker@Node(phd10.quantifind.com,50501)
11/11/18 15:52:44 INFO spark.Executor: Running task ID 14
11/11/18 15:52:44 INFO spark.Executor: Running task ID 15
11/11/18 15:52:44 INFO spark.SimpleShuffleFetcher: Fetching outputs


for shuffle 0, reduce 6

11/11/18 15:52:44 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 15:52:44 INFO spark.SimpleShuffleFetcher: Fetching outputs


for shuffle 0, reduce 7

11/11/18 15:52:44 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 14
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 10
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 11
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 12
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 15
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 9
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 13
11/11/18 15:52:44 INFO spark.Executor: Finished task ID 8

FYI..My program runs fine in local mode (The problem is when it runs
on the mesos cluster).
Let me start simplifying my program to the point it doesn't break.
That ought to help zero in on the issue.

-- Karthik

Message has been deleted
Message has been deleted

Matei Zaharia

unread,
Nov 19, 2011, 6:01:41 PM11/19/11
to spark...@googlegroups.com
Ah, got it. There is definitely a limit on how much data can be returned from a task now -- it should not be more than a few MB. This is mostly because we sent it through the Mesos library for convenience, but that's not optimized for large messages. I've opened an issue to fix this, but in the meantime, there are two possibilities:

1) If your application allows it, you could run smaller tasks (at a higher level of parallelism).

2) Instead of returning the data with collect(), you can call save() to write it to HDFS or some other storage system, and load it manually from there. This is less convenient but it should work.

Thanks for tracking this down! It looks like it fails in a pretty weird way, and I'm still not sure how the null URLs got there, but now I have a way to reproduce it. Here's he issue to track it: https://github.com/mesos/spark/issues/94.

Matei

On Nov 19, 2011, at 2:53 PM, Karthik Thiyagarajan wrote:

> After investigating my code, I found out that there were issues in the
> way I was calling collect() on the RDD (something to do with either
> the impl or my understanding of it). I understand that the collect is
> usually called on a sufficiently "small" subset of data after a filter
> operation.Although I do a filter operation, collect (in my case)
> returns a pretty "considerable" amount of data to the driver program.
> (Haven't estimated the data sizes involved but "considerable" here is
> in the order of few hundred megabytes).
> I found that, when a task needs to return big amounts of data to the
> driver program, for some reason the master loses the tasks.When there
> is a groupBy (shuffle) involved, in addition to the tasks being lost -
> the logs on the slave have ConnectionRefused and MalformedURL
> exceptions.
> Here are some of the results from experimenting with different test
> programs. I haven't gone through the Spark source to find out why they
> are happening but hopefully some one who knows the source will be able
> to interpret them better.
> Configuration : Mesos cluster with one slave, Master and Slave on the
> same physical machine.
> Program 1 : Simple program with a map and a collect. Works fine.
> import spark.SparkContextimport spark.SparkContext._object TestProgram
> { def main(args: Array[String]) { if (args.length == 0) {
> System.err.println("Usage: TestProgram <host> [<slices>]")
> System.exit(1) } val sc = new SparkContext(args(0),
> "Testprogram") sc.parallelize((1 to 10000), 2).map((_,Array.fill(10)
> {"data"})).collect() }}
> ===
> Program 2 : Increased the amount of data the slave has to return to
> the driver program (In the order of a few gigabytes). Runs fine in
> local mode but fails on the cluster with the master losing tasks.The
> master keeps trying to restart the failed tasks in an endless cycle
> without giving up.
> import spark.SparkContextimport spark.SparkContext._object TestProgram
> { def main(args: Array[String]) { if (args.length == 0) {
> System.err.println("Usage: TestProgram <host> [<slices>]")
> System.exit(1) } val sc = new SparkContext(args(0),
> "Testprogram") sc.parallelize((1 to 10000),
> 2).map((_,Array.fill(100000){"data"})).collect() }}
>
> Logs at executor :
> 11/11/19 14:14:39 INFO spark.MesosScheduler: Submitting Stage 0, which
> has no missing parents11/11/19 14:14:39 INFO spark.MesosScheduler: Got
> a job with 2 tasks11/11/19 14:14:39 INFO spark.MesosScheduler: Adding
> job with ID 011/11/19 14:14:39 INFO spark.SimpleJob: Starting task 0:0
> as TID 0 on slave 201111180014-0-0: phd10.quantifind.com
> (preferred)11/11/19 14:14:39 INFO spark.SimpleJob: Starting task 0:1
> as TID 1 on slave 201111180014-0-0: phd10.quantifind.com
> (preferred)11/11/19 14:14:51 INFO spark.SimpleJob: Lost TID 0 (task
> 0:0)11/11/19 14:14:51 INFO spark.SimpleJob: Lost TID 1 (task 0:1)...
> Logs at slave :
> 11/11/19 14:15:07 INFO spark.Executor: Running task ID 011/11/19
> 14:15:07 INFO spark.Executor: Running task ID 1...
> ===
>
> Program 3 : Same as Program 2 but added a groupBy statement. Exhibits
> the same pbm as Program 2 but the slave has a few exceptions (below).
> import spark.SparkContextimport spark.SparkContext._object TestProgram
> { def main(args: Array[String]) { if (args.length == 0) {
> System.err.println("Usage: TestProgram <host> [<slices>]")
> System.exit(1) } val sc = new SparkContext(args(0),
> "Testprogram") sc.parallelize((1 to 10000), 2).map(num =>
> (num.toString, num)).groupByKey().map((_,Array.fill(100000)
> {"data"})).collect() }}
>
> Logs at executor :
> 11/11/19 14:24:28 INFO spark.MesosScheduler: Adding job with ID
> 111/11/19 14:24:28 INFO spark.SimpleJob: Starting task 1:0 as TID 2 on
> slave 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19
> 14:24:28 INFO spark.SimpleJob: Starting task 1:1 as TID 3 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:28
> INFO spark.SimpleJob: Starting task 1:2 as TID 4 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:28
> INFO spark.SimpleJob: Starting task 1:3 as TID 5 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:28
> INFO spark.SimpleJob: Starting task 1:4 as TID 6 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:28
> INFO spark.SimpleJob: Starting task 1:5 as TID 7 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:28
> INFO spark.SimpleJob: Starting task 1:6 as TID 8 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:28
> INFO spark.SimpleJob: Starting task 1:7 as TID 9 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:28
> INFO spark.MapOutputTrackerActor: Asked to get map output locations
> for shuffle 011/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 3 (task
> 1:1)11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 5 (task
> 1:3)11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 9 (task
> 1:7)11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 2 (task
> 1:0)11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 4 (task
> 1:2)11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 7 (task
> 1:5)11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 6 (task
> 1:4)11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 8 (task
> 1:6)11/11/19 14:24:38 INFO spark.SimpleJob: Starting task 1:6 as TID
> 10 on slave 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19
> 14:24:38 INFO spark.SimpleJob: Starting task 1:4 as TID 11 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:38
> INFO spark.SimpleJob: Starting task 1:5 as TID 12 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:38
> INFO spark.SimpleJob: Starting task 1:2 as TID 13 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:38
> INFO spark.SimpleJob: Starting task 1:0 as TID 14 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:38
> INFO spark.SimpleJob: Starting task 1:7 as TID 15 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:38
> INFO spark.SimpleJob: Starting task 1:3 as TID 16 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:38
> INFO spark.SimpleJob: Starting task 1:1 as TID 17 on slave
> 201111180014-0-0: phd10.quantifind.com (preferred)11/11/19 14:24:39
> INFO spark.MapOutputTrackerActor: Asked to get map output locations
> for shuffle 011/11/19 14:24:39 INFO spark.SimpleJob: Lost TID 13 (task
> 1:2)11/11/19 14:24:39 INFO spark.SimpleJob: Loss was due to fetch
> failure from http://10.10.5.152:4701011/11/19 14:24:39 INFO
> spark.SimpleJob: Lost TID 14 (task 1:0)11/11/19 14:24:39 INFO
> spark.SimpleJob: Loss was due to fetch failure from http://10.10.5.152:4701011/11/19
> 14:24:39 INFO spark.MesosScheduler: Marking Stage 0 for resubmision
> due to a fetch failure11/11/19 14:24:39 INFO spark.SimpleJob: Lost TID
> 17 (task 1:1)11/11/19 14:24:39 INFO spark.SimpleJob: Loss was due to
> fetch failure from http://10.10.5.152:4701011/11/19 14:24:39 INFO
> spark.SimpleJob: Lost TID 12 (task 1:5)11/11/19 14:24:39 INFO
> spark.SimpleJob: Loss was due to fetch failure from http://10.10.5.152:4701011/11/19
> 14:24:39 INFO spark.SimpleJob: Lost TID 11 (task 1:4)11/11/19 14:24:39


> INFO spark.SimpleJob: Loss was due to fetch failure from

> http://10.10.5.152:4701011/11/19 14:24:39 INFO spark.MesosScheduler:
> The failed fetch was from Stage 1; marking it for resubmission11/11/19
> 14:24:39 INFO spark.MesosScheduler: Marking Stage 0 for resubmision
> due to a fetch failure11/11/19 14:24:39 INFO spark.SimpleJob: Lost TID
> 16 (task 1:3)
> ...
>
> Logs at slave :
> 11/11/19 14:24:39 INFO spark.Executor: Running task ID 1211/11/19
> 14:24:39 INFO spark.SimpleShuffleFetcher: Fetching outputs for shuffle
> 0, reduce 1...11/11/19 14:24:39 INFO spark.MapOutputTracker: Don't
> have map outputs for 0, fetching them...11/11/19 14:24:39 INFO


> spark.MapOutputTracker: Doing the fetch; tracker actor =

> 'MapOutputTracker@Node(phd10.quantifind.com,50501)11/11/19 14:24:39
> ERROR spark.SimpleShuffleFetcher: Fetch
> failedjava.net.ConnectException: Connection refused at
> java.net.PlainSocketImpl.socketConnect(Native Method) at
> java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at


> java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at
> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at
> java.net.Socket.connect(Socket.java:529) at
> java.net.Socket.connect(Socket.java:478) at
> sun.net.NetworkClient.doConnect(NetworkClient.java:163) at
> sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at
> sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at
> sun.net.www.http.HttpClient.<init>(HttpClient.java:233) at
> sun.net.www.http.HttpClient.New(HttpClient.java:306) at
> sun.net.www.http.HttpClient.New(HttpClient.java:323) at
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:
> 975) at
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:
> 916) at
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:
> 841) at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:

> 1177) at java.net.URL.openStream(URL.java:1010)...
>
> Although I'm still in the prototyping phase for my application and
> will eventually make sure that the tasks don't have to return large
> amounts of data to the driver program, it' ll be good to know what the
> limits are (and why this is so) and the reason there are
> ConnectRefused / MalformedURL exceptions on the slave.
> ThanksKarthik
> On Nov 18, 4:15 pm, Karthik Thiyagarajan <karthik....@gmail.com>


> wrote:
>> So I ran the test program with the shuffle. It ran successfully.
>>
>> Logs from the slave node :
>>
>> 11/11/18 15:52:43 INFO spark.Executor: Running task ID 1
>> 11/11/18 15:52:43 INFO spark.Executor: Running task ID 6
>> 11/11/18 15:52:43 INFO spark.Executor: Running task ID 7
>> 11/11/18 15:52:43 INFO spark.Executor: Running task ID 5
>> 11/11/18 15:52:43 INFO spark.Executor: Running task ID 2
>> 11/11/18 15:52:43 INFO spark.Executor: Running task ID 0
>> 11/11/18 15:52:43 INFO spark.Executor: Running task ID 3
>> 11/11/18 15:52:43 INFO spark.Executor: Running task ID 4
>> 11/11/18 15:52:43 INFO spark.LocalFileShuffle: Shuffle dir: /tmp/spark-
>> local-3fe2df23-4ce5-4da6-bcc5-99c1463ace06/shuffle
>> 11/11/18 15:52:44 INFO util.log: jetty-7.4.2.v20110526
>> 11/11/18 15:52:44 INFO util.log: Started

>> SelectChannelConnec...@0.0.0.0:41648 STARTING

>> ...
>>
>> read more »

Karthik Thiyagarajan

unread,
Nov 19, 2011, 6:19:16 PM11/19/11
to Spark Users
Thanks a lot Matei. I should be able to get around the limits pretty
easily.

Here's my post again - removed and re-posted to fix formatting (Sorry
if it still looks screwed up)

===

The master keeps trying to restart the failed tasks in an endless
cycle without giving up.

import spark.SparkContext
import spark.SparkContext._
object TestProgram {
  def main(args: Array[String]) {
    if (args.length == 0) {
      System.err.println("Usage: TestProgram <host> [<slices>]")
      System.exit(1)
    }
    val sc = new SparkContext(args(0), "Testprogram")
    sc.parallelize((1 to 10000), 2).map((_,Array.fill(100000)
{"data"})).collect()
  }
}


Logs at executor :

11/11/19 14:14:39 INFO spark.MesosScheduler: Submitting Stage 0, which
has no missing parents
11/11/19 14:14:39 INFO spark.MesosScheduler: Got a job with 2 tasks

11/11/19 14:14:39 INFO spark.MesosScheduler: Adding job with ID 0


11/11/19 14:14:39 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:14:39 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:14:51 INFO spark.SimpleJob: Lost TID 0 (task 0:0)
11/11/19 14:14:51 INFO spark.SimpleJob: Lost TID 1 (task 0:1)

...

Logs at slave :

11/11/19 14:15:07 INFO spark.Executor: Running task ID 0
11/11/19 14:15:07 INFO spark.Executor: Running task ID 1
...

===


Program 3 :  Same as Program 2 but added a groupBy statement. Exhibits
the same pbm as Program 2 but the slave has a few exceptions (below).

import spark.SparkContext
import spark.SparkContext._
object TestProgram {
  def main(args: Array[String]) {
    if (args.length == 0) {
      System.err.println("Usage: TestProgram <host> [<slices>]")
      System.exit(1)
    }
    val sc = new SparkContext(args(0), "Testprogram")
    sc.parallelize((1 to 10000), 2).map(num => (num.toString,
num)).groupByKey().map((_,Array.fill(100000){"data"})).collect()
  }
}


Logs at executor :

11/11/19 14:24:28 INFO spark.MesosScheduler: Adding job with ID 1
11/11/19 14:24:28 INFO spark.SimpleJob: Starting task 1:0 as TID 2 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)

11/11/19 14:24:28 INFO spark.SimpleJob: Starting task 1:1 as TID 3 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:28 INFO spark.SimpleJob: Starting task 1:2 as TID 4 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:28 INFO spark.SimpleJob: Starting task 1:3 as TID 5 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:28 INFO spark.SimpleJob: Starting task 1:4 as TID 6 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:28 INFO spark.SimpleJob: Starting task 1:5 as TID 7 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:28 INFO spark.SimpleJob: Starting task 1:6 as TID 8 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:28 INFO spark.SimpleJob: Starting task 1:7 as TID 9 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:28 INFO spark.MapOutputTrackerActor: Asked to get map


output locations for shuffle 0
11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 3 (task 1:1)
11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 5 (task 1:3)
11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 9 (task 1:7)
11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 2 (task 1:0)
11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 4 (task 1:2)
11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 7 (task 1:5)
11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 6 (task 1:4)
11/11/19 14:24:38 INFO spark.SimpleJob: Lost TID 8 (task 1:6)
11/11/19 14:24:38 INFO spark.SimpleJob: Starting task 1:6 as TID 10 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)

11/11/19 14:24:38 INFO spark.SimpleJob: Starting task 1:4 as TID 11 on


slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:38 INFO spark.SimpleJob: Starting task 1:5 as TID 12 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:38 INFO spark.SimpleJob: Starting task 1:2 as TID 13 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:38 INFO spark.SimpleJob: Starting task 1:0 as TID 14 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:38 INFO spark.SimpleJob: Starting task 1:7 as TID 15 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:38 INFO spark.SimpleJob: Starting task 1:3 as TID 16 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)
11/11/19 14:24:38 INFO spark.SimpleJob: Starting task 1:1 as TID 17 on
slave 201111180014-0-0: phd10.quantifind.com (preferred)

11/11/19 14:24:39 INFO spark.MapOutputTrackerActor: Asked to get map


output locations for shuffle 0

11/11/19 14:24:39 INFO spark.SimpleJob: Lost TID 13 (task 1:2)
11/11/19 14:24:39 INFO spark.SimpleJob: Loss was due to fetch failure
from http://10.10.5.152:47010

11/11/19 14:24:39 INFO spark.SimpleJob: Lost TID 14 (task 1:0)


11/11/19 14:24:39 INFO spark.SimpleJob: Loss was due to fetch failure
from http://10.10.5.152:47010

11/11/19 14:24:39 INFO spark.MesosScheduler: Marking Stage 0 for


resubmision due to a fetch failure
11/11/19 14:24:39 INFO spark.SimpleJob: Lost TID 17 (task 1:1)

11/11/19 14:24:39 INFO spark.SimpleJob: Loss was due to fetch failure
from http://10.10.5.152:47010

11/11/19 14:24:39 INFO spark.SimpleJob: Lost TID 12 (task 1:5)


11/11/19 14:24:39 INFO spark.SimpleJob: Loss was due to fetch failure
from http://10.10.5.152:47010

11/11/19 14:24:39 INFO spark.SimpleJob: Lost TID 11 (task 1:4)


11/11/19 14:24:39 INFO spark.SimpleJob: Loss was due to fetch failure
from http://10.10.5.152:47010

11/11/19 14:24:39 INFO spark.MesosScheduler: The failed fetch was from
Stage 1; marking it for resubmission

11/11/19 14:24:39 INFO spark.MesosScheduler: Marking Stage 0 for


resubmision due to a fetch failure

11/11/19 14:24:39 INFO spark.SimpleJob: Lost TID 16 (task 1:3)

...


Logs at slave :

11/11/19 14:24:39 INFO spark.Executor: Running task ID 12

11/11/19 14:24:39 INFO spark.SimpleShuffleFetcher: Fetching outputs


for shuffle 0, reduce 1

...
11/11/19 14:24:39 INFO spark.MapOutputTracker: Don't have map outputs
for 0, fetching them
...
11/11/19 14:24:39 INFO spark.MapOutputTracker: Doing the fetch;
tracker actor = 'MapOutputTracker@Node(phd10.quantifind.com,50501)
11/11/19 14:24:39 ERROR spark.SimpleShuffleFetcher: Fetch failed


java.net.ConnectException: Connection refused
  at java.net.PlainSocketImpl.socketConnect(Native Method)

Matei Zaharia

unread,
Jan 14, 2012, 2:09:54 PM1/14/12
to Spark Users
For future reference, I believe I've fixed the "fetch failed from
null" bug, as well as another error you can get related to this, "head
of empty list". It's fixed in this commit:
https://github.com/mesos/spark/commit/fd5581a0d3a3be45e02adf67dff1da0d26833a4f
in the master branch.


On Nov 19 2011, 3:19 pm, Karthik Thiyagarajan <karthik....@gmail.com>
wrote:
> fromhttp://10.10.5.152:47010
> 11/11/19 14:24:39 INFO spark.SimpleJob: Lost TID 14 (task 1:0)
> 11/11/19 14:24:39 INFO spark.SimpleJob: Loss was due tofetchfailure
> fromhttp://10.10.5.152:47010
> 11/11/19 14:24:39 INFO spark.MesosScheduler: Marking Stage 0 for
> resubmision due to afetchfailure
> 11/11/19 14:24:39 INFO spark.SimpleJob: Lost TID 17 (task 1:1)
> 11/11/19 14:24:39 INFO spark.SimpleJob: Loss was due tofetchfailure
> fromhttp://10.10.5.152:47010
> 11/11/19 14:24:39 INFO spark.SimpleJob: Lost TID 12 (task 1:5)
> 11/11/19 14:24:39 INFO spark.SimpleJob: Loss was due tofetchfailure
> fromhttp://10.10.5.152:47010
> 11/11/19 14:24:39 INFO spark.SimpleJob: Lost TID 11 (task 1:4)
> 11/11/19 14:24:39 INFO spark.SimpleJob: Loss was due tofetchfailure
> fromhttp://10.10.5.152:47010
> 11/11/19 14:24:39 INFO spark.MesosScheduler: The failedfetchwas from
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnectio...
> 975)
>   at
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:
> 916)
>   at
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:
> 841)
>   at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection....
Reply all
Reply to author
Forward
0 new messages