Job hanging with ConnectionManager exceptions

201 views
Skip to first unread message

Karthik Thiyagarajan

unread,
Feb 2, 2013, 12:52:05 PM2/2/13
to spark...@googlegroups.com
I'm on the tip of branch-0.6. I have jobs that hang sometimes with exceptions from the ConnectionManager.
I've pasted a small fragment of the logs on the executor that are representative of the exceptions I see.

13/02/02 05:05:49 WARN network.SendingConnection: Error writing in connection to ConnectionManagerId(shd4.quantifind.com,56609)
java.io.IOException: Bad address
	at sun.nio.ch.FileDispatcher.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:72)
	at sun.nio.ch.IOUtil.write(IOUtil.java:28)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
	at spark.network.SendingConnection.write(Connection.scala:246)
	at spark.network.ConnectionManager.run(ConnectionManager.scala:138)
	at spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:73)
13/02/02 05:05:49 INFO storage.MemoryStore: ensureFreeSpace(380777323) called with curMem=50751548973, maxMem=50935627776
13/02/02 05:05:49 INFO storage.MemoryStore: 1 blocks selected for dropping
13/02/02 05:05:49 INFO storage.BlockManager: Dropping block rdd_34_66 from memory
13/02/02 05:05:49 INFO storage.MemoryStore: Block rdd_34_66 of size 298907130 dropped from memory (free 482985933)
13/02/02 05:05:49 INFO storage.MemoryStore: Block rdd_12_228 stored as values to memory (estimated size 363.1 MB, free 97.5 MB)
13/02/02 05:05:49 INFO network.ConnectionManager: Handling connection error on connection to ConnectionManagerId(shd4.quantifind.com,56609)
13/02/02 05:05:49 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(shd4.quantifind.com,56609)
13/02/02 05:05:49 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(shd4.quantifind.com,56609)
13/02/02 05:05:49 INFO storage.MemoryStore: ensureFreeSpace(237125032) called with curMem=50833419166, maxMem=50935627776
13/02/02 05:05:49 INFO storage.MemoryStore: 1 blocks selected for dropping
13/02/02 05:05:49 INFO storage.BlockManager: Dropping block rdd_34_46 from memory
13/02/02 05:05:49 INFO storage.MemoryStore: Block rdd_34_46 of size 279686732 dropped from memory (free 381895342)
13/02/02 05:05:49 INFO storage.MemoryStore: Block rdd_12_231 stored as values to memory (estimated size 226.1 MB, free 138.1 MB)
13/02/02 05:05:49 ERROR network.ConnectionManager: Error in select loop
java.util.NoSuchElementException: key not found: sun.nio.ch.SelectionKeyImpl@303a169f
	at scala.collection.MapLike$class.default(MapLike.scala:224)
	at scala.collection.mutable.HashMap.default(HashMap.scala:43)
	at scala.collection.MapLike$class.apply(MapLike.scala:135)
	at spark.network.ConnectionManager$$anon$1.scala$collection$mutable$SynchronizedMap$$super$apply(ConnectionManager.scala:48)
	at scala.collection.mutable.SynchronizedMap$class.apply(SynchronizedMap.scala:48)
	at spark.network.ConnectionManager$$anon$1.apply(ConnectionManager.scala:48)
	at spark.network.ConnectionManager.run(ConnectionManager.scala:96)
	at spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:73)

I haven't been able to find any patterns into when these exceptions occur. Any insights into this would be useful.

Thanks,
Karthik

Karthik Thiyagarajan

unread,
Feb 2, 2013, 12:56:26 PM2/2/13
to spark...@googlegroups.com
Clarification : 
I'm running this on a 4 machine Spark Standalone cluster.

Eduardo Alfaia

unread,
May 16, 2013, 12:52:59 PM5/16/13
to spark...@googlegroups.com
Hi all, 

I am having the same problem, What could I do to  ConnectionManager doesn't get the "hostname -f", but the ip address of the interface for example eth0.

Best Regards

Mridul Muralidharan

unread,
May 16, 2013, 2:38:30 PM5/16/13
to spark...@googlegroups.com

There were few bugs I  Connect ion and Connection Manager wbich were fixed and committed recently ... and is in spark master right now.
Are you seeing same issue with master ?

Regards
Mridul

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Eduardo Alfaia

unread,
May 16, 2013, 3:49:50 PM5/16/13
to spark...@googlegroups.com
Yes Mridul the same issue with master, I'm using Spark 0.7.0 in a Cluster of 4 machines.

Regards
Eduardo

Mridul Muralidharan

unread,
May 16, 2013, 4:09:00 PM5/16/13
to spark...@googlegroups.com
Just to get it clarified - are you seeing the issue with spark 0.7.0 ?
Or with the latest spark github master ?
Also, are you seeing the exact same exceptions ?

The stacktraces on spark master might help debug the issue.


Regards,
Mridul

Eduardo Alfaia

unread,
May 16, 2013, 4:29:11 PM5/16/13
to spark...@googlegroups.com
Answering your questions:

Just to get it clarified - are you seeing the issue with spark 0.7.0 ?
yes, I've seen this issue in spark 0.7.0 

Or with the latest spark github master ? 
I'm not working with spark-master

Also, are you seeing the exact same exceptions ? 
yes, bellow is  part of the log file
13/05/16 18:41:01 INFO storage.BlockManager: maxBytesInFlight: 50331648, minRequest: 10066329
13/05/16 18:41:01 INFO storage.BlockManager: maxBytesInFlight: 50331648, minRequest: 10066329
13/05/16 18:41:01 INFO storage.BlockManager: maxBytesInFlight: 50331648, minRequest: 10066329
13/05/16 18:41:01 INFO storage.BlockManager: Started 166 remote gets in  8 ms
13/05/16 18:41:01 WARN network.SendingConnection: Error finishing connection to achab3/127.0.1.1:37345
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
	at spark.network.SendingConnection.finishConnect(Connection.scala:221)
	at spark.network.ConnectionManager.spark$network$ConnectionManager$$run(ConnectionManager.scala:127)
	at spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:70)
13/05/16 18:41:01 INFO network.ConnectionManager: Handling connection error on connection to ConnectionManagerId(achab3,37345)
13/05/16 18:41:01 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(achab3,37345)
13/05/16 18:41:01 INFO network.SendingConnection: Initiating connection to [achab3/127.0.1.1:35226
13/05/16 18:41:01 WARN network.SendingConnection: Error finishing connection to achab3/127.0.1.1:35226 

Mridul Muralidharan

unread,
May 16, 2013, 4:33:14 PM5/16/13
to spark...@googlegroups.com
On Fri, May 17, 2013 at 1:59 AM, Eduardo Alfaia
<eduardo...@gmail.com> wrote:
> Answering your questions:
>
> Just to get it clarified - are you seeing the issue with spark 0.7.0 ?
> yes, I've seen this issue in spark 0.7.0
>
> Or with the latest spark github master ?
> I'm not working with spark-master


The changes I referred to were made to spark master about a month or
so ago - much after 0.7 was released.
The fixes were pretty involved - I dont think it has been ported to
0.7 (not sure if there is a plan to do that).

Regards,
Mridul

Eduardo Alfaia

unread,
May 16, 2013, 4:57:18 PM5/16/13
to spark...@googlegroups.com
Ok, I did understand, but are you sure this issue in ConnectionManager was solve in spark-master? 

Regards 

Mridul Muralidharan

unread,
May 16, 2013, 5:04:56 PM5/16/13
to spark...@googlegroups.com
I fixed all the problems we faced while running spark on a large
cluster ... and I am reasonably confident there might not be any
issues.
It is worth giving spark master a shot - just to verify that you are
not facing the issues with it : if you are, please do let me know and
I will try to resolve it while I can !


Regards,
Mridul


On Fri, May 17, 2013 at 2:27 AM, Eduardo Alfaia

Eduardo Alfaia

unread,
May 16, 2013, 5:13:08 PM5/16/13
to spark...@googlegroups.com
Oh Sorry Mridul
My problem is that my user linux hasn't privileges to change /etc/hosts, I think that the ConnectionManager get the name of the host from hosts file in etc or get the hostname -f  
command. Am I sure or not? 

Mridul Muralidharan

unread,
May 16, 2013, 5:21:41 PM5/16/13
to spark...@googlegroups.com
I dont recall how the host name resolution happens - Matei or someone
else might be able to clarify that !
But I think we assume proper name -> ip and ip -> name resolution to
work consistently on the various cluster nodes.


Regards,
Mridul

On Fri, May 17, 2013 at 2:43 AM, Eduardo Alfaia
Reply all
Reply to author
Forward
0 new messages