Thift connection timed out when downloading code on supervisors

Filip de Waard

unread,

Oct 24, 2013, 8:47:48 AM10/24/13

to storm...@googlegroups.com

Hello,

I have a problem with a newly set up Storm cluster. This is what is happening: the Nimbus keeps trying to send assignments to the supervisor nodes every two minutes (see the log entry below for an example). The supervisors are picking it up, start downloading the jar, but eventually fail with a timeout. At that point the Nimbus determines the executor is not alive, sets up a new assignment and the cycle repeats.

Here is what keeps happening on the Nimbus:

2013-10-24 11:25:19 nimbus [INFO] Setting new assignment for topology id wctest2-2-1382613431: #backtype.storm.daemon.common.Assignment{:master-code-dir "/mnt/storm/nimbus/stormdist/wctest2-2-1382613431", :node->host {"f6059693-50d7-4ff8-91db-8df6d95bb384" "obfuscated.ec2.internal", "c594b0df-67ee-4378-95c1-a31df2966757" "obfuscated.ec2.internal"}, :executor->node+port {[2 2] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [3 3] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [4 4] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701], [5 5] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [6 6] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [7 7] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701], [8 8] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [9 9] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [10 10] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701], [11 11] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [12 12] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [13 13] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701], [14 14] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [15 15] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [1 1] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701]}, :executor->start-time-secs {[2 2] 1382613919, [3 3] 1382613919, [4 4] 1382613919, [5 5] 1382613919, [6 6] 1382613919, [7 7] 1382613919, [8 8] 1382613919, [9 9] 1382613919, [10 10] 1382613919, [11 11] 1382613919, [12 12] 1382613919, [13 13] 1382613919, [14 14] 1382613919, [15 15] 1382613919, [1 1] 1382613919}}

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[2 2] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[3 3] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[4 4] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[5 5] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[6 6] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[7 7] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[8 8] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[9 9] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[10 10] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[11 11] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[12 12] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[13 13] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[14 14] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[15 15] not alive

2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[1 1] not alive

After determining that the nodes are not alive it tries to set up the new assignment again, repeating the cycle.

This is what happens on the supervisor end:

2013-10-24 11:43:48 supervisor [INFO] Downloading code for storm id wctest2-2-1382613431 from /mnt/storm/nimbus/stormdist/wctest2-2-1382613431

2013-10-24 11:44:51 event [ERROR] Error when processing event

java.lang.RuntimeException: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection timed out

at backtype.storm.utils.NimbusClient.<init>(NimbusClient.java:36)

at backtype.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:17)

at backtype.storm.utils.Utils.downloadFromMaster(Utils.java:224)

at backtype.storm.daemon.supervisor$fn__4791.invoke(supervisor.clj:401)

at clojure.lang.MultiFn.invoke(MultiFn.java:172)

at backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this__4716.invoke(supervisor.clj:295)

at backtype.storm.event$event_manager$fn__2507.invoke(event.clj:24)

at clojure.lang.AFn.run(AFn.java:24)

at java.lang.Thread.run(Thread.java:724)

Caused by: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection timed out

at org.apache.thrift7.transport.TSocket.open(TSocket.java:183)

at org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81)

at backtype.storm.utils.NimbusClient.<init>(NimbusClient.java:34)

... 8 more

Caused by: java.net.ConnectException: Connection timed out

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)

at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)

at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:579)

at org.apache.thrift7.transport.TSocket.open(TSocket.java:178)

... 10 more

2013-10-24 11:44:51 util [INFO] Halting process: ("Error when processing an event")

This keeps repeating in the same cycles as with the Nimbus, which keeps sending the assignment to the supervisors.

Possible causes would be:

- There is no /mnt/storm/nimbus directory on the supervisor nodes. Is this a problem?

- that the firewall allows the Nimbus to send commands to the supervisors, but doesn't allow the supervisors to read from the Nimbus. I don't have the firewall configuration at the moment, but I'm looking into this possibility.

Does anyone have any other suggestions? I'd appreciate your feedback!

With kind regards,

Filip de Waard
Vixu.com

Derek

unread,

Oct 24, 2013, 10:21:11 AM10/24/13

to storm...@googlegroups.com

I have see this happen when nimbus is configured to use SASL but the supervisors are not.

Because of the way SASL and thrift interact, a client not using SASL to connect to a server that expects SASL will result in the server timing out while waiting for a handshake (or some extra part of the protocol).

The relevant setting is storm.thrift.transport in the configuration. Just make sure there is not a mismatch.
--
Derek

Filip de Waard

unread,

Oct 24, 2013, 10:41:38 AM10/24/13

to storm...@googlegroups.com

Thanks for your response, Derek!

On Thursday, October 24, 2013 4:21:11 PM UTC+2, Derek wrote:

I have see this happen when nimbus is configured to use SASL but the supervisors are not.

Because of the way SASL and thrift interact, a client not using SASL to connect to a server that expects SASL will result in the server timing out while waiting for a handshake (or some extra part of the protocol).

The configuration file is identical on the different nodes.

The relevant setting is storm.thrift.transport in the configuration. Just make sure there is not a mismatch.

The configuration doesn't set storm.thrift.transport at all currently. Here is the configuration file: https://gist.github.com/fmw/e5ae3410f8f8af21dd33 (with IP addresses obfuscated).

-Filip

Filip de Waard

unread,

Nov 1, 2013, 10:10:06 AM11/1/13

to storm...@googlegroups.com

It did turn out to be a firewall issue.

Reply all

Reply to author

Forward