Thift connection timed out when downloading code on supervisors

414 views
Skip to first unread message

Filip de Waard

unread,
Oct 24, 2013, 8:47:48 AM10/24/13
to storm...@googlegroups.com
Hello,

I have a problem with a newly set up Storm cluster. This is what is happening: the Nimbus keeps trying to send assignments to the supervisor nodes every two minutes (see the log entry below for an example). The supervisors are picking it up, start downloading the jar, but eventually fail with a timeout. At that point the Nimbus determines the executor is not alive, sets up a new assignment and the cycle repeats.

Here is what keeps happening on the Nimbus:

2013-10-24 11:25:19 nimbus [INFO] Setting new assignment for topology id wctest2-2-1382613431: #backtype.storm.daemon.common.Assignment{:master-code-dir "/mnt/storm/nimbus/stormdist/wctest2-2-1382613431", :node->host {"f6059693-50d7-4ff8-91db-8df6d95bb384" "obfuscated.ec2.internal", "c594b0df-67ee-4378-95c1-a31df2966757" "obfuscated.ec2.internal"}, :executor->node+port {[2 2] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [3 3] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [4 4] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701], [5 5] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [6 6] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [7 7] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701], [8 8] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [9 9] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [10 10] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701], [11 11] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [12 12] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [13 13] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701], [14 14] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [15 15] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [1 1] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701]}, :executor->start-time-secs {[2 2] 1382613919, [3 3] 1382613919, [4 4] 1382613919, [5 5] 1382613919, [6 6] 1382613919, [7 7] 1382613919, [8 8] 1382613919, [9 9] 1382613919, [10 10] 1382613919, [11 11] 1382613919, [12 12] 1382613919, [13 13] 1382613919, [14 14] 1382613919, [15 15] 1382613919, [1 1] 1382613919}}
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[2 2] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[3 3] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[4 4] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[5 5] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[6 6] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[7 7] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[8 8] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[9 9] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[10 10] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[11 11] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[12 12] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[13 13] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[14 14] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[15 15] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[1 1] not alive

After determining that the nodes are not alive it tries to set up the new assignment again, repeating the cycle.

This is what happens on the supervisor end:

2013-10-24 11:43:48 supervisor [INFO] Downloading code for storm id wctest2-2-1382613431 from /mnt/storm/nimbus/stormdist/wctest2-2-1382613431
2013-10-24 11:44:51 event [ERROR] Error when processing event
java.lang.RuntimeException: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection timed out
        at backtype.storm.utils.NimbusClient.<init>(NimbusClient.java:36)
        at backtype.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:17)
        at backtype.storm.utils.Utils.downloadFromMaster(Utils.java:224)
        at backtype.storm.daemon.supervisor$fn__4791.invoke(supervisor.clj:401)
        at clojure.lang.MultiFn.invoke(MultiFn.java:172)
        at backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this__4716.invoke(supervisor.clj:295)
        at backtype.storm.event$event_manager$fn__2507.invoke(event.clj:24)
        at clojure.lang.AFn.run(AFn.java:24)
        at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection timed out
        at org.apache.thrift7.transport.TSocket.open(TSocket.java:183)
        at org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81)
        at backtype.storm.utils.NimbusClient.<init>(NimbusClient.java:34)
        ... 8 more
Caused by: java.net.ConnectException: Connection timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:579)
        at org.apache.thrift7.transport.TSocket.open(TSocket.java:178)
        ... 10 more
2013-10-24 11:44:51 util [INFO] Halting process: ("Error when processing an event")

This keeps repeating in the same cycles as with the Nimbus, which keeps sending the assignment to the supervisors.

Possible causes would be:

- There is no /mnt/storm/nimbus directory on the supervisor nodes. Is this a problem? 
- that the firewall allows the Nimbus to send commands to the supervisors, but doesn't allow the supervisors to read from the Nimbus. I don't have the firewall configuration at the moment, but I'm looking into this possibility.

Does anyone have any other suggestions? I'd appreciate your feedback!

With kind regards,

Filip de Waard
Vixu.com

Derek

unread,
Oct 24, 2013, 10:21:11 AM10/24/13
to storm...@googlegroups.com
I have see this happen when nimbus is configured to use SASL but the supervisors are not.

Because of the way SASL and thrift interact, a client not using SASL to connect to a server that expects SASL will result in the server timing out while waiting for a handshake (or some extra part of the protocol).

The relevant setting is storm.thrift.transport in the configuration.  Just make sure there is not a mismatch.
--
Derek

Filip de Waard

unread,
Oct 24, 2013, 10:41:38 AM10/24/13
to storm...@googlegroups.com
Thanks for your response, Derek!

On Thursday, October 24, 2013 4:21:11 PM UTC+2, Derek wrote:
I have see this happen when nimbus is configured to use SASL but the supervisors are not.

Because of the way SASL and thrift interact, a client not using SASL to connect to a server that expects SASL will result in the server timing out while waiting for a handshake (or some extra part of the protocol).

The configuration file is identical on the different nodes. 

The relevant setting is storm.thrift.transport in the configuration.  Just make sure there is not a mismatch.

The configuration doesn't set storm.thrift.transport at all currently. Here is the configuration file: https://gist.github.com/fmw/e5ae3410f8f8af21dd33 (with IP addresses obfuscated).

-Filip

Filip de Waard

unread,
Nov 1, 2013, 10:10:06 AM11/1/13
to storm...@googlegroups.com
It did turn out to be a firewall issue.
Reply all
Reply to author
Forward
0 new messages