Hello,
I have a problem with a newly set up Storm cluster. This is what is happening: the Nimbus keeps trying to send assignments to the supervisor nodes every two minutes (see the log entry below for an example). The supervisors are picking it up, start downloading the jar, but eventually fail with a timeout. At that point the Nimbus determines the executor is not alive, sets up a new assignment and the cycle repeats.
Here is what keeps happening on the Nimbus:
2013-10-24 11:25:19 nimbus [INFO] Setting new assignment for topology id wctest2-2-1382613431: #backtype.storm.daemon.common.Assignment{:master-code-dir "/mnt/storm/nimbus/stormdist/wctest2-2-1382613431", :node->host {"f6059693-50d7-4ff8-91db-8df6d95bb384" "obfuscated.ec2.internal", "c594b0df-67ee-4378-95c1-a31df2966757" "obfuscated.ec2.internal"}, :executor->node+port {[2 2] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [3 3] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [4 4] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701], [5 5] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [6 6] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [7 7] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701], [8 8] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [9 9] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [10 10] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701], [11 11] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [12 12] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [13 13] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701], [14 14] ["c594b0df-67ee-4378-95c1-a31df2966757" 6702], [15 15] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6700], [1 1] ["f6059693-50d7-4ff8-91db-8df6d95bb384" 6701]}, :executor->start-time-secs {[2 2] 1382613919, [3 3] 1382613919, [4 4] 1382613919, [5 5] 1382613919, [6 6] 1382613919, [7 7] 1382613919, [8 8] 1382613919, [9 9] 1382613919, [10 10] 1382613919, [11 11] 1382613919, [12 12] 1382613919, [13 13] 1382613919, [14 14] 1382613919, [15 15] 1382613919, [1 1] 1382613919}}
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[2 2] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[3 3] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[4 4] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[5 5] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[6 6] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[7 7] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[8 8] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[9 9] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[10 10] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[11 11] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[12 12] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[13 13] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[14 14] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[15 15] not alive
2013-10-24 11:27:20 nimbus [INFO] Executor wctest2-2-1382613431:[1 1] not alive
After determining that the nodes are not alive it tries to set up the new assignment again, repeating the cycle.
This is what happens on the supervisor end:
2013-10-24 11:43:48 supervisor [INFO] Downloading code for storm id wctest2-2-1382613431 from /mnt/storm/nimbus/stormdist/wctest2-2-1382613431
2013-10-24 11:44:51 event [ERROR] Error when processing event
java.lang.RuntimeException: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection timed out
at backtype.storm.utils.NimbusClient.<init>(NimbusClient.java:36)
at backtype.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:17)
at backtype.storm.utils.Utils.downloadFromMaster(Utils.java:224)
at backtype.storm.daemon.supervisor$fn__4791.invoke(supervisor.clj:401)
at clojure.lang.MultiFn.invoke(MultiFn.java:172)
at backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this__4716.invoke(supervisor.clj:295)
at backtype.storm.event$event_manager$fn__2507.invoke(event.clj:24)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection timed out
at org.apache.thrift7.transport.TSocket.open(TSocket.java:183)
at org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81)
at backtype.storm.utils.NimbusClient.<init>(NimbusClient.java:34)
... 8 more
Caused by: java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at org.apache.thrift7.transport.TSocket.open(TSocket.java:178)
... 10 more
2013-10-24 11:44:51 util [INFO] Halting process: ("Error when processing an event")
This keeps repeating in the same cycles as with the Nimbus, which keeps sending the assignment to the supervisors.
Possible causes would be:
- There is no /mnt/storm/nimbus directory on the supervisor nodes. Is this a problem?
- that the firewall allows the Nimbus to send commands to the supervisors, but doesn't allow the supervisors to read from the Nimbus. I don't have the firewall configuration at the moment, but I'm looking into this possibility.
Does anyone have any other suggestions? I'd appreciate your feedback!
With kind regards,
Filip de Waard
Vixu.com