Supervisor startup failure, "stormconf.ser not found" , storm 0.9.0-wip17

1,211 views
Skip to first unread message

Sven Stark

unread,
Oct 7, 2013, 10:30:25 PM10/7/13
to storm...@googlegroups.com
Hi All,

we're running storm 0.9.0-wip17 together with kafka 0.8.0-beta1 and zookeeper 3.4.5.

Our biggest issue is intermittent failure of supervisors to start up with the well known "stormconf.ser not found" message. From what I read in the changelogs and on this group the issue was supposed to be fixed in 0.8.3 and 0.9.0-wip16. Nonetheless we still see this issue quite often.

Below is the end of a supervisor log on debug level. Please notice how the supervisor first deletes the data that it later requires (and therefore fails to start up).

Question is simply: is that still the "old" bug that hasn't been fixed for all circumstances or are we doing anything wrong? If the latter - what can we do to overcome this? I can provide more information if necessary,

Thanks in advance,
Sven

=================================


2013-10-08 00:55:56 b.s.d.supervisor [DEBUG] Synchronizing supervisor
2013-10-08 00:55:56 b.s.d.supervisor [DEBUG] Assigned executors: {6706 #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "CommentTopology-56-1380865400", :executors ([2 2] [5 5] [8 8] [11 11] [14 14] [17 17])}}
2013-10-08 00:55:56 b.s.d.supervisor [DEBUG] Allocated: {"ed296253-4e4f-498a-93f7-e68e1a8ce654" [:timed-out #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1381193529, :storm-id "CommentTopology-56-1380865400", :executors #{[2 2] [5 5] [8 8] [11 11] [14 14] [17 17] [-1 -1]}, :port 6706}]}
2013-10-08 00:55:56 b.s.d.supervisor [DEBUG] Storm code map: {"NotificationTopology-58-1380865401" "/var/storm/nimbus/stormdist/NotificationTopology-58-1380865401", "CommentTopology-56-1380865400" "/var/storm/nimbus/stormdist/CommentTopology-56-1380865400", "ExportTopology-60-1380865402" "/var/storm/nimbus/stormdist/ExportTopology-60-1380865402", "ImportTopology-57-1380865401" "/var/storm/nimbus/stormdist/ImportTopology-57-1380865401", "CommentVoteTopology-59-1380865402" "/var/storm/nimbus/stormdist/CommentVoteTopology-59-1380865402"}
2013-10-08 00:55:56 b.s.d.supervisor [DEBUG] Downloaded storm ids: #{"CommentTopology-56-1380865400"}
2013-10-08 00:55:56 b.s.d.supervisor [DEBUG] All assignment: {}
2013-10-08 00:55:56 b.s.d.supervisor [DEBUG] New assignment: {}
2013-10-08 00:55:56 b.s.d.supervisor [DEBUG] Writing new assignment {}
2013-10-08 00:55:56 b.s.d.supervisor [INFO] Shutting down and clearing state for id ed296253-4e4f-498a-93f7-e68e1a8ce654. Current supervisor time: 1381193756. State: :timed-out, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1381193529, :storm-id "CommentTopology-56-1380865400", :executors #{[2 2] [5 5] [8 8] [11 11] [14 14] [17 17] [-1 -1]}, :port 6706}
2013-10-08 00:55:56 b.s.d.supervisor [INFO] Shutting down 5538ef5c-73df-4926-9ce9-6255e020b02e:ed296253-4e4f-498a-93f7-e68e1a8ce654
2013-10-08 00:55:56 b.s.d.supervisor [INFO] Removing code for storm id CommentTopology-56-1380865400
2013-10-08 00:55:56 b.s.util [DEBUG] Rmr path /var/storm/supervisor/stormdist/CommentTopology-56-1380865400
2013-10-08 00:55:56 b.s.util [INFO] Error when trying to kill 10896. Process is probably already dead.
2013-10-08 00:55:56 b.s.util [DEBUG] Removing path /var/storm/workers/ed296253-4e4f-498a-93f7-e68e1a8ce654/pids/10896
2013-10-08 00:55:56 b.s.util [DEBUG] Rmr path /var/storm/workers/ed296253-4e4f-498a-93f7-e68e1a8ce654/heartbeats
2013-10-08 00:55:56 b.s.util [DEBUG] Removing path /var/storm/workers/ed296253-4e4f-498a-93f7-e68e1a8ce654/pids
2013-10-08 00:55:56 b.s.util [DEBUG] Removing path /var/storm/workers/ed296253-4e4f-498a-93f7-e68e1a8ce654
2013-10-08 00:55:56 b.s.d.supervisor [INFO] Shut down 5538ef5c-73df-4926-9ce9-6255e020b02e:ed296253-4e4f-498a-93f7-e68e1a8ce654
2013-10-08 00:55:56 b.s.util [DEBUG] Making dirs at /var/storm/workers/ff64b203-2242-4564-9b22-3fefcbf5f4af/pids
2013-10-08 00:55:56 b.s.d.supervisor [INFO] Launching worker with assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "CommentTopology-56-1380865400", :executors ([2 2] [5 5] [8 8] [11 11] [14 14] [17 17])} for this supervisor 5538ef5c-73df-4926-9ce9-6255e020b02e on port 6706 with id ff64b203-2242-4564-9b22-3fefcbf5f4af
2013-10-08 00:55:56 b.s.event [ERROR] Error when processing event
java.io.FileNotFoundException: File '/var/storm/supervisor/stormdist/CommentTopology-56-1380865400/stormconf.ser' does not exist
at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137) ~[commons-io-1.4.jar:1.4]
at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135) ~[commons-io-1.4.jar:1.4]
at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:138) ~[storm-core-0.9.0-wip17.jar:na]
at backtype.storm.daemon.supervisor$fn__5063.invoke(supervisor.clj:410) ~[storm-core-0.9.0-wip17.jar:na]
at clojure.lang.MultiFn.invoke(MultiFn.java:177) ~[clojure-1.4.0.jar:na]
at backtype.storm.daemon.supervisor$sync_processes$iter__4954__4958$fn__4959.invoke(supervisor.clj:244) ~[storm-core-0.9.0-wip17.jar:na]
at clojure.lang.LazySeq.sval(LazySeq.java:42) ~[clojure-1.4.0.jar:na]
at clojure.lang.LazySeq.seq(LazySeq.java:60) ~[clojure-1.4.0.jar:na]
at clojure.lang.RT.seq(RT.java:473) ~[clojure-1.4.0.jar:na]
at clojure.core$seq.invoke(core.clj:133) ~[clojure-1.4.0.jar:na]
at clojure.core$dorun.invoke(core.clj:2725) ~[clojure-1.4.0.jar:na]
at clojure.core$doall.invoke(core.clj:2741) ~[clojure-1.4.0.jar:na]
at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:232) ~[storm-core-0.9.0-wip17.jar:na]
at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.4.0.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]
at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
at clojure.core$partial$fn__4070.doInvoke(core.clj:2343) ~[clojure-1.4.0.jar:na]
at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
at backtype.storm.event$event_manager$fn__3027.invoke(event.clj:24) ~[storm-core-0.9.0-wip17.jar:na]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
at java.lang.Thread.run(Thread.java:662) [na:1.6.0_39]
2013-10-08 00:55:56 b.s.util [INFO] Halting process: ("Error when processing an event")

Jacek Dabrowski

unread,
Oct 16, 2013, 5:36:32 AM10/16/13
to storm...@googlegroups.com
Hello,

We have a very similar problem. A topology was working flawlessly, then one day it hung.
Nimbus & supervisor start up OK. Only when we try to deploy the topology there is an error & it doesn't succeed.

We were initially running 0.8.2 + zk 3.3.3, but after reading similar threads, changed Storm version to 0.9.0-rc2 ==> did not help.

  1. Restarting ==> does not help
  2. Cleaning storm data directories + restarting ==> does not help
  3. Cleaning zookeeper + cleaning storm data directories + restarting ==> does not help.
Nimbus log (fragments):
2013-10-16 11:14:12 b.s.d.nimbus [DEBUG] Assignment for EsperTopology-1-1381914842 hasn't changed
2013-10-16 11:14:12 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e0003, packet:: clientPath:null serverPath:null finished:false header:: 82,12  replyHeader:: 82,22,0  request:: '/storm/workerbeats,F  response:: v{'EsperTo
pology-1-1381914842},s{13,13,1381914838566,1381914838566,0,1,0,0,1,1,18} 
2013-10-16 11:14:12 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e0003, packet:: clientPath:null serverPath:null finished:false header:: 83,12  replyHeader:: 83,22,0  request:: '/storm/errors,F  response:: v{},s{14,14,138
1914838574,1381914838574,0,0,0,0,1,0,14} 
2013-10-16 11:14:12 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e0003, packet:: clientPath:null serverPath:null finished:false header:: 84,12  replyHeader:: 84,22,0  request:: '/storm/storms,F  response:: v{'EsperTopolog
y-1-1381914842},s{10,10,1381914838538,1381914838538,0,1,0,0,1,1,19} 
2013-10-16 11:14:19 o.a.z.ClientCnxn [DEBUG] Got ping response for sessionid: 0x141c08aef2e0003 after 1ms
2013-10-16 11:14:23 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e0003, packet:: clientPath:null serverPath:null finished:false header:: 85,12  replyHeader:: 85,40,0  request:: '/storm/storms,F  response:: v{},s{10,10,138
1914838538,1381914838538,0,2,0,0,1,0,31} 
2013-10-16 11:14:23 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e0003, packet:: clientPath:null serverPath:null finished:false header:: 86,12  replyHeader:: 86,40,0  request:: '/storm/assignments,F  response:: v{},s{9,9,
1381914838531,1381914838531,0,2,0,0,1,0,30} 
2013-10-16 11:14:23 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e0003, packet:: clientPath:null serverPath:null finished:false header:: 87,12  replyHeader:: 87,40,0  request:: '/storm/supervisors,F  response:: v{'8346470
3-06e0-4697-b8d6-08f34bfce926},s{11,11,1381914838549,1381914838549,0,1,0,0,1,1,16} 
2013-10-16 11:14:23 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e0003, packet:: clientPath:null serverPath:null finished:false header:: 88,3  replyHeader:: 88,40,0  request:: '/storm/supervisors/83464703-06e0-4697-b8d6-0
8f34bfce926,F  response:: s{16,32,1381914838648,1381914861412,5,0,0,90565170474909698,957,0,16} 



Supervisor log:

2013-10-16 11:14:19 b.s.d.supervisor [INFO] 7aab2452-23d3-4084-9e81-d9c79a07b016 still hasn't started
2013-10-16 11:14:20 b.s.d.supervisor [INFO] 7aab2452-23d3-4084-9e81-d9c79a07b016 still hasn't started
2013-10-16 11:14:20 b.s.d.supervisor [INFO] 7aab2452-23d3-4084-9e81-d9c79a07b016 still hasn't started
2013-10-16 11:14:21 o.a.z.ClientCnxn [DEBUG] Got notification sessionid:0x141c08aef2e0002
2013-10-16 11:14:21 o.a.z.ClientCnxn [DEBUG] Got WatchedEvent state:SyncConnected type:NodeDeleted path:/assignments/EsperTopology-1-1381914842 for sessionid 0x141c08aef2e0002
2013-10-16 11:14:21 o.a.z.ClientCnxn [DEBUG] Got notification sessionid:0x141c08aef2e0002
2013-10-16 11:14:21 o.a.z.ClientCnxn [DEBUG] Got WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/assignments for sessionid 0x141c08aef2e0002
2013-10-16 11:14:21 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e0002, packet:: clientPath:null serverPath:null finished:false header:: 36,12  replyHeader:: 36,30,0  request:: '/storm/assignments,T  response:: v{},s{9,9,
1381914838531,1381914838531,0,2,0,0,1,0,30} 
2013-10-16 11:14:21 b.s.d.supervisor [DEBUG] Synchronizing supervisor
2013-10-16 11:14:21 b.s.d.supervisor [DEBUG] Storm code map: {}
2013-10-16 11:14:21 b.s.d.supervisor [DEBUG] Downloaded storm ids: #{"EsperTopology-1-1381914842"}
2013-10-16 11:14:21 b.s.d.supervisor [DEBUG] All assignment: 
2013-10-16 11:14:21 b.s.d.supervisor [DEBUG] New assignment: {}
2013-10-16 11:14:21 b.s.d.supervisor [DEBUG] Writing new assignment {}
2013-10-16 11:14:21 b.s.d.supervisor [INFO] Removing code for storm id EsperTopology-1-1381914842
2013-10-16 11:14:21 b.s.util [DEBUG] Rmr path /data/storm/supervisor/stormdist/EsperTopology-1-1381914842


Worker log (similar for all workers):
2013-10-16 11:14:22 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e000a, packet:: clientPath:null serverPath:null finished:false header:: 1,3  replyHeader:: 1,38,0  request:: '/storm/assignments,F  response:: s{9,9,1381914838531,1381914838531,0,2,0,0,1,0,30} 
2013-10-16 11:14:22 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e000a, packet:: clientPath:null serverPath:null finished:false header:: 2,3  replyHeader:: 2,38,0  request:: '/storm/storms,F  response:: s{10,10,1381914838538,1381914838538,0,2,0,0,1,0,31} 
2013-10-16 11:14:22 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e000a, packet:: clientPath:null serverPath:null finished:false header:: 3,3  replyHeader:: 3,39,0  request:: '/storm/supervisors,F  response:: s{11,11,1381914838549,1381914838549,0,1,0,0,1,1,16} 
2013-10-16 11:14:22 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e000a, packet:: clientPath:null serverPath:null finished:false header:: 4,3  replyHeader:: 4,39,0  request:: '/storm/workerbeats,F  response:: s{13,13,1381914838566,1381914838566,0,2,0,0,1,0,33} 
2013-10-16 11:14:22 o.a.z.ClientCnxn [DEBUG] Reading reply sessionid:0x141c08aef2e000a, packet:: clientPath:null serverPath:null finished:false header:: 5,3  replyHeader:: 5,39,0  request:: '/storm/errors,F  response:: s{14,14,1381914838574,1381914838574,0,0,0,0,1,0,14} 
2013-10-16 11:14:23 b.s.d.worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/data/storm/supervisor/stormdist/EsperTopology-1-1381914842/stormconf.ser' does not exist
        at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137) ~[commons-io-1.4.jar:1.4]
        at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135) ~[commons-io-1.4.jar:1.4]
        at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:177) ~[storm-core-0.9.0-rc2.jar:na]
        at backtype.storm.daemon.worker$worker_data.invoke(worker.clj:155) ~[storm-core-0.9.0-rc2.jar:na]
        at backtype.storm.daemon.worker$fn__5874$exec_fn__1227__auto____5875.invoke(worker.clj:338) ~[storm-core-0.9.0-rc2.jar:na]
        at clojure.lang.AFn.applyToHelper(AFn.java:185) [clojure-1.4.0.jar:na]
        at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]
        at clojure.core$apply.invoke(core.clj:601) ~[clojure-1.4.0.jar:na]
        at backtype.storm.daemon.worker$fn__5874$mk_worker__5930.doInvoke(worker.clj:329) [storm-core-0.9.0-rc2.jar:na]
        at clojure.lang.RestFn.invoke(RestFn.java:512) [clojure-1.4.0.jar:na]
        at backtype.storm.daemon.worker$_main.invoke(worker.clj:439) [storm-core-0.9.0-rc2.jar:na]
        at clojure.lang.AFn.applyToHelper(AFn.java:172) [clojure-1.4.0.jar:na]
        at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]
        at backtype.storm.daemon.worker.main(Unknown Source) [storm-core-0.9.0-rc2.jar:na]
2013-10-16 11:14:23 b.s.util [INFO] Halting process: ("Error on initialization")


We would be very grateful for any help - our topology is in production & we're currently stuck. It is working on our other clusters - with identical setup (!).


Kind regards,

--
Jacek

James Xu

unread,
Oct 16, 2013, 5:47:54 AM10/16/13
to storm...@googlegroups.com, James Xu
A workaround here is: copy the missing stormconf.ser from other machine and put it where the supervisor was reading from, then the supervisor can be started.

--
You received this message because you are subscribed to the Google Groups "storm-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to storm-user+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jacek Dabrowski

unread,
Oct 16, 2013, 6:37:47 AM10/16/13
to storm...@googlegroups.com, James Xu
Hello James,

Do you possibly know what the content of this file is?

Our topology jar is identical on multiple clusters, but the startup arguments are different. Of course storm.yaml is different too.

Isn't this a problem when copying this file?


Kind regards,

--
Jacek

James Xu

unread,
Oct 16, 2013, 8:37:54 AM10/16/13
to Jacek Dabrowski, James Xu, storm...@googlegroups.com
Since your supervisor can't be started now, and the corresponding worker can't be started, then the topology in question should not have been assigned to this supervisor, so what the actual content of stormconf.ser does not matter, as long as it is in the right format. After the supervisor(and the corresponding worker) is started, the worker will be killed anyway, you can confirm it through Storm UI.

Jacek Dabrowski

unread,
Oct 17, 2013, 4:48:01 AM10/17/13
to storm...@googlegroups.com, Jacek Dabrowski, James Xu
Well James, 

thank you for your reply - it does seem to help.. a bit. Nonetheless after the workers start, they are unexpectedly shutdown by the supervisor. In GUI I can see the topology being submitted, then it disappears after several seconds.

I think it is identical to the behavior described by Tong Jin in this thread: https://groups.google.com/forum/#!topic/storm-user/40l788uOoW0
My log output is very similar.


I experience this behavior on one of our clusters only. For the time being, I have moved our production topology to a backup cluster - it works with no problems.

What is strange is that the setup is very similar (it's basically the same KVM image installed on a different host). I have also tried copying the exact configuration from the working cluster to the failing one - it did not help.

I can of course scrap this KVM image and install a new one, but I'm hesitant to do this, so I won't be able to reproduce the problem. It seems that clearing Storm+Zookeeper data & restoring configuration would resume the original state - but it does not.

Any ideas?


Kind regards,

--
Jacek

James Xu

unread,
Oct 17, 2013, 5:01:06 AM10/17/13
to Jacek Dabrowski, James Xu, storm...@googlegroups.com
The workers been killed by supervisor is actually expected, supervisor tries to start these workers because supervisor's *local-state* says so -- which is outdated. After the supervisor is up, it will synchronize topology assignments from zookeeper, it will find out that all its workers have been de-assigned, so the workers are killed.

I think it IS a bug of storm -- the local assignment is not updated, but the stormconf.ser is already gone, just not sure how to reliably reproduce it and fix it right now @Tong Jin.
Reply all
Reply to author
Forward
0 new messages