storm supervisor start failed after machine is rebooted

560 views
Skip to first unread message

haitao.yao

unread,
Oct 11, 2011, 4:48:12 AM10/11/11
to storm-user

hi,all
I got a test server runing for storm test. After the server is
reboot, the supervisor is failed to start up.
the supervisor only can be started after I deleted the supervisors
and workers folder under storm.local.dir
What's the probloem?
here's the log:

2011-10-11 16:38:45 ClientCnxn [INFO] org.apache.zookeeper.ClientCnxn
$SendThread.readConnectResult(ClientCnxn.java:738) Session
establishment complete on server 10.130.137.169/10.130.137.169:2181,
sessionid = 0x132f21610f8000f, negotiated timeout = 20000
2011-10-11 16:38:45 event [ERROR] clojure.contrib.logging
$impl_write_BANG_.invoke(NO_SOURCE_FILE:0) Error when processing event
backtype.storm.daemon.supervisor
$fn__3405$exec_fn__855__auto____3406$sync_processes__3408@60407166
java.lang.RuntimeException: java.lang.RuntimeException:
java.io.EOFException
at clojure.lang.LazySeq.sval(LazySeq.java:47)
at clojure.lang.LazySeq.seq(LazySeq.java:56)
at clojure.lang.RT.seq(RT.java:450)
at clojure.core$seq.invoke(core.clj:122)
at clojure.core$dorun.invoke(core.clj:2450)
at clojure.core$doall.invoke(core.clj:2465)
at backtype.storm.daemon.supervisor
$read_worker_heartbeats.invoke(supervisor.clj:76)
at backtype.storm.daemon.supervisor
$read_allocated_workers.invoke(supervisor.clj:92)
at backtype.storm.daemon.supervisor
$fn__3405$exec_fn__855__auto____3406$sync_processes__3408.invoke(supervisor.clj:
178)
at backtype.storm.event$event_manager
$fn__1940$fn__1941.invoke(event.clj:25)
at backtype.storm.event$event_manager
$fn__1940.invoke(event.clj:22)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: java.io.EOFException
at backtype.storm.utils.Utils.deserialize(Utils.java:47)
at backtype.storm.utils.LocalState.snapshot(LocalState.java:
24)
at backtype.storm.utils.LocalState.get(LocalState.java:28)
at backtype.storm.daemon.supervisor
$read_worker_heartbeat.invoke(supervisor.clj:64)
at backtype.storm.daemon.supervisor$read_worker_heartbeats
$iter__3322__3326$fn__3327.invoke(supervisor.clj:77)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
... 12 more
Caused by: java.io.EOFException
at java.io.ObjectInputStream
$PeekInputStream.readFully(ObjectInputStream.java:2280)
at java.io.ObjectInputStream
$BlockDataInputStream.readShort(ObjectInputStream.java:2749)
at
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:779)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:
279)
at backtype.storm.utils.Utils.deserialize(Utils.java:42)
... 17 more
2011-10-11 16:38:45 util [INFO] clojure.contrib.logging
$impl_write_BANG_.invoke(NO_SOURCE_FILE:0) Halting process: ("Error
when processing an event")


All I can do is delete all the data in zookeeper and reboot the

nathanmarz

unread,
Oct 11, 2011, 3:41:39 PM10/11/11
to storm-user
This is a really strange error. It's failing to read a hearbeat of a
worker to the supervisor even though those heartbeats are created
completely atomically. With what version of Storm are you seeing this
problem?

I'm opening up an issue to track this: https://github.com/nathanmarz/storm/issues/23

-Nathan

haitao.yao

unread,
Oct 11, 2011, 11:13:44 PM10/11/11
to storm-user
version: storm-0.5.3

Thanks~ Hope I can help to fix this.

haitao.yao

unread,
Oct 24, 2011, 10:48:50 PM10/24/11
to storm-user
this problem still exits after I killed the nimbus process.

How can I gracefully stop the cluster?

I stopped the cluster for upgrade to storm-0.5.4.

The error still exits in 0.5.4

thanks very much.

Aaron Son

unread,
Oct 25, 2011, 11:13:17 AM10/25/11
to storm-user
Was the host cleanly shutdown? If it was, then the following
suggestion won't help the actual problem you're running into.

It seems like the durability of the LocalState could be improved
slightly by flushing the file descriptor after the file is written.
Example of the change:

https://gist.github.com/1313059

Of course, when those bytes hit the disk is going to be dependent on a
lot of factors, but this at least tells the system we care. Assuming
we do care. This also risks changing performance characteristics of
the LocalState.

-- Aaron

nathanmarz

unread,
Oct 27, 2011, 2:22:43 AM10/27/11
to storm-user
Thanks, I'll look into this more.
Reply all
Reply to author
Forward
0 new messages