Re: [storm-user] Error on initialization of server mk-worker (stormconf.ser is missing)

Nathan Marz

unread,

Aug 13, 2012, 3:36:37 AM8/13/12

to storm...@googlegroups.com

Are your supervisors sharing a directory over a network mount, by any chance? What happens if you turn off supervisor2 completely? Do topologies launch successfully on the other supervisor?

On Fri, Aug 10, 2012 at 2:06 AM, bmoshe <bmo...@gmail.com> wrote:

Hi guys,

I have a simple topology that puts values on a Redis server.
When I deploy it, no client gets to connect Redis, although it works perfectly fine when I run it via LocalCluster.

I attached logs and conf for the nimbus and the supervisors.

The cluster is configured as following:
1GB RAM for the nimbus (192.168.1.22)
1GB RAM for the zookeeper1 (192.168.1.31)

2GB RAM for the supervisor1 (192.168.1.16; 4 workers)
2GB RAM for the supervisor2 (192.168.1.19; 2 workers)

All machines are virtual and have JDK 6u33 x64 installed.
nimbus, supervisor1 & supervisor2 have Storm 0.8.0, ZeroMQ 2.1.7 and the latest JZMQ installed.
zookeeper1 has Python 2.6.6 (with default configuration) and Zookeeper 3.3.6 installed.

I'm not sure this is the entire problem, but I'm getting the following exception on some of my supervisors (in our case - supervisor2):
2012-08-10 08:21:27 worker [ERROR] Error on initialization of server mk-worker

java.io.FileNotFoundException: File '/opt/storm/local/supervisor/stormdist/DistributedSystem-1-1344586762/stormconf.ser' does not exist
at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137)

at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135)
at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:138)

at backtype.storm.daemon.worker$worker_data.invoke(worker.clj:146)
at backtype.storm.daemon.worker$fn__4316$exec_fn__1206__auto____4317.invoke(worker.clj:331)

at clojure.lang.AFn.applyToHelper(AFn.java:185)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(c ore.clj:601)

at backtype.storm.daemon.worker$fn__4316$mk_worker__4372.doInvoke(worker.clj:322)
at clojure.lang.RestFn.invoke(RestFn.java:512)

at backtype.storm.daemon.worker$_main.invoke(worker.clj:432)
at clojure.lang.AFn.applyToHelper(AFn.java:172)
at clojure.lang.AFn.applyTo(AFn.java:151)

at backtype.storm.daemon.worker.main(Unknown Source)
2012-08-10 08:21:27 util [INFO] Halting process: ("Error on initialization")

The topology I'm trying to run requires 4 workers altogether.
So even if supervisor2 dysfunctions, the other supervisor should be able to run the entire topology on its own.
Am I doing something wrong here?

Thanks,
Moshe.

--
Twitter: @nathanmarz
http://nathanmarz.com

Moshe Bixenshpaner

unread,

Aug 13, 2012, 6:56:54 PM8/13/12

to storm...@googlegroups.com

No, supervisors don't share directories.

They are virtual machines created by kvm though (I'm not sure if it has anything to do with the problem).

If I have enough workers on a single supervisor, everything works perfectly fine.
It seems the coordination between the supervisors is the cause of the problem.

Thanks,

Moshe.

Nathan Marz

unread,

Aug 14, 2012, 4:54:04 AM8/14/12

to storm...@googlegroups.com

The error you're facing indicates that the supervisor failed to download the configuration file from Nimbus. Can you show me the results of doing an ls -R on the supervisor local dir for the node that's getting that error? (do it while the topology is active and causing the error – that is, don't shut it down and then do the ls -R).

Message has been deleted

Moshe Bixenshpaner

unread,

Aug 14, 2012, 11:25:27 AM8/14/12

to storm...@googlegroups.com

Hi,

I attached the local directory and log files for nimbus and each of the supervisors.

sv2 is the supervisor that fails to load.

Thanks,

Moshe.

nb-local.tar

nb-logs.tar

sv1-local.tar

sv1-logs.tar

sv2-local.tar

sv2-logs.tar

Nathan Marz

unread,

Aug 14, 2012, 2:20:28 PM8/14/12

to storm...@googlegroups.com

I would need you to do the ls -R while the error is happening and the topology is still active.

Moshe Bixenshpaner

unread,

Aug 14, 2012, 3:35:25 PM8/14/12

to storm...@googlegroups.com

This is exactly what I did (only I attached a tar file or the entire local directory, instead of just attaching the output from an ls -R).

Nathan Marz

unread,

Aug 17, 2012, 4:26:06 AM8/17/12

to storm...@googlegroups.com

I don't quite understand – you said you did the ls -R a few days after the exception happened.

Moshe Bixenshpaner

unread,

Aug 17, 2012, 4:28:41 AM8/17/12

to storm...@googlegroups.com

I deleted that post, the one I posted eventually was after I reset everything, reproduced the whole thing and attached logs and contents of local directories.

Nathan Marz

unread,

Aug 17, 2012, 4:35:20 AM8/17/12

to storm...@googlegroups.com

The sv2 logs don't show any exceptions.

Moshe Bixenshpaner

unread,

Aug 25, 2012, 4:07:59 PM8/25/12

to storm...@googlegroups.com

Hi Nathan,

Log files of both SV2 workers show the logs show java.io.FileNotFoundException: File '/opt/storm/local/supervisor/stormdist/DistributedSystem-1-1344956702/stormconf.ser' does not exist followed by Halting process: ("Error on initialization").

On another note, the ZK1 log shows that clients are disconnecting every few seconds.

Moshe Bixenshpaner

unread,

Aug 25, 2012, 7:19:37 PM8/25/12

to storm...@googlegroups.com

Hey guys,

Problem is solved.

There were actually two of them:

1. The documentation specify to use a specific version of ZeroMQ, JZMQ, Python and JDK but doesn't specify anything about the Zookeeper, I assumed I can use the newest version (3.3.6) but it turned out to be a bad move. After a week with poor performance, I checked the jars attached to Storm 0.8.0 and I saw that it is aimed for Zookeeper 3.3.3.

2. I'm not sure how it is with real clusters, but on virtual cluster you need to have each node specified in the /etc/hosts file of all other nodes - pay attention to the following form:

ip_address host_name.defaultdomain

Notice the .defaultdomain at the end of each host name - this was what actually solved the problem of having a cluster of supervisors working together simultaneously.

Reply all

Reply to author

Forward