3 machine cluster
-----------------
Nimbus and UI on a machine
2 supervisors on other 2 machines with 1 worker per machine
Configuration of one of the identical machines
----------------------------------------------
MemTotal: 198452480 kB (189 GB)
Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
cache size : 15360 KB
cpu cores : 6
Software/Platform details:
----------------------------------
Java.version=1.6.0_24 (Sun Microsystems Inc.)
Red Hat Enterprise Linux Server release 6.1 (Santiago)
0MQ : v2.1.7
jzmq : From nathan's github branch
Storm: v0.8.2
We installed 0MQ and jzmq using the source and used the storm binary. We used slightly modified word count topology(attached) for testing out. We had a parallelism of 25 for both split sentence bolt and word count bolt. We kept on tweaking the spout parameters. Please find the details and the inferences below.
1) 2 spout threads emitting infinitely(once per nextTuple call) with no sleep
Fails with the following exception after running for about 2 minutes.
java.lang.RuntimeException: Should always receive two-part ZMQ messages
at backtype.storm.messaging.zmq.ZMQConnection.recv_with_flags(zmq.clj:36)
at backtype.storm.messaging.loader$launch_receive_thread_BANG_$fn__1629$fn__1630.invoke(loader.clj:38)
at backtype.storm.util$async_loop$fn__465.invoke(util.clj:377)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:662)
followed by another exception as shown below
java.io.FileNotFoundException: File '/home/arinto/local-var/storm/supervisor/stormdist/fed_word_count_two-4-1362153302/stormconf.ser' does not exist
at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137)
at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135)
at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:138)
at backtype.storm.daemon.supervisor$fn__4793.invoke(supervisor.clj:414)
at clojure.lang.MultiFn.invoke(MultiFn.java:177)
at backtype.storm.daemon.supervisor$sync_processes$iter__4684__4688$fn__4689.invoke(supervisor.clj:249)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.RT.seq(RT.java:473)
at clojure.core$seq.invoke(core.clj:133)
at clojure.core$dorun.invoke(core.clj:2725)
at clojure.core$doall.invoke(core.clj:2741)
at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:237)
at clojure.lang.AFn.applyToHelper(AFn.java:161)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:603)
at clojure.core$partial$fn__4070.doInvoke(core.clj:2343)
at clojure.lang.RestFn.invoke(RestFn.java:397)
at backtype.storm.event$event_manager$fn__2507.invoke(event.clj:24)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:722)
2013-03-01 16:10:09 util [INFO] Halting process: ("Error when processing an event")
Other inferences: The latency was really high right from the beginning and the workers died once the exceptions were thrown
2) 2 spout threads emitting infinitely(once per nextTuple call) with 2 milli second sleep(once per nextTuple call)
This seemed to work without issues for about 10 mins after which we killed the topology.
We're blocked on this issue. Any pointers on which component leads to the issue and how to resolve it would be really appreciated.
Thanks,
Kiran