Bruno D. Rodrigues
unread,Oct 24, 2013, 5:23:08 PM10/24/13Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to storm...@googlegroups.com
I’ve been thinking how to ask a more concrete question, with samples and everything, but I’ll try a simpler version first:
- will I be able to feed Gbit (>128MB) data into storm and get something out of it?
My first test was a simple spout doing an http request (or reading a file), which even with a simple BufferedReader and reader.readLine() gives me more than 600MB/sec (4.8Gbps).
The bolt simply counts how many messages and the size of the string.
There is one spout and 4 bolts (tested 1 to 8). There is a local mode, or a cluster mode, either in a single machine, or between two machines.
My messages do not need reliability, so I’m emitting them with a simple emit([id, line], streamid), and as far as I understood I don’t need to ack the message later, nor need to listen to the ack/fail on the spout.
So this lets me decide between emitting one line per nextTuple, or put a for(;;) on it.
If I dispatch one line per call, everything works perfectly. Memory usage is controlled, performance is quite linear with the amount of bolts (measured by putting a sleep on the bolt), the maxSpoutMessages works as intended, all great, except that… can’t go above 10 to 20MB/sec!
If I dispatch on a loop, the performance metrics are quite nicer. I can then see about 300MB on the spout and the sum of the bolts slightly below that. Recall that the bolt simply does msg++ and bytes+=line.length. And because it’s *slightly below that*, messages accumulate somewhere and the workers die after some seconds with OOM.
So I thought that a mixed mode - emit some 128 or 1024 or some value to be defined later - would solve it. Until I tried my next PoC and got confused.
So I kept the one-line-per-emit, and added a middle bolt (also 1 to 8 workers) just doing JSONValue.parse() and emitting the json map forward. The final bolt then just counts messages, as it’s no longer a string.
Once again, all looked great until… OOM again.
I got the impression that the spout.nextTuple would be called only if a message was possible to be processed, but that doesn’t seem to be the case. So even emitting a mere 10MB/sec, line by line (the lines are json lines average 800bytes, ranging from 200 to a couple K), and decoding the json and counting messages with 4 parallel workers, it was still enough for the spout to fill up something and OOM the worker.
So besides the initial question - can I fill each of my Gb network cards if given enough nodes and optimised enough bolts - the follow up question is how to control the flow.
PS: Sometimes I feel the work I’ve been tasked has nothing to do with BigData. Or that I’m misunderstanding BigData. From my tests with many other products, I’ve been seeing a lot of attempts to solve the problem of processing big chunks of data in some relatively large amount of time, where the core problem is the data being a big block. The problem I have is having a lot of small data, but in a huge quantity. Real-time huge source data. Unicorns in the middle of enough garbage. Hence why storm catched my attention (also because from a beginners’s point of view, it just works out-of-the box). But let’s say mgmt is already asking me for 10Gb cards, when I can’t even fill up a regular Gb. Or I can, with my current code, but unfortunately it’s an old school monolithic block and of course now it’s getting hard to paralelize it, hence why I’d prefer to contribute to a project that solves similar problems than continue the “reinvented-here"