Thanks nathan. I change the parallelism and throttle the spout with max spout pending conf and it looks the complete latency decrease from 700-800ms to 300-500ms, and I do a little calculation and I am still a little confused by the performance.
Now we have 1 spout, 64 bolts to parse the log, 16 bolts to do the aggregation, the tps to the storm topology(scribe) is about 5000-10000.
The parsing bolt process time is about 0.8 ms and aggregation bolts process time is 0.02 ms.
Each log will be parsed to 4 tuples and sent to the different aggregation bolts.
Let's assume the time spent in the spout is almost zero(since what it did is poll a log from a queue.)
Even if we execute the whole processing chain in a sync mode.
The time spent on processing a log might be
0(time spent on poll the log) + 0.8 ms (time to parse a log) + 0.02x4(time to do the aggregation) and it would be no more than 1ms
With 64 parsing bolts, which means every parsing bolts only need to process 100-200 tuples per second, with 0.8ms processing time, it would take 80-160ms to process all the tuples put to that bolt, it looks the parsing bolt would be the bottleneck.
With 16 aggregation bolts, which means every aggregation bolts only need to process 1600-3200 tuples per second, with 0.02 ms processing time, it would be 32-64ms to process all the tuples put to that bolt.
So if we didn't consider the network/io cost and the cost of acks, I thought it should be much faster than what I saw from storm ui.
Do I miss anything here?
And another data I could see is that with a ganglia monitor, the 2 machine's cpu used is about 50%, and there would be about 10-15% of system cpu.
How could I make the CPU busier to reduce the latency? Looks the context switch between thread has already been a little heavier that what I expected?
Thanks.