What does "Complete latency(ms)" means for a spout?

3,756 views
Skip to first unread message

Stanley Xu

unread,
Jun 11, 2012, 9:56:43 AM6/11/12
to storm...@googlegroups.com
Hi buddies,

We are trying to migrate one of our realtime workflow from a customized application from a single machine to storm 0.72 with 2 machines(16core, 48G memory).
When we do the testing, we were using the default storm configuration, which means 2 supervisor on 2 machines with 4 workers per machine.

We are using 1 ScribeSpout to accept logs from our serving system, several bolts to do the log parsing and transformation, another several bolts to do the aggregation.

When we turn off the acker(by changing the acker number to 0), we found the "Complete latency(ms)" is 0 by storm ui.
When we turn on the acker(using 1 or 2 acker), we found the "Complete latency" is about 700-800ms by storm ui.

Does this number means the time between getting the log in the spout  and all acks received form all the bolts?
And it looks a little longer to me, since the process latency in the two kinds of bolts are 0.6 ms and 0.02 seconds.
Is there any possiblity that anything would make the "Complete latency" look so long?

And another problem we have met is that we would randomly met an "gc overhead limit exceed exception" in some of the task and bring down one of the worker node. I am wondering what kind of things might make that happen?
I tried to change the worker.childopts to "-Xmx4096M" but then the worker node cannot started successfully, do we need any other configurations to increase the memory the worker jvm used?

Best wishes,
Stanley Xu

Nathan Marz

unread,
Jun 11, 2012, 3:42:08 PM6/11/12
to storm...@googlegroups.com




On Jun 11, 2012, at 6:56 AM, Stanley Xu <wenh...@gmail.com> wrote:

> Hi buddies,
>
> We are trying to migrate one of our realtime workflow from a customized application from a single machine to storm 0.72 with 2 machines(16core, 48G memory).
> When we do the testing, we were using the default storm configuration, which means 2 supervisor on 2 machines with 4 workers per machine.
>
> We are using 1 ScribeSpout to accept logs from our serving system, several bolts to do the log parsing and transformation, another several bolts to do the aggregation.
>
> When we turn off the acker(by changing the acker number to 0), we found the "Complete latency(ms)" is 0 by storm ui.
> When we turn on the acker(using 1 or 2 acker), we found the "Complete latency" is about 700-800ms by storm ui.
>
> Does this number means the time between getting the log in the spout and all acks received form all the bolts?

Its the time between spout emitting the tuple and the tuple tree being completed.

> And it looks a little longer to me, since the process latency in the two kinds of bolts are 0.6 ms and 0.02 seconds.
> Is there any possiblity that anything would make the "Complete latency" look so long?

That means your tuples are getting backed up between bolts. Either increase your parallelism or throttle the spout with max spout pending conf.

>
> And another problem we have met is that we would randomly met an "gc overhead limit exceed exception" in some of the task and bring down one of the worker node. I am wondering what kind of things might make that happen?
> I tried to change the worker.childopts to "-Xmx4096M" but then the worker node cannot started successfully, do we need any other configurations to increase the memory the worker jvm used?

This means your spouts are out-pacing the consumers. Same advice as above applies.

>
> Best wishes,
> Stanley Xu
>

Stanley Xu

unread,
Jun 11, 2012, 9:14:57 PM6/11/12
to storm...@googlegroups.com
Sorry, Nathan.

I might have  typo, my bolts only took 0.6 ms and 0.02ms(I typed seconds here before.), and the spout complete latency is 700-800 ms which is 1000 more than time spent on parsing and aggregation.

So  I didn't get the idea "tuple get backed up"  here. I will try your advice but still very confusing why a faster tuple will back up tuple to slow down the slow spouts?

Thanks.

Stanley Xu

unread,
Jun 12, 2012, 12:53:19 AM6/12/12
to storm...@googlegroups.com
Thanks nathan. I change the parallelism and throttle the spout with max spout pending conf and it looks the complete latency decrease from 700-800ms to 300-500ms, and I do a little calculation and I am still a little confused by the performance.

Now we have 1 spout, 64 bolts to parse the log, 16 bolts to do the aggregation, the tps to the storm topology(scribe) is about 5000-10000.

The parsing bolt process time is about 0.8 ms and aggregation bolts process time is 0.02 ms.
Each log will be parsed to 4 tuples and sent to the different aggregation bolts.

Let's assume the time spent in the spout is almost zero(since what it did is poll a log from a queue.)

Even if we execute the whole processing chain in a sync mode. 
The time spent on processing a log might be 

0(time spent on poll the log) + 0.8 ms (time to parse a log) + 0.02x4(time to do the aggregation) and it would be no more than 1ms


With 64 parsing bolts, which means every parsing bolts only need to process 100-200 tuples per second, with 0.8ms processing time, it would take 80-160ms to process all the tuples put to that bolt, it looks the parsing bolt would be the bottleneck.
With 16 aggregation bolts, which means every aggregation bolts only need to process 1600-3200 tuples per second, with 0.02 ms processing time, it would be 32-64ms to process all the tuples put to that bolt.

So if we didn't consider the network/io cost and the cost of acks, I thought it should be much faster than what I saw from storm ui. 
Do I miss anything here?

And another data I could see is that with a ganglia monitor, the 2 machine's cpu used is about 50%, and there would be about 10-15% of system cpu. 
How could I make the CPU busier to reduce the latency? Looks the context switch between thread has already been a little heavier that what I expected?
 
Thanks.

Nathan Marz

unread,
Jun 12, 2012, 1:10:00 AM6/12/12
to storm...@googlegroups.com
How many ackers do you have configured for your topology? If you haven't set the number of ackers then you might be bottlnecked there – try setting # ackers = # workers.
--
Twitter: @nathanmarz
http://nathanmarz.com

Stanley Xu

unread,
Jun 12, 2012, 1:16:25 AM6/12/12
to storm...@googlegroups.com
I just set 2 ackers, got you, I will try to increase the number of ackers.

Stanley Xu

unread,
Jun 12, 2012, 2:19:02 AM6/12/12
to storm...@googlegroups.com
Thanks a lot, Nathan, looks the latency decreased to about 20 ms with 6k TPS by increasing the number of ackers = workers. Now every worker node(JVM) has an acker.
Even though it is much larger than 1ms, considering the overhead and network latency, the number make more sense. 

Nathan Marz

unread,
Jun 12, 2012, 2:23:56 AM6/12/12
to storm...@googlegroups.com
Great, glad we figured that out. By the way, you might see better performance with 0.8.0.

Shrijeet Paliwal

unread,
Jun 12, 2012, 2:27:01 AM6/12/12
to storm...@googlegroups.com
/off topic (sorry Stanley)

Nathan, 
Is 0.8.0 backward compatible with application code written using 0.7.2 ? 
--
Shrijeet

Nathan Marz

unread,
Jun 12, 2012, 2:31:30 AM6/12/12
to storm...@googlegroups.com
Not completely. The OutputCollector is not longer thread safe and there are some minor API changes. But it's pretty close.

Shrijeet Paliwal

unread,
Jun 12, 2012, 2:42:08 AM6/12/12
to storm...@googlegroups.com
>>The OutputCollector is not longer thread safe..
Interesting. Thanks Nathan. 
-Shrijeet

Richards Peter

unread,
Jun 13, 2012, 2:56:01 AM6/13/12
to storm...@googlegroups.com
Nathan,

I would like to know the impact of this change: OutputCollector not being thread safe. So does it mean that every emits should be synchronized? I am seeing some errors in deserialization. Does it mean that the tuples are getting mixed up? And if we are supposed to synchronize the emits, should we hold the lock on collector?

Thanks,
Richards.

Nathan Marz

unread,
Jun 13, 2012, 3:13:47 AM6/13/12
to storm...@googlegroups.com
No, you don't have to synchronize aything – only if you yourself launch additional threads that use the OutputCollector. Each task is given its own OutputCollector.
Reply all
Reply to author
Forward
0 new messages