Jet profiling tips

hern...@gmail.com

unread,

Mar 3, 2021, 2:22:46 PM3/3/21

to hazelcast-jet

Hi Jet team,

What is the recommended approach to diagnose performance issues in Jet job, when metrics show that execution rate and throughput stops growing under increasing load? I have set up profiler, captured some data, but honestly have no idea what the problem is. I can see lots of time being spent moving items between vertices, no threads blocked, not much time spent in user code, serialization etc. It seems I am missing something important here.

Can you please share any hints, ideas, or approach how to pinpoint the location where the application is spending most of CPU time? I am aware of profiler shortcomings in the area of safepoint sampling.

Thank you

Lukas Herman

Viliam Durina

unread,

Mar 4, 2021, 6:47:40 AM3/4/21

to hern...@gmail.com, hazelcast-jet

Hi Lukas,

Maybe it's backpressure from the sink or some other processor: look at which processors have full queues, i.e. the `queuesSize` metric is close to `queuesCapacity`. I'm not sure: is your cpu usage high?

Looking at metrics and profiling are the basic tools with perf issues. Another is understanding the inner workings, maybe try to share some details about your job.

Viliam

--
You received this message because you are subscribed to the Google Groups "hazelcast-jet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast-je...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast-jet/4c4c5aa4-f940-42d6-81a7-58215a7fd251n%40googlegroups.com.

--

Viliam Durina

Jet Developer

			hazelcast®

2 W 5th Ave, Ste 300 | San Mateo, CA 94402 | USA
+1 (650) 521-5453 | hazelcast.com

This message contains confidential information and is intended only for the individuals named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required, please request a hard-copy version. -Hazelcast

hern...@gmail.com

unread,

Mar 4, 2021, 9:34:41 AM3/4/21

to hazelcast-jet

Hi Viliam,

there is no backpressure and CPU saturation is around 60-80%. Queue sizes vary between 0-1, in rare cases 4 max.

Job is basically large DAG consisting of 80% AsyncTransformUsingContextUnorderedP-like processors, 20% cooperative processors. Total 1400 processor instances, distributed equally into TaskletExecutionService$CooperativeWorker.trackers.

Time spent inside processors (measured separately and written into processed item as start timestamp + elapsed time) is typically < 1ms. This way I can observe latency between individual processors by measuring time difference between individual processor start timestamps. Time spent to pass items between processors is increasing under load from 3-6ms up to 120ms, reducing overall throughput at the moment at 240 end-to-end tps.

Some statistics: 26000 Hazelcast operations-per-second, CPU 6,7 cores out of 8, 14000 processor executions per second (4600 per node), 150 live threads. The expectation is to reach 2x more.

Regards

Lukas

Dne čtvrtek 4. března 2021 v 12:47:40 UTC+1 uživatel vil...@hazelcast.com napsal:

Viliam Durina

unread,

Mar 4, 2021, 11:18:49 AM3/4/21

to hern...@gmail.com, hazelcast-jet

Seems that you're on Jet 3.x, the `AsyncTransformUsingContextUnorderedP` was renamed in 4 :)

Did I get it correctly that your async-transform processors are non-cooperative and that you have 80% * 1400=1120 OS threads? That's quite a lot. The non-cooperative processors are spinning with short sleeps (up to 5ms by default), checking their input queues, just context switching might take that time. To confirm this, can you try to increase `jet.idle.noncooperative.min.microseconds` and `jet.idle.noncooperative.max.microseconds` parameters? See here. If it decreases, that might be the cause.

You should try to use cooperative processors if possible - you're using transform-async processors, if the async api you're calling doesn't block on the calling thread it should be fine. With cooperative processors, having thousands of them is fine.

Viliam

To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast-jet/dc8c8e52-92c0-47d5-a283-7675b5db2082n%40googlegroups.com.

Lukáš Herman

unread,

Mar 4, 2021, 11:24:11 AM3/4/21

to Viliam Durina, hazelcast-jet

Hi Viliam,

Our async-transform processors are cooperative, there are 150 threads total. So this is a different story.

Is there a way to measure how fast is CooperativeWorker spinning?

Lukas

čt 4. 3. 2021 v 17:18 odesílatel Viliam Durina <vil...@hazelcast.com> napsal:

Viliam Durina

unread,

Mar 8, 2021, 4:23:33 AM3/8/21

to hazelcast-jet

Hi Lukas,

there are "iterationCount" and "taskletCount" metrics, see https://jet-start.sh/docs/operations/monitoring. These indicate if the cooperative worker spins quickly enough.

But this whole thing doesn't make sense to me. You say that there's no full queue in your DAG. That means there's no bottleneck inside the DAG, which would mean that your source isn't able to read fast enough. Can you please check that?

Or another idea: maybe your network is the bottleneck? Unfortunately we don't show queuesSize/queuesCapacity for SenderTasklet, but you can check your network statistics.

Another thing to try is BlockHound. It will show you if any of your cooperative processors does any blocking. When you have no backpressure from the sink and enough data from the source, then your CPU usage should be 100% - if it's less, it's an indication of an issue. But this is more work to implement, some time ago I experimented with it. See last commits in this branch or this file - you essentially need to list where the green threads are allowed to block. However, this tool doesn't show you if cooperative threads are blocked waiting for a synchronization monitor, which is common.

You can also enable finest logging for `com.hazelcast.jet.impl.execution.TaskletExecutionService`, it will print if any cooperative tasklet takes more than 5ms to complete. It might print false positives (due to a GC pause or due to high system load), but maybe it will indicate some issues to you.

I'm not sure about your issue, so I'm troubleshooting.

Viliam

hern...@gmail.com

unread,

Mar 10, 2021, 3:00:54 AM3/10/21

to hazelcast-jet

Hi Viliam,

thanks for the hints provided.

We are using BlockHound in unit tests to verify that business logic does not block important threads. Works well there.

Finally captured some interesting information with async profiler and flame graph output:

and located ineffective code in third party libraries - mostly expensive object initialization, hidden monitor locks, classpath scanning for factories etc. Observed significant improvements after couple of easy fixes.

Just a question: so it is expected that CooperativeWorker.run -> CopyOnWriteArrayList.forEach should take most of active CPU time, since it is constantly looping over tasklets, right?

Best regards

Lukas

Dne pondělí 8. března 2021 v 10:23:33 UTC+1 uživatel vil...@hazelcast.com napsal:

Marko Topolnik

unread,

Mar 10, 2021, 4:14:35 AM3/10/21

to hern...@gmail.com, hazelcast-jet

On 10.03.2021., at 09:00, hern...@gmail.com <hern...@gmail.com> wrote:

it is expected that CooperativeWorker.run -> CopyOnWriteArrayList.forEach should take most of active CPU time, since it is constantly looping over tasklets, right?

All the processor work happens within the line

trackers.forEach(runTasklet);

where trackers is

private final CopyOnWriteArrayList<TaskletTracker> trackers;

and the main line of code in runTasklet() is this:

final ProgressState result = t.tasklet.call();

If there isn't any visible CPU time reported from this line, it probably means the cooperative worker thread is iterating over the tasklets looking for work to do, but all the tasklets are blocked from making progress.

--

Marko

hern...@gmail.com

unread,

Mar 11, 2021, 3:58:32 AM3/11/21

to hazelcast-jet

>The flame graph looks as if your processors don't have much work to do. A big part is spent in ProcessorTasklet.fillInbox, that is when the processors are checking for more input to process. One disadvantage of having a >lot of processors that then one cooperative worker has many tasklets to run. If just one tasklet makes progress, the cooperative worker doesn't sleep and starts a new iteration immediately. This leads to higher wasted CPU. >But under load little CPU is wasted.

>

>Viliam

Thanks guys for the feedback, appreciate that!

Now I have better understanding of Jet internals and areas to focus on.

Regards

Lukas

Dne středa 10. března 2021 v 10:14:35 UTC+1 uživatel Marko Topolnik napsal:

Reply all

Reply to author

Forward