Overhead of WasmStackCheck

52 views
Skip to first unread message

Sam Parker-Haynes

unread,
Jun 12, 2025, 10:44:36 AMJun 12
to v8-dev
Hi!

While running some AI code, the loop header WasmStackCheck was appearing quite heavily in the profile. Disabling the checks results in ~1.5% speedup. So, is it necessary to execute these for every iteration? Or could we wrap inner loops, devoid of a stack check, in a new loop with one so that these checks are executed for a fraction of what they are now?

Cheers,
Sam



Emanuel Ziegler

unread,
Jun 12, 2025, 11:02:24 AMJun 12
to v8-...@googlegroups.com
Hi Sam,

In principle, loop unrolling should already reduce the number of stack checks, but it could be that it's insufficient or that for whatever reason this optimization does not get applied here. Did you take a look at the generated code?

Cheers,
    Emanuel

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/v8-dev/b4a77d25-21f9-4a0d-a467-8cbe48275bfbn%40googlegroups.com.
Message has been deleted

Darius Mercadier

unread,
Jun 13, 2025, 3:25:30 AMJun 13
to v8-...@googlegroups.com
Hi Sam,

Just so that everybody is on the same page: these stack checks are really interrupt checks rather than stack-overflow checks or something like that.

And yea, performing them on every loop iteration is a bit of a waste of time, but we kinda need them in every loop where we can't prove that the loop won't run forever (or even: where we can't prove that the loop won't run for a long time). There is already a mechanism in place in Turboshaft to try to estimate the number of iterations of specific loops, and then remove stack checks based on this, cf LoopStackCheckElisionReducer, whose analysis is done by LoopUnrollingAnalyzer in the LoopUnrollingPhase. However, as I write this, I'm noticing that the Wasm pipeline doesn't use the LoopStackCheckElisionReducer outside of loop unrolling (and loop unrolling only runs if there is at least one small(ish) inner loop). So, it would make sense to put a LoopStackCheckElisionReducer in the next phase after LoopUnrollingPhase: WasmGCOptimizePhase... except that this phase only runs for WasmGC. Maybe we can tweak a little bit the LoopStackCheckElisionReducer to put it in both WasmGCOptimizePhase and WasmLoweringPhase (the next phase, which runs unconditionally), but it would be a nop in WasmLoweringPhase if it has already ran in LoopStackCheckElisionReducer.
Additional limitation of LoopUnrollingAnalyzer+LoopStackCheckElisionReducer: the LoopUnrollingAnalyzer only detects the number of iterations of very specific loops (something like "for (i = <constant>; i <comparison operation> <constant>; i += <constant>)"). But it's perfectly reasonable to improve the pattern matching there to support more loops.

Another way to improve these interrupt checks that we've talked about before would be to remove all loop stack checks, and patch the code in-place when we have an interrupt to trigger. This way, we wouldn't have to keep repeating those stack checks that most of the time don't do anything. Probably not super trivial to implement, but doable I would think. If you want to go down that route, I suggest writing a design doc and gathering approvals from V8 folks, since some might have concerns (regarding security for instance).

Cheers,
Darius

Darius Mercadier

Software Engineer

dmerc...@google.com


Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Liana Sebastian

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde. 

     

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.

Sam Parker-Haynes

unread,
Jun 13, 2025, 6:29:26 AMJun 13
to v8-dev
Hi,

It seems my previous reply to Emanuel didn't send... Basically, I have enabled wasm unrolling in LLVM to help mitigate this already, and I have modified the TS unroller to help improve compile time, and performance, so I'd like to avoid increasing code size.

With respect to figuring out whether a loop iteration takes a long time, what is the upper bound for interrupt latency? If we have an inner loop, without calls, I would assume we could have a reasonable number of instructions and iterations before that latency would be adversely affected?

The runtime patching certainly sounds interesting, if more complicated that a loop transform. I had no idea what you were talking about though :) but Pierre has filled me in. I was wondering if patching would be completely necessary, could we not just always have a probing load in the loop?

Cheers,
Sam

Jakob Kummerow

unread,
Jun 13, 2025, 7:41:12 AMJun 13
to v8-...@googlegroups.com
what is the upper bound for interrupt latency? If we have an inner loop, without calls, I would assume we could have a reasonable number of instructions and iterations before that latency would be adversely affected?

Yes, interrupting doesn't need to be immediate. There are situations where users can observe the latency of interrupting a long-running loop, so it should be something that a human perceives as fast. To give numbers, I'd aim for no more than 10ms latency on fast hardware, so that it won't be more than 100ms on slow hardware. If an order of magnitude less than this upper bound can be achieved without significant drawbacks, that's even better.

Darius Mercadier

unread,
Jun 13, 2025, 7:53:08 AMJun 13
to v8-...@googlegroups.com
Hi,

> what is the upper bound for interrupt latency?

There is no clear official upper bound. LoopStackCheckElisionReducer removes stack checks from loops that have less than kMaxIterForStackCheckRemoval (= 5000) iterations, but I chose that number fairly arbitrarily, using the "meh, sounds vaguely reasonable" principle (and it doesn't even take into account the number of instructions in the loop).

Regarding your initial proposal (splitting loops into an inner loop without stack checks and an outer part with stack checks): if you can get that to work, sure, why not. Doesn't sound super easy to do, but why not. Also, once you've implemented that, we can reuse the analysis to improve our loop unrolling (which is currently very trivial: it repeats the whole loop header (including the condition) in between the unrolled iterations): we should then be able to remove the in-between condition checking.

> The runtime patching certainly sounds interesting, if more complicated that a loop transform. I had no idea what you were talking about though :) but Pierre has filled me in. 

For the readers that don't have Pierre nearby to fill them in: What I meant is that when we emit loops, we could add a few nops before the backedge jump (remember that loops only have a single backedge in Turbofan/shaft). When we want to trigger an interrupt and the main thread is executing a loop, we could patch the nops in-place to instead either call a specific function or jump to a trampoline that would take care of processing the interrupt. Details would need to be ironed out: the nops might not even be necessary as we could maybe patch the backedge itself to jump to a trampoline within the Turboshaft code; how to patch back to the original code safely is also a good question, etc, etc.

The benefit of patching compared to your proposal is that it can apply to every loop, without requiring an analysis pass in Turboshaft to figure out the general shape of the loop, like what is the loop condition & co, and without increasing the size of Turboshaft graphs (it might slightly increase the size of the generated code because of a few nops, but hopefully not much).

> could we not just always have a probing load in the loop?

Not quite sure what you mean by that. Do you mean a load from a memory region that we would protect when we want an interrupt to happen so that the load traps then and we would rely on the trap handler to process the interrupt request? If so: maybe, but that's still a load, which isn't great.

Also, it's not entirely clear to me why stack checks are expensive. In theory, there are 1 load + 1 comparison + 1 branch, and the branch is easily branch predictable, so this whole thing should be free on superscalar out-of-order CPUs... Yet it definitely isn't. If the load is the expensive part, then we could only do the stack check every 5000 iterations or so (which still requires 1 comparison + 1 branch, to know whether we are in an iteration where we need to do a stack check). If it's the comparison + branch part, indeed a trapping load might be helpful. And if it's simply the combination of load+compare+branch, then either code patching or your idea of splitting the loop into inner loop + outer loop are the only options I can think of. If you have insights on what is expensive in stack checks, I'm all ears :)  

Cheers,
Darius


On Fri, 13 Jun 2025 at 13:41, Jakob Kummerow <jkum...@chromium.org> wrote:
what is the upper bound for interrupt latency? If we have an inner loop, without calls, I would assume we could have a reasonable number of instructions and iterations before that latency would be adversely affected?

Yes, interrupting doesn't need to be immediate. There are situations where users can observe the latency of interrupting a long-running loop, so it should be something that a human perceives as fast. To give numbers, I'd aim for no more than 10ms latency on fast hardware, so that it won't be more than 100ms on slow hardware. If an order of magnitude less than this upper bound can be achieved without significant drawbacks, that's even better.

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.

Sam Parker-Haynes

unread,
Jun 13, 2025, 8:49:30 AMJun 13
to v8-dev
> Do you mean a load from a memory region that we would protect when we want an interrupt to happen so that the load traps then and we would rely on the trap handler to process the interrupt request? If so: maybe, but that's still a load, which isn't great.

Yes, but because the subsequent instructions in the loop do not depend on the loaded value, I would hope it would be relatively cheap. This sounds like the easiest option, so the first thing to try..?

> If you have insights on what is expensive in stack checks, I'm all ears :)

With the caveat that yesterday I looked at one loop, on one machine, the loop is still doing a reasonable amount of vector number crunching, perf reports the overhead of load and compare as 18% and 10%, respectively. The branch doesn't register in the profile but the first instruction afterwards is 6%. I'm sure newer, bigger hardware, with more fancy predictors would do a better job though.

Cheers,
Sam 

Darius Mercadier

unread,
Jun 13, 2025, 10:22:54 AMJun 13
to v8-...@googlegroups.com
> This sounds like the easiest option, so the first thing to try..?

Sure, go for it :)

> With the caveat that yesterday I looked at one loop, on one machine, the loop is still doing a reasonable amount of vector number crunching, perf reports the overhead of load and compare as 18% and 10%, respectively. The branch doesn't register in the profile but the first instruction afterwards is 6%. I'm sure newer, bigger hardware, with more fancy predictors would do a better job though.

Ok, so feels a like like "everything is a bit expensive" :'(

Sam Parker-Haynes

unread,
Jun 13, 2025, 1:44:24 PMJun 13
to v8-dev
Hi Emanuel,

I actually enabled loop unrolling for Wasm, in LLVM, for this same reason. The Turboshaft unroller doesn't appear to be triggering for the one loop I looked at, as I think it's too big. I have tweaked the TS unroller heuristics to generally unroll less, as added the compilation time was having a detrimental affect, so I would like to avoid having more static code. Also, I'd really like to reduce these overheads by, at least, an order of magnitude.

Cheers,
Sam

Reply all
Reply to author
Forward
0 new messages