Root Relative Constant Table for WASM SIMD

88 views
Skip to first unread message

Dan Weber

unread,
Mar 16, 2021, 6:07:04 PM3/16/21
to v8-...@googlegroups.com
Hi everyone,

I wanted to approach the group to understand the feasibility of creating a root relative constant pool for WASM to support reuse of intermediate constants generated during complex instruction sequences.

Right now, when a complex instruction sequence like shuffle or swizzle operates and cannot find an architectural match for the requested operation, it generates in flight code to build shuffle masks which are passed to pshufb on x64 and tbl on a64. Due to the current nature of the code generator / assembler, these can be regenerated multiple times even if the input is constant (https://bugs.chromium.org/p/v8/issues/detail?id=11545).

Two options exist for resolving this. 

1) Generating a constant pool for the code to use at compile time and loading the data from there.
2) Lifting sections of multi instruction code up to the graph for optimization and reduction.

Each is a partial solution since both can benefit from the other.

What I'd like to enquire now is the first option -- the feasibility of implementing an isolate root relative constant pool.

From what I can see, this might be an easy and effective solution covering
address moves, security concerns, and alignment. 

Since isolate() has a single coherent heap that is garbage collected and moved (https://v8.dev/blog/embedded-builtins), one can allocate objects relative to the root. If you use it with the ExternalReference operand mechanism we've been using, it'll automatically generate all of the address constants relative to the root register (https://source.chromium.org/chromium/chromium/src/+/master:v8/src/codegen/x64/macro-assembler-x64.cc;l=124;bpv=1;bpt=1). This should preclude any concerns about addresses moving when the heap moves or gets reallocated. Likewise, isolate()->factory() provides mechanisms for aligning the pointers on each allocation. As such, if an external data structure such as a map or hash map is used to track the constants at code generation time, then each constant can be allocated individually on the isolate heap without respect to any other. If the heap persists the entire duration of the executed code and is deallocated at the end, then there are no memory management concerns. Lastly, there should be no security concerns since the data allocated on the isolate heap will not be executable by default.

Is this correct? If so, what's the appropriate process for submitting and reviewing a design proposal?

Dan

Clemens Backes

unread,
Mar 17, 2021, 8:19:07 AM3/17/21
to v8-dev, Zhi An Ng
Hi Dan,

that sounds like an interesting idea. In fact, I considered implementing a Wasm constant pool for floating point constants, but SIMD code might benefit even more.
Note that in Wasm code we do not have a root register (currently), so any access to the isolate root is a few indirections away. I am also not sure I fully understand how you plan to use external references to reference to dynamically generated constants.

An alternative to allocating on the heap might be allocating in (or close to) the Wasm code space, and use PC-relative addressing. For security reasons, we should try to make the constant pool non-executable, so if we decide to allocate in the Wasm code space (instead of close by) we should reserve a whole page and remove execute permissions from that page.

I think a good next step would be starting a little design doc (see https://v8.dev/docs/design-review-guidelines). I would propose to initially include @Zhi An Ng and me, and we can add more people once we figure out a working solution.

Thanks for working on this!

-Clemens

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAAg-m6qF47m8fP81GoeeM4YJBaBAC8%2BY_z%3DcmanviBBeQTsD_A%40mail.gmail.com.


--

Clemens Backes

Software Engineer

clem...@google.com

Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.


This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.

Zhi An Ng

unread,
Mar 17, 2021, 12:16:24 PM3/17/21
to Clemens Backes, Brown, Andrew, v8-dev
We did a little exploration on this when trying to optimize shuffles, see relevant design doc (https://docs.google.com/document/d/1uCYwyQYjgNAtaXDNgHusDGCV1m9YGhOWJx2eqzv2rdI/edit). We ultimately decided against it because we found another way to improve shuffle performance without adding a constant pool.

+Brown, Andrew fyi, since you have interest in exploring a 128-bit constant pool as well.

Jakob Kummerow

unread,
Mar 17, 2021, 12:32:56 PM3/17/21
to v8-dev, Clemens Backes, Brown, Andrew
I'd like to try to clear up the understanding of memory handling a bit:
There is indeed the option to put stuff into the Isolate, so that root-relative addressing can be used to access it. This makes sense when the amount of data is fixed and statically known (for example: a list of all builtins that exist).
It's also true that there's a 1:1 relation between Isolates and Heaps, but that's the managed heap! If you use the Factory to allocate an object, then it lives on the managed heap. It'll (potentially) move around on GC, you need Handles to refer to it from C++ code, and you can't use root-relative addressing to get to it.

So if you want to store constants required by a specific compiled function (and not particularly likely to be shared by other functions), then their storage should be dynamic and have the same lifetime as the code that needs them. Putting them right inside the code is an easy way to achieve that (though with security downsides, as Clemens points out), and in fact that's what "constant pool" has historically meant in V8. Putting them elsewhere (non-executable) but "nearby" would be even better.

I hope this helps at least a little bit, and that I'm not misunderstanding the whole idea (not familiar with SIMD).


Dan Weber

unread,
Mar 17, 2021, 6:27:52 PM3/17/21
to v8-...@googlegroups.com
Hi everyone, 

I've summarized comments, questions, and responses at the top here with the effect of making this a little bit easier to read.  My comments, questions, and responses are just below.

Clemens:
- Currently, there is no kRootRegister in WASM SIMD (or at least x64). 
- Any access to data through the Isolate heap would require a few indirections since there wouldn't be any way to calculate a consistent offset or displacement for a specific constant.
- An alternative to using the heap is to allocate data blocks somewhere that's PC-relative (or within 32bits of RIP).  If pages can be allocated in that range and they are not code pages, they're not executable by default.  This helps alleviate any security concerns.  If closeby is not an option, we can use the code page allocator to allocate pages.  However, if we do this, we should ensure that those pages are marked as not executable.
- Start putting together a design document and include Zhi and me on it.
- How would you use External References with the Heap?
Zhi:
- Have explored the possibility of using PC-relative/RIP relative addressing in https://docs.google.com/document/d/1uCYwyQYjgNAtaXDNgHusDGCV1m9YGhOWJx2eqzv2rdI/edit
- Proposal was specific to shuffles and abandoned when another solution could provide immediate performance benefits without the complexity of the constant pool.
- There is still interest in a constant pool and it warrants further investigation.
Jakob:
- Isolate is good for builtins and or anything fixed and static in scope. This might not be a good use case for two reasons:
1) Constants are likely limited in scope to the code using them and are unlikely to get benefit from sharing.
2) If it's allocated with an isolate factory, it now requires a handle since the address can move if the GC moves it.
- A better alternative would be something like what Clemens is describing (a PC relative solution) since that will follow the same lifecycle as the code using it. 
- If the implementation can be made to work in such a way that it's PC relative but not in code space, that's even better, since it alleviates security concerns.

Clemens:
- With respect to External References, we've started using them quite a bit since they have some very nice properties.  Regardless of address space, we can make any pointer address available with a movq, not just PC relative, and any other instruction (pandn/pshufb...) with the result as an aligned memory operand.  My thought is that if we can find a way to ensure any given block of memory is deallocated after the code executes (or simply when the code itself falls out of scope), we can build a constant pool wherever whenever.  In such a case, we could hypothetically have a std::set somewhere in heap space that could be used to deduplicate any/all constants we need and allow for their generation and use during the code generation process.  The thought of using the Isolate Heap was appealing if kRootRegister existed and we could always generate a constant displacement -- thus eliminating the extra movq instruction.
- Generally speaking, I would love your help and am open to any solution that performs better than constant re-generation with shuffles.  If the PC relative solution is viable and efficient, it's certainly worth testing.
- How would you like me to list you on the V8 design doc?  Do I put you and Zhi as technical leads?  I'm not sure what or whom to put in the LGTM column, and then what the next steps are.  Do we talk offline and then submit it to v8-dev+design?  Or is that where the dialog happens?
Zhi:
- This design doc and the prototype implementation are super helpful even if only for reference.  Thanks.
- With respect to the prototype implementation, does it actually build a constant pool or just inline constants before they're used? I'm curious about anything/everything that's happening in this green block: https://chromium-review.googlesource.com/c/v8/v8/+/2149408/2/src/codegen/x64/assembler-x64.cc#431
Jakob:
- With respect to memory allocations that tie to the scope of the code, what options are available to us if isolate isn't good?
- If we could allocate pages for data that were never in the pathway of being set as executable that would be awesome.  If we can then allocate from said pages each individual aligned constant, it would be really awesome.

Thanks!
Dan



Zhi An Ng

unread,
Mar 18, 2021, 12:31:33 PM3/18/21
to v8-dev
On Wed, Mar 17, 2021 at 3:27 PM Dan Weber <dwe...@gmail.com> wrote:
Hi everyone, 

I've summarized comments, questions, and responses at the top here with the effect of making this a little bit easier to read.  My comments, questions, and responses are just below.

Clemens:
- Currently, there is no kRootRegister in WASM SIMD (or at least x64). 
- Any access to data through the Isolate heap would require a few indirections since there wouldn't be any way to calculate a consistent offset or displacement for a specific constant.
- An alternative to using the heap is to allocate data blocks somewhere that's PC-relative (or within 32bits of RIP).  If pages can be allocated in that range and they are not code pages, they're not executable by default.  This helps alleviate any security concerns.  If closeby is not an option, we can use the code page allocator to allocate pages.  However, if we do this, we should ensure that those pages are marked as not executable.
- Start putting together a design document and include Zhi and me on it.
- How would you use External References with the Heap?
Zhi:
- Have explored the possibility of using PC-relative/RIP relative addressing in https://docs.google.com/document/d/1uCYwyQYjgNAtaXDNgHusDGCV1m9YGhOWJx2eqzv2rdI/edit
- Proposal was specific to shuffles and abandoned when another solution could provide immediate performance benefits without the complexity of the constant pool.
- There is still interest in a constant pool and it warrants further investigation.
Jakob:
- Isolate is good for builtins and or anything fixed and static in scope. This might not be a good use case for two reasons:
1) Constants are likely limited in scope to the code using them and are unlikely to get benefit from sharing.
2) If it's allocated with an isolate factory, it now requires a handle since the address can move if the GC moves it.
- A better alternative would be something like what Clemens is describing (a PC relative solution) since that will follow the same lifecycle as the code using it. 
- If the implementation can be made to work in such a way that it's PC relative but not in code space, that's even better, since it alleviates security concerns.

Clemens:
- With respect to External References, we've started using them quite a bit since they have some very nice properties.  Regardless of address space, we can make any pointer address available with a movq, not just PC relative, and any other instruction (pandn/pshufb...) with the result as an aligned memory operand.  My thought is that if we can find a way to ensure any given block of memory is deallocated after the code executes (or simply when the code itself falls out of scope), we can build a constant pool wherever whenever.  In such a case, we could hypothetically have a std::set somewhere in heap space that could be used to deduplicate any/all constants we need and allow for their generation and use during the code generation process.  The thought of using the Isolate Heap was appealing if kRootRegister existed and we could always generate a constant displacement -- thus eliminating the extra movq instruction.
- Generally speaking, I would love your help and am open to any solution that performs better than constant re-generation with shuffles.  If the PC relative solution is viable and efficient, it's certainly worth testing.
- How would you like me to list you on the V8 design doc?  Do I put you and Zhi as technical leads?  I'm not sure what or whom to put in the LGTM column, and then what the next steps are.  Do we talk offline and then submit it to v8-dev+design?  Or is that where the dialog happens?
Zhi:
- This design doc and the prototype implementation are super helpful even if only for reference.  Thanks.
- With respect to the prototype implementation, does it actually build a constant pool or just inline constants before they're used? I'm curious about anything/everything that's happening in this green block: https://chromium-review.googlesource.com/c/v8/v8/+/2149408/2/src/codegen/x64/assembler-x64.cc#431

Good question, it inlines the constants into the instruction stream and refers to them via rip-relative addressing. I also just noticed that it doesn't try to share constant across two different calls to e.g. shuffle, which could be a reason why we didn't see performance improvements. (This probably points to the fact that rip-relative load itself doesn't give us obvious gains over using 2 or 3 instructions to build the constant.)
 

Clemens Backes

unread,
Mar 19, 2021, 8:05:43 AM3/19/21
to v8-dev
On Wed, Mar 17, 2021 at 11:28 PM Dan Weber <dwe...@gmail.com> wrote:
Hi everyone, 

I've summarized comments, questions, and responses at the top here with the effect of making this a little bit easier to read.  My comments, questions, and responses are just below.

Clemens:
- Currently, there is no kRootRegister in WASM SIMD (or at least x64). 
- Any access to data through the Isolate heap would require a few indirections since there wouldn't be any way to calculate a consistent offset or displacement for a specific constant.
- An alternative to using the heap is to allocate data blocks somewhere that's PC-relative (or within 32bits of RIP).  If pages can be allocated in that range and they are not code pages, they're not executable by default.  This helps alleviate any security concerns.  If closeby is not an option, we can use the code page allocator to allocate pages.  However, if we do this, we should ensure that those pages are marked as not executable.
- Start putting together a design document and include Zhi and me on it.
- How would you use External References with the Heap?
Zhi:
- Have explored the possibility of using PC-relative/RIP relative addressing in https://docs.google.com/document/d/1uCYwyQYjgNAtaXDNgHusDGCV1m9YGhOWJx2eqzv2rdI/edit
- Proposal was specific to shuffles and abandoned when another solution could provide immediate performance benefits without the complexity of the constant pool.
- There is still interest in a constant pool and it warrants further investigation.
Jakob:
- Isolate is good for builtins and or anything fixed and static in scope. This might not be a good use case for two reasons:
1) Constants are likely limited in scope to the code using them and are unlikely to get benefit from sharing.
2) If it's allocated with an isolate factory, it now requires a handle since the address can move if the GC moves it.
- A better alternative would be something like what Clemens is describing (a PC relative solution) since that will follow the same lifecycle as the code using it. 
- If the implementation can be made to work in such a way that it's PC relative but not in code space, that's even better, since it alleviates security concerns.

Clemens:
- With respect to External References, we've started using them quite a bit since they have some very nice properties.  Regardless of address space, we can make any pointer address available with a movq, not just PC relative, and any other instruction (pandn/pshufb...) with the result as an aligned memory operand.  My thought is that if we can find a way to ensure any given block of memory is deallocated after the code executes (or simply when the code itself falls out of scope), we can build a constant pool wherever whenever.  In such a case, we could hypothetically have a std::set somewhere in heap space that could be used to deduplicate any/all constants we need and allow for their generation and use during the code generation process.  The thought of using the Isolate Heap was appealing if kRootRegister existed and we could always generate a constant displacement -- thus eliminating the extra movq instruction.
- Generally speaking, I would love your help and am open to any solution that performs better than constant re-generation with shuffles.  If the PC relative solution is viable and efficient, it's certainly worth testing.
- How would you like me to list you on the V8 design doc?  Do I put you and Zhi as technical leads?  I'm not sure what or whom to put in the LGTM column, and then what the next steps are.  Do we talk offline and then submit it to v8-dev+design?  Or is that where the dialog happens?

I would propose to start with the design doc, and we can have "offline" discussions if we find things that require more in-depth discussion. Just put us as reviewers in the LGTM section, we will also help with the completion of the design doc and potentially add more reviewers if necessary. All this can happen before actually sending the doc to v8-dev.
Before starting actual work on this it would be nice to learn more about the motivation. Do we have evidence that what we currently do is too slow? Maybe it would be possible to prototype something and measure. I can certainly help with that, if we have a benchmark that we think would benefit. But let's discuss this in the design doc.

Jakob Kummerow

unread,
Mar 22, 2021, 12:51:02 PM3/22/21
to v8-dev
On Wed, Mar 17, 2021 at 11:27 PM Dan Weber <dwe...@gmail.com> wrote:
Hi everyone, 


We currently do custom memory management for Wasm code objects. So the easiest option is to have constant pools within the code object; I think the alternative would require building some new infrastructure to allow storing data for each WasmCode that has the same lifetime.
 
- If we could allocate pages for data that were never in the pathway of being set as executable that would be awesome.  If we can then allocate from said pages each individual aligned constant, it would be really awesome.

Everything is possible, it's "just" a question of effort...
I think a viable path forward might be: use constant pools inside the code object (as mentioned elsewhere in this thread) to evaluate performance. If the performance results indicate that the feature should be productionized, evaluate options for where to store the data: is in-code-object good enough? What would the alternatives look like, and how much effort would they be? That'd be the stuff for a design doc :-)

Andrew Brown

unread,
Mar 22, 2021, 1:20:34 PM3/22/21
to v8-dev
> Maybe it would be possible to prototype something and measure.

I am also interested in this topic; in some micros I ran (just looping on shuffle) I noticed ~15% improvement by using a "constant pool" mask rather than constructing the mask using immediate scalars and RIP-relative deduplication as is done currently. If someone here prototypes this I believe we can run some benchmarks (e.g. certain MediaPipe models) and share the speed-ups. I originally had it in my to-do list to take Zhi's initial patch (https://chromium-review.googlesource.com/c/v8/v8/+/2149408) a bit further but I would prefer to benchmark a more "official" prototype. Please keep me in the loop!

Dan Weber

unread,
Mar 23, 2021, 6:32:45 PM3/23/21
to v8-...@googlegroups.com
On Fri, Mar 19, 2021 at 12:05 PM Clemens Backes <clem...@chromium.org> wrote:
On Wed, Mar 17, 2021 at 11:28 PM Dan Weber <dwe...@gmail.com> wrote:
Hi everyone, 

I would propose to start with the design doc, and we can have "offline" discussions if we find things that require more in-depth discussion. Just put us as reviewers in the LGTM section, we will also help with the completion of the design doc and potentially add more reviewers if necessary. All this can happen before actually sending the doc to v8-dev.
Before starting actual work on this it would be nice to learn more about the motivation. Do we have evidence that what we currently do is too slow? Maybe it would be possible to prototype something and measure. I can certainly help with that, if we have a benchmark that we think would benefit. But let's discuss this in the design doc.

This sounds great.  Be on the lookout later this week or early next. 

Dan Weber

unread,
Mar 29, 2021, 5:51:27 PM3/29/21
to clem...@chromium.org, Jakob Kummerow, Brown, Andrew, Zhi An Ng
Here's the link for the draft proposal: https://docs.google.com/document/d/1W40WUTcLPYE3uBek_olTgiOgyKXMr4BP3o8m9uzv6Y8/edit?usp=sharing [Moving v8-dev to bcc].

Clemens/Jakob -- I hope this answers your questions about motivations and benchmarks.  Inside, you'll also find two options for an initial performance evaluation prototype.  Unless you have an objection to going with option #1 for the initial benchmarks, I'm going to see if I can get it done early this week.

Andrew -- If I can get you a patch by COB Wednesday (3/31), do you think we could start benchmarking by COB the following Wednesday (4/7)?

Thanks!
Dan
Reply all
Reply to author
Forward
0 new messages