Hi all, I’ve run into an issue involving the fast allocator and constrained register classes that I’m looking for pointers on.
Consider a target with a register store instruction using an addressing mode with register base and offset:
*(base + index) = src;
This instruction may be modeled in tablegen with the operands: (outs) (ins REGCLASS:$src, REGCLASS:$base, REGCLASS_CONSTRAINED:$index)
In my particular case, REGCLASS contains 8 registers, R0-R7, and REGCLASS_CONSTRAINED has only 2, R0 and R1. The allocation order for REGCLASS is specifically set up to avoid R0 and R1 via an order that rotates the list left to (R2-R7, R0, R1).
In a piece of code at -O0, I’m at a point where I’ve allocated R2-R7, and the fast allocator sees a store of the form above. It chooses R0 and R1 for $src and $base. Because R0 and R1 are taken, $index can’t be allocated at all, and the allocator emits an error.
I first thought that the backend should be able to prioritize the constrained register first and found the AllocationPriority field in tablegen, but then noted that it was only used in the greedy allocator. I then looked at the fast allocator and noted that it seems to have no method for the target to affect priority or ordering. This is probably by design, so I pose my conundrum to the community since I can’t imagine that this hasn’t come up at some point in the past.
My current options seem to be:
Regards,
J.B. Nagurne
Code Generation
Texas Instruments
Thanks for the response, Matthias.
- Shuffling operands is viable, but I had only considered this somewhat lightheartedly since it seems very fragile to tie the chances that we get an unallocatable situation to something as seemingly unrelated as operand order.
- The option to effectively separate base/src and index allocation sets is also viable, but becomes more problematic as the ISA has fewer and fewer registers. Say if we remove all but R0-R3, it becomes extremely easy to get into pathological situations if we keep the 'general' set to 2 registers and the constrained set to the other 2, especially due to the relative rarity of base+index memory operations relative to other types of operations. I do admit that should this situation occur, the fault lies somewhat on overzealous architecture designers, but it's a possibility to consider nonetheless.
- Fixing index to R0 or R1 prior to allocation does have merit, whose spirit is reflected in the AMDGPU "staged" allocation mechanism. The target could tell the allocator to go through and assign the constrained registers first, and then go through to allocate the rest. I think this is a good solution in the long run, and am very happy that it has been shared upstream.
I did find it interesting that you used 'greedy' in the description for the fast allocator, given that there is a 'Greedy' allocator. I also feel that fast seems far greedier.
Can you or anyone else see an issue with always running the Greedy (tuned Basic) allocator at all optimization levels, or have history as to why Fast was chosen as the default at -O0? I can imagine it might involve the debugability of values as their live ranges are split.
JB
On Jul 21, 2021, at 10:36 PM, Nagurne, James via llvm-dev <llvm...@lists.llvm.org> wrote:Thanks for the response, Matthias.- Shuffling operands is viable, but I had only considered this somewhat lightheartedly since it seems very fragile to tie the chances that we get an unallocatable situation to something as seemingly unrelated as operand order.- The option to effectively separate base/src and index allocation sets is also viable, but becomes more problematic as the ISA has fewer and fewer registers. Say if we remove all but R0-R3, it becomes extremely easy to get into pathological situations if we keep the 'general' set to 2 registers and the constrained set to the other 2, especially due to the relative rarity of base+index memory operations relative to other types of operations. I do admit that should this situation occur, the fault lies somewhat on overzealous architecture designers, but it's a possibility to consider nonetheless.- Fixing index to R0 or R1 prior to allocation does have merit, whose spirit is reflected in the AMDGPU "staged" allocation mechanism. The target could tell the allocator to go through and assign the constrained registers first, and then go through to allocate the rest. I think this is a good solution in the long run, and am very happy that it has been shared upstream.I did find it interesting that you used 'greedy' in the description for the fast allocator, given that there is a 'Greedy' allocator. I also feel that fast seems far greedier.Can you or anyone else see an issue with always running the Greedy (tuned Basic) allocator at all optimization levels, or have history as to why Fast was chosen as the default at -O0?
> • Perhaps attempt to adapt the recently-committed staged-allocation mechanism fromhttps://reviews.llvm.org/rGeebe841a47cb to allocate constrained registers first
>
> Regards,
> J.B. Nagurne
> Code Generation
> Texas Instruments
> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
On Jul 21, 2021, at 10:36 PM, Nagurne, James via llvm-dev <llvm...@lists.llvm.org> wrote:I did find it interesting that you used 'greedy' in the description for the fast allocator, given that there is a 'Greedy' allocator. I also feel that fast seems far greedier.Can you or anyone else see an issue with always running the Greedy (tuned Basic) allocator at all optimization levels, or have history as to why Fast was chosen as the default at -O0? I can imagine it might involve the debugability of values as their live ranges are split.