I don't think the new instructions have equivalent semantics. If the null check
fails with the iload/istore instructions, we need to throw the appropriate
language-specific exception and evaluate it against the landing pad. There
may also need to be an active exception that can be resumed.
As far as I can tell, the frontend would need to emit IR that calls the
language runtime to manually throw the exception. This IR would need to be
recognized by the IR optimization that converts the icmp+br+load/store to
a checked load/store. It seems to me that it would be simpler to just start
with the checked load/store.
> ‘invoke’ is different. It is needed because there is no way for the caller to explicitly check for an exception.
>
> We do introduce intrinsics that encapsulate things like overflow checks. This is done to eliminate control flow edges that tend to inhibit LLVM’s optimizer and instruction selection. But you’re not removing the control flow, so this technique does not apply. Null checks should actually be exposed in IR so general optimizations can remove redundant checks.
My idea for removing redundant checks is to teach the IR optimizer to
treat iloads/istores as if they were null checks. Is there any reason
why this wouldn't work?
> Ideally this would just be a machine code pass that can hoist a load/store above a branch and nuke the compare. However, I can see how it’s easier to relate the compare operand to the address arithmetic at IR level.
>
> To do this at IR level, you could introduce a pre-CodeGen pass that converts cmp+br+load/store into a checked load intrinsic. Since this intrinsic only exists for instruction selection, the optimizer doesn’t need to know about it.
I did initially consider implementing the checked load/store as an
intrinsic. But there are relatively boring reasons why this won't work at
present. For example, there is no current way to represent a struct load
using an intrinsic, as there is no mangling for struct types. Also, named
structs would need a mangling that is resistant to renames. Rather than
solve these problems, I decided to avoid intrinsics entirely.
> The intrinsic would need to be lowered to an MI pseudo-instruction that feels like a load/store to the backend, but is a terminator. During code emission you could grab the address of that pseudo load/store and its resulting branch target to inform the runtime.
As far as I know, a load can lower to multiple machine instructions. This
will definitely be the case for the Go frontend that I'm working with, as
its IR tends to use struct loads/stores quite frequently. So I'm not sure
if this will work. I think it needs to look a lot like how the lowering for
invokes currently looks, with a pair of EH_LABELs around a set of ordinary
load/store MIs -- which is how I implemented it.
Thanks,
--
Peter
Thanks for sharing these insights.
Since I've already started on an implementation, what I think I'll do is
experiment locally and see if these new instructions give a performance
improvement in the context of the compiler I'm working on -- I'll probably
need to implement explicit null checks anyway for the sake of supporting
targets that don't support unwinding through signal handlers. Once I have
some numbers I'll re-evaluate.
Thanks,
Peter
I agree with Filip that implementing null checks as trapping loads is essentially a minor code-size optimization that is typically not worthwhile. However, my point is that this decision should not be made at IR level.
It could make sense to have a null check intrinsic that encapsulates and+cmp+br+invoke. I just don't like the idea of it subsuming the load/store. I think that will complicate both null check optimization and load/store optimization.
Whether the null check can be profitably implemented as a trapping load/store is specific to the target platform. This decision should be independent of optimization in the presence of null checks and independent of optimization of the null checks themselves. It is purely a codegen issue.
Imagine someone else porting your frontend to a new platform. They should not be required to implement trapping loads/stores to achieve a working system. And either way, they should benefit from the same platform independent optimization.