Hi,
I wanted to revive this issue of supporting asm goto (Bug 9295).
As was already proposed, the best way seems to be introducing new IR.
If we’re changing the IR, we should probably provide an infrastructure that solves or at least enables future support for things like:
1. MS-style inline asm jmps and goto (Bug 24529)
2. Analyzing symbols defined/references in the inline assembly (Bug 28970), taking into account module/file-scope inline assembly.
3. Provide some information about the cost of the inline assembly? (I’m not sure if we want to couple it with this issue and if the cost should be represented in this new IR or some other way)
This new IR should contain:
1. A list of symbols referenced by the inline assembly
a. Should we try to provide better analysis of how we use these symbols (e.g. in jmp instructions, in call instructions)? Can it help us do some error checking?
2. A list of symbols defines by the inline assembly
a. Should we provide additional information about how symbols defined (e.g. which symbols defined as globals)?.
At a later stage we also need to discuss other aspects of this new instruction (e.g. it being a block terminator and other behaviors we might want to be affected by it).
Any other thoughts and ideas we want to include in this change that I’m missing?
Thanks,
Marina
---------------------------------------------------------------------
Intel Israel (74) Limited
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
As was already proposed, the best way seems to be introducing new IR.
If we’re changing the IR, we should probably provide an infrastructure that solves or at least enables future support for things like:
1. MS-style inline asm jmps and goto (Bug 24529)
2. Analyzing symbols defined/references in the inline assembly (Bug 28970), taking into account module/file-scope inline assembly.
3. Provide some information about the cost of the inline assembly? (I’m not sure if we want to couple it with this issue and if the cost should be represented in this new IR or some other way)
Linux kernel is using the “asm goto” feature, other projects probably use it as well. I think it provides motivation to support it in LLVM.
Regarding the complexity, I believe there is some infrastructure that we can at least partially reuse (the support for “indirectbr” instruction).
My focus is adding “asm goto” support, the other things are indeed completely orthogonal and came up in bugs related to Bug 9295.
The reason I mentioned them in this discussion is that they seem to require IR change as well.
I thought that if we’re doing this IR change for “asm goto” it’s worth checking if there is additional useful information we want to expose except the C labels used by inline assembly.
If you prefer to separate the IR changes for “asm goto” and the IR changes that will allow exposing defined symbols, then we can focus on the “asm goto” feature alone.
Thanks,
Marina
---------------------------------------------------------------------
Intel Israel (74) Limited
I haven't seen anything in all of pkgsrc. Just as data point.
Joerg
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Linux kernel is using the “asm goto” feature,
other projects probably use it as well.
Asm goto feature was introduces to GCC in order to optimize the support for tracepoints in Linux kernel (it can be used for other things that do nop patching).
GCC documentation describes their motivating example here:
https://gcc.gnu.org/onlinedocs/gcc-4.8.4/gcc/Extended-Asm.html
#define TRACE1(NUM) \
do { \
asm goto ("0: nop;" \
".pushsection trace_table;" \
".long 0b, %l0;" \
".popsection" \
: : : : trace#NUM); \
if (0) { trace#NUM: trace(); } \
} while (0)
#define TRACE TRACE1(__COUNTER__)
In this example (which in fact inspired the asm goto
feature) we want on rare occasions to call the
trace
function; on other occasions we'd like to keep the overhead to the absolute minimum. The normal code path consists of a single
nop
instruction. However, we record the address of this
nop
together with the address of a label that calls the
trace
function. This allows the
nop
instruction to be patched at run time to be an unconditional branch to the stored label. It is assumed that an optimizing compiler moves the labeled block out of line, to optimize the fall through path
from the asm
.
Here is the Linux kernel RFC which discusses the old C way of implementing it and the performance issues that were noticed.
It also states some performance numbers of the old C code vs. the asm goto:
https://lwn.net/Articles/350714/
This LTTng (Linux Trace Toolkit Next Generation) presentation talks about using this feature as a way of optimize static tracepoints (slides 3-4)
https://www.computer.org/cms/ComputingNow/HomePage/2011/0111/rW_SW_UsingTracing.pdf
This presentation also mentions that a lot of other Linux applications use this tracing mechanism.
I believe we already have much of the infrastructure in place (using the indirecbr instruction infrastructure).
We do need to make sure MachineBlockPlacement optimizes the fall through path to make sure we can gain the performance for the nop patching.
Thanks,
Marina
From: Chandler Carruth [mailto:chan...@gmail.com]
Sent: Thursday, March 30, 2017 23:22
To: Yatsina, Marina <marina....@intel.com>; llvm...@lists.llvm.org; r...@google.com; jykn...@google.com; eh...@mozilla.com; rjmc...@apple.com; mehdi...@apple.com; ma...@braunis.de; Tayree, Coby <coby....@intel.com>
Subject: Re: [llvm-dev] [inline-asm][asm-goto] Supporting "asm goto" in inline assembly
Just responding to the motivation stuff as that remains an open question:
---------------------------------------------------------------------
Intel Israel (74) Limited
Asm goto feature was introduces to GCC in order to optimize the support for tracepoints in Linux kernel (it can be used for other things that do nop patching).
GCC documentation describes their motivating example here:
https://gcc.gnu.org/onlinedocs/gcc-4.8.4/gcc/Extended-Asm.html#define TRACE1(NUM) \do { \asm goto ("0: nop;" \".pushsection trace_table;" \".long 0b, %l0;" \".popsection" \: : : : trace#NUM); \if (0) { trace#NUM: trace(); } \} while (0)#define TRACE TRACE1(__COUNTER__)In this example (which in fact inspired the
asm goto
feature) we want on rare occasions to call thetrace
function; on other occasions we'd like to keep the overhead to the absolute minimum. The normal code path consists of a singlenop
instruction. However, we record the address of thisnop
together with the address of a label that calls thetrace
function. This allows thenop
instruction to be patched at run time to be an unconditional branch to the stored label. It is assumed that an optimizing compiler moves the labeled block out of line, to optimize the fall through path from theasm
.Here is the Linux kernel RFC which discusses the old C way of implementing it and the performance issues that were noticed.
It also states some performance numbers of the old C code vs. the asm goto:
https://lwn.net/Articles/350714/
This LTTng (Linux Trace Toolkit Next Generation) presentation talks about using this feature as a way of optimize static tracepoints (slides 3-4)https://www.computer.org/cms/ComputingNow/HomePage/2011/0111/rW_SW_UsingTracing.pdf
This presentation also mentions that a lot of other Linux applications use this tracing mechanism.
I believe we already have much of the infrastructure in place (using the indirecbr instruction infrastructure).
We do need to make sure MachineBlockPlacement optimizes the fall through path to make sure we can gain the performance for the nop patching.
Thanks,
Marina
From: Chandler Carruth [mailto:chan...@gmail.com]
Sent: Thursday, March 30, 2017 23:22
To: Yatsina, Marina <marina....@intel.com>; llvm...@lists.llvm.org; r...@google.com; jykn...@google.com; eh...@mozilla.com; rjmc...@apple.com; mehdi...@apple.com; ma...@braunis.de; Tayree, Coby <coby....@intel.com>
Subject: Re: [llvm-dev] [inline-asm][asm-goto] Supporting "asm goto" in inline assembly
Just responding to the motivation stuff as that remains an open question:
On Thu, Mar 30, 2017 at 4:44 PM Yatsina, Marina <marina....@intel.com> wrote:
Linux kernel is using the “asm goto” feature,
But your original email indicated they have an alternative code path for compilers that don't support it?
What might be compelling would be if there are serious performance problems when using the other code path that cannot be addressed by less invasive (and more general) improvements to LLVM. If this is the *only* way to get comparable performance from the Linux Kernel, then I think that might be an interesting discussion. But it would take a very careful and detailed analysis of why IMO.
other projects probably use it as well.
This is entirely possible, but I'd like to understand which projects and why they use it rather than any of the alternatives before we impose the implementation complexity on LLVM. At least that's my two cents.
-Chandler
---------------------------------------------------------------------
Intel Israel (74) LimitedThis e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
_______________________________________________
On Apr 4, 2017, at 9:26 AM, Chandler Carruth via llvm-dev <llvm...@lists.llvm.org> wrote:
On Tue, Apr 4, 2017 at 6:07 AM Yatsina, Marina via llvm-dev <llvm...@lists.llvm.org> wrote:
Asm goto feature was introduces to GCC in order to optimize the support for tracepoints in Linux kernel (it can be used for other things that do nop patching).
GCC documentation describes their motivating example here:
https://gcc.gnu.org/onlinedocs/gcc-4.8.4/gcc/Extended-Asm.html#define TRACE1(NUM) \do { \asm goto ("0: nop;" \".pushsection trace_table;" \".long 0b, %l0;" \".popsection" \: : : : trace#NUM); \if (0) { trace#NUM: trace(); } \} while (0)#define TRACE TRACE1(__COUNTER__)
In this example (which in fact inspired the
asm goto
feature) we want on rare occasions to call thetrace
function; on other occasions we'd like to keep the overhead to the absolute minimum. The normal code path consists of a singlenop
instruction. However, we record the address of thisnop
together with the address of a label that calls thetrace
function. This allows thenop
instruction to be patched at run time to be an unconditional branch to the stored label. It is assumed that an optimizing compiler moves the labeled block out of line, to optimize the fall through path from theasm
.
On Apr 4, 2017, at 2:12 PM, Matthias Braun <ma...@braunis.de> wrote:My two cents:- I think inline assembly should work even if the compiler cannot parse the contents. This would rule out msvc inline assembly (or alternatively put all the parsing and interpretation burden on the frontend), but would work with gcc asm goto which specifies possible targets separately.- Supporting control flow in inline assembly by allowing jumps out of an assembly block seems natural to me.- Jumping into an inline assembly block seems like an unnecessary feature to me.- To have this working in lib/CodeGen we would need an alternative opcode with the terminator flag set. (There should also be opportunities to remodel some instruction flags in the backend, to be part of the MachineInstr instead of the opcode, but that is an orthogonal discussion to this)- I don't foresee big problems in CodeGen, we should take a look on how computed goto is implementation to find ways to reference arbitrary basic blocks.- The register allocator fails when the terminator instruction also writes a register which is subsequently spilled (none of the existing targets does that, but you could specify this situation in inline assembly).- I'd always prefer intrinsics over inline assembly. Hey, why don't we add a -Wassembly that warns on inline assembly usage and is enabled by default...- I still think inline assembly is valuable for new architecture bringup/experimentation situations.
- The register allocator fails when the terminator instruction also writes a register which is subsequently spilled (none of the existing targets does that, but you could specify this situation in inline assembly).
You can still have register clobbers, which I suppose might trigger the same failure case?
On Apr 4, 2017, at 11:44 AM, John McCall via llvm-dev <llvm...@lists.llvm.org> wrote:On Apr 4, 2017, at 2:12 PM, Matthias Braun <ma...@braunis.de> wrote:My two cents:- I think inline assembly should work even if the compiler cannot parse the contents. This would rule out msvc inline assembly (or alternatively put all the parsing and interpretation burden on the frontend), but would work with gcc asm goto which specifies possible targets separately.- Supporting control flow in inline assembly by allowing jumps out of an assembly block seems natural to me.- Jumping into an inline assembly block seems like an unnecessary feature to me.- To have this working in lib/CodeGen we would need an alternative opcode with the terminator flag set. (There should also be opportunities to remodel some instruction flags in the backend, to be part of the MachineInstr instead of the opcode, but that is an orthogonal discussion to this)- I don't foresee big problems in CodeGen, we should take a look on how computed goto is implementation to find ways to reference arbitrary basic blocks.- The register allocator fails when the terminator instruction also writes a register which is subsequently spilled (none of the existing targets does that, but you could specify this situation in inline assembly).- I'd always prefer intrinsics over inline assembly. Hey, why don't we add a -Wassembly that warns on inline assembly usage and is enabled by default...- I still think inline assembly is valuable for new architecture bringup/experimentation situations.To me, this feels like a great example of "we really wanted a language feature, but we figured out that we could hack it in using inline assembly in a way that's ultimately significantly harder for the compiler to support than a language feature, and now it's your problem." I agree with Chandler that we should just design and implement the language feature.I would recommend:if (__builtin_patchable_branch("section name")) {trace();}==>%0 = call i1 @llvm.patchable_branch(i8* @sectionNameString)br %0, ...where @llvm.patchable_branch has the semantics of appending whatever patching information is necessary to the given section such that, if you apply the patch, it will change the result of the call from 0 to 1. That can then typically be pattern-matched in the backend to get the optimal codegen.If I might recommend a better ABI for the patching information: consider using a pair of relative pointers, one from the patching information to the patchable instruction, and one from the patchable instruction to the new target. That would allow the patching information to be relocated at zero cost.The actual details of how to apply the patch, and what the inline patchable-instruction sequence needs to be in order to accept the patch, would be target-specific. The documented motivating example seems to assume that a single nop is always big enough, which is pretty questionable.This feature could be made potentially interesting to e.g. JIT authors by allowing the patching information to be embellished with additional information to identify the source branch.
> > I completely agree that for this example we rather want a proper
> > intrinsic. As a matter of fact we have similar mechanism in CodeGen
> > already to support the XRay feature.
Also, we're very much _NOT_ going to support anything 'similar' but
completely dfferent. It's asm-goto or nothing.
Let's not conflate the asm-goto part with the .pushsection/.popsection.
The latter ("0: .pushsection foo; .long 0b; .popsection") is used *all*
over the kernel to build up tables of code locations — for exception
handling of instructions which might fault, as well as for runtime
patching of instructions like the above. It's not always a nop vs. call
alternative.
It would be nice to have the compiler assist with that. We currently
have code to trawl through all the built object files and find calls to
__fentry__ so we can patch them in/out at runtime, for example. And we
might considered doing the same for calls to the retpoline thunks.
But I think we would be best served right now by considering that out
of scope, and looking *only* at the part which is handled by 'asm
goto'.