> > The new LTO EXPORT_SYMBOL references symbols even without CONFIG_MODULES.
> > Since these functions are macros in this case this doesn't work.
> > Add a ifdef to fix the build.
> > > The new LTO EXPORT_SYMBOL references symbols even without CONFIG_MODULES.
> > > Since these functions are macros in this case this doesn't work.
> > > Add a ifdef to fix the build.
> > This rather large patchkit enables gcc Link Time Optimization (LTO) > > support for the kernel.
> > With LTO gcc will do whole program optimizations for
> > the whole kernel and each module. This increases compile time,
> > but can generate faster code.
> By how much does it increase compile time?
All numbers are preliminary at this point. I miss both some code
quality and compile time improvements that it could do, to work
around some issues that are fixable.
Compile time:
Compilation slowdown depends on the largest binary size. I see between 50% and 4x. The 4x case is mainly for allyes (so unlikely); a normal
distro build, which is mostly modular, or a defconfig like build is more
towards the 50%.
Currently I have to disable slim LTO, which essentially means everything
is compiled twice. Once that's fixed it should compile faster for
the normal case too (although it will be still slower than non LTO)
A lot of the overhead on the larger builds is also some specific gcc code that I'm working with the gcc developers on to improve.
So the 4x extreme case will hopefully go down.
The large builds also currently suffer from too much memory
consumption. That will hopefully improve too, as gcc improves.
I wouldn't expect anyone using it for day to day kernel hacking
(I understand that 50% are annoying for that). It's more like a
"release build" mode.
The performance is currently also missing some improvements due
to workarounds.
Performance:
Hackbench goes about 5% faster, so the scheduler benefits. Kbuild
is not changing much. Various network benchmarks over loopback
go faster too (best case seen 18%+), so the network stack seems to benefit. A lot of micro benchmarks go faster, sometimes larger numbers. There are some minor regressions.
A lot of benchmarking on larger workloads is still outstanding.
But the existing numbers are promising I believe. Things will still
change, it's still early.
I would welcome any benchmarking from other people.
I also expect gcc to do more LTO optimizations in the future, so we'll
hopefully see more gains over time. Essentially it gives more
power to the compiler.
Long term it would also help the kernel source organization. For example
there's no reason with LTO to have gigantic includes with large inlines,
because cross file inlining works in a efficient way without reparsing.
In theory (but that's not realized today) the automatic repartitioning of
compilation units could improve compile time with lots of small files
>> Isn't this a little to harsh? Rather than not using popcnt at all, why don't >> you just add the necessary clobbers to the asm() in the LTO case?
> gcc lacks the means to declare that a asm uses an external symbol
> currently. Ok we could make it visible. But there's no way to make the
> special calling convention work anyways, at least not without someone > changing gcc to allow to declare this per function.
That's not the point: The point really is that you could allow the
alternative regardless of LTO, and just penalize the LTO case
by having even the asm clobber the registers that a function call
would not preserve.
> I'm not sure the optimization is really worth it anyways, hweight should
> be uncommon.
That's a separate question (but I sort of agree - not sure whether
CPU mask weights ever get calculated on hot paths).
>>> On 19.08.12 at 17:01, Andi Kleen <a...@firstfloor.org> wrote:
> On Sun, Aug 19, 2012 at 09:46:04AM +0100, Jan Beulich wrote:
>> >>> Andi Kleen <a...@firstfloor.org> 08/19/12 5:05 AM >>>
>> >Work around a LTO gcc problem: when there is no reference to a variable
>> >in a module it will be moved to the end of the program. This causes
>> >reordering of initcalls which the kernel does not like.
>> >Add a dummy reference function to avoid this. The function is
>> >deleted by the linker.
>> This is not even true on x86, not to speak of generally.
> Why is it not true ?
> __initcall is only defined for !MODULE and there __exit discards.
__exit, on x86 and perhaps other arches, causes the code
to be discarded at runtime only.
>> >+#ifdef CONFIG_LTO
>> >+/* Work around a LTO gcc problem: when there is no reference to a variable
>> >+ * in a module it will be moved to the end of the program. This causes
>> >+ * reordering of initcalls which the kernel does not like.
>> >+ * Add a dummy reference function to avoid this. The function is >> >+ * deleted by the linker.
>> >+ */
>> >+#define LTO_REFERENCE_INITCALL(x) \
>> >+ ; /* yes this is needed */ \
>> >+ static __used __exit void *reference_##x(void) \
>> Why not put it into e.g. section .discard.text? That could be expected to be
>> discarded by the linker without being arch dependent, as long as all arches
>> use DISCARDS in their linker script.
> That's what __exit does, doesn't it?
No - see above. Using .discard.* enforces the discarding at link
time.
> That's not the point: The point really is that you could allow the
> alternative regardless of LTO, and just penalize the LTO case
> by having even the asm clobber the registers that a function call
> would not preserve.
That's just what a normal call does, right?
-Andi
-- a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
>>> On 20.08.12 at 13:18, Andi Kleen <a...@linux.intel.com> wrote:
>> That's not the point: The point really is that you could allow the
>> alternative regardless of LTO, and just penalize the LTO case
>> by having even the asm clobber the registers that a function call
>> would not preserve.
On Sat, Aug 18, 2012 at 7:56 PM, Andi Kleen <a...@firstfloor.org> wrote:
> From: Andi Kleen <a...@linux.intel.com>
> Every includer of vvar.h currently gets own static variables
> for all the vvar addresses. Generate just one set each for the
> main kernel and for the vdso. This saves some data space.
> Cc: Andy Lutomirski <l...@mit.edu>
> Signed-off-by: Andi Kleen <a...@linux.intel.com>
[This doesn't apply to -linus or to 3.5, so I haven't actually tested it.]
NACK, without significant further evidence that this is a good idea.
On input like this:
static const int * const vvaraddr_test = 0xffffffffff601000;
Note, in particular, that (a) the load from the vvar uses an immediate
memory operand (this avoids a cacheline access, which is a measureable
speedup) and (b) vvaraddr_test was not emitted as data at all.
Your code will force each vvar address to be emitted as data and will
cause each reference to reference it as data. Barring cleverness (and
I don't remember whether the vdso build is currently clever), this
could result in double-indirect access via the GOT from the vdso.
This kind of change IMO needs actual size measurements, benchmarks,
and some evidence that duplicate .data/.rodata things were emitted.
> > > This rather large patchkit enables gcc Link Time Optimization (LTO) > > > support for the kernel.
> > > With LTO gcc will do whole program optimizations for
> > > the whole kernel and each module. This increases compile time,
> > > but can generate faster code.
> > By how much does it increase compile time?
> All numbers are preliminary at this point. I miss both some > code quality and compile time improvements that it could do, > to work around some issues that are fixable.
> Compile time:
> Compilation slowdown depends on the largest binary size. I > see between 50% and 4x. The 4x case is mainly for allyes (so > unlikely); a normal distro build, which is mostly modular, or > a defconfig like build is more towards the 50%.
> Currently I have to disable slim LTO, which essentially means > everything is compiled twice. Once that's fixed it should > compile faster for the normal case too (although it will be > still slower than non LTO)
The other hope would be that if LTO is used by a high-profile project like the Linux kernel then the compiler folks might look at it and improve it.
> A lot of the overhead on the larger builds is also some > specific gcc code that I'm working with the gcc developers on > to improve. So the 4x extreme case will hopefully go down.
> The large builds also currently suffer from too much memory > consumption. That will hopefully improve too, as gcc improves.
Are there any LTO build files left around, blowing up the size of the build tree?
> I wouldn't expect anyone using it for day to day kernel hacking
> (I understand that 50% are annoying for that). It's more like a
> "release build" mode.
> The performance is currently also missing some improvements > due to workarounds.
> Performance:
> Hackbench goes about 5% faster, so the scheduler benefits. > Kbuild is not changing much. Various network benchmarks over > loopback go faster too (best case seen 18%+), so the network > stack seems to benefit. A lot of micro benchmarks go faster, > sometimes larger numbers. There are some minor regressions.
> A lot of benchmarking on larger workloads is still > outstanding. But the existing numbers are promising I believe. > Things will still change, it's still early.
> I would welcome any benchmarking from other people.
> I also expect gcc to do more LTO optimizations in the future, > so we'll hopefully see more gains over time. Essentially it > gives more power to the compiler.
> Long term it would also help the kernel source organization. > For example there's no reason with LTO to have gigantic > includes with large inlines, because cross file inlining works > in a efficient way without reparsing.
Can the current implementation of LTO optimize to the level of inlining? A lot of our include file hell situation results from the desire to declare structures publicly so that inlined functions can use them directly.
If data structures could be encapsulated/internalized to subsystems and only global functions are exposed to other subsystems [which are then LTO optimized] then our include
file dependencies could become a *lot* simpler.
On Tue, Aug 21, 2012 at 09:49:21AM +0200, Ingo Molnar wrote:
> > A lot of the overhead on the larger builds is also some > > specific gcc code that I'm working with the gcc developers on > > to improve. So the 4x extreme case will hopefully go down.
> > The large builds also currently suffer from too much memory > > consumption. That will hopefully improve too, as gcc improves.
> Are there any LTO build files left around, blowing up the size > of the build tree?
Hi Ingo,
Joe Mario from Red Hat has been assisting Andi with his LTO work. One of
the ideas he had which may help here is to push the LTO granularity down
to the directory level. This would allow subsystem maintainers to opt-in
and keep the compile overhead consistent across randconfigs as the linker
would have a smaller pool of files to deal with.
Joe was wondering if he hacked something up for the scheduler directory
only, if there was some preferred benchmark tools he could run to verify a
performance increase or not?
> Can the current implementation of LTO optimize to the level of > inlining? A lot of our include file hell situation results from > the desire to declare structures publicly so that inlined > functions can use them directly.
> If data structures could be encapsulated/internalized to > subsystems and only global functions are exposed to other > subsystems [which are then LTO optimized] then our include
> file dependencies could become a *lot* simpler.
I think modules break this (if I understand what you mean correctly).
If the main kernel exposes symbol x as a global function, then lto will
not inline it into a module.
-- error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> The other hope would be that if LTO is used by a high-profile > project like the Linux kernel then the compiler folks might look > at it and improve it.
Yes definitely. I already got lot of help from toolchain people.
> > A lot of the overhead on the larger builds is also some > > specific gcc code that I'm working with the gcc developers on > > to improve. So the 4x extreme case will hopefully go down.
> > The large builds also currently suffer from too much memory > > consumption. That will hopefully improve too, as gcc improves.
> Are there any LTO build files left around, blowing up the size > of the build tree?
The objdir size increases from the immediate information in the objects,
even though it's compressed. A typical LTO objdir is about 2.5x as big as non LTO.
[this will go down a bit with slim LTO; right now there is an unnecessary
copy of the non LTOed code too; but I expect it will still be
significantly larger]
There's also the TMPDIR problem. If you put /tmp in tmpfs and gcc
defaults to put the immediate files during the final link into /tmp the memory fills up even faster, because tmpfs is competing
with anonymous memory.
4.7 improved a lot over 4.6 for this with better partitioning; with 4.6 I had some spectacular OOMst. 4.6 is not supported for LTO anymore now,
with 4.7 it became much better.
I also hope tmpfs will get better algorithms eventually that make
this less likely.
Anyways this can be overriden by setting TMPDIR to the object directory.
With TMPDIR set and not too aggressive -j* for most kernels you should
be ok with 4GB of memory. Just allyes still suffers.
This was one of the reasons why I made it not default for allyesconfig.
> > so we'll hopefully see more gains over time. Essentially it > > gives more power to the compiler.
> > Long term it would also help the kernel source organization. > > For example there's no reason with LTO to have gigantic > > includes with large inlines, because cross file inlining works > > in a efficient way without reparsing.
> Can the current implementation of LTO optimize to the level of > inlining? A lot of our include file hell situation results from
Yes, it does cross file inlining. Maybe a bit too much even
(Currently there are about 40% less static CALLs when LTOed)
In fact some of the current workarounds limit it, so there may be
even more in the future.
One side effect is that backtraces are harder to read. You'll
need to rely more on addr2line than before (or we may need
to make kallsyms smarter)
It only inlines inside a final binary though, as Avi mentioned,
so it's more useful inside a subsystem for modular kernels.
> If data structures could be encapsulated/internalized to > subsystems and only global functions are exposed to other > subsystems [which are then LTO optimized] then our include
> file dependencies could become a *lot* simpler.
Yes, long term we could have these benefits.
BTW I should add LTO does more than just inlining:
- Drop unused global functions and variables
(so may cut down on ifdefs)
- Detect type inconsistencies between files
- Partial inlining (inline only parts of a function like a test
at the beginning)
- Detect pure and const functions without side effects that can be more aggressively optimized in the caller.
- Detect global clobbers globally. Normally any global call has to assume all global variables could be changed. With LTO information some
of them can be cached in registers over calls.
- Detect read only variables and optimize them
- Optimize arguments to global functions (drop unnecessary arguments, optimize input/output etc.)
- Replace indirect calls with direct calls, enabling other
optimizations.
- Do constant propagation and specialization for functions. So if a
function is called commonly with a constant it can generate a special variant of this function optimized for that. This still needs more tuning (and
currently the code size impact is on the largish side), but I hope
to eventually have e.g. a special kmalloc optimized for GFP_KERNEL. It can also in principle inline callbacks.
-Andi
-- a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> This rather large patchkit enables gcc Link Time Optimization (LTO) > support for the kernel.
> With LTO gcc will do whole program optimizations for
> the whole kernel and each module. This increases compile time,
> but can generate faster code.
> LTO allows gcc to inline functions between different files and
> do various other optimization across the whole binary.
This looks quite nice overall. Have you seen other disadvantages
besides bugs and compile time? There are two possible issues that
I can see happening:
* Debuggability: When we get more aggressive optimizations, it
often becomes harder to trace back object code to a specific source
line, which may be a reason for distros not to enable it for their
product kernels in the end because it can make the work of their
support teams harder.
* Stack consumption: If you do more inlining, the total stack usage
of large functions can become higher than what the deepest path through
the same code in the non-inlined version would be. This bites us
more in the kernel than in user applications, which have much more
stack space available.
Have you noticed problems with either of these so far? Do you think
they are realistic concerns or is the LTO implementation good enough
that they would rarely become an issue?
On Wed, Aug 22, 2012 at 08:43:35AM +0000, Arnd Bergmann wrote:
> On Sunday 19 August 2012, Andi Kleen wrote:
> > -static struct e1000_mac_operations e1000_mac_ops_82575 = {
> > +/* Workaround for LTO bug */
> > +__visible struct e1000_mac_operations e1000_mac_ops_82575 = {
> The comment is not very clear outside the context of this patch.
> Maybe change it to /* __visible added to work around an LTO but */.
I hope to remove this soon, just needs another fix for initcalls
first.
-Andi
-- a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Aug 22, 2012 at 08:58:02AM +0000, Arnd Bergmann wrote:
> * Debuggability: When we get more aggressive optimizations, it
> often becomes harder to trace back object code to a specific source
> line, which may be a reason for distros not to enable it for their
> product kernels in the end because it can make the work of their
> support teams harder.
Yes, that's a potential issue with the larger functions. People looking
at oopses may need to rely more on addr2line with debug info. It's probably less an issue for distributions (who should have debug info for their kernels
and may even use crash instead of only oops logs), but more for random reports on linux-kernel.
That said for the few LTO crashes I looked at it wasn't that big an issue.
Usually the inline chains are still broken up by indirect calls, and a lot of kernel paths have that, so all the backtraces I could make
sense of without debug info.
> * Stack consumption: If you do more inlining, the total stack usage
> of large functions can become higher than what the deepest path through
> the same code in the non-inlined version would be. This bites us
> more in the kernel than in user applications, which have much more
> stack space available.
Newer gcc has a heuristic to not inline when the stack frame gets too
large. We set that option. Also there's a warning for too large
stack frames. With these two together we should be pretty safe.
iirc the warning mostly showed up in some staging drivers which were likely
already too large on their own. I haven't hunted for it explicitely,
but I don't remember seeing it much in other places. Also it was alwas
still in a range that does not necessarily crash.
> Have you noticed problems with either of these so far? Do you think
> they are realistic concerns or is the LTO implementation good enough
> that they would rarely become an issue?
I think the first is a realistic possible concern, but I personally haven't
had much trouble with it so far.
-Andi
-- a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> We cannot assume that the inline assembler code always ends up
> in the same file as the original C file. So make any assembler labels
> that are called with "extern" by C global
-----Original Message-----
From: wi...@spo001.leaseweb.com [mailto:wi...@spo001.leaseweb.com] On Behalf Of Wim Van Sebroeck
Sent: Wednesday, August 22, 2012 2:25 PM
To: Andi Kleen
Cc: linux-ker...@vger.kernel.org; x...@kernel.org; mma...@suse.cz; linux-kbu...@vger.kernel.org; JBeul...@suse.com; a...@linux-foundation.org; Andi Kleen; Mingarelli, Thomas
Subject: Re: [PATCH 38/74] lto, watchdog/hpwdt.c: Make assembler label global
> We cannot assume that the inline assembler code always ends up
> in the same file as the original C file. So make any assembler labels
> that are called with "extern" by C global
> On a 32bit build gcc 4.7 with LTO decides to clobber the 6th argument on the
> stack. Unfortunately this corrupts the user EBP and leads to later crashes.
> For now mark do_futex noinline to prevent this.
> I wish there was a generic way to handle this. Seems like a ticking time
> bomb problem.
There is a generic way to handle this. This is actually a bug in Linux
that has been known for at least 15 years and which we keep hacking around.
The right thing to do is to change head_32.S to not violate the i386
ABI. Arguments pushed (by value) on the stack are property of the
callee, that is, they are volatile, so the hack of making them do double
duty as both being saved and passed as arguments is just plain bogus.
The problem is that it works "just well enough" that people (including
myself) keep hacking around it with hacks like this, with assembly
macros, and whatnot instead of fixing the root cause.
>> On a 32bit build gcc 4.7 with LTO decides to clobber the 6th argument on the
>> stack. Unfortunately this corrupts the user EBP and leads to later crashes.
>> For now mark do_futex noinline to prevent this.
>> I wish there was a generic way to handle this. Seems like a ticking time
>> bomb problem.
> There is a generic way to handle this. This is actually a bug in Linux
> that has been known for at least 15 years and which we keep hacking around.
> The right thing to do is to change head_32.S to not violate the i386
> ABI. Arguments pushed (by value) on the stack are property of the
> callee, that is, they are volatile, so the hack of making them do double
> duty as both being saved and passed as arguments is just plain bogus.
> The problem is that it works "just well enough" that people (including
> myself) keep hacking around it with hacks like this, with assembly
> macros, and whatnot instead of fixing the root cause.
> -hpa
Just a clarification (Andi knows this, I'm sure, but others might not): this wasn't done the way it is for no reason; back when Linus originally wrote the code, i386 passed *all* arguments on the stack, and we still do that for "asmlinkage" functions on i386. Since gcc back then rarely if ever mucked with the stack arguments, it made sense to make them "double duty." Fixing this really should entail changing the invocation of system calls on i386 to use the regparm convention, which means we only need to push three arguments twice, rather than six.
-hpa
-- H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
> The right thing to do is to change head_32.S to not violate the i386
> ABI. Arguments pushed (by value) on the stack are property of the
> callee, that is, they are volatile, so the hack of making them do double
> duty as both being saved and passed as arguments is just plain bogus.
> The problem is that it works "just well enough" that people (including
> myself) keep hacking around it with hacks like this, with assembly
> macros, and whatnot instead of fixing the root cause.
How about just use register arguments for the first three arguments.
This should work for the syscalls at least (may be too risky for all
other asm entry points)
And for syscalls with more than three generate a stub that saves on the stack
explicitely. This could be done using the new fancy SYSCALL definition macros (except that arch/x86 would need to start using them too in its own code)
Or is there some subtle reason with syscall restart and updated args that prevents it?
Perhaps newer gcc can do regparm(X), X > 3 too, may be worth trying.
Don't have time to look into this currently though.
-Andi
-- a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> How about just use register arguments for the first three arguments.
> This should work for the syscalls at least (may be too risky for all
> other asm entry points)
Well, it's just an effort to convert each one in turn...
> And for syscalls with more than three generate a stub that saves on the stack
> explicitely. This could be done using the new fancy SYSCALL definition macros
> (except that arch/x86 would need to start using them too in its own code)
I don't think there is any point. Just push the six potential arguments to the stack and be done with it.
> Or is there some subtle reason with syscall restart and updated args
> that prevents it?
> Perhaps newer gcc can do regparm(X), X > 3 too, may be worth trying.
No, there is no such ABI defined.
> Don't have time to look into this currently though.
Always the problem.
-hpa
-- H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
> > If data structures could be encapsulated/internalized to > > subsystems and only global functions are exposed to other > > subsystems [which are then LTO optimized] then our include
> > file dependencies could become a *lot* simpler.
> Yes, long term we could have these benefits.
Yes, LTO should make in long term life of developers easier, it is just not tool
how to get few extra % of performance.
There is a lot to do.
> BTW I should add LTO does more than just inlining:
> - Drop unused global functions and variables
> (so may cut down on ifdefs)
> - Detect type inconsistencies between files
> - Partial inlining (inline only parts of a function like a test
> at the beginning)
> - Detect pure and const functions without side effects that can be more > aggressively optimized in the caller.
Also noreturn and nothorw are autodetected (the second is probably not big deal
for kernel, but it makes some C++ codebases a lot smaller by elliminating EH
and cleanps). We plan to add more in near future.
> - Detect global clobbers globally. Normally any global call has to > assume all global variables could be changed. With LTO information some
> of them can be cached in registers over calls.
> - Detect read only variables and optimize them
> - Optimize arguments to global functions (drop unnecessary arguments, > optimize input/output etc.)
At this moment this really happen s within compilation units only.
It is one of harder optimizations to get working over whole program,
we are slowly getting infrasrtucture to make this possible.
> - Replace indirect calls with direct calls, enabling other
> optimizations.
> - Do constant propagation and specialization for functions. So if a
> function is called commonly with a constant it can generate a special > variant of this function optimized for that. This still needs more tuning (and
> currently the code size impact is on the largish side), but I hope
> to eventually have e.g. a special kmalloc optimized for GFP_KERNEL. > It can also in principle inline callbacks.
Also profile propagation is done. When function is called only on cold paths, it becomes
cold.
Thanks for all the hard work on LTO kernel, Andi!
Honza