-falign-functions=16 for i386/amd64

Ryota Ozaki

unread,

Aug 29, 2016, 3:43:40 AM8/29/16

to tech...@netbsd.org

Hi,

I propose to set -falign-functions=16 to kernels
of i386/amd64 to reduce performance fluctuations
by small, unrelated changes.

[Background]

I noticed that performance of IP forwarding had
been degraded by 10% between Aug. 1 and Aug. 16.
Bisecting commits between them points out that
performance degradations happened by several
commits and unfortunately the commits aren't
related to performance of IP forwarding; for
example a change to ip6flow.

I and knakahara investigated how these
degradations happened and concluded that they
are because of changes of the start of functions
(alignment of function codes), which probably
affects CPU cache hits. (Actually this is just
our guess because we don't have a way to know
cache hit/miss ratios for now...)

[How -falign-functions=16 helps?]

Currently the start of functions of kernels of
i386/amd64 is unaligned, i.e., functions can
start at any bytes depending on leading objects
linked to the kernel. If the size of leading
objects has been changed, starts of all following
functions also change.

You can see how function alignments are organized
by nm -n netbsd or just seeing symbol files
generated in releasedir.

If you specify -falign-functions=16 to COPTS in
your kernel config, you can align functions by
16 bytes. By doing so, addresses of the start of
all functions always become 0xXXXXXXX0 for i386
0xffffffffXXXXXXX0 for amd64. The alignment makes
sure that functions don't affect by other
unrelated code changes.

[Why not aligned in the first place?]

It seems because of -mtune=nocona that is specified
in bsd.own.mk. -mtune=generic provides functions
aligned by 16 bytes, but provides poorer performance
than -mtune=nocona, so I don't propose this kind of
changes.

[-falign-functions=16 solves the issue completely?]

No. It seems there remains some other cause(s) that
provide performance fluctuations. Nonetheless,
setting -falign-functions=16 reduces fluctuations.

[The point of the proposal]

The aim of the proposal isn't to provide good
performance by aligning functions of a kernel,
but to reduce performance fluctuations by small,
unrelated changes. Such behavior makes it
difficult to measure small overhead of a change
because we cannot distinguish a given performance
change comes from either the real change or
function alignment changes.

Any suggestions or comments?

Adding -falign-functions=16 is one solution and
there may be a better way to the goal. And also
I'm not sure where we should add such option.

Thanks,
ozaki-r

Ryota Ozaki

unread,

Aug 29, 2016, 4:18:22 AM8/29/16

to tech...@netbsd.org

[Where 16 comes from?]

From old Intel Optimization Manual (for Pen II and III).
For recent processors 32 may be better, but for stock
kernels (such as GENERIC) 16 is probably better (for old
machines). (And if we want to optimize really we should
use -march or -mtune instead.)

Another reason is that stock kernels of other OSes
(FreeBSD, OpenBSD and Linux) look employing 16 byte
alignment.

ozaki-r

Ryota Ozaki

unread,

Sep 1, 2016, 2:46:51 AM9/1/16

to tech...@netbsd.org

On Mon, Aug 29, 2016 at 4:43 PM, Ryota Ozaki <oza...@netbsd.org> wrote:

http://www.netbsd.org/~ozaki-r/align-functions-16.diff

The patch adds the option to sys/arch/amd64/conf/Makefile.amd64.
Is it a feasible place to add?

ozaki-r

Joerg Sonnenberger

unread,

Sep 1, 2016, 5:45:20 AM9/1/16

to tech...@netbsd.org

On Thu, Sep 01, 2016 at 03:46:15PM +0900, Ryota Ozaki wrote:
> http://www.netbsd.org/~ozaki-r/align-functions-16.diff
>
> The patch adds the option to sys/arch/amd64/conf/Makefile.amd64.
> Is it a feasible place to add?

There are two small issues I have with this patch:
(1) I think it should be restricted to GCC with an appropiate comment of
what this is a workaround for. Clang seems to behave a lot more sensible
out of the box. If there are CPU models with a different base alignment
and the user asked for one of them as optimisation target, it should be
honored IMO.
(2) This should not touch CFLAGS, but COPTS.

Joerg

Ryota Ozaki

unread,

Sep 1, 2016, 7:33:24 AM9/1/16

to matthew green, tech...@netbsd.org

On Thu, Sep 1, 2016 at 4:04 PM, matthew green <m...@eterna.com.au> wrote:
> have you tested other values than 1 and 16? what about 4 or 8?

4 and 8 are not so good; their performance fluctuations are
similar to the unaligned case in my experiments.

>
> can you post the size difference of kernels? particularly the
> kernel without DIAGNOSTIC or DEBUG (since those are the ones
> where performance matters most.)

I measured the sizes of GENERIC kernels, i.e., DIAGNOSTIC on
and DEBUG off.

The sizes of kernel binaries don't change in most cases because
the alignment of __rodata_start that begins just after kernel text
hides the changes due to -falign-functions.

The sizes of the actual kernel text (from kernel_text to _etext)
slightly changes. The difference between that of GENERIC kernels
w/ and w/o -falign-functions=16 is 200kB. That is 1% of the total
kernel text size.

BTW, as I noted, I'm not exploring an alignment size that provides
best performance, I just want to reduce performance fluctuations.

Thanks,
ozaki-r

Ryota Ozaki

unread,

Sep 1, 2016, 9:01:24 AM9/1/16

to tech...@netbsd.org

Okay, I see. How about the following patch?
(nonaka@ helped improving Makefile options.)

http://www.netbsd.org/~ozaki-r/align-functions-16.v2.diff

Thanks,
ozaki-r

Adam

unread,

Sep 1, 2016, 9:14:40 AM9/1/16

to Ryota Ozaki, tech...@netbsd.org

>>> http://www.netbsd.org/~ozaki-r/align-functions-16.diff
>>>
>>> The patch adds the option to sys/arch/amd64/conf/Makefile.amd64.
>>> Is it a feasible place to add?
>>
>> There are two small issues I have with this patch:
>> (1) I think it should be restricted to GCC with an appropiate comment of
>> what this is a workaround for. Clang seems to behave a lot more sensible
>> out of the box. If there are CPU models with a different base alignment
>> and the user asked for one of them as optimisation target, it should be
>> honored IMO.
>> (2) This should not touch CFLAGS, but COPTS.
>
> Okay, I see. How about the following patch?
> (nonaka@ helped improving Makefile options.)
>
> http://www.netbsd.org/~ozaki-r/align-functions-16.v2.diff
>
> Thanks,
> ozaki-r

Have you tried compiling with clang? AFAIK, clang does not support -falign-functions (and warns about it).

Kind regards,
Adam

Joerg Sonnenberger

unread,

Sep 1, 2016, 3:40:52 PM9/1/16

to tech...@netbsd.org

Almost. You shouldn't need the whole if-block. I don't understand the
bsd.own.mk reference, it doesn't contain -mtune=nocona here?

Joerg

Ryota Ozaki

unread,

Sep 1, 2016, 8:21:08 PM9/1/16

to Adam, tech...@netbsd.org

The ACTIVE_CC trick ensures to add the option iff gcc is used.

ozaki-r

Ryota Ozaki

unread,

Sep 1, 2016, 8:40:23 PM9/1/16

to tech...@netbsd.org

The if-block intends to add the option even when a kernel config
(like GENERIC) has own COPTS.

Well, should we diereclty add the option to COPTS in GENERIC instead?
Or we may be able to get rid of (or comment-out) COPTS from GENERIC
because it's the same as DEFCOPTS.

> I don't understand the
> bsd.own.mk reference, it doesn't contain -mtune=nocona here?

I meant GCC_CONFIG_TUNE.x86_64=nocona line that makes gcc use
nocona as the default value of -mtune.

Okay, revised the comment like this:

-# By default, our gcc uses -mtune=nocona for compiling the kernels
-# (see share/mk/bsd.own.mk). With -mtune=nocona, gcc doesn't align
+# Our gcc is built to use nocona as the default value of -mtune
+# (see GCC_CONFIG_TUNE in share/mk/bsd.own.mk). With -mtune=nocona,
+# gcc doesn't align (...)

Thanks,
ozaki-r

Ryota Ozaki

unread,

Sep 5, 2016, 5:29:16 AM9/5/16

to matthew green, tech...@netbsd.org

On Mon, Sep 5, 2016 at 1:19 PM, matthew green <m...@eterna.com.au> wrote:
> Ryota Ozaki writes:
>> On Thu, Sep 1, 2016 at 4:04 PM, matthew green <m...@eterna.com.au> wrote:
>> > have you tested other values than 1 and 16? what about 4 or 8?
>>
>> 4 and 8 are not so good; their performance fluctuations are
>> similar to the unaligned case in my experiments.
>>
>> >
>> > can you post the size difference of kernels? particularly the
>> > kernel without DIAGNOSTIC or DEBUG (since those are the ones
>> > where performance matters most.)
>>
>> I measured the sizes of GENERIC kernels, i.e., DIAGNOSTIC on
>> and DEBUG off.
>

> DIAGNOSTIC is enabled on most -current GENERIC kernels including
> the amd64 one. it's disabled on release branches.

I tried without DIAGNOSTIC. The overhead due to alignment doesn't
change but the total text size of the kernel is reduced by 660kB,
so the ratio of overhead increases a bit (< 1%).

>
>> The sizes of kernel binaries don't change in most cases because
>> the alignment of __rodata_start that begins just after kernel text
>> hides the changes due to -falign-functions.
>>
>> The sizes of the actual kernel text (from kernel_text to _etext)
>> slightly changes. The difference between that of GENERIC kernels
>> w/ and w/o -falign-functions=16 is 200kB. That is 1% of the total
>> kernel text size.
>>
>> BTW, as I noted, I'm not exploring an alignment size that provides
>> best performance, I just want to reduce performance fluctuations.
>

> 200KB is a lot of text. that's a non trivial i-cache issue.
>
> what are the CPU specifics of the system you're testing on?

dut1# cpuctl identify 0
cpu0: highest basic info 0000000b
cpu0: highest extended info 80000008
cpu0: "Intel(R) Atom(TM) CPU C2558 @ 2.40GHz"
cpu0: Intel Atom C2000 (686-class), 2400.27 MHz
cpu0: family 0x6 model 0x4d stepping 0x8 (id 0x406d8)
cpu0: features 0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE>
cpu0: features 0xbfebfbff<MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2>
cpu0: features 0xbfebfbff<SS,HTT,TM,SBF>
cpu0: features1 0x43d8e3bf<SSE3,PCLMULQDQ,DTES64,MONITOR,DS-CPL,VMX,EST,TM2>
cpu0: features1 0x43d8e3bf<SSSE3,CX16,xTPR,PDCM,SSE41,SSE42,MOVBE,POPCNT>
cpu0: features1 0x43d8e3bf<DEADLINE,AES,RDRAND>
cpu0: features2 0x28100800<SYSCALL/SYSRET,XD,RDTSCP,EM64T>
cpu0: features3 0x101<LAHF,PREFETCHW>
cpu0: I-cache 32KB 64B/line 8-way, D-cache 24KB 64B/line 6-way
cpu0: L2 cache 1MB 64B/line 16-way
cpu0: ITLB 48 4KB entries fully associative
cpu0: DTLB 128 4KB entries 4-way, 4K/2M: 16 entries
cpu0: Initial APIC ID 0
cpu0: Cluster/Package ID 0
cpu0: Core ID 0
cpu0: SMT ID 0
cpu0: DSPM-eax 0x5<DTS,ARAT>
cpu0: DSPM-ecx 0x9<HWF,EPB>
cpu0: SEF highest subleaf 00000000
cpu0: SEF-main 0x2282<TSCADJUST,SMEP,ERMS,FPUCSDS>
cpu0: microcode version 0x127, platform ID 0

> can you run performance tests on systems with small cache?

Not tested ever. It'll take a bit time to do because I don't
have a suitable one. BTW what size do you expect for small?

Thanks,
ozaki-r

Ryota Ozaki

unread,

Sep 8, 2016, 5:04:45 AM9/8/16

to matthew green, tech...@netbsd.org

> ah, heh. the above is a small system. :)

Oh, okay :)

>
> can you run it on a big system? like a xeon with a large cpu
> cache?

Hmm, we don't have such big systems that we can test conveniently.
I hope someone who has such systems tries it out.

ozaki-r