GCC builtins

Davide Libenzi

unread,

Oct 21, 2015, 7:47:31 PM10/21/15

to Akaros

Any reason of not using GCC builtin functions for things like string and memory stuff?

They map to faster implementations WRT open coded ones.

Davide Libenzi

unread,

Oct 21, 2015, 7:49:51 PM10/21/15

to Akaros

/me waits for Ron rant about compilers compatibility :)

(clang has many - all the string/mem ones for sure - of them)

ron minnich

unread,

Oct 21, 2015, 8:09:09 PM10/21/15

to Akaros

I'm saving all my rants today for intel. I'm out!

ron

--
You received this message because you are subscribed to the Google Groups "Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akaros+un...@googlegroups.com.
To post to this group, send email to aka...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Barret Rhoden

unread,

Oct 22, 2015, 10:51:49 AM10/22/15

to aka...@googlegroups.com

On 2015-10-21 at 16:47 "'Davide Libenzi' via Akaros"

Depending on the builtin function, it's probably due to ignorance,
meaning we didn't know about them.

A lot of the string related functions came from JOS (the starting "lab"
that bootstrapped Akaros), so we never looked at the builtins.

That was also the case for memcpy, but that popped up as a major cycle
consumer (on the a RISCV or SPARC simulator, IIRC), so Andrew Waterman
fixed that one up.

Barret

Davide Libenzi

unread,

Oct 22, 2015, 11:02:16 AM10/22/15

to Akaros

A solution could be to have "if gcc use builtins, otherwise fall back to open coded".

The if-gcc is actually going to be, in reality, wider than that, because many compilers support most of them.

Note: Many GCC string/memory functions have a preamble that aligns pointers, and then use the SSE instructions with xmm registers.

This is a big win for large memory blocks, but not so much for blocks smaller than 32 bytes (cycles spent in align code is not countered by savings). We ended up, at my previous job, having two sets of functions. One using glibc ones (which mapped to builtins), and open coded ones (which we used when we knew the blocks we were working on were relatively short).

For things like memcpy() or memset() on big blocks, the win in using SSE is pretty big.

Barret

ron minnich

unread,

Oct 22, 2015, 11:10:01 AM10/22/15

to Akaros

I think it makes sense to give this a go. The only thing I'd request is to switch out in the makefile, i.e. you pick a different .c depending on which compiler you're using. I just find it easier not to have to figure out which #ifdef is active. But that's me ;-)

ron

Barret Rhoden

unread,

Oct 22, 2015, 12:10:56 PM10/22/15

to aka...@googlegroups.com

On 2015-10-22 at 08:02 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> A solution could be to have "if gcc use builtins, otherwise fall back
> to open coded".
> The if-gcc is actually going to be, in reality, wider than that,
> because many compilers support most of them.
>
> Note: Many GCC string/memory functions have a preamble that aligns
> pointers, and then use the SSE instructions with xmm registers.

This is a problem for the kernel and for some user code (vcore context
code). In general, we don't do anything that could touch the
FP/MMX/XMM registers.

> This is a big win for large memory blocks, but not so much for
> blocks smaller than 32 bytes (cycles spent in align code is not
> countered by savings). We ended up, at my previous job, having two
> sets of functions. One using glibc ones (which mapped to builtins),
> and open coded ones (which we used when we knew the blocks we were
> working on were relatively short). For things like memcpy() or
> memset() on big blocks, the win in using SSE is pretty big.

For kernel code, if we know we're memcpy/memsetting large blocks,
perhaps we could save and restore the FPU state around a special large
memcpy/memset function. The overhead of the save/restore would adjust
the break-even point where we switch from one style (the current
version) to the SSE style.

Barret

Edward Jee

unread,

Oct 22, 2015, 12:31:57 PM10/22/15

to aka...@googlegroups.com

If I remember correctly, Linux kernel uses ERMSB(enhanced rep movsb) for most of its memcpy-like functions.

http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf says ERMSB is faster than AVX based implementation in many cases, but slower for some cases (e.g. copying small chunks, or unaligned, ...).

But Linux kernel seems to avoid using AVX (and other SIMD type instructions), probably because saving/restoring those special SIMD registers is expensive.

Barret

Davide Libenzi

unread,

Oct 23, 2015, 10:36:15 AM10/23/15

to Akaros

Oh, I need to check what GCC generates when SSE(2) is disabled.

Saving FPU state at ctx switch might be an issue (though it is today much less of a pain than 10 years ago).

Reply all

Reply to author

Forward