[RFC PATCH] Adjust the _Float16 ABI to return in a GPR

34 views
Skip to first unread message

Trevor Gross

unread,
Mar 4, 2026, 6:32:59 AM (2 days ago) Mar 4
to IA32 System V Application Binary Interface, H.J. Lu, gcc-p...@gcc.gnu.org, libc-...@sourceware.org, llvm...@lists.llvm.org, Joseph Myers, Pengfei Wang, Jacob Lifshay, Folkert de Vries, Trevor Gross
Hello all,

I am interested in revisiting the return ABI of _Float16 on i386.
Currently it is returned in xmm0, meaning SSE is required for the type.
This is rather inconvenient when _Float16 is otherwise quite well
supported. Compilers need to pick between hacking together a custom ABI
that works on the baseline, or passing the burden on to users to gate
everything.

Is there any interest in adjusting the specification such that _Float16
is returned in a GPR rather than SSE?

This was brought up before in the thread at [1], with the concern about
efficient 16-bit moves between GPRs or memory and XMM. This doesn't seem
to be relevant, however, given there isn't any reason to have a _Float16
in XMM unless F16C is available, implying SSE2 and SSE4.1 for PINSRW and
PEXTRW to/from memory (unless I am missing something?).

A sample patch to the psABI is below. Needless to say there are
compatibility concerns that come from a change but given workarounds
already exist (e.g. in LLVM), it seems worth considering whether
something should be codefied to make this simpler for everyone.

Best regards,
Trevor

[1]: https://inbox.sourceware.org/gcc-patches/20210701210537.5...@gmail.com/

(some CCs added from the linked discussion)

--- patch follows ---

From 1af72db89f9a10b93569fa0b9f64f65f2dd73334 Mon Sep 17 00:00:00 2001
From: Trevor Gross <tmg...@umich.edu>
Date: Fri, 23 Jan 2026 21:11:43 +0000
Subject: [PATCH] Return _Float16 and _Complex _Float16 in GPRs

Currently the ABI specifies that _Float16 is to be passed on the stack
and returned in xmm0, meaning SSE is required to support the type.
Adjust both _Float16 and _Complex _Float16 to return in eax, dropping
the SSE requirement.

This has the benefit of making _Float16 ABI-compatible with `short`.
---
low-level-sys-info.tex | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/low-level-sys-info.tex b/low-level-sys-info.tex
index 0015c8c..a2d8d6d 100644
--- a/low-level-sys-info.tex
+++ b/low-level-sys-info.tex
@@ -384,8 +384,7 @@ of some 64bit return types & No \\
\ESI & callee-saved register & yes \\
\EDI & callee-saved register & yes \\
\reg{xmm0} & scratch register; also used to pass the first \code{__m128}
- parameter and return \code{__m128}, \code{_Float16},
- \code{_Complex _Float16} & No \\
+ parameter and return \code{__m128} & No \\
\reg{ymm0} & scratch register; also used to pass the first \code{__m256}
parameter and return \code{__m256} & No \\
\reg{zmm0} & scratch register; also used to pass the first \code{__m512}
@@ -472,7 +471,11 @@ and \texttt{unions}) are always returned in memory.
& \texttt{\textit{any-type} *} & \EAX \\
& \texttt{\textit{any-type} (*)()} & \\
\hline
- & \texttt{_Float16} & \reg{xmm0} \\
+ & \texttt{_Float16} & \reg{ax} \\
+ & & The upper 16 bits of \EAX are undefined.
+ The caller must not \\
+ & & rely on these being set in a predefined
+ way by the called function. \\
\cline{2-3}
& \texttt{float} & \reg{st0} \\
\cline{2-3}
@@ -484,7 +487,7 @@ and \texttt{unions}) are always returned in memory.
\cline{2-3}
& \texttt{__float128} & memory \\
\hline
- & \texttt{_Complex _Float16} & \reg{xmm0} \\
+ & \texttt{_Complex _Float16} & \reg{eax} \\
& & The real part is returned in bits 0..15. The imaginary part is
returned \\
& & in bits 16..31.\\
--
2.50.1 (Apple Git-155)

Thiago Macieira

unread,
Mar 4, 2026, 2:09:24 PM (2 days ago) Mar 4
to IA32 System V Application Binary Interface, H.J. Lu, gcc-p...@gcc.gnu.org, libc-...@sourceware.org, llvm...@lists.llvm.org, Trevor Gross
On Wednesday, 4 March 2026 03:27:40 Pacific Standard Time Trevor Gross wrote:
> This was brought up before in the thread at [1], with the concern about
> efficient 16-bit moves between GPRs or memory and XMM. This doesn't seem
> to be relevant, however, given there isn't any reason to have a _Float16
> in XMM unless F16C is available, implying SSE2 and SSE4.1 for PINSRW and
> PEXTRW to/from memory (unless I am missing something?).

There is still a cost of transferring from one register file to another: those
operations cost 3 cycles. That would imply efficient software that uses F16C or
(better yet) AVX512FP16 would pay an extra 3-cycle penalty to move into a GPR
on function return and another 3 cycles to reload it back into the SSE
register file.

This is of course the opposite of what would happen on systems requiring
emuation of FP16 conversions: one would pay a 3-cycle penalty to move from GPR
to SSE on function return and another 3 cycles to move it back to make any use
of the returned number.

So there are two questions to be answered, one of which has already been:

1) does FP16 support require SSE?

H.J. stated it does in the discussion you linked to and no one argued.

2) whom are we optimising this for: emulated conversions or HW-backed ones?

F16C was first introduced in 2013, though there are still systems without AVX
being produced (e.g. embedded Pentium and Celeron). But they already have a
massive performance loss by having to convert to and from FP32 in software,
before performing even simple math like:

_Float16 f(_Float16 a, _Float16 b)
{
return a + b;
}

So I'd argue it's not worth optimising for them, and it's far better to allow
the best performance when one has HW-backed conversion instructions (and for
GCC, using -mfpmath=sse).

Are you asking to reopen the "requires SSE" discussion?

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Principal Engineer - Intel Data Center - Platform & Sys. Eng.
signature.asc

John McCall

unread,
Mar 4, 2026, 2:28:45 PM (2 days ago) Mar 4
to ia32...@googlegroups.com, H.J. Lu, gcc-p...@gcc.gnu.org, libc-...@sourceware.org, llvm...@lists.llvm.org, Joseph Myers, Pengfei Wang, Jacob Lifshay, Folkert de Vries, Trevor Gross
On Wed, Mar 4, 2026 at 6:33 AM Trevor Gross <tmg...@umich.edu> wrote:
Hello all,

I am interested in revisiting the return ABI of _Float16 on i386.
Currently it is returned in xmm0, meaning SSE is required for the type.
This is rather inconvenient when _Float16 is otherwise quite well
supported. Compilers need to pick between hacking together a custom ABI
that works on the baseline, or passing the burden on to users to gate
everything.

Is there any interest in adjusting the specification such that _Float16
is returned in a GPR rather than SSE?

This was brought up before in the thread at [1], with the concern about
efficient 16-bit moves between GPRs or memory and XMM. This doesn't seem
to be relevant, however, given there isn't any reason to have a _Float16
in XMM unless F16C is available, implying SSE2 and SSE4.1 for PINSRW and
PEXTRW to/from memory (unless I am missing something?).

A sample patch to the psABI is below. Needless to say there are
compatibility concerns that come from a change but given workarounds
already exist (e.g. in LLVM), it seems worth considering whether
something should be codefied to make this simpler for everyone.

Both F16C (announced 2009) and SSE (announced 1999) are widely available
in practice. It's good that there's interest in supporting older CPUs, but I don't think
it's unreasonable for the ABI to be... I can't even say "forward-looking", more like
"not quite that backward-looking". Compiler flags that enable the type with an
incompatible ABI seem like a fine solution for code that's actually unwilling to
commit to a 25-year-old minimum hardware target; there's no problem as long
as the code doesn't have a call that crosses an ABI boundary.

John.

Trevor Gross

unread,
Mar 4, 2026, 6:10:09 PM (2 days ago) Mar 4
to Thiago Macieira, IA32 System V Application Binary Interface, H.J. Lu, gcc-p...@gcc.gnu.org, libc-...@sourceware.org, llvm...@lists.llvm.org, Joseph Myers
On Wed, Mar 4, 2026 at 2:09 PM Thiago Macieira <thi...@macieira.org> wrote:
>
> On Wednesday, 4 March 2026 03:27:40 Pacific Standard Time Trevor Gross wrote:
> > This was brought up before in the thread at [1], with the concern about
> > efficient 16-bit moves between GPRs or memory and XMM. This doesn't seem
> > to be relevant, however, given there isn't any reason to have a _Float16
> > in XMM unless F16C is available, implying SSE2 and SSE4.1 for PINSRW and
> > PEXTRW to/from memory (unless I am missing something?).
>
> There is still a cost of transferring from one register file to another: those
> operations cost 3 cycles. That would imply efficient software that uses F16C or
> (better yet) AVX512FP16 would pay an extra 3-cycle penalty to move into a GPR
> on function return and another 3 cycles to reload it back into the SSE
> register file.
>
> This is of course the opposite of what would happen on systems requiring
> emuation of FP16 conversions: one would pay a 3-cycle penalty to move from GPR
> to SSE on function return and another 3 cycles to move it back to make any use
> of the returned number.

It indeed is not maximally efficient, but any `float` or `double` code
is already paying a similar (or slightly higher) cost for %st0 return
right? At least if any operations are done in XMM registers, which
Clang likes to do whenever SSE2 is available (or GCC with options).

The compatibility issues using XMM doesn't seem necessarily worth the
cycle savings specifically for _Float16, given the cost for other
floats at non-inlineable function boundaries. Especially when many ops
with the type require a f16<->f32 conversion, which itself doesn't
have the call overhead (if supported).

> So there are two questions to be answered, one of which has already been:
>
> 1) does FP16 support require SSE?
>
> H.J. stated it does in the discussion you linked to and no one argued.

I took Joseph's first reply on the thread to be an expression of some
disagreement, followed by discussion about efficient GPR<->XMM to
support a GPR return that didn't exactly come to a conclusion. But it
is possible I am misreading here, none of this is stated explicitly.

(Joseph's email address from that thread bounced, added a new one here.)

> 2) whom are we optimising this for: emulated conversions or HW-backed ones?
>
> F16C was first introduced in 2013, though there are still systems without AVX
> being produced (e.g. embedded Pentium and Celeron). But they already have a
> massive performance loss by having to convert to and from FP32 in software,
> before performing even simple math like:
>
> _Float16 f(_Float16 a, _Float16 b)
> {
> return a + b;
> }
>

At the ABI level the choice isn't between two performance optimization
goals, but rather between optimization and compatibility. The current
_Float16 ABI does lean toward optimization (as much as possible with
stack passing), but this makes it the only C-specificed type to not be
compatible with baseline i386.

> So I'd argue it's not worth optimising for them, and it's far better to allow
> the best performance when one has HW-backed conversion instructions (and for
> GCC, using -mfpmath=sse).

This is a bit of a tangent but I think it would be much more useful to
have an ABI-changing flag that raises the baseline to SSE2 and returns
_Float16, float, and double in XMM. That gets the return ABI
performance improvement for all float types, not just _Float16, and
effectively resolves a whole class of issues for x86-32 users like
[1], [2], [3].

> Are you asking to reopen the "requires SSE" discussion?

That is my interest here, to the extent that is possible at this point.

Thanks,
Trevor

> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Principal Engineer - Intel Data Center - Platform & Sys. Eng.

[1]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93681
[2]: https://github.com/llvm/llvm-project/issues/44218
[3]: https://github.com/llvm/llvm-project/issues/66803

Thiago Macieira

unread,
Mar 4, 2026, 7:27:25 PM (2 days ago) Mar 4
to Trevor Gross, Joseph Myers, IA32 System V Application Binary Interface, H.J. Lu, gcc-p...@gcc.gnu.org, libc-...@sourceware.org
On Wednesday, 4 March 2026 15:09:55 Pacific Standard Time Trevor Gross wrote:
> It indeed is not maximally efficient, but any `float` or `double` code
> is already paying a similar (or slightly higher) cost for %st0 return
> right? At least if any operations are done in XMM registers, which
> Clang likes to do whenever SSE2 is available (or GCC with options).

Indeed, but that's an ABI issue now. There was no SSE when the ABI was
created. Even just enforcing the stack alignment for SSE about 25 years ago
was a problem.

A compiler might be able to decide to use SSE or x87 depending on the cost of
the transfer at the end. For most simple operations, x87 is as fast as SSE,
but complex code won't be due to the use of stack-based registers.

> The compatibility issues using XMM doesn't seem necessarily worth the
> cycle savings specifically for _Float16, given the cost for other
> floats at non-inlineable function boundaries. Especially when many ops
> with the type require a f16<->f32 conversion, which itself doesn't
> have the call overhead (if supported).

I disagree. The cost is additional, regardless of how the implementation of
FP16 code is done, except if it were done entirely emulated in SW.

> > So there are two questions to be answered, one of which has already been:
> >
> > 1) does FP16 support require SSE?
> >
> > H.J. stated it does in the discussion you linked to and no one argued.
>
> I took Joseph's first reply on the thread to be an expression of some
> disagreement, followed by discussion about efficient GPR<->XMM to
> support a GPR return that didn't exactly come to a conclusion. But it
> is possible I am misreading here, none of this is stated explicitly.

I took that as a question, to which H.J. replied saying "it is" and no one
argued. You seem to wish to reopen this discussion.

> (Joseph's email address from that thread bounced, added a new one here.)

The llvm-dev one too, so I dropped it.

> At the ABI level the choice isn't between two performance optimization
> goals, but rather between optimization and compatibility. The current
> _Float16 ABI does lean toward optimization (as much as possible with
> stack passing), but this makes it the only C-specificed type to not be
> compatible with baseline i386.

Indeed. But what's the harm?

> > So I'd argue it's not worth optimising for them, and it's far better to
> > allow the best performance when one has HW-backed conversion instructions
> > (and for GCC, using -mfpmath=sse).
>
> This is a bit of a tangent but I think it would be much more useful to
> have an ABI-changing flag that raises the baseline to SSE2 and returns
> _Float16, float, and double in XMM. That gets the return ABI
> performance improvement for all float types, not just _Float16, and
> effectively resolves a whole class of issues for x86-32 users like
> [1], [2], [3].

Doesn't sseregparm do that?
https://i386.godbolt.org/z/Pxj6YM365

But the question here is whether we need an ABI-breaking option to be able to
use _Float16 efficiently, given that the type itself was only introduced after
SSE became existent.

> > Are you asking to reopen the "requires SSE" discussion?
>
> That is my interest here, to the extent that is possible at this point.

Ok, but why? While AVX is missing in some hardware still being sold, SSE has
been present in everything for two decades, with the exception of the Intel
Quark microcontroller (and that is no longer commercialised). Or am I missing
something relevant that would be excluded from using _Float16?

I suppose there are still people with old Pentium III or older systems still
running, but are they updating software for them? Do they have a *new* need
for _Float16? Software that *requires* _Float16 is incredibly rare, since that
was a non-standard type before C++23. Instead, software that needs the type
either has their own emulations they've deployed for years or they gracefully
degrade, not requiring the type to compile.
signature.asc
Reply all
Reply to author
Forward
0 new messages