[PATCH] Add optional _Float16 support

13 views
Skip to first unread message

H.J. Lu

unread,
Jul 1, 2021, 5:05:40 PM7/1/21
to ia32...@googlegroups.com, gcc-p...@gcc.gnu.org, libc-...@sourceware.org, llvm...@lists.llvm.org
1. Pass _Float16 and _Complex _Float16 values on stack.
2. Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1 registers.
---
low-level-sys-info.tex | 57 +++++++++++++++++++++++++++++-------------
1 file changed, 40 insertions(+), 17 deletions(-)

diff --git a/low-level-sys-info.tex b/low-level-sys-info.tex
index acaf30e..82956e3 100644
--- a/low-level-sys-info.tex
+++ b/low-level-sys-info.tex
@@ -30,7 +30,8 @@ object, and the term \emph{\textindex{\sixteenbyte{}}} refers to a
\subsubsection{Fundamental Types}

Table~\ref{basic-types} shows the correspondence between ISO C
-scalar types and the processor scalar types. \code{__float80},
+scalar types and the processor scalar types. \code{_Float16},
+\code{__float80},
\code{__float128}, \code{__m64}, \code{__m128}, \code{__m256} and
\code{__m512} types are optional.

@@ -79,22 +80,25 @@ scalar types and the processor scalar types. \code{__float80},
& \texttt{\textit{any-type} *} & 4 & 4 & unsigned \fourbyte \\
& \texttt{\textit{any-type} (*)()} & & \\
\hline
- Floating-& \texttt{float} & 4 & 4 & single (IEEE-754) \\
\cline{2-5}
- point & \texttt{double} & 8 & 4 & double (IEEE-754) \\
- & \texttt{long double}$^{\dagger\dagger\dagger\dagger}$ & & & \\
+ & \texttt{_Float16}$^{\dagger\dagger\dagger\dagger\dagger\dagger}$ & 2 & 2 & 16-bit (IEEE-754) \\
\cline{2-5}
- & \texttt{__float80}$^{\dagger\dagger}$ & 12 & 4 & 80-bit extended (IEEE-754) \\
- & \texttt{long double}$^{\dagger\dagger\dagger\dagger}$ & & & \\
+ & \texttt{float} & 4 & 4 & single (IEEE-754) \\
+ \cline{2-5}
+ Floating- & \texttt{double} & 8
+ & 8$^{\dagger\dagger\dagger\dagger}$ & double (IEEE-754) \\
+ \cline{2-5}
+ point & \texttt{__float80}$^{\dagger\dagger}$ & 16 & 16 & 80-bit extended (IEEE-754) \\
+ & \texttt{long double}$^{\dagger\dagger\dagger\dagger\dagger}$ & 16 & 16 & 80-bit extended (IEEE-754) \\
\cline{2-5}
& \texttt{__float128}$^{\dagger\dagger}$ & 16 & 16 & 128-bit extended (IEEE-754) \\
\hline
- Complex& \texttt{_Complex float} & 8 & 4 & complex single (IEEE-754) \\
+ & \texttt{_Complex float} & 8 & 4 & complex single (IEEE-754) \\
\cline{2-5}
- Floating-& \texttt{_Complex double} & 16 & 4 & complex double (IEEE-754) \\
- point & \texttt{_Complex long double}$^{\dagger\dagger\dagger\dagger}$ & & & \\
+ Complex& \texttt{_Complex double} & 16 & 4 & complex double (IEEE-754) \\
+ Floating-& \texttt{_Complex long double}$^{\dagger\dagger\dagger\dagger}$ & & & \\
\cline{2-5}
- & \texttt{_Complex __float80}$^{\dagger\dagger}$ & 24 & 4 & complex 80-bit extended (IEEE-754) \\
+ point & \texttt{_Complex __float80}$^{\dagger\dagger}$ & 24 & 4 & complex 80-bit extended (IEEE-754) \\
& \texttt{_Complex long double}$^{\dagger\dagger\dagger\dagger}$ & & & \\
\cline{2-5}
& \texttt{_Complex __float128}$^{\dagger\dagger}$ & 32 & 16 & complex 128-bit extended (IEEE-754) \\
@@ -125,6 +129,8 @@ The \texttt{long double} type is 64-bit, the same as the \texttt{double}
type, on the Android{\texttrademark} platform. More information on the
Android{\texttrademark} platform is available from
\url{http://www.android.com/}.}\\
+\multicolumn{5}{p{13cm}}{\myfontsize $^{\dagger\dagger\dagger\dagger\dagger\dagger}$
+The \texttt{_Float16} type, from ISO/IEC TS 18661-3:2015, is optional.}\\
\end{tabular}
}
\end{table}
@@ -323,6 +329,7 @@ at the time of the call.
\begin{table}
\Hrule
\caption{Register Usage}
+ \myfontsize
\label{fig-reg-usage}
\begin{center}
\begin{tabular}{l|p{8.35cm}|l}
@@ -346,13 +353,29 @@ of some 64bit return types & No \\
\EBP & callee-saved register; optionally used as frame pointer & Yes \\
\ESI & callee-saved register & yes \\
\EDI & callee-saved register & yes \\
-\reg{xmm0}, \reg{ymm0} & scratch registers; also used to pass and return
-\code{__m128}, \code{__m256} parameters & No\\
-\reg{xmm1}--\reg{xmm2},& scratch registers; also used to pass
-\code{__m128}, & No \\
-\reg{ymm1}--\reg{ymm2} & \code{__m256} parameters & \\
-\reg{xmm3}--\reg{xmm7},& scratch registers & No \\
-\reg{ymm3}--\reg{ymm7} & & \\
+\reg{xmm0} & scratch register; also used to pass the first \code{__m128}
+ parameter and return \code{__m128}, \code{_Float16},
+ the real part of \code{_Complex _Float16} & No \\
+\reg{ymm0} & scratch register; also used to pass the first \code{__m256}
+ parameter and return \code{__m256} & No \\
+\reg{zmm0} & scratch register; also used to pass the first \code{__m512}
+ parameter and return \code{__m512} & No \\
+\reg{xmm1} & scratch register; also used to pass the second \code{__m128}
+ parameter and return the imaginary part of
+ \code{_Complex _Float16} & No \\
+\reg{ymm1} & scratch register; also used to pass the second \code{__m256}
+ parameters & No \\
+\reg{zmm1} & scratch register; also used to pass the second \code{__m512}
+ parameters & No \\
+\reg{xmm2} & scratch register; also used to pass the third \code{__m128}
+ parameters & No \\
+\reg{ymm2} & scratch register; also used to pass the third \code{__m256}
+ parameters & No \\
+\reg{zmm2} & scratch register; also used to pass the third \code{__m512}
+ parameters & No \\
+\reg{xmm3}--\reg{xmm7} & scratch registers & No \\
+\reg{ymm3}--\reg{ymm7} & scratch registers & No \\
+\reg{zmm3}--\reg{zmm7} & scratch registers & No \\
\reg{mm0} & scratch register; also used to pass and return
\code{__m64} parameter & No\\
\reg{mm1}--\reg{mm2} & used to pass \code{__m64} parameters & No\\
--
2.31.1

H.J.

unread,
Jul 1, 2021, 5:14:09 PM7/1/21
to IA32 System V Application Binary Interface

Joseph Myers

unread,
Jul 1, 2021, 6:10:21 PM7/1/21
to H.J. Lu, ia32...@googlegroups.com, llvm...@lists.llvm.org, libc-...@sourceware.org, gcc-p...@gcc.gnu.org
On Thu, 1 Jul 2021, H.J. Lu via Gcc-patches wrote:

> 2. Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1 registers.

That restricts use of _Float16 to processors with SSE. Is that what we
want in the ABI, or should _Float16 be available with base 32-bit x86
architecture features only, much like _Float128 and the decimal FP types
are? (If it is restricted to SSE, we can of course ensure relevant libgcc
functions are built with SSE enabled, and likewise in glibc if that gains
_Float16 functions, though maybe with some extra complications to get
relevant testcases to run whenever possible.)

--
Joseph S. Myers
jos...@codesourcery.com

H.J. Lu

unread,
Jul 1, 2021, 6:28:14 PM7/1/21
to Joseph Myers, IA32 System V Application Binary Interface, llvm...@lists.llvm.org, GNU C Library, GCC Patches
On Thu, Jul 1, 2021 at 3:10 PM Joseph Myers <jos...@codesourcery.com> wrote:
>
> On Thu, 1 Jul 2021, H.J. Lu via Gcc-patches wrote:
>
> > 2. Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1 registers.
>
> That restricts use of _Float16 to processors with SSE. Is that what we
> want in the ABI, or should _Float16 be available with base 32-bit x86
> architecture features only, much like _Float128 and the decimal FP types

Yes, _Float16 requires XMM registers.

> are? (If it is restricted to SSE, we can of course ensure relevant libgcc
> functions are built with SSE enabled, and likewise in glibc if that gains
> _Float16 functions, though maybe with some extra complications to get
> relevant testcases to run whenever possible.)
>

_Float16 functions in libgcc should be compiled with SSE enabled.

BTW, _Float16 software emulation may require more than just SSE
since we need to do _Float16 load and store with XMM registers.
There is no 16bit load/store for XMM registers without AVX512FP16.

--
H.J.

Joseph Myers

unread,
Jul 1, 2021, 6:40:25 PM7/1/21
to H.J. Lu, IA32 System V Application Binary Interface, llvm...@lists.llvm.org, GNU C Library, GCC Patches
On Thu, 1 Jul 2021, H.J. Lu wrote:

> BTW, _Float16 software emulation may require more than just SSE
> since we need to do _Float16 load and store with XMM registers.
> There is no 16bit load/store for XMM registers without AVX512FP16.

You should be able to make the move go via general-purpose registers (for
example) if you can't do a direct 16-bit load/store for XMM registers.

H.J. Lu

unread,
Jul 1, 2021, 7:02:21 PM7/1/21
to Joseph Myers, IA32 System V Application Binary Interface, llvm...@lists.llvm.org, GNU C Library, GCC Patches
On Thu, Jul 1, 2021 at 3:40 PM Joseph Myers <jos...@codesourcery.com> wrote:
>
> On Thu, 1 Jul 2021, H.J. Lu wrote:
>
> > BTW, _Float16 software emulation may require more than just SSE
> > since we need to do _Float16 load and store with XMM registers.
> > There is no 16bit load/store for XMM registers without AVX512FP16.
>
> You should be able to make the move go via general-purpose registers (for
> example) if you can't do a direct 16-bit load/store for XMM registers.
>

There is no 16bit move between GPRs and XMM registers without
AVX512FP16.


--
H.J.

H.J. Lu

unread,
Jul 13, 2021, 10:26:39 AM7/13/21
to Wang, Pengfei, llvm...@lists.llvm.org, Joseph Myers, GCC Patches, GNU C Library, IA32 System V Application Binary Interface
On Mon, Jul 12, 2021 at 8:59 PM Wang, Pengfei <pengfe...@intel.com> wrote:
>
> > Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1 registers.
>
> Can you please explain the behavior here? Is there difference between _Float16 and _Complex _Float16 when return? I.e.,
> 1, In which case will _Float16 values return in both %xmm0 and %xmm1?
> 2, For a single _Float16 value, are both real part and imaginary part returned in %xmm0? Or returned in %xmm0 and %xmm1 respectively?

Here is the v2 patch to add the missing _Float16 bits. The PDF file is at

https://gitlab.com/x86-psABIs/i386-ABI/-/wikis/Intel386-psABI

> Thanks
> Pengfei
> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev



--
H.J.
v2-0001-Add-optional-_Float16-support.patch

H.J. Lu

unread,
Jul 13, 2021, 11:05:16 AM7/13/21
to Wang, Pengfei, llvm...@lists.llvm.org, Joseph Myers, GCC Patches, GNU C Library, IA32 System V Application Binary Interface
On Tue, Jul 13, 2021 at 7:48 AM Wang, Pengfei <pengfe...@intel.com> wrote:
>
> Hi H.J.,
>
> Our LLVM implementation currently use %xmm0 for both _Complex's real part and imaginary part. Do we have special reason to use two registers?
> We are using one register on X64. Considering the performance, especially the register pressure, should it be better to use one register for _Complex _Float16 on 32 bits target?

x86-64 psABI is unrelated to i386 psABI. Using a pair of registers is
more natural for
complex _Float16. Since it is only used for function return value, I
don't think there is
a register pressure issue.
--
H.J.

Joseph Myers

unread,
Jul 13, 2021, 11:41:18 AM7/13/21
to IA32 System V Application Binary Interface, Wang, Pengfei, llvm...@lists.llvm.org, GCC Patches, GNU C Library
On Tue, 13 Jul 2021, H.J. Lu wrote:

> On Mon, Jul 12, 2021 at 8:59 PM Wang, Pengfei <pengfe...@intel.com> wrote:
> >
> > > Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1 registers.
> >
> > Can you please explain the behavior here? Is there difference between _Float16 and _Complex _Float16 when return? I.e.,
> > 1, In which case will _Float16 values return in both %xmm0 and %xmm1?
> > 2, For a single _Float16 value, are both real part and imaginary part returned in %xmm0? Or returned in %xmm0 and %xmm1 respectively?
>
> Here is the v2 patch to add the missing _Float16 bits. The PDF file is at
>
> https://gitlab.com/x86-psABIs/i386-ABI/-/wikis/Intel386-psABI

This PDF shows _Complex _Float16 as having a size of 2 bytes (should be
4-byte size, 2-byte alignment).

It also seems to change double from 4-byte to 8-byte alignment, which is
wrong. And it's inconsistent about whether it covers the long double =
double (Android) case - it shows that case for _Complex long double but
not for long double itself.

H.J. Lu

unread,
Jul 13, 2021, 12:24:51 PM7/13/21
to IA32 System V Application Binary Interface, Wang, Pengfei, llvm...@lists.llvm.org, GCC Patches, GNU C Library
Here is the v3 patch with the fixes. I also updated the PDF file.

> --
> Joseph S. Myers
> jos...@codesourcery.com
>


--
H.J.
v3-0001-Add-optional-_Float16-support.patch

H.J. Lu

unread,
Jul 29, 2021, 9:40:14 AM7/29/21
to IA32 System V Application Binary Interface, Wang, Pengfei, llvm...@lists.llvm.org, GCC Patches, GNU C Library
Here is the final patch I checked in. _Complex _Float16 is changed to return
in XMM0 register. The new PDF file is at

https://gitlab.com/x86-psABIs/i386-ABI/-/wikis/Intel386-psABI

--
H.J.
0001-Add-optional-_Float16-support.patch

John McCall

unread,
Aug 24, 2021, 1:55:39 AM8/24/21
to ia32...@googlegroups.com, Wang, Pengfei, LLVM Dev, GCC Patches, GNU C Library
This should be explicit that the real part is returned in bits 0..15 and the imaginary part is returned in bits 16..31, or however we conventionally designate subcomponents of a vector.

John.

H.J. Lu

unread,
Aug 25, 2021, 8:36:20 AM8/25/21
to IA32 System V Application Binary Interface, Wang, Pengfei, LLVM Dev, GCC Patches, GNU C Library
How about this?

diff --git a/low-level-sys-info.tex b/low-level-sys-info.tex
index 860ff66..8f527c1 100644
--- a/low-level-sys-info.tex
+++ b/low-level-sys-info.tex
@@ -457,6 +457,9 @@ and \texttt{unions}) are always returned in memory.
& \texttt{__float128} & memory \\
\hline
& \texttt{_Complex _Float16} & \reg{xmm0} \\
+ & & The real part is returned in bits 0..15. The imaginary part is
+ returned \\
+ & & in bits 16..31.\\
\cline{2-3}
Complex & \texttt{_Complex float} & \EDX:\EAX \\
floating- & & The real part is returned in \EAX. The imaginary part is

https://gitlab.com/x86-psABIs/i386-ABI/-/wikis/uploads/89eb3e52c7e5eadd58f7597508e13f34/intel386-psABI-2021-08-25.pdf

--
H.J.

John McCall

unread,
Aug 25, 2021, 4:33:04 PM8/25/21
to ia32...@googlegroups.com, Wang, Pengfei, LLVM Dev, GCC Patches, GNU C Library
On Wed, Aug 25, 2021 at 8:36 AM H.J. Lu <hjl....@gmail.com> wrote:
On Mon, Aug 23, 2021 at 10:55 PM John McCall <rjmc...@gmail.com> wrote:
> On Thu, Jul 29, 2021 at 9:40 AM H.J. Lu <hjl....@gmail.com> wrote:
>> On Tue, Jul 13, 2021 at 9:24 AM H.J. Lu <hjl....@gmail.com> wrote:
>> > On Tue, Jul 13, 2021 at 8:41 AM Joseph Myers <jos...@codesourcery.com> wrote:
>> > > On Tue, 13 Jul 2021, H.J. Lu wrote:
>> > > > On Mon, Jul 12, 2021 at 8:59 PM Wang, Pengfei <pengfe...@intel.com> wrote:
>> > > > >
>> > > > > > Return _Float16 and _Complex _Float16 values in %xmm0/%xmm1 registers.
>> > > > >
>> > > > > Can you please explain the behavior here? Is there difference between _Float16 and _Complex _Float16 when return? I.e.,
>> > > > > 1, In which case will _Float16 values return in both %xmm0 and %xmm1?
>> > > > > 2, For a single _Float16 value, are both real part and imaginary part returned in %xmm0? Or returned in %xmm0 and %xmm1 respectively?
>> > > >
>> > > > Here is the v2 patch to add the missing _Float16 bits.   The PDF file is at
>> > > >
>> > > > https://gitlab.com/x86-psABIs/i386-ABI/-/wikis/Intel386-psABI
>> > >
>> > > This PDF shows _Complex _Float16 as having a size of 2 bytes (should be
>> > > 4-byte size, 2-byte alignment).
>> > >
>> > > It also seems to change double from 4-byte to 8-byte alignment, which is
>> > > wrong.  And it's inconsistent about whether it covers the long double =
>> > > double (Android) case - it shows that case for _Complex long double but
>> > > not for long double itself.
>> >
>> > Here is the v3 patch with the fixes.  I also updated the PDF file.
>>
>> Here is the final patch I checked in.   _Complex _Float16 is changed to return
>> in XMM0 register.   The new PDF file is at
>>
>> https://gitlab.com/x86-psABIs/i386-ABI/-/wikis/Intel386-psABI
>
>
> This should be explicit that the real part is returned in bits 0..15 and the imaginary part is returned in bits 16..31, or however we conventionally designate subcomponents of a vector.

How about this?

diff --git a/low-level-sys-info.tex b/low-level-sys-info.tex
index 860ff66..8f527c1 100644
--- a/low-level-sys-info.tex
+++ b/low-level-sys-info.tex
@@ -457,6 +457,9 @@ and \texttt{unions}) are always returned in memory.
     & \texttt{__float128} & memory \\
     \hline
     & \texttt{_Complex _Float16} & \reg{xmm0} \\
+    & & The real part is returned in bits 0..15. The imaginary part is
+        returned \\
+    & & in bits 16..31.\\
     \cline{2-3}
     Complex & \texttt{_Complex float} & \EDX:\EAX \\
     floating- & & The real part is returned in \EAX. The imaginary part is

https://gitlab.com/x86-psABIs/i386-ABI/-/wikis/uploads/89eb3e52c7e5eadd58f7597508e13f34/intel386-psABI-2021-08-25.pdf

Looks good to me, thanks.

John.
Reply all
Reply to author
Forward
0 new messages