[llvm-dev] [RFC] carry-less multiplication instruction

194 views
Skip to first unread message

Shawn Landden via llvm-dev

unread,
Jul 5, 2020, 5:18:55 AM7/5/20
to llvm...@lists.llvm.org
 

Carry-less multiplication[1] instructions exist (at least optionally) on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly more.

This proposal is to add a llvm.clmul instruction. Or if that is contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer operands of the same width, and returns an integer with twice the width of the operands. (Is there a good reason to make these the same width, as all the other operations do even when it doesn’t really make sense for the mathematical operation–like multiplication or ctpop/ctlz/cttz?)

If the CPU does not have a dedication clmul operation, it can be lowered to regular multiplication, by using holes to avoid carrys.

==Where is clmul used?==

While somewhat specialized, the RISC-V manual documents many uses: [2]

The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]

and Galois/CounterMode (GCM), but more applications exist, including the following examples.There are obvious applications in hashing and pseudo random number generations. For exam-ple, it has been reported that hashes based on carry-less multiplications can outperform Google’sCityHash [17].

clmulof a number with itself inserts zeroes between each input bit. This can be useful for generatingMorton code [23].

clmulof a number with -1 calculates the prefix XOR operation. This can be useful for decodinggray codes.Another application of XOR prefix sums calculated withclmulis branchless tracking of quotedstrings in high-performance parsers. [16]

Carry-less multiply can also be used to implement Erasure code efficiently. [14]

==clmul lowering without hardware support==
A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when there is no specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].

[1] https://en.wikipedia.org/wiki/Carry-less_product
[2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
[3] https://www.bearssl.org/constanttime.html

 

(First posted to discord

-- 
Shawn Landden
 

Roman Lebedev via llvm-dev

unread,
Jul 5, 2020, 6:22:26 AM7/5/20
to Shawn Landden, llvm...@lists.llvm.org

What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Note that teaching backends to produce it, or even adding it to
backend (ISD opcodes)
and matching it in DAGCombiner has much lower barrier of entry, i
would suggest to start there.

> (First posted to discord
>
> --
> Shawn Landden

Roman

> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Nicolai Hähnle via llvm-dev

unread,
Jul 5, 2020, 8:12:48 AM7/5/20
to Roman Lebedev, Shawn Landden, llvm...@lists.llvm.org
On 05.07.20 12:21, Roman Lebedev via llvm-dev wrote:
> On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev
> <llvm...@lists.llvm.org> wrote:
>>
>>
>>
>> Carry-less multiplication[1] instructions exist (at least optionally) on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly more.
>>
>> This proposal is to add a llvm.clmul instruction. Or if that is contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer operands of the same width, and returns an integer with twice the width of the operands. (Is there a good reason to make these the same width, as all the other operations do even when it doesn’t really make sense for the mathematical operation–like multiplication or ctpop/ctlz/cttz?)
>>
>> If the CPU does not have a dedication clmul operation, it can be lowered to regular multiplication, by using holes to avoid carrys.
>>
>> ==Where is clmul used?==
>>
>> While somewhat specialized, the RISC-V manual documents many uses: [2]
>>
>> The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]
>>
>> and Galois/CounterMode (GCM), but more applications exist, including the following examples.There are obvious applications in hashing and pseudo random number generations. For exam-ple, it has been reported that hashes based on carry-less multiplications can outperform Google’sCityHash [17].
>>
>> clmulof a number with itself inserts zeroes between each input bit. This can be useful for generatingMorton code [23].
>>
>> clmulof a number with -1 calculates the prefix XOR operation. This can be useful for decodinggray codes.Another application of XOR prefix sums calculated withclmulis branchless tracking of quotedstrings in high-performance parsers. [16]
>>
>> Carry-less multiply can also be used to implement Erasure code efficiently. [14]
>>
>> ==clmul lowering without hardware support==
>> A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when there is no specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].
>>
>> [1] https://en.wikipedia.org/wiki/Carry-less_product
>> [2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
>> [3] https://www.bearssl.org/constanttime.html
>
> What benefit would this intrinsic would bring to the middle-end IR,
> over it's current naive expanded form?

Isn't a "naive" expansion of NxN carryless multiply extremely involved?
I'd expect something like 2N shifts, N truncs, N selects, and N xors.

That link mentions an alternative that is more efficient, but I wouldn't
exactly call it naive...

Cheers,
Nicolai


>
> Note that teaching backends to produce it, or even adding it to
> backend (ISD opcodes)
> and matching it in DAGCombiner has much lower barrier of entry, i
> would suggest to start there.
>
>> (First posted to discord
>>
>> --
>> Shawn Landden
> Roman
>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm...@lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.

James Y Knight via llvm-dev

unread,
Jul 5, 2020, 10:55:09 AM7/5/20
to Shawn Landden, llvm-dev
It'd be useful in your proposal to to note which of the existing llvm target specific intrinsics this generic intrinsic can effectively supersede. (E.g. llvm.x86.pclmulqdq for x86.)

When we are already supporting a given function via target specific intrinsics for a number of different targets, that seems a pretty good argument for making it available as a more generic target independent intrinsic.

Craig Topper via llvm-dev

unread,
Jul 5, 2020, 2:44:53 PM7/5/20
to James Y Knight, llvm-dev
Shawn,

Are you able to summarize the different instructions from the various targets. It looks like there different implementation choices made for each target. For example, X86 takes two v2i64 inputs and picks either an even or odd element from each to multiply to produce a v1i128 result. It looks like RISC-V has instructions to produce either the high half of the result or the low half of the result. Those are the only two I checked.

Will a common intrinsic need custom handling for each target or is there a common version that multiple targets use that we should choose for the intrinsic?

~Craig

Chris Lattner via llvm-dev

unread,
Jul 5, 2020, 4:07:44 PM7/5/20
to Nicolai Hähnle, llvm...@lists.llvm.org
On Jul 5, 2020, at 5:12 AM, Nicolai Hähnle via llvm-dev <llvm...@lists.llvm.org> wrote:

[1] https://en.wikipedia.org/wiki/Carry-less_product
[2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
[3] https://www.bearssl.org/constanttime.html
What benefit would this intrinsic would bring to the middle-end IR,
over it's current naive expanded form?

Isn't a "naive" expansion of NxN carryless multiply extremely involved? I'd expect something like 2N shifts, N truncs, N selects, and N xors.

That link mentions an alternative that is more efficient, but I wouldn't exactly call it naive…

No, the typical expansion (used by the existing LLVM code generator) uses “grade school math” to decompose large multiplications into smaller ones.

Wide multiplications (e.g. those on X86) are pretty easy to pattern match from “sign/zero extend operands from 64 bits to 128 bits, then do a 128x128=128 multiply” for example.  This is all already handled by the existing selection dag infra, my understanding is that it works well in practice.

-Chris

Roman Lebedev via llvm-dev

unread,
Jul 5, 2020, 4:09:53 PM7/5/20
to Chris Lattner, llvm...@lists.llvm.org
On Sun, Jul 5, 2020 at 11:07 PM Chris Lattner <clat...@nondot.org> wrote:
>
>
>
> On Jul 5, 2020, at 5:12 AM, Nicolai Hähnle via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> [1] https://en.wikipedia.org/wiki/Carry-less_product
> [2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
> [3] https://www.bearssl.org/constanttime.html
>
> What benefit would this intrinsic would bring to the middle-end IR,
> over it's current naive expanded form?
>
>
> Isn't a "naive" expansion of NxN carryless multiply extremely involved? I'd expect something like 2N shifts, N truncs, N selects, and N xors.
>
> That link mentions an alternative that is more efficient, but I wouldn't exactly call it naive…
>
>
> No, the typical expansion (used by the existing LLVM code generator) uses “grade school math” to decompose large multiplications into smaller ones.
>
> Wide multiplications (e.g. those on X86) are pretty easy to pattern match from “sign/zero extend operands from 64 bits to 128 bits, then do a 128x128=128 multiply” for example.
That is what i meant, yes.

> This is all already handled by the existing selection dag infra, my understanding is that it works well in practice.
>
> -Chris

James Y Knight via llvm-dev

unread,
Jul 5, 2020, 5:28:33 PM7/5/20
to Chris Lattner, llvm-dev
"Carry-less multiplication" isn't a wider bit width multiplication, it's an entirely different operation. (See the Wikipedia page linked in the proposal).

Shawn Landden via llvm-dev

unread,
Jul 6, 2020, 6:34:59 AM7/6/20
to Craig Topper, James Y Knight, llvm-dev
 
 
05.07.2020, 13:44, "Craig Topper" <craig....@gmail.com>:
Shawn,
 
Are you able to summarize the different instructions from the various targets. It looks like there different implementation choices made for each target. For example, X86 takes two v2i64 inputs and picks either an even or odd element from each to multiply to produce a v1i128 result. It looks like RISC-V has instructions to produce either the high half of the result or the low half of the result. Those are the only two I checked.
 
Will a common intrinsic need custom handling for each target or is there a common version that multiple targets use that we should choose for the intrinsic?

Only the Power8 instructions are differen't, as it can do two 64+64=>128 multiplications at the same time, with the result xored together (Karatsuba-style), and if you don't want that you have to make sure you are multiplying by zero for one of them. So Power would require special lowing for 128+128=>256 multiply. So with Power you get 3 multiplys, but you have to zero some registers, while on RISC-V you would have 4 right after each other, and them have to xor the middle two.
-- 
Shawn Landden
 

Shawn Landden via llvm-dev

unread,
Jul 6, 2020, 6:44:38 AM7/6/20
to Nicolai Hähnle, Roman Lebedev, llvm...@lists.llvm.org
 
 
05.07.2020, 07:12, "Nicolai Hähnle" <nhae...@gmail.com>:

On 05.07.20 12:21, Roman Lebedev via llvm-dev wrote:

 On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev
 <llvm...@lists.llvm.org> wrote:
 This proposal is to add a llvm.clmul instruction.

 What benefit would this intrinsic would bring to the middle-end IR,
 over it's current naive expanded form?


Isn't a "naive" expansion of NxN carryless multiply extremely involved?


Yes it is. And this is then sped up with a table (such as in the official GCM spec), however using a table can introduce key-dependent loads and security problems. The 32+32->64 or 64+64->64 multiplication lowering is generally constant-time and does not have these security problems.

I'd expect something like 2N shifts, N truncs, N selects, and N xors.

That link mentions an alternative that is more efficient, but I wouldn't
exactly call it naive...

Cheers,
Nicolai

 


 Note that teaching backends to produce it, or even adding it to
 backend (ISD opcodes)
 and matching it in DAGCombiner has much lower barrier of entry, i
 would suggest to start there.
 
 (First posted to discord

 --
 Shawn Landden
 Roman
 
 _______________________________________________
 LLVM Developers mailing list
 llvm...@lists.llvm.org
 https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
 _______________________________________________
 LLVM Developers mailing list
 llvm...@lists.llvm.org
 https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
 

 

--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
 
 
-- 
Shawn Landden
 

Shawn Landden via llvm-dev

unread,
Jul 6, 2020, 7:42:06 AM7/6/20
to Nicolai Hähnle, Roman Lebedev, llvm...@lists.llvm.org

05.07.2020, 07:12, "Nicolai Hähnle" <nhae...@gmail.com>:

> On 05.07.20 12:21, Roman Lebedev via llvm-dev wrote:
>>  On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev
>>  <llvm...@lists.llvm.org> wrote:
>>>  This proposal is to add a llvm.clmul instruction.
>>

>>  What benefit would this intrinsic would bring to the middle-end IR,
>>  over it's current naive expanded form?
>
> Isn't a "naive" expansion of NxN carryless multiply extremely involved?
> I'd expect something like 2N shifts, N truncs, N selects, and N xors.

Yes it is. And this is then sped up with a table (such as in the official GCM spec), however using a table can introduce key-dependent loads and security problems. The 32+32->64 or 64+64->64 multiplication lowering is generally constant-time and does not have these security problems.
>

> That link mentions an alternative that is more efficient, but I wouldn't
> exactly call it naive...
>
> Cheers,
> Nicolai

Shawn Landden via llvm-dev

unread,
Jul 6, 2020, 7:42:50 AM7/6/20
to Craig Topper, James Y Knight, llvm-dev

05.07.2020, 13:44, "Craig Topper" <craig....@gmail.com>:
> Shawn,
>
> Are you able to summarize the different instructions from the various targets. It looks like there different implementation choices made for each target. For example, X86 takes two v2i64 inputs and picks either an even or odd element from each to multiply to produce a v1i128 result. It looks like RISC-V has instructions to produce either the high half of the result or the low half of the result. Those are the only two I checked.
>
> Will a common intrinsic need custom handling for each target or is there a common version that multiple targets use that we should choose for the intrinsic?
>
Only the Power8 instructions are differen't, as it can do two 64+64=>128 multiplications at the same time, with the result xored together (Karatsuba-style), and if you don't want that you have to make sure you are multiplying by zero for one of them. So Power would require special lowing for 128+128=>256 multiply. So with Power you get 3 multiplys, but you have to zero some registers, while on RISC-V you would have 4 right after each other, and them have to xor the middle two.

_______________________________________________

Steve (Numerics) Canon via llvm-dev

unread,
Jul 8, 2020, 12:23:13 PM7/8/20
to Shawn Landden, llvm...@lists.llvm.org
FWIW, this seems like a no-brainer to me (as llvm.experimental initially), assuming that it can be designed in such a way that it would eliminate the need for intrinsics on at least two targets (I think it should be possible to do so, with a small amount of back-end work).

– Steve

Roman Lebedev via llvm-dev

unread,
Jul 9, 2020, 10:42:08 AM7/9/20
to llvm...@lists.llvm.org
(As per IRC discussion)

I understand that the carry-less multiplication algorithm has it's uses
since/and it is implemented as an instruction in many architectures
and that adding it as a general-purpose intrinsic will allow us
to drop target-specific intrinsics as by-product.

What i do *NOT* understand is: what is the actual/main goal/driving
factor of adding an LLVM intrinsic for it?

The use that was mentioned is crypto, and i'm personally not really
registering anything else. Am i just misreading it?
The crypto use-case doesn't make sense to me, because
as of this moment LLVM "explicitly" has zero constant-time
guarantees for LLVM IR instructions/intrinsics.

I feel like it's a really important question.
If there isn't interest in crypto/constant-time here, i think it would
be best to explicitly state so. If there is, i think it may be good
to hear from Chandler (who IIRC is driving some constant-time
work for C++/LLVM, that is not yet widely public)

Roman

Steve (Numerics) Canon via llvm-dev

unread,
Jul 9, 2020, 11:13:48 AM7/9/20
to Roman Lebedev, llvm...@lists.llvm.org
CLMUL is absolutely useful outside of “crypto” contexts that want/require “constant time” operation.

To name just two families of uses, it’s the backbone of many hash/checksum algorithms and error-correcting codes, where the goal is often simply to go as fast as possible, and uArch side-channel resistance is not a concern.

– Steve

Hal Finkel via llvm-dev

unread,
Jul 9, 2020, 11:25:15 AM7/9/20
to Steve (Numerics) Canon, Roman Lebedev, llvm...@lists.llvm.org


On 7/9/20 10:13 AM, Steve (Numerics) Canon via llvm-dev wrote:
CLMUL is absolutely useful outside of “crypto” contexts that want/require “constant time” operation.

To name just two families of uses, it’s the backbone of many hash/checksum algorithms and error-correcting codes, where the goal is often simply to go as fast as possible, and uArch side-channel resistance is not a concern.

– Steve


+1

See, e.g., https://lemire.me/blog/2015/10/26/crazily-fast-hashing-with-carry-less-multiplications/ -- and also, https://en.wikipedia.org/wiki/CLMUL_instruction_set, "One use of these instructions is to improve the speed of applications doing block cipher encryption in Galois/Counter Mode, which depends on finite field GF(2^k) multiplication. Another application is the fast calculation of CRC values, including those used to implement the LZ77 sliding window DEFLATE algorithm in zlib and pngcrush."

 -Hal



On Jul 9, 2020, at 10:41 AM, Roman Lebedev via llvm-dev <llvm...@lists.llvm.org> wrote:


What i do *NOT* understand is: what is the actual/main goal/driving
factor of adding an LLVM intrinsic for it?

The use that was mentioned is crypto, and i'm personally not really
registering anything else. Am i just misreading it?
The crypto use-case doesn't make sense to me, because
as of this moment LLVM "explicitly" has zero constant-time
guarantees for LLVM IR instructions/intrinsics.


_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Shawn Landden via llvm-dev

unread,
Jul 9, 2020, 12:39:42 PM7/9/20
to Roman Lebedev, llvm...@lists.llvm.org

05.07.2020, 05:22, "Roman Lebedev" <lebed...@gmail.com>:


> On Sun, Jul 5, 2020 at 12:18 PM Shawn Landden via llvm-dev
> <llvm...@lists.llvm.org> wrote:
>>  Carry-less multiplication[1] instructions exist (at least optionally) on many architectures: armv8, RISC-V, x86_64, POWER, SPARC, C64x, and possibly more.
>>
>>  This proposal is to add a llvm.clmul instruction. Or if that is contentious, llvm.experimental.bitmanip.clmul instruction. It takes two integer operands of the same width, and returns an integer with twice the width of the operands. (Is there a good reason to make these the same width, as all the other operations do even when it doesn’t really make sense for the mathematical operation–like multiplication or ctpop/ctlz/cttz?)
>>
>>  If the CPU does not have a dedication clmul operation, it can be lowered to regular multiplication, by using holes to avoid carrys.
>>
>>  ==Where is clmul used?==
>>
>>  While somewhat specialized, the RISC-V manual documents many uses: [2]
>>
>>  The classic applications forclmulare Cyclic Redundancy Check (CRC) [11, 26]
>>
>>  and Galois/CounterMode (GCM), but more applications exist, including the following examples.There are obvious applications in hashing and pseudo random number generations. For exam-ple, it has been reported that hashes based on carry-less multiplications can outperform Google’sCityHash [17].
>>
>>  clmulof a number with itself inserts zeroes between each input bit. This can be useful for generatingMorton code [23].
>>
>>  clmulof a number with -1 calculates the prefix XOR operation. This can be useful for decodinggray codes.Another application of XOR prefix sums calculated withclmulis branchless tracking of quotedstrings in high-performance parsers. [16]
>>
>>  Carry-less multiply can also be used to implement Erasure code efficiently. [14]
>>
>>  ==clmul lowering without hardware support==
>>  A 8x8=>16 clmul can also be lowered to a 32x32=>64 multiplication when there is no specialized instruction (also 15x15=>30, to a 60x60=>120, or if bitreverse is available 16x16=>32 to TWO 64x64=>64 multiplications)[3].
>>
>>  [1] https://en.wikipedia.org/wiki/Carry-less_product
>>  [2] (page 30) https://raw.githubusercontent.com/riscv/riscv-bitmanip/master/bitmanip-0.92.pdf
>>  [3] https://www.bearssl.org/constanttime.html
>
> What benefit would this intrinsic would bring to the middle-end IR,
> over it's current naive expanded form?
>
> Note that teaching backends to produce it, or even adding it to
> backend (ISD opcodes)
> and matching it in DAGCombiner has much lower barrier of entry, i
> would suggest to start there.

It cannot be matched.


>
>>  (First posted to discord
>>
>>  --
>>  Shawn Landden
>
> Roman
>
>>  _______________________________________________
>>  LLVM Developers mailing list
>>  llvm...@lists.llvm.org
>>  https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--
Shawn Landden

Krzysztof Parzyszek via llvm-dev

unread,
Jul 9, 2020, 1:28:01 PM7/9/20
to llvm...@lists.llvm.org
FWIW, Hexagon has a pass to recognize polynomial multiplication:
llvm/lib/Target/Hexagon/HexagonLoopIdiomRecognition.cpp
See "PolynomialMultiplyRecognize"

--
Krzysztof Parzyszek kpar...@quicinc.com AI tools development

Craig Topper via llvm-dev

unread,
Jul 9, 2020, 3:15:54 PM7/9/20
to Krzysztof Parzyszek, llvm...@lists.llvm.org
I think i'd prefer to have the output type match the input type. Makes it more similar to other intrinsics and binary operators. We should be able to zero extend the inputs if you want something like 64x64->128.

Are you planning to expose this to C through clang? What types would we expose?

~Craig

Krzysztof Parzyszek via llvm-dev

unread,
Jul 9, 2020, 3:32:04 PM7/9/20
to Craig Topper, llvm...@lists.llvm.org

I didn’t have any specific plans.  What happens next depends on whether people find this code useful.  It’s kind of complicated because since the time it was written the “canonical form” of LLVM IR has changed multiple times, and I added more and more stuff to “recanonicalize” it back into the form the idiom recognition code is used to seeing.

 

I was thinking about inventing a different way of recognizing the idiom, but I haven’t had time to spend on it.  Whatever it is, it should be immune to ongoing changes in instcombine.

 

--

Krzysztof Parzyszek  kpar...@quicinc.com   AI tools development

 

Shawn Landden via llvm-dev

unread,
Jul 9, 2020, 3:52:58 PM7/9/20
to Craig Topper, Krzysztof Parzyszek, llvm...@lists.llvm.org

09.07.2020, 14:16, "Craig Topper via llvm-dev" <llvm...@lists.llvm.org>:


> I think i'd prefer to have the output type match the input type. Makes it more similar to other intrinsics and binary operators. We should be able to zero extend the inputs if you want something like 64x64->128.
>
> Are you planning to expose this to C through clang? What types would we expose?

That is difficult because __int128 doesn't work on all the targets that have this instruction. So I was thinking void __builtin_clmul(uint64_t l, uint64_t r, uint64_t *lo, uint64_t *hi), which is pretty ugly. And then to get the best code for 128+128->256 you have to pull those lo and hi right off off a v2i64, and then put these results right back into a v2i64, and then xor the middle sum (power can do two multiplies and an xor in one instruction) and do this all in a way that is nice to the optimizer.....just because we don't have a __int128 type everywhere we have 128-bit SIMD.

> ,

Shawn Landden via llvm-dev

unread,
Jul 9, 2020, 4:33:36 PM7/9/20
to Craig Topper, Krzysztof Parzyszek, llvm...@lists.llvm.org
 
 
09.07.2020, 14:16, "Craig Topper via llvm-dev" <llvm...@lists.llvm.org>:
I think i'd prefer to have the output type match the input type. Makes it more similar to other intrinsics and binary operators. We should be able to zero extend the inputs if you want something like 64x64->128.
 
Are you planning to expose this to C through clang? What types would we expose?
Like we could do v4i64 __builtin_clmul128(v2i64, v2i64) (using vector extensions) but really it is 128+128->256 and we are just using these funny types to represent __int128 and __int256....and then emit a i256 multiply in the IR. And also expose the 64+64->i128 as
 
void __builtin_clmul64(uint64_t l, uint64_t r, uint64_t *lo, uint64_t *hi)
,

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

 
 
-- 
Shawn Landden
 

Shawn Landden via llvm-dev

unread,
Jul 19, 2020, 11:24:10 AM7/19/20
to Shawn Landden, llvm...@lists.llvm.org
In trying to implement my RFC I ran into difficulty selecting
zero-extends, either in a custom lowering, or in tablegen.

For example, Power 8 has <2 x i32> -> <2 x i64>, <4 x i16> -> <4 x
i32>, <1 x i64> -> <1 x i128> instructions.

And the generated MIR looks different for the different cases:

LLVM ERROR: Cannot select: t10: v2i64 = clmul t36, t35
t36: v2i64 = bitcast t33
t33: v16i8 =
vector_shuffle<0,1,2,3,16,17,18,19,4,5,6,7,20,21,22,23> t32, t29
t32: v16i8 = bitcast t2
t2: v4i32,ch = CopyFromReg t0, Register:v4i32 %0
t1: v4i32 = Register %0
t29: v16i8 = bitcast t19
t19: v4i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>,
Constant:i32<0>, Constant:i32<0>
t18: i32 = Constant<0>
t18: i32 = Constant<0>
t18: i32 = Constant<0>
t18: i32 = Constant<0>
t35: v2i64 = bitcast t30
t30: v16i8 =
vector_shuffle<0,1,2,3,16,17,18,19,4,5,6,7,20,21,22,23> t28, t29
t28: v16i8 = bitcast t4
t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %1
t3: v4i32 = Register %1
t29: v16i8 = bitcast t19
t19: v4i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>,
Constant:i32<0>, Constant:i32<0>
t18: i32 = Constant<0>
t18: i32 = Constant<0>
t18: i32 = Constant<0>
t18: i32 = Constant<0>

LLVM ERROR: Cannot select: t9: v1i128 = clmul t24, t28
t24: v1i128 = bitcast t23
t23: v2i64 = BUILD_VECTOR t2, Constant:i64<0>
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t1: i64 = Register %0
t13: i64 = Constant<0>
t28: v1i128 = bitcast t27
t27: v2i64 = BUILD_VECTOR t4, Constant:i64<0>
t4: i64,ch = CopyFromReg t0, Register:i64 %1
t3: i64 = Register %1
t13: i64 = Constant<0>


I might be able to use demandedBits in a custom rule, but it sure
would nice if zext actually worked, like it would in LLVM-IR. The
burden of learning totally distinct tools for different levels of the
compiler (in addition to the assembly of the target) is quite
annoying.

Reply all
Reply to author
Forward
0 new messages