use SIMD to speedup POLYVEC

30 views
Skip to first unread message

Qian Yun

unread,
Oct 14, 2024, 9:34:20 AM10/14/24
to fricas-devel
I've taken a brief look, I think it is possible to speedup
functions in package POLYVEC by SIMD, via SBCL package sb-simd,
on machines with AVX2 instructions.

- Qian

Qian Yun

unread,
Oct 16, 2024, 4:24:57 AM10/16/24
to fricas-devel
The functions in POLYVEC uses modulus (to a prime).
This makes SIMD algorithm more complex than I though,
but if implemented correctly, it should easily gives
over 10x speedup.

- Qian

Waldek Hebisch

unread,
Oct 29, 2024, 8:21:33 AM10/29/24
to fricas...@googlegroups.com
Already SSE2 (which is supposed to be present on all 64-bit PC-s)
should give significant speedup. Comparatively, AVX/AVX2 offers
limited gains, basicaly only on machines which have sufficiently
wide hardware (IIUC only top models). Potential troubles to solve:

- aligning data to 16 byte boundary which is required by normal SSE2
instructions (I did not check what is required by AVX)
- extention from 32-bits to 64-bits. Main operations limit
coeffiecints to 32-bits (actually less than this) but to
prevent overflow we need operations with 64-bit accuracy.
SSE has one operation that returns increased accuracy, but
IIRC this is for 16-bit to 32-bit (we could use such thing
sometimes, but not always). AVX may be better here. But
for example 'vector_combination' needs extention before
actual arithmetic (or some way to access high part of the
product, but IIUC AVX have no way to do this).
- fast way to compute remainders. There are fast ways but quite
low-level, it is hard to use them from higher-level languages
- for polynomial mutiplication we probably should reverse one
of arguments, otherwise the loop is hard to vectorise.

To check what is possible I tried polynomial multiplication code
on C. Using '-O' the C version results in scalar code, but faster
than sbcl-compiled one. At '-O3' gcc tries to autovectorize the
loop resulting in slower code. Namely, since there is no warranty
of alignment gcc inserts extra code to specially handle few first
elements, then do main part in aligned way and then trailer.
since one of indices goes in "wrong direction" gcc inserts shuffle
instruction. All this slows down the code so that in interesting
range it run at about half of speed of scalar code. Reversing
one of arguments should eliminate most of the slowdown.
But extra handling of inital part is problematic too, for this
we should get proper alignment of data.

BTW: My past profile data indicates that most work is done in
'inner_mul' and 'vector_combination'. 'inner_mul' is performing
several additions before computing remainder so should be easier
to speed up. 'vector_combination' has only 3 arithmetic operations
before final remainder, so here remainder is critical. One possible
way to speed up remainder is to use primes of special form.
If p = 2^k - a with small a and resonable k, then remainder could be
done in 6 operations (2 shifts, 2 multiplications and 2 subtractions).
On my Core2 remainder takes 20 clocks, so 6 operations would be
a substantial improvement. There are other schemes for computing
remainder, of similar cost.

--
Waldek Hebisch

Qian Yun

unread,
Oct 29, 2024, 9:41:14 AM10/29/24
to fricas...@googlegroups.com


On 10/29/24 8:21 PM, Waldek Hebisch wrote:
> On Wed, Oct 16, 2024 at 04:24:53PM +0800, Qian Yun wrote:
>> The functions in POLYVEC uses modulus (to a prime).
>> This makes SIMD algorithm more complex than I though,
>> but if implemented correctly, it should easily gives
>> over 10x speedup.
>>
>> - Qian
>>
>> On 10/14/24 9:34 PM, Qian Yun wrote:
>>> I've taken a brief look, I think it is possible to speedup
>>> functions in package POLYVEC by SIMD, via SBCL package sb-simd,
>>> on machines with AVX2 instructions.
>
> Already SSE2 (which is supposed to be present on all 64-bit PC-s)
> should give significant speedup. Comparatively, AVX/AVX2 offers
> limited gains, basicaly only on machines which have sufficiently
> wide hardware (IIUC only top models). Potential troubles to solve:

Steam survey shows that 97.49% machines have AVX and 94.61% have AVX2.
AVX512 is much less common, only 13%.

Isn't AVX supposed to have twice throughput than SSE2?

> - aligning data to 16 byte boundary which is required by normal SSE2
> instructions (I did not check what is required by AVX)
> - extention from 32-bits to 64-bits. Main operations limit
> coeffiecints to 32-bits (actually less than this) but to
> prevent overflow we need operations with 64-bit accuracy.
> SSE has one operation that returns increased accuracy, but
> IIRC this is for 16-bit to 32-bit (we could use such thing
> sometimes, but not always). AVX may be better here. But
> for example 'vector_combination' needs extention before
> actual arithmetic (or some way to access high part of the
> product, but IIUC AVX have no way to do this).
> - fast way to compute remainders. There are fast ways but quite
> low-level, it is hard to use them from higher-level languages

I saw "Modular SIMD arithmetic in Mathemagix" by J van der Hoeven.
There's algorithm doing modular multiplication entirely in 32-bit.
Our current method stores tmp value in 64bit.

I wrote potentially 10x speedup is based on the numbers in
this article.


BTW, which "high level" functions uses POLYVEC underneath?
I'd like to use that as benchmark.

- Qian

Waldek Hebisch

unread,
Nov 1, 2024, 2:15:52 PM11/1/24
to fricas...@googlegroups.com
On Tue, Oct 29, 2024 at 09:41:20PM +0800, Qian Yun wrote:
>
>
> On 10/29/24 8:21 PM, Waldek Hebisch wrote:
> > On Wed, Oct 16, 2024 at 04:24:53PM +0800, Qian Yun wrote:
> > > The functions in POLYVEC uses modulus (to a prime).
> > > This makes SIMD algorithm more complex than I though,
> > > but if implemented correctly, it should easily gives
> > > over 10x speedup.
> > >
> > > - Qian
> > >
> > > On 10/14/24 9:34 PM, Qian Yun wrote:
> > > > I've taken a brief look, I think it is possible to speedup
> > > > functions in package POLYVEC by SIMD, via SBCL package sb-simd,
> > > > on machines with AVX2 instructions.
> >
> > Already SSE2 (which is supposed to be present on all 64-bit PC-s)
> > should give significant speedup. Comparatively, AVX/AVX2 offers
> > limited gains, basicaly only on machines which have sufficiently
> > wide hardware (IIUC only top models). Potential troubles to solve:
>
> Steam survey shows that 97.49% machines have AVX and 94.61% have AVX2.
> AVX512 is much less common, only 13%.

My Core 2 has only SSE2, no AVX. There are faster machines now,
but it is faster than modern low-end machines.

> Isn't AVX supposed to have twice throughput than SSE2?

AVX doubles decode efficiency and adds some new instructions.
But throughput depends on width of execution units. If machine
has 128-bit execution units (or narrower, but I hope no machine
is narrower), then gain compared to SSE@ will be
quit modest (if any). IIUC quite a lot of AVX machines has
128-bit execution units

> > - aligning data to 16 byte boundary which is required by normal SSE2
> > instructions (I did not check what is required by AVX)
> > - extention from 32-bits to 64-bits. Main operations limit
> > coeffiecints to 32-bits (actually less than this) but to
> > prevent overflow we need operations with 64-bit accuracy.
> > SSE has one operation that returns increased accuracy, but
> > IIRC this is for 16-bit to 32-bit (we could use such thing
> > sometimes, but not always). AVX may be better here. But
> > for example 'vector_combination' needs extention before
> > actual arithmetic (or some way to access high part of the
> > product, but IIUC AVX have no way to do this).
> > - fast way to compute remainders. There are fast ways but quite
> > low-level, it is hard to use them from higher-level languages
>
> I saw "Modular SIMD arithmetic in Mathemagix" by J van der Hoeven.
> There's algorithm doing modular multiplication entirely in 32-bit.
> Our current method stores tmp value in 64bit.

No, AFAICS Joris is extending 32-bit numbers to 64-bit.

> I wrote potentially 10x speedup is based on the numbers in
> this article.

Well, in 'vector_combination' base step is 'z(i) := (a*v(i) + b*w(i)) rem p'.
IIRC time per step on Core 2 is 20 cycles. Joris is considering
each operation as separate thing, so that is two multiplications
and addition. Using his timing for operations that is 4.2 cycles
in best case with SSE and 2.5 with AVX. So there is potential
for significant speedup.

> BTW, which "high level" functions uses POLYVEC underneath?
> I'd like to use that as benchmark.

Original motivation for POLYVEC was guessing package, that
is functions like 'guessHolo', 'guessPRec', etc. Theoreticaly
modular part is the only important one (other have lower
theoretical complexity), so I initally worked on modular
part. However, when modular part was integrated into whole
thing it turned out that in several cases other parts took
more than 90% of execution time. So remaing work was mostly
to make non-modular parts faster. Still, sometimes they take
substantial time.

Second use is polynomial GCD over algebraic extentions, which
in turn is used by the integrator.

Currently factorization over prime fields use POLYVEC and similar
modular operations, like multiplication of vector by matrix
or matrix multiplication.

--
Waldek Hebisch

Qian Yun

unread,
Nov 1, 2024, 7:48:40 PM11/1/24
to fricas...@googlegroups.com


On 11/2/24 2:15 AM, Waldek Hebisch wrote:
>>
>> I saw "Modular SIMD arithmetic in Mathemagix" by J van der Hoeven.
>> There's algorithm doing modular multiplication entirely in 32-bit.
>> Our current method stores tmp value in 64bit.
>
> No, AFAICS Joris is extending 32-bit numbers to 64-bit.
>

Yes, you are right. I take a deeper look, for multiplication,
it unpacks each vector into 2 vectors (high bits and low bits)
and for u32, it is doing multiplication in u64.

The main contribution is that it is doing remainder in SIMD.

- Qian

Waldek Hebisch

unread,
Nov 5, 2024, 8:56:56 PM11/5/24
to fricas...@googlegroups.com
I looked at information about vector instructions. One thing,
it is a mess, available instructions are quite irregular.
Concerning availablity, it seems that to have 256-bit operations
we need AVX2, more precisly, we need 'vpaddq', 'vpmullq',
'vpminuq' (and 'vmovdqu' but it is probably available on all machines
that have first 3). AFAICS of machines that I have or can use at least
4 do not have AVX, one is newly bought mini-PC. 2 other machines
apparently have AVX, but miss AVX2. You can try the C code below
implementing 'vector_combination'. When you direct clang or gcc to
generate code for new high-end processors, and one give it
'-ftree-vectorize -O' (or '-O3'), then it generates 256 bit
vector instructions. However, the code is quite bulky and it is
not clear to me how well it will perform on moderate length vectors
which are typical in our use. One reason for size is that vector
operations need initial setup. Another is that they have substantial
latency and processor needs to have several in the pipeline to attain
good troughput. But then there is partial block at start or at the
end which needs separate handling.

When compiling for older high-end processors or low end ones compiler
uses 128-bit instructions or scalar code.

BTW1: Code would be smaller and faster with properly aligned and
padded buffers, but IIUC neither C compiler nor sbcl-simd have
good support for this (generated code uses unaligned loads which
are slower than aligned ones).

BTW2: Our current code is equvalent with replacing part after
initialization of s1 with 'res[i] = s1 % p;'. My measurements
indicate that on Core 2 this works essentially as fast as division
(IIUC all other computations can be done in parallel with division).
When using C compiler this is likely to hold also on newer processors
which have faster division.

BTW3: Agner Fogg in his tables claims that on Zen1 vector multiplication
works essentially at speed of scalar multiplication (processor can
do 1 scalar multiplication per clock and 1 vector multiplication
per 4 clocks). There should be gain from doing other operations
in vector way, but clearly on such processor gain from vector code
is limited.

BTW4: Ideally we should have tens of routines in similar spirit to
cover various use cases. When you multiply this by number of
processor architectures there is a combinatorial explosion.

#include <stdint.h>
void
vector_combination(uint32_t * v1, uint32_t c1, uint32_t * v2, uint32_t c2,
uint32_t * res, int n, uint32_t p, uint32_t q) {
int i;
for(i = 0; i < n; i++) {
uint64_t s1 = ((uint64_t)(v1[i]))*c1 + ((uint64_t)(v2[i]))*c2;
uint64_t hs1 = s1 >> 32;
uint64_t q1 = (hs1*q) >> 31;
uint64_t r1 = s1 - q1*p;
/* May need also second correction */
r1 = (r1 > p)?(r1 - p):r1;
// r1 = (r1 > p)?(r1 - p):r1;
res[i] = r1;
}
}

--
Waldek Hebisch
Reply all
Reply to author
Forward
0 new messages