Already SSE2 (which is supposed to be present on all 64-bit PC-s)
should give significant speedup. Comparatively, AVX/AVX2 offers
limited gains, basicaly only on machines which have sufficiently
wide hardware (IIUC only top models). Potential troubles to solve:
- aligning data to 16 byte boundary which is required by normal SSE2
instructions (I did not check what is required by AVX)
- extention from 32-bits to 64-bits. Main operations limit
coeffiecints to 32-bits (actually less than this) but to
prevent overflow we need operations with 64-bit accuracy.
SSE has one operation that returns increased accuracy, but
IIRC this is for 16-bit to 32-bit (we could use such thing
sometimes, but not always). AVX may be better here. But
for example 'vector_combination' needs extention before
actual arithmetic (or some way to access high part of the
product, but IIUC AVX have no way to do this).
- fast way to compute remainders. There are fast ways but quite
low-level, it is hard to use them from higher-level languages
- for polynomial mutiplication we probably should reverse one
of arguments, otherwise the loop is hard to vectorise.
To check what is possible I tried polynomial multiplication code
on C. Using '-O' the C version results in scalar code, but faster
than sbcl-compiled one. At '-O3' gcc tries to autovectorize the
loop resulting in slower code. Namely, since there is no warranty
of alignment gcc inserts extra code to specially handle few first
elements, then do main part in aligned way and then trailer.
since one of indices goes in "wrong direction" gcc inserts shuffle
instruction. All this slows down the code so that in interesting
range it run at about half of speed of scalar code. Reversing
one of arguments should eliminate most of the slowdown.
But extra handling of inital part is problematic too, for this
we should get proper alignment of data.
BTW: My past profile data indicates that most work is done in
'inner_mul' and 'vector_combination'. 'inner_mul' is performing
several additions before computing remainder so should be easier
to speed up. 'vector_combination' has only 3 arithmetic operations
before final remainder, so here remainder is critical. One possible
way to speed up remainder is to use primes of special form.
If p = 2^k - a with small a and resonable k, then remainder could be
done in 6 operations (2 shifts, 2 multiplications and 2 subtractions).
On my Core2 remainder takes 20 clocks, so 6 operations would be
a substantial improvement. There are other schemes for computing
remainder, of similar cost.
--
Waldek Hebisch