On my machine the mpir FFT starts beating toom algorithms at about
160000 bits, so I start the comparison just above that point. There
are timings missing because I do not yet have the truncated matrix
fourier sqrt2 transforms implemented. Also, the latter timings make
use of the mpir Nussbaumer convolution as I do not have this
implemented yet.
Bits mpir new_fft
195840 1.149s 1.105s
261120 1.483s 1.415s
391296 0.261s 0.248s
521728 0.344s 0.315s
782592 0.577s 0.539s
1043456 0.706s 0.688s
1569024 1.229s 1.153s
2092032 1.543s 1.462s
3127296 0.283s 0.286s
4169728 0.357s 0.350s
6273024 0.621s 0.615s
8364032 0.831s 0.799s
12539904 1.441s 1.471s
16719872 0.230s 0.209s
25122816 0.379s 0.357s
33497088 0.524s 0.462s
50245632 0.833s 0.782s
66994176 1.596s 0.967s
100577280 1.906s 1.704s
134103040 2.784s 2.162s
201129984 3.971s 3.524s
268173312 5.146s 4.572s
....
536608768 9.841s 9.428s
....
1073217536 21.17s 19.33s
....
2146959360 43.25s 39.17s
....
4293787648 96.00s 80.15s
....
8588754944 208.4s 183.3s
....
17177509888 485.0s 382.2s
If anyone is interested in helping with the implementation, one thing
we need is a fast naive acyclic convolution for 1 and 2 limb
coefficients.
So if [a0, a1, ...., a{n-1}], [b0, b1, ..., b{n-1}] is a pair of
vectors with entries that are one limb (or resp. two limbs) then the
convolution is simply the vector
[c0, c1, ...., c{n-1}] where
ci = sum_{k + j = i mod n} (-1)^{(k+j)/n} ak.bj mod B^s
where s = 1 (resp. 2) and B = 2^GMP_LIMB_BITS
If anyone would like to implement this (we need a 1 limb and a 2 limb
version) that would go a long way towards getting the Nussbaumer
convolution implemented.
Bill.
bits mpir new_fft
195840 1.149s 1.105s
261120 1.483s 1.415s
391296 0.261s 0.248s
521728 0.344s 0.315s
782592 0.577s 0.539s
1043456 0.706s 0.688s
1569024 1.229s 1.153s
2092032 1.543s 1.462s
3127296 0.283s 0.286s
4169728 0.357s 0.350s
6273024 0.621s 0.615s
8364032 0.831s 0.799s
12539904 1.441s 1.471s
16719872 0.230s 0.209s
25122816 0.379s 0.357s
33497088 0.524s 0.462s
50245632 0.833s 0.782s
66994176 1.596s 0.967s
100577280 1.906s 1.704s
134103040 2.784s 2.162s
201129984 3.971s 3.524s
268173312 5.146s 4.572s
402456576 7.548s 7.041s
536608768 9.841s 9.428s
804913152 15.48s 14.18s
1073217536 21.17s 19.33s
1610219520 31.64s 30.83s
2146959360 43.25s 39.17s
3220340736 70.14s 61.39s
4293787648 96.00s 80.15s
6441566208 150.2s 136.9s
8588754944 208.4s 183.3s
12883132416 327.4s 290.2s
17177509888 485.0s 382.2s
Times for larger multiplications will drop further when I combine
pointwise multiplications with the inverse matrix fourier transforms.
There's a vast number of other things that could be done to speed it
up.
Here are some features of the new fft:
* No tuning required for multiplications beyond about 3 million bits
* Tuning is easy and should be exceeding fast
* Works well for unbalanced multiplications
* Relies on very few low level functions and butterflies which are
candidates for assembly optimisation
* Can be used easily for multiplying polynomials over the integers
* heaps of test code (nearly half the code base)
Disadvantages:
* 5300 lines of code including test code, and not so many ways to reduce this
The slides are available here:
http://wiki.sagemath.org/SageFlintDays/slides?action=AttachFile&do=view&target=fastntt.pdf
On the final slide you will find some timings of his new code versus
mpir-2.4.0 and gmp-5.0.2. On that machine (an Intel Xeon) it appears
that GMP is way faster than MPIR (and his code is slightly faster
again, and much faster when parallelised).
Anyhow, I decided to do all my timings for gmp-5.0.2 to see how it
compares with mpir-2.4.0 and my new fft on my machine (a K10-2). I
don't have a copy of David's code, so I can't compare with that at
present.
bits iters mpir new_fft gmp
195840 1000 1.149s 1.105s 0.997s
261120 1000 1.483s 1.415s 1.396s
391296 100 0.261s 0.248s 0.282s
521728 100 0.344s 0.315s 0.411s
782592 100 0.577s 0.539s 0.628s
1043456 100 0.706s 0.688s 0.848s
1569024 100 1.229s 1.153s 1.317s
2092032 100 1.543s 1.462s 2.765s
3127296 10 0.283s 0.286s 0.408s
4169728 10 0.357s 0.350s 0.543s
6273024 10 0.621s 0.615s 0.843s
8364032 10 0.831s 0.799s 1.156s
12539904 10 1.441s 1.471s 1.798s
16719872 1 0.230s 0.209s 0.288s
25122816 1 0.379s 0.357s 0.434s
33497088 1 0.524s 0.462s 0.646s
50245632 1 0.833s 0.782s 1.035s
66994176 1 1.596s 0.967s 1.358s
100577280 1 1.906s 1.704s 2.177s
134103040 1 2.784s 2.162s 2.984s
201129984 1 3.971s 3.524s 4.536s
268173312 1 5.146s 4.572s 5.781s
402456576 1 7.548s 7.041s 9.867s
536608768 1 9.841s 9.428s 12.71s
804913152 1 15.48s 14.18s 20.06s
1073217536 1 21.17s 19.33s 27.19s
1610219520 1 31.64s 30.83s 43.37s
2146959360 1 43.25s 39.17s 57.66s
3220340736 1 70.14s 61.39s 92.94s
4293787648 1 96.00s 80.15s 146.1s
6441566208 1 150.2s 136.9s 217.5s
8588754944 1 208.4s 183.3s 312.8s
12883132416 1 327.4s 290.2s 447.7s
17177509888 1 485.0s 382.2s 614.2s
I've checked multiple ways that I am really compiling against the
right GMP. It also appears that GMP is configuring correctly on my
machine. So it looks very much like either MPIR misconfigured on
David's machine, or the GMP FFT is much, much better with Intel
machines or the MPIR FFT is way, way worse with Intel machines.
I'll try to get some configuration details from David and eventually
I'll do some timings on an Intel machine to see how everything
compares.
Last night I tried combining pointwise multiplications with the
inverse transform to improve memory locality. However, this failed
because calling mpir's Nussbaumer convolution rapidly in succession
results in an "Unable to allocate memory" error. I have no idea what
causes this (it is certainly not out of memory). The code was
exhibiting some very odd behaviour which I was unable to account for.
My best guess is that it is to do with too many allocations on the
stack or something. I am doubtlessly calling the function in a way it
was not intended to be called. Anyhow, this just underscores the need
to write a fast Nussbaumer convolution, which I currently do not have
time to do (I don't need it for my application).
Bill.
Bill.
Just a few minor test functions to go, some tuning code and a nice
integer mul wrapper and it will be fully usable. I'm thinking of just
doing a Nussbaumer convolution now rather than leaving it until later.
Bill.
The fourth column in the following table shows the times for the new
code. It is only used from a certain point on hence no changes to the
first half of the table:
bits mpir new_fft gmp nussbaumer
195840 1.149s 1.105s 0.997s
261120 1.483s 1.415s 1.396s
391296 0.261s 0.248s 0.282s
521728 0.344s 0.315s 0.411s
782592 0.577s 0.539s 0.628s
1043456 0.706s 0.688s 0.848s
1569024 1.229s 1.153s 1.317s
2092032 1.543s 1.462s 2.765s
3127296 0.283s 0.286s 0.408s
4169728 0.357s 0.350s 0.543s
6273024 0.621s 0.615s 0.843s
8364032 0.831s 0.799s 1.156s
12539904 1.441s 1.471s 1.798s
16719872 0.230s 0.209s 0.288s
25122816 0.379s 0.357s 0.434s
33497088 0.524s 0.462s 0.646s
50245632 0.833s 0.782s 1.035s
66994176 1.596s 0.967s 1.358s
100577280 1.906s 1.704s 2.177s 1.684s
134103040 2.784s 2.162s 2.984s 2.154s
201129984 3.971s 3.460s 4.536s 3.349s
268173312 5.146s 4.419s 5.781s 4.350s
402456576 7.548s 7.150s 9.867s 6.717s
536608768 9.841s 9.264s 12.71s 8.804s
804913152 15.48s 14.71s 20.06s 13.92s
1073217536 21.17s 18.85s 27.19s 17.96s
1610219520 31.64s 30.64s 43.37s 29.64s
2146959360 43.25s 39.17s 57.66s 39.07s
3220340736 70.14s 61.39s 92.94s 61.17s
4293787648 96.00s 80.15s 146.1s 79.13s
6441566208 150.2s 136.9s 217.5s 132.9s
8588754944 208.4s 183.3s 312.8s 177.2s
12883132416 327.4s 290.2s 447.7s 281.8s
17177509888 485.0s 382.2s 614.2s 372.6s
The final step is to write a few missing test functions and to write
tuning code. (Note the Nussbaumer code is in the mpir-fft repo not the
flint repo at this stage. I will fix this tomorrow.)
Happy New Year!
Bill.
At the moment some hard coded tuning values are present in
mulmod_2expp1.c. There are no tuning values for the main
multiplication routine because there is no wrapper function
mpn_mul_fft or whatever at this point (it's trivial to write one, it
just gets the appropriate tuning values and calls
fft_mul_mfa_sqrt2_truncate).
A port to MPIR would involve the following:
* replace FLINT_BITS with GMP_LIMB_BITS throughout, similarly with
FLINT_MIN -> MIN
* replace malloc with TMP_MARK, TMP_LIMBS_BALLOC throughout
* replace free with TMP_FREE throughout
* #include gmp-impl.h throughout
* copy n_revbin from the ulong_extras directory (I hereby license it
BSD as per the remainder of the FFT implementation)
* add alternative code paths for mpn_sumdiff_n and mpn_addsub_n where
these are used but not available on certain architectures
Eventually when I add tuning code it will need rewriting for MPIR.
It's trivial to tune this FFT.
Obviously as two thirds of the time is spent in butterflies, any
assembly optimisation of these will speed things up. I don't plan to
do this myself.
Bill.
This results in about a 5% speedup over much of the range. Here are
the times now.
bits iters mpir this gmp n w
195840 1000 1.149s 1.105s 0.997s 7 16
261120 1000 1.483s 1.415s 1.396s 7 16
391296 100 0.261s 0.248s 0.282s 8 8
521728 100 0.344s 0.315s 0.411s 8 8
782592 100 0.577s 0.539s 0.628s 9 4
1043456 100 0.706s 0.688s 0.848s 9 4
1569024 100 1.229s 1.153s 1.317s 9 8
2092032 100 1.543s 1.440s 2.765s 9 8
3127296 10 0.283s 0.266s 0.408s 11 1
4169728 10 0.357s 0.335s 0.543s 11 1
6273024 10 0.621s 0.597s 0.843s 11 2
8364032 10 0.831s 0.742s 1.156s 11 2
12539904 10 1.441s 1.394s 1.798s 12 1
16719872 1 0.230s 0.205s 0.288s 12 1
25122816 1 0.379s 0.336s 0.434s 12 2
33497088 1 0.524s 0.428s 0.646s 12 2
50245632 1 0.833s 0.693s 1.035s 13 1
66994176 1 1.596s 0.896s 1.358s 13 1
100577280 1 1.906s 1.552s 2.177s 13 2
134103040 1 2.784s 2.076s 2.984s 13 2
201129984 1 3.971s 3.158s 4.536s 14 1
268173312 1 5.146s 4.137s 5.781s 14 1
402456576 1 7.548s 6.443s 9.867s 14 2
536608768 1 9.841s 8.365s 12.71s 14 2
804913152 1 15.48s 13.29s 20.06s 15 1
1073217536 1 21.17s 17.16s 27.19s 15 1
1610219520 1 31.64s 28.60s 43.37s 15 2
2146959360 1 43.25s 37.02s 57.66s 15 2
3220340736 1 70.14s 58.09s 92.94s 16 1
4293787648 1 96.00s 74.26s 146.1s 16 1
6441566208 1 150.2s 131.1s 217.5s 16 2
8588754944 1 208.4s 175.0s 312.8s 16 2
12883132416 1 327.4s 278.6s 447.7s 17 1
17177509888 1 485.0s 360.ss 614.2s 17 1
Bill.
I can give it go , starting this week and also the
fast naive acyclic convolution for 1 and 2 limb
coefficients.
Jason
you made quick work of that!
All three of those tests would fail if there was a problem in
fft_split_bits or fft_combine_bits. Those are relatively
straightforward functions. They break a long integer up into
coefficients. They are documented pretty well in the doc directory and
should be relatively easy to figure out.
Nothing obvious strikes me, but it is most likely our old friend
sizeof(long) != sizeof(ptr) or something similar. If you can get the
test split/combine bits to pass, chances are the others will too.
Bill.
> --
> You received this message because you are subscribed to the Google Groups
> "mpir-devel" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mpir-devel/-/-FPbPpm5_EMJ.
>
> To post to this group, send email to mpir-...@googlegroups.com.
> To unsubscribe from this group, send email to
> mpir-devel+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mpir-devel?hl=en.
Bill.
> --
> You received this message because you are subscribed to the Google Groups
> "mpir-devel" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mpir-devel/-/1DOqZHcRabcJ.
BH> I do make the assumption that sizeof(mp_limb_t) = sizeof(ptr) in the
BH> memory allocation. But I think this is true on 64 bit Windows, is it
BH> not?
doze 64 has 32bit long; if mp_limb_t is based on long it will not be the
same size as pointer.
-JimC
--
James Cloos <cl...@jhcloos.com> OpenPGP: 1024D/ED7DAEA6
It should be longlong in mpir, and hence 64 bits.
Bill.
Bill.
> --
> You received this message because you are subscribed to the Google Groups
> "mpir-devel" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mpir-devel/-/8F-vCR84hjAJ.
The new code doesn't call fft_radix2_twiddle.
I hope you are using the implementation here:
https://github.com/wbhart/flint2/tree/trunk/fft
The other implementation is extremely messy in comparison.
Bill.
On 3 January 2012 01:15, Cactus <riem...@gmail.com> wrote:
you are absolutely right. I had completely forgotten about the
function. I renamed a whole pile of others that were called twiddle,
but I forgot that these still existed. It's ok though, that is what I
intended.
Bill.
> --
> You received this message because you are subscribed to the Google Groups
> "mpir-devel" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mpir-devel/-/mq8l6hag37UJ.
This shouldn't be a problem if you use BALLOC which I think allocates
on the heap.
>
> On another issue, how do I actually integrate the new FFT with MPIR?
>
> I assume that I take out mul_fft but the call to the FFT code in MPIR is not
> the same as the calls available in the new FFT - there are extra parameters.
> Does this mean we need a new routine to do the integration?
>
Yes, there has to be a tuned wrapper to set n and w appropriately. The
conditions on n and w are in the documentation.
Bill.
I am not sure how n ever gets to be zero. It should start as a power
of 2 and always remain thus. Probably the bug is elsewhere.
Bill.
On 3 January 2012 13:42, Cactus <riem...@gmail.com> wrote:
> I think I have found the problem.
>
> Both ifft_radix2_twiddle and ifft_radix2_twiddle don't guard against entry
> Brian
>
> --
> You received this message because you are subscribed to the Google Groups
> "mpir-devel" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mpir-devel/-/TQnb2K3MiKgJ.
Bill.