FFT progress

57 views
Skip to first unread message

Bill Hart

unread,
Dec 24, 2011, 1:12:50 PM12/24/11
to flint-devel, mpir-devel
I have finally managed to implement the truncated sqrt2 transforms and
the matrix fourier sqrt2 transforms. This enables me to get actual
timings vs mpir. These times for multiplying two integers of the given
numbers of bits.

On my machine the mpir FFT starts beating toom algorithms at about
160000 bits, so I start the comparison just above that point. There
are timings missing because I do not yet have the truncated matrix
fourier sqrt2 transforms implemented. Also, the latter timings make
use of the mpir Nussbaumer convolution as I do not have this
implemented yet.

Bits mpir new_fft

195840 1.149s 1.105s
261120 1.483s 1.415s
391296 0.261s 0.248s
521728 0.344s 0.315s
782592 0.577s 0.539s
1043456 0.706s 0.688s
1569024 1.229s 1.153s
2092032 1.543s 1.462s
3127296 0.283s 0.286s
4169728 0.357s 0.350s
6273024 0.621s 0.615s
8364032 0.831s 0.799s
12539904 1.441s 1.471s
16719872 0.230s 0.209s
25122816 0.379s 0.357s
33497088 0.524s 0.462s
50245632 0.833s 0.782s
66994176 1.596s 0.967s
100577280 1.906s 1.704s
134103040 2.784s 2.162s
201129984 3.971s 3.524s
268173312 5.146s 4.572s
....
536608768 9.841s 9.428s
....
1073217536 21.17s 19.33s
....
2146959360 43.25s 39.17s
....
4293787648 96.00s 80.15s
....
8588754944 208.4s 183.3s
....
17177509888 485.0s 382.2s

If anyone is interested in helping with the implementation, one thing
we need is a fast naive acyclic convolution for 1 and 2 limb
coefficients.

So if [a0, a1, ...., a{n-1}], [b0, b1, ..., b{n-1}] is a pair of
vectors with entries that are one limb (or resp. two limbs) then the
convolution is simply the vector

[c0, c1, ...., c{n-1}] where

ci = sum_{k + j = i mod n} (-1)^{(k+j)/n} ak.bj mod B^s

where s = 1 (resp. 2) and B = 2^GMP_LIMB_BITS

If anyone would like to implement this (we need a 1 limb and a 2 limb
version) that would go a long way towards getting the Nussbaumer
convolution implemented.

Bill.

Bill Hart

unread,
Dec 25, 2011, 6:18:47 PM12/25/11
to flint-devel, mpir-devel
I finished implementing the truncated matrix fourier square root two
transforms and I can finally multiply integers of all sizes. (Note
that I'm still using the mpir Nussbaumer convolution for pointwise
multiplications when the products get huge. Times will drop further
when at some point in the future this is implemented independently.)

bits mpir new_fft


195840 1.149s 1.105s
261120 1.483s 1.415s
391296 0.261s 0.248s
521728 0.344s 0.315s
782592 0.577s 0.539s
1043456 0.706s 0.688s
1569024 1.229s 1.153s
2092032 1.543s 1.462s
3127296 0.283s 0.286s
4169728 0.357s 0.350s
6273024 0.621s 0.615s
8364032 0.831s 0.799s
12539904 1.441s 1.471s
16719872 0.230s 0.209s
25122816 0.379s 0.357s
33497088 0.524s 0.462s
50245632 0.833s 0.782s
66994176 1.596s 0.967s
100577280 1.906s 1.704s
134103040 2.784s 2.162s
201129984 3.971s 3.524s
268173312 5.146s 4.572s

402456576 7.548s 7.041s
536608768 9.841s 9.428s
804913152 15.48s 14.18s
1073217536 21.17s 19.33s
1610219520 31.64s 30.83s
2146959360 43.25s 39.17s
3220340736 70.14s 61.39s
4293787648 96.00s 80.15s
6441566208 150.2s 136.9s
8588754944 208.4s 183.3s
12883132416 327.4s 290.2s
17177509888 485.0s 382.2s

Times for larger multiplications will drop further when I combine
pointwise multiplications with the inverse matrix fourier transforms.
There's a vast number of other things that could be done to speed it
up.

Here are some features of the new fft:

* No tuning required for multiplications beyond about 3 million bits
* Tuning is easy and should be exceeding fast
* Works well for unbalanced multiplications
* Relies on very few low level functions and butterflies which are
candidates for assembly optimisation
* Can be used easily for multiplying polynomials over the integers
* heaps of test code (nearly half the code base)

Disadvantages:

* 5300 lines of code including test code, and not so many ways to reduce this

Bill Hart

unread,
Dec 26, 2011, 9:44:49 AM12/26/11
to flint-devel, mpir-devel
A few days ago, David Harvey gave a really nice talk at Warwick about
some new number theoretic transform code he has been working on. It
uses some nice tricks to get *really* fast butterflies. Even a C
implementation turned out to be quite fast, although he is also
working on some assembly optimised versions. Two really nice features
are very low memory usage and ability to easily parallelise it (which
he has done).

The slides are available here:

http://wiki.sagemath.org/SageFlintDays/slides?action=AttachFile&do=view&target=fastntt.pdf

On the final slide you will find some timings of his new code versus
mpir-2.4.0 and gmp-5.0.2. On that machine (an Intel Xeon) it appears
that GMP is way faster than MPIR (and his code is slightly faster
again, and much faster when parallelised).

Anyhow, I decided to do all my timings for gmp-5.0.2 to see how it
compares with mpir-2.4.0 and my new fft on my machine (a K10-2). I
don't have a copy of David's code, so I can't compare with that at
present.

bits iters mpir new_fft gmp
195840 1000 1.149s 1.105s 0.997s
261120 1000 1.483s 1.415s 1.396s
391296 100 0.261s 0.248s 0.282s
521728 100 0.344s 0.315s 0.411s
782592 100 0.577s 0.539s 0.628s
1043456 100 0.706s 0.688s 0.848s
1569024 100 1.229s 1.153s 1.317s
2092032 100 1.543s 1.462s 2.765s
3127296 10 0.283s 0.286s 0.408s
4169728 10 0.357s 0.350s 0.543s
6273024 10 0.621s 0.615s 0.843s
8364032 10 0.831s 0.799s 1.156s
12539904 10 1.441s 1.471s 1.798s
16719872 1 0.230s 0.209s 0.288s
25122816 1 0.379s 0.357s 0.434s
33497088 1 0.524s 0.462s 0.646s
50245632 1 0.833s 0.782s 1.035s
66994176 1 1.596s 0.967s 1.358s
100577280 1 1.906s 1.704s 2.177s
134103040 1 2.784s 2.162s 2.984s
201129984 1 3.971s 3.524s 4.536s
268173312 1 5.146s 4.572s 5.781s
402456576 1 7.548s 7.041s 9.867s
536608768 1 9.841s 9.428s 12.71s
804913152 1 15.48s 14.18s 20.06s
1073217536 1 21.17s 19.33s 27.19s
1610219520 1 31.64s 30.83s 43.37s
2146959360 1 43.25s 39.17s 57.66s
3220340736 1 70.14s 61.39s 92.94s
4293787648 1 96.00s 80.15s 146.1s
6441566208 1 150.2s 136.9s 217.5s
8588754944 1 208.4s 183.3s 312.8s
12883132416 1 327.4s 290.2s 447.7s
17177509888 1 485.0s 382.2s 614.2s

I've checked multiple ways that I am really compiling against the
right GMP. It also appears that GMP is configuring correctly on my
machine. So it looks very much like either MPIR misconfigured on
David's machine, or the GMP FFT is much, much better with Intel
machines or the MPIR FFT is way, way worse with Intel machines.

I'll try to get some configuration details from David and eventually
I'll do some timings on an Intel machine to see how everything
compares.

Last night I tried combining pointwise multiplications with the
inverse transform to improve memory locality. However, this failed
because calling mpir's Nussbaumer convolution rapidly in succession
results in an "Unable to allocate memory" error. I have no idea what
causes this (it is certainly not out of memory). The code was
exhibiting some very odd behaviour which I was unable to account for.
My best guess is that it is to do with too many allocations on the
stack or something. I am doubtlessly calling the function in a way it
was not intended to be called. Anyhow, this just underscores the need
to write a fast Nussbaumer convolution, which I currently do not have
time to do (I don't need it for my application).

Bill.

Bill Hart

unread,
Dec 27, 2011, 10:47:58 AM12/27/11
to flint-devel, mpir-devel
David clarified the timing in that last slide of his. Apparently it
was just a misconfiguration external to MPIR. He reports that the
correct time for MPIR is 224s, which is in line with what we might
expect from the figures I gave for AMD chips.

Bill.

Bill Hart

unread,
Dec 29, 2011, 1:28:28 PM12/29/11
to flint-devel, mpir-devel
I have cleaned up my FFT massively and it now takes just over 2000
lines of code including test code. You can view the code here:

http://selmer.warwick.ac.uk/gitweb/flint2.git?a=tree;f=fft;h=1fcb640bdbf46137b01611a7ff7e1cc90fd2e4c0;hb=HEAD

Just a few minor test functions to go, some tuning code and a nice
integer mul wrapper and it will be fully usable. I'm thinking of just
doing a Nussbaumer convolution now rather than leaving it until later.

Bill.

Bill Hart

unread,
Dec 31, 2011, 9:28:22 PM12/31/11
to flint-devel, mpir-devel
I have finished implementing and debugging a new Nussbaumer
convolution. It isn't the absolutely fastest possible thing one can
do, but it is still slightly better than what is in mpir.

The fourth column in the following table shows the times for the new
code. It is only used from a certain point on hence no changes to the
first half of the table:

bits mpir new_fft gmp nussbaumer
195840 1.149s 1.105s 0.997s
261120 1.483s 1.415s 1.396s
391296 0.261s 0.248s 0.282s
521728 0.344s 0.315s 0.411s
782592 0.577s 0.539s 0.628s
1043456 0.706s 0.688s 0.848s
1569024 1.229s 1.153s 1.317s
2092032 1.543s 1.462s 2.765s
3127296 0.283s 0.286s 0.408s
4169728 0.357s 0.350s 0.543s
6273024 0.621s 0.615s 0.843s
8364032 0.831s 0.799s 1.156s
12539904 1.441s 1.471s 1.798s
16719872 0.230s 0.209s 0.288s
25122816 0.379s 0.357s 0.434s
33497088 0.524s 0.462s 0.646s
50245632 0.833s 0.782s 1.035s
66994176 1.596s 0.967s 1.358s
100577280 1.906s 1.704s 2.177s 1.684s
134103040 2.784s 2.162s 2.984s 2.154s
201129984 3.971s 3.460s 4.536s 3.349s
268173312 5.146s 4.419s 5.781s 4.350s
402456576 7.548s 7.150s 9.867s 6.717s
536608768 9.841s 9.264s 12.71s 8.804s
804913152 15.48s 14.71s 20.06s 13.92s
1073217536 21.17s 18.85s 27.19s 17.96s
1610219520 31.64s 30.64s 43.37s 29.64s
2146959360 43.25s 39.17s 57.66s 39.07s
3220340736 70.14s 61.39s 92.94s 61.17s
4293787648 96.00s 80.15s 146.1s 79.13s
6441566208 150.2s 136.9s 217.5s 132.9s
8588754944 208.4s 183.3s 312.8s 177.2s
12883132416 327.4s 290.2s 447.7s 281.8s
17177509888 485.0s 382.2s 614.2s 372.6s

The final step is to write a few missing test functions and to write
tuning code. (Note the Nussbaumer code is in the mpir-fft repo not the
flint repo at this stage. I will fix this tomorrow.)

Happy New Year!

Bill.

Bill Hart

unread,
Jan 2, 2012, 2:30:03 AM1/2/12
to flint-devel, mpir-devel
I've now added the negacyclic transforms to the flint repo and cleaned
it all up ready for some tuning code to be written.

At the moment some hard coded tuning values are present in
mulmod_2expp1.c. There are no tuning values for the main
multiplication routine because there is no wrapper function
mpn_mul_fft or whatever at this point (it's trivial to write one, it
just gets the appropriate tuning values and calls
fft_mul_mfa_sqrt2_truncate).

A port to MPIR would involve the following:

* replace FLINT_BITS with GMP_LIMB_BITS throughout, similarly with
FLINT_MIN -> MIN
* replace malloc with TMP_MARK, TMP_LIMBS_BALLOC throughout
* replace free with TMP_FREE throughout
* #include gmp-impl.h throughout
* copy n_revbin from the ulong_extras directory (I hereby license it
BSD as per the remainder of the FFT implementation)
* add alternative code paths for mpn_sumdiff_n and mpn_addsub_n where
these are used but not available on certain architectures

Eventually when I add tuning code it will need rewriting for MPIR.
It's trivial to tune this FFT.

Obviously as two thirds of the time is spent in butterflies, any
assembly optimisation of these will speed things up. I don't plan to
do this myself.

Bill.

Bill Hart

unread,
Jan 2, 2012, 3:57:49 AM1/2/12
to flint-devel, mpir-devel
The code can be more easily browsed here:

https://github.com/wbhart/flint2/tree/trunk/fft

Bill.

Bill Hart

unread,
Jan 2, 2012, 8:54:49 AM1/2/12
to flint-devel, mpir-devel
I found another speedup: combine the inner transforms and pointwise
mults and combine the ifft and normalisation in a cache friendly way.

This results in about a 5% speedup over much of the range. Here are
the times now.

bits iters mpir this gmp n w

195840 1000 1.149s 1.105s 0.997s 7 16
261120 1000 1.483s 1.415s 1.396s 7 16
391296 100 0.261s 0.248s 0.282s 8 8
521728 100 0.344s 0.315s 0.411s 8 8
782592 100 0.577s 0.539s 0.628s 9 4
1043456 100 0.706s 0.688s 0.848s 9 4
1569024 100 1.229s 1.153s 1.317s 9 8

2092032 100 1.543s 1.440s 2.765s 9 8
3127296 10 0.283s 0.266s 0.408s 11 1
4169728 10 0.357s 0.335s 0.543s 11 1
6273024 10 0.621s 0.597s 0.843s 11 2
8364032 10 0.831s 0.742s 1.156s 11 2
12539904 10 1.441s 1.394s 1.798s 12 1
16719872 1 0.230s 0.205s 0.288s 12 1
25122816 1 0.379s 0.336s 0.434s 12 2
33497088 1 0.524s 0.428s 0.646s 12 2
50245632 1 0.833s 0.693s 1.035s 13 1
66994176 1 1.596s 0.896s 1.358s 13 1
100577280 1 1.906s 1.552s 2.177s 13 2
134103040 1 2.784s 2.076s 2.984s 13 2
201129984 1 3.971s 3.158s 4.536s 14 1
268173312 1 5.146s 4.137s 5.781s 14 1
402456576 1 7.548s 6.443s 9.867s 14 2
536608768 1 9.841s 8.365s 12.71s 14 2
804913152 1 15.48s 13.29s 20.06s 15 1
1073217536 1 21.17s 17.16s 27.19s 15 1
1610219520 1 31.64s 28.60s 43.37s 15 2
2146959360 1 43.25s 37.02s 57.66s 15 2
3220340736 1 70.14s 58.09s 92.94s 16 1
4293787648 1 96.00s 74.26s 146.1s 16 1
6441566208 1 150.2s 131.1s 217.5s 16 2
8588754944 1 208.4s 175.0s 312.8s 16 2
12883132416 1 327.4s 278.6s 447.7s 17 1
17177509888 1 485.0s 360.ss 614.2s 17 1

Bill.

Cactus

unread,
Jan 2, 2012, 9:32:44 AM1/2/12
to mpir-...@googlegroups.com, flint-devel
Fantastic work Bill - it offers a really substantial speed up and is hence something we should add to MPIR as soon as you feel it is ready.

From your list of changes needed for moving it into MPIR, it seems that almost all are search/replace operations - is it always obvious where TMP_MARK should be inserted?

I might be able to have a go at this if it is little more than a search/replace operation (does your list of changes cover the tests?).

    Brian

Jason

unread,
Jan 2, 2012, 11:41:16 AM1/2/12
to mpir-...@googlegroups.com, flint-devel
On Monday 02 January 2012 07:30:03 Bill Hart wrote:
> I've now added the negacyclic transforms to the flint repo and cleaned
> it all up ready for some tuning code to be written.
>
> At the moment some hard coded tuning values are present in
> mulmod_2expp1.c. There are no tuning values for the main
> multiplication routine because there is no wrapper function
> mpn_mul_fft or whatever at this point (it's trivial to write one, it
> just gets the appropriate tuning values and calls
> fft_mul_mfa_sqrt2_truncate).
>
> A port to MPIR would involve the following:
>
> * replace FLINT_BITS with GMP_LIMB_BITS throughout, similarly with
> FLINT_MIN -> MIN
> * replace malloc with TMP_MARK, TMP_LIMBS_BALLOC throughout
> * replace free with TMP_FREE throughout
> * #include gmp-impl.h throughout
> * copy n_revbin from the ulong_extras directory (I hereby license it
> BSD as per the remainder of the FFT implementation)
> * add alternative code paths for mpn_sumdiff_n and mpn_addsub_n where
> these are used but not available on certain architectures
>
> Eventually when I add tuning code it will need rewriting for MPIR.
> It's trivial to tune this FFT.
>
> Obviously as two thirds of the time is spent in butterflies, any
> assembly optimisation of these will speed things up. I don't plan to
> do this myself.
>

I can give it go , starting this week and also the

fast naive acyclic convolution for 1 and 2 limb
coefficients.

Jason

Cactus

unread,
Jan 2, 2012, 3:58:43 PM1/2/12
to mpir-...@googlegroups.com, flint-devel
I have got the FFT building on Windows but three of the tests fail in x64:

mul_mfa_truncate_sqrt2....error in limb 0, ea987380 != aa987380
mul_truncate_sqrt2....error in limb 0, ea987380 != aa987380
split/combine_bits....FAIL: Error in limb 2, 1002153456 != 1520

I'm not sure how to debug these as the code is completely new to me.

    Brian

Bill Hart

unread,
Jan 2, 2012, 4:12:06 PM1/2/12
to mpir-...@googlegroups.com, flint-devel
Hi Brian,

you made quick work of that!

All three of those tests would fail if there was a problem in
fft_split_bits or fft_combine_bits. Those are relatively
straightforward functions. They break a long integer up into
coefficients. They are documented pretty well in the doc directory and
should be relatively easy to figure out.

Nothing obvious strikes me, but it is most likely our old friend
sizeof(long) != sizeof(ptr) or something similar. If you can get the
test split/combine bits to pass, chances are the others will too.

Bill.

> --
> You received this message because you are subscribed to the Google Groups
> "mpir-devel" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mpir-devel/-/-FPbPpm5_EMJ.
>
> To post to this group, send email to mpir-...@googlegroups.com.
> To unsubscribe from this group, send email to
> mpir-devel+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mpir-devel?hl=en.

Cactus

unread,
Jan 2, 2012, 4:21:07 PM1/2/12
to mpir-...@googlegroups.com, flint-devel
The split/combine works in win32 so this is, as you suggest, the old 32/64 bit integer problem.  I couldn't test the other tow failures in win32 as they require gmpn_addsub_n which ends up as an undefined symbol in the 32-bit  builds that I can use.

I have not yet done the malloc conversions so it could be some odd interaction in memory management.

    Brian
 

Bill Hart

unread,
Jan 2, 2012, 4:23:29 PM1/2/12
to mpir-...@googlegroups.com
I do make the assumption that sizeof(mp_limb_t) = sizeof(ptr) in the
memory allocation. But I think this is true on 64 bit Windows, is it
not?

Bill.

> --
> You received this message because you are subscribed to the Google Groups
> "mpir-devel" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/mpir-devel/-/1DOqZHcRabcJ.

Cactus

unread,
Jan 2, 2012, 4:51:47 PM1/2/12
to mpir-...@googlegroups.com, flint-devel
I think the FLINT64 define is a potential issue.  

Should the FFT work with MPIR's longlong.h

What should I do about flint_clz_tab?

I have removed all references to the include file 'ulong_extras.h' - maybe this is needed somewhere (although everything builds without it)..
 
    Brian 

Bill Hart

unread,
Jan 2, 2012, 5:06:19 PM1/2/12
to mpir-...@googlegroups.com
Hi Brian,


On Monday, 2 January 2012, Cactus <riem...@gmail.com> wrote:
> I think the FLINT64 define is a potential issue.

It's the size of an mp_limb_t in Flint, not an unsigned long, which I barely use (if I do it's probably a bug).


 
> Should the FFT work with MPIR's longlong.h

Yes.

> What should I do about flint_clz_tab?

It is only needed by count_leading_zeros which is provided by mpir's longlong.h.


> I have removed all references to the include file 'ulong_extras.h' - maybe this is needed somewhere (although everything builds without it)..

It is only needed for n_revbin and the random functions which you have presimsble replaced.

Bill.


>  
>     Brian 
>
> --
> You received this message because you are subscribed to the Google Groups "mpir-devel" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/mpir-devel/-/ME4hy1sfZPoJ.

James Cloos

unread,
Jan 2, 2012, 6:06:30 PM1/2/12
to mpir-...@googlegroups.com
>>>>> "BH" == Bill Hart <goodwi...@googlemail.com> writes:

BH> I do make the assumption that sizeof(mp_limb_t) = sizeof(ptr) in the
BH> memory allocation. But I think this is true on 64 bit Windows, is it
BH> not?

doze 64 has 32bit long; if mp_limb_t is based on long it will not be the
same size as pointer.

-JimC
--
James Cloos <cl...@jhcloos.com> OpenPGP: 1024D/ED7DAEA6

Bill Hart

unread,
Jan 2, 2012, 6:27:03 PM1/2/12
to mpir-...@googlegroups.com
On 2 January 2012 23:06, James Cloos <cl...@jhcloos.com> wrote:
>>>>>> "BH" == Bill Hart <goodwi...@googlemail.com> writes:
>
> BH> I do make the assumption that sizeof(mp_limb_t) = sizeof(ptr) in the
> BH> memory allocation. But I think this is true on 64 bit Windows, is it
> BH> not?
>
> doze 64 has 32bit long; if mp_limb_t is based on long it will not be the
> same size as pointer.
>

It should be longlong in mpir, and hence 64 bits.

Bill.

Cactus

unread,
Jan 2, 2012, 8:15:20 PM1/2/12
to mpir-...@googlegroups.com, flint-devel
Only one test now fails:

   mul_mfa_truncate_sqrt2

It runs out of memory after a huge number of recursive calls to fft_radix2_twiddle

   Brian

Bill Hart

unread,
Jan 2, 2012, 8:30:27 PM1/2/12
to mpir-...@googlegroups.com, flint-devel
That's interesting. How much memory does your machine have?

Bill.

> --
> You received this message because you are subscribed to the Google Groups
> "mpir-devel" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/mpir-devel/-/8F-vCR84hjAJ.

Bill Hart

unread,
Jan 2, 2012, 8:33:14 PM1/2/12
to mpir-...@googlegroups.com
Oh dear, which implementation are you using?

The new code doesn't call fft_radix2_twiddle.

I hope you are using the implementation here:

https://github.com/wbhart/flint2/tree/trunk/fft

The other implementation is extremely messy in comparison.

Bill.

On 3 January 2012 01:15, Cactus <riem...@gmail.com> wrote:

Cactus

unread,
Jan 3, 2012, 3:16:29 AM1/3/12
to mpir-...@googlegroups.com
I am using the version you refer to above.  

As far as I can see, the test t-mul_mfa_truncate_sqrt2 does end up calling fft_radix2_twiddle via this call sequence:

   mul_mfa_truncate_sqrt2(r1, i1, int_limbs, i2, int_limbs, depth, w);
   fft_mfa_truncate_sqrt2_outer(ii, n, w, &t1, &t2, &s1, sqrt, trunc);
   fft_radix2_twiddle(ii + i, n1, n2/2, w*n1, t1, t2, w, 0, i, 1);

Is this test now redundant?

     Brian

Bill Hart

unread,
Jan 3, 2012, 4:49:20 AM1/3/12
to mpir-...@googlegroups.com
Hi Brian,

you are absolutely right. I had completely forgotten about the
function. I renamed a whole pile of others that were called twiddle,
but I forgot that these still existed. It's ok though, that is what I
intended.

Bill.

> --
> You received this message because you are subscribed to the Google Groups
> "mpir-devel" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/mpir-devel/-/mq8l6hag37UJ.

Cactus

unread,
Jan 3, 2012, 5:00:02 AM1/3/12
to mpir-...@googlegroups.com
Thanks Bill,

So the test failure matters - I was hoping that it didn't :-)

My laptop, which I am using right now, has 4GB of RAM - I am not clear whether it is the stack or the heap that overflows but I suspect the former.

On another issue, how do I actually integrate the new FFT with MPIR?   

I assume that I take out mul_fft but the call to the FFT code in MPIR is not the same as the calls available in the new FFT - there are extra parameters.  Does this mean we need a new routine to do the integration?

   Brian

Bill Hart

unread,
Jan 3, 2012, 5:35:03 AM1/3/12
to mpir-...@googlegroups.com
On 3 January 2012 10:00, Cactus <riem...@gmail.com> wrote:
> Thanks Bill,
>
> So the test failure matters - I was hoping that it didn't :-)
>
> My laptop, which I am using right now, has 4GB of RAM - I am not clear
> whether it is the stack or the heap that overflows but I suspect the former.

This shouldn't be a problem if you use BALLOC which I think allocates
on the heap.

>
> On another issue, how do I actually integrate the new FFT with MPIR?
>
> I assume that I take out mul_fft but the call to the FFT code in MPIR is not
> the same as the calls available in the new FFT - there are extra parameters.
>  Does this mean we need a new routine to do the integration?
>

Yes, there has to be a tuned wrapper to set n and w appropriately. The
conditions on n and w are in the documentation.

Bill.

Message has been deleted

Cactus

unread,
Jan 3, 2012, 8:46:23 AM1/3/12
to mpir-...@googlegroups.com
I think I have found the problem.

Both fft_radix2_twiddle and ifft_radix2_twiddle don't guard against entry with n = 0.  I changed the first part of both routines according to the template:

   if (n < 2) 
   {
      mp_size_t tw1, tw2;
      tw1 = r*c;
      tw2 = tw1 + rs*c;

      if(n)
      {
          ifft_butterfly_twiddle(*t1, *t2, ii[0], ii[is], limbs, tw1*ws, tw2*ws);
      
          SWAP_PTRS(ii[0],  *t1);
          SWAP_PTRS(ii[is], *t2);
      }

      return;
   }

and all tests now pass.  

I noticed that the routine fft_radix2_twiddle is tail recursive so we can reduce its memory footprint and maybe gain some speed by eliminating the tail call.

    Brian

Bill Hart

unread,
Jan 3, 2012, 10:39:40 AM1/3/12
to mpir-...@googlegroups.com
Hi Brian,

I am not sure how n ever gets to be zero. It should start as a power
of 2 and always remain thus. Probably the bug is elsewhere.

Bill.

On 3 January 2012 13:42, Cactus <riem...@gmail.com> wrote:
> I think I have found the problem.
>

> Both  ifft_radix2_twiddle and ifft_radix2_twiddle don't guard against entry

>     Brian
>
> --
> You received this message because you are subscribed to the Google Groups
> "mpir-devel" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/mpir-devel/-/TQnb2K3MiKgJ.

Bill Hart

unread,
Jan 3, 2012, 10:59:41 AM1/3/12
to mpir-...@googlegroups.com
One possibility is that in t-mul_mfa_truncate_sqrt2 the top limbs of
the integers being multiplied are zero, so that ultimately trunc <=
2*n which would cause the problem.

Bill.

Cactus

unread,
Jan 3, 2012, 11:06:53 AM1/3/12
to mpir-...@googlegroups.com
This is one sequence IN fft_mfa_truncate_sqrt2.c that ends with entry to the failing code with n = 0.

void fft_truncate1_twiddle(mp_limb_t ** ii, mp_size_t is,
      mp_size_t n, mp_bitcnt_t w, mp_limb_t ** t1, mp_limb_t ** t2,
      mp_size_t ws, mp_size_t r, mp_size_t c, mp_size_t rs, mp_size_t trunc)
{
   mp_size_t i;
   mp_size_t limbs = (w*n)/GMP_LIMB_BITS;
   
   if (trunc == 2*n)
      fft_radix2_twiddle(ii, is, n, w, t1, t2, ws, r, c, rs);
   else if (trunc <= n)
   {
      for (i = 0; i < n; i++)
         mpn_add_n(ii[i*is], ii[i*is], ii[(i+n)*is], limbs + 1);
      
      fft_truncate1_twiddle(ii, is, n/2, 2*w, t1, t2, ws, r, c, 2*rs, trunc);
   } else

If this is entered with n = 1 and trunk = 0 or 1, the last call above will have n = 0.   This sequence is invoked in my code.

    Brian

Cactus

unread,
Jan 3, 2012, 11:16:03 AM1/3/12
to mpir-...@googlegroups.com
The twiddle code is entered with n = 0 from the above sequence both when n = 1 and also when trunk = 0 and n = 0.

    Brian
 
 

Bill Hart

unread,
Jan 3, 2012, 11:48:13 AM1/3/12
to mpir-...@googlegroups.com
Hi Brian,

yes n=1, trunc=0 is an invalid set of params.

We must have  trunc > n. So we need to track down where it is called from to see how it's happening.

Bill.


On Tuesday, 3 January 2012, Cactus <riem...@gmail.com> wrote:
> The twiddle code is entered with n = 0 from the above sequence both when n = 1 and also when trunk = 0 and n = 0.
>     Brian
>  
>  
>
> --
> You received this message because you are subscribed to the Google Groups "mpir-devel" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/mpir-devel/-/U7M2dMhUOoAJ.

Bill Hart

unread,
Jan 3, 2012, 11:51:39 AM1/3/12
to mpir-...@googlegroups.com
Oh sorry, I'm confused. Obviously trunc <= n is valid.

What we need though is the entire call trace that leads to that set of params so we can see where the bug is. I am sure the bug is in my code, I'm just not sure where.

Bill.
Message has been deleted

Cactus

unread,
Jan 3, 2012, 12:01:58 PM1/3/12
to mpir-...@googlegroups.com
exe!fft_truncate1_twiddle(ii=649f90, is=8, n=0, w=000100, t1=2ef878, t2=2ef898, ws=1, r=0, c=0, rs=000020, trunc=0)  Line 113
exe!fft_truncate1_twiddle(ii=649f90, is=8, n=1, w=000080, t1=2ef878, t2=2ef898, ws=1, r=0, c=0, rs=000010, trunc=0)  Line 123
exe!fft_truncate1_twiddle(ii=649f90, is=8, n=2, w=000040, t1=2ef878, t2=2ef898, ws=1, r=0, c=0, rs=8, trunc=0)  Line 123
exe!fft_truncate1_twiddle(ii=649f90, is=8, n=4, w=000020, t1=2ef878, t2=2ef898, ws=1, r=0, c=0, rs=4, trunc=0)  Line 123
exe!fft_truncate1_twiddle(ii=649f90, is=8, n=8, w=000010, t1=2ef878, t2=2ef898, ws=1, r=0, c=0, rs=2, trunc=0)  Line 123
exe!fft_truncate1_twiddle(ii=649f90, is=8, n=000010, w=8, t1=2ef878, t2=2ef898, ws=1, r=0, c=0, rs=1, trunc=0)  Line 123
exe!fft_mfa_truncate_sqrt2_outer(ii=649f90, n=000080, w=1, t1=2ef878, t2=2ef898, temp=2ef8b8, n1=8, trunc=000100)  Line 329
mul_mfa_truncate_sqrt2.exe!mul_mfa_truncate_sqrt2(* r1=648810, * i1=648090, n1=000078, * i2=648450, n2=000078, depth=7, w=1)  Line 79 C
exe!main()  Line 71

calls read from bottom upwards

   Brian

Reply all
Reply to author
Forward
0 new messages