NTL v FLINT

82 views
Skip to first unread message

Victor Shoup

unread,
Feb 8, 2021, 3:52:49 PM2/8/21
to flint-devel
For those who have been following, I was able to rerun my NTL v FLINT benchmark suite using FLINT v2.7.1. I had to rework some of my programs t use the new APIs for polynomial arithmetic modulo bignums. I also built FLINT to use OpenBlas, to see how the new code for matrix arithmetic mod (small) primes works.

You can download the report and the test programs as the usual place (which is now also pointed to by https://libntl.org). 

The upshot: not much changed since my comparison with FLINT v2.6.0, with the exception of matrix arithmetic modulo small primes. FLINT now does much better, and is sometimes as much as 2x faster than NTL. Here, I'm comparing on a Haswell machine. To be fair, my implementation is really only properly optimized for Intel machines, as it relies explicitly on AVX instructions, so FLINT will no doubt perform better on other machines.

One thing that surprised me is that FLINT's matrix multiplication modulo large primes did not get more of a speed boost using OpenBlas.

Bill Hart

unread,
Feb 8, 2021, 4:27:24 PM2/8/21
to flint-devel
Hi Victor,

Thanks again for these benchmarks, which are a wonderful resource for
the community. It's not too surprising at this point that matrix mul
mod large primes is not much faster. We mainly focused on matrix mul
mod small primes with BLAS at this point.

Whilst we have done bits and pieces here to improve the speed of Flint
we definitely aren't surprised that it falls behind NTL in many ways.
I've been saying for a while that Flint is starting to show its age
and that there is insufficient research and effort going into speeding
it up.

Having said that, Dan Schultz has been systematically improving many
things and I would think that this will start to show soon if nothing
else does. He did start trying to write a small prime FFT, but wasn't
able to show a significant advantage right away, so that will have to
be revisited at a later date. I'm still using less than optimal
routines for division of polynomials and series, and the code in
fmpz_mod_poly, whilst improved, is still an absolute mess and needs
considerable work for us to be happy with it.

Much of our linear algebra is woefully inadequate and has frankly been
embarrassing at times. We have years of effort ahead of us to improve
it to a state-of-the-art implementation.

The only times that are really surprising to me in your benchmarks are
the multiplication times for Z[x]. I would have thought we'd be doing
pretty well there given the FFT which is still state of the art as far
as I know. But as you've pointed out, more can be done with a
multimodular approach. Many years ago we were ahead in this region due
to the Kronecker segmentation, and we admittedly did optimise for AMD
processors at the time, but since then hardware seems to have changed
in ways that make the multimodular approach much faster again.

Anyhow, thanks again for your insights and efforts as always.

Bill.
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "flint-devel" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to flint-devel...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/flint-devel/bb84d3e7-33aa-4956-a3bb-825ca376df87n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages