Just posted a new version. In addition to a few performance improvements,
I've added new routines that give direct access to the underlying "limbs" of a ZZ,
which is a feature that has been requested from time to time.
................
The next thing I will be working on is implementing a multi-modular algorithm
for matrix multiplication over ZZ_p, using the now much faster matrix multiplication
algorithms over zz_p. My preliminary estimates indicate that this should
speed up NTL's mat_ZZ_p mul routines by a factor of 10-20x.
And a multicore version should make it even faster.
After I get faster matrix multiplication working, I want to first experiment with
applying it to the modular composition algorithm used in the ZZ_pX factoring
algorithm. I'm hoping to get a 2-3x speedup in the DDF routine.
After that, I want to work on implementing the reductions from matrix inversion, etc,
to matrix multiplication (over ZZ_p). For this I can probably "re-purpose" FLINT or other
open-source code that has already been written. If anyone would like to help me
with that, I would be grateful.
This multi-modular approach to matrix mul is of course not new --- it is in Sage already
via FFLAS, I think. But still, I'd like to work on getting a fairly decent implementation in NTL.
It's a hobby :-)
Note: Based on my benchmarks, for 20-bit primes, NTL's mat_zz_p multiplication
is about 20% slower than FFLAS on an x86 with AVX2, so not too bad.
Also, NTL's inversion routine is a bit faster than FFLAS's (between 15-35%).