Hi all, a couple of things to note.
1) FLINT only uses zn_poly for FLINT 1.1 fmpz_poly stuff at present
(note *NOT* zmod_poly at this stage - though this is certainly still
planned).
2) zn_poly is certainly faster than zmod_poly, and FLINT *should* use
zn_poly to make it easy and faster for sage. I don't know what the
exact differences are in speed, but I know it is 40% in some medium
sized cases with diminishing returns for larger problems, and perhaps
in other cases for small polynomials (where FLINT will likely also get
improved eventually) it may be a factor of 2 or 3 at present, I simply
don't know. David may be able to answer this. Of course after Python/
Cython overhead this becomes a smaller difference.
3) There are some problems with "just using" zn_poly.
3a) zn_poly doesn't offer a packed type yet. To rival Magma, or at
least get the best performance possible, this seems very important.
Note, FLINT 1.1 doesn't have a packed type yet either. FLINT 2 does,
but that won't be available for almost a year.
3b) It is difficult to see how to use zn_poly in FLINT. FLINT needs
to use the same packed array type for Z/nZ polynomials, matrices over
Z/nZ and also for exponent vectors for multivariable polynomials (both
multipolynomials over Z and Z/pZ). The problem is, zn_poly is a
separate library, not part of FLINT, and it certainly doesn't do all
these other things. The only way I can see to use zn_poly at present
is to convert from FLINT format to zn_poly format and back again *when
that will be faster than using FLINT directly*. I can do that, but
this is all turning out to be much more complicated than I hoped. It
doesn't help that the format for packed arrays in FLINT is still not
fully decided.
4) zn_poly's eventual packed arrays may differ from those eventually
used in FLINT. At this stage it looks like I need a packed array
format which stores things as packed arrays of 8, 16, 32 or 64 bit
limbs, or as doubles and packed doubles, i.e. it must provide numerous
packed formats and convert between them all automatically. It's a
difficult problem to solve.
4a) One doesn't want to have to rewrite every algorithm to handle
all the different kinds of packed format one uses. At the same time,
one doesn't want to wrap all these formats in some kind of easy to use
uber wrapper, as then the time taken to access any individual
coefficient becomes way too high, and you lose most of the benefit of
having a packed array in the first place. I believe using iterators
over a very thin structure may be optimal in both cases, and I've been
experimenting with that (and finding out about its downfalls).
4b) In some cases one wants to pack into entries of 8 bits. One can
use an array of chars for that. Whilst this is fast, on some machines
the order of the bytes within limbs will be different than on others.
This is fine if you are doing scalar operations, where the order of
bytes within limbs doesn't matter. But it is worthless for algorithms
where the order is important. In the experimental packed vector format
in FLINT 2 there is a compile time option to choose between arrays of
chars and actually packing 8 bit entries 8 bits into limbs using
arithmetic operations (which preserve arithmetic order in limbs). The
problem is the packing option is slower than the array option. I
haven't had time to solve this problem yet, or to even analyse all the
algorithms I am going to want to implement and convince myself that it
can all be done with the faster arrays of chars. Probably it is not
possible.
4c) On some machines uint8_t, uint16_t, uint32_t may not all be
available. Don't know, haven't checked. I think c99 only requires the
interface to provide them when they are available as native formats on
the machine hardware. uint64_t is not required on 32 bit machines, I
know that for sure.
4d) On some machines one absolutely *must* pack one or more entries
into a double for optimum performance over some problem ranges. Any
basic zmod_t format needs to take that into account, whether used for
polynomials or not.
In short, this problem is waaaay complicated and just using zn_poly
for polynomial multiplication over Z/pZ is not necessarily going to be
easy.
All I can offer at present is the following:
a) Wrap FLINT and we'll eventually sort it out. FLINT will offer much
higher level stuff than zn_poly. Eventually we'll figure out how to
milk all the speed from zn_poly in FLINT.
b) I want to offer people the option of compiling FLINT without
zn_poly. So FLINT will still have to have its own code for everything.
It just may not be as highly optimised as David's stuff. A compile
time option will allow people to link against zn_poly for maximum
performance. Perhaps one way around the problem will be to have two
versions of zmod_poly, one which wraps zn_poly and uses the zn_poly
packed format, the other of which uses FLINT native code and FLINT's
own format. A conversion will have to be performed for the algorithms
which need to convert. But this would be an absolute nuisance when it
comes to implementing multipolynomial stuff over Z/pZ, so I am
reluctant to do things this way.
*IF* David and I could agree on a common low level packed format which
was sufficiently capable of dealing with all the problems in FLINT
that it is needed for, then this problem could be avoided altogether.
But I have to say that the different emphases of FLINT and zn_poly
make this already a tall order. I reckon we are going to end up just
converting to zn_poly format and back again when it is faster to do
so.
Anyhow, it seems like now is a good time to begin discussing these
issues. Maybe David and I can hack something out between us. I think I
understand the problems much better now than I did before, having had
a basic go at implementing all of the above in FLINT experimentally
using the packed vector format I came up with. But at a minimum I need
to have a packed double vector format as well.
David, any comments??
Bill.