William Stein suggested I join the group and post something I emailed
to him, for comments from the sage-devel list.
I see he has already posted a link to the quadratic sieve I wrote for
SAGE (around 3 times faster than the current one I think):
So on to the current matter....
Here is a copy of my correspondence with William re NTL. Please see his
question to the group (sorry for the backwards quoting - apparently
William replied to my email before I sent it to him - he works faster
than the speed of light ;-) ):
>> Hi Everybody,
>>
>> Any comments on this. William Hart is a
>> super-energetic
>> skilled programmer, who seems to be suddenly
>> obsessed with
>> re-implementing certain of the functionality of NTL
>> as a C library
>> on top of GMP. Any advice, thoughts, comments,
>> etc., for him?
>> He would make the results GPL'd and include them in
>> SAGE. He
>> is doing this 100% because he wants optimal
>> performance -- see below.
>>
> ------- Forwarded message -------
> From: "William Hart" <har...@yahoo.com>
> To: "William Stein" <wst...@gmail.com>
> Cc:
> Subject: NTL speed test
> Date: Sat, 14 Oct 2006 17:23:40 -0700
>
> Hi William,
> I decided to see just how much of an overhead there
> is using NTL as opposed to something written
> specifically *for* GMP.
>
> So over the weekend I wrote my own library (some of it
> from code I already had lying around) of a selection
> of functions similar to those in NTL's ZZ library
> (which end up mimicing pretty much everything NTL does
> in ZZ that GMP doesn't already do).
> In comparing the times, I compared my code with NTL
> compiled on top of GMP and tuned with Victor's script.
> The result is that everything I implemented was
> between 1.33 and 2.2 times faster than NTL (including
> the stuff that doesn't rely on GMP at all - I just use
> slightly better algorithms and save some time not
> doing bounds checking in time critical functions).
>
> As I suspected, there is quite a lot of loss in
> building one package on top of another that it wasn't
> originally written to sit on top of, as NTL does.
>
> I also implemented a square root mod p^k function,
> which NTL doesn't have (I don't think).
>
> Anyhow, I attach the library to this email, though
> note that this library is *NOT* a drop in
> replacement for Victor's ZZ library (something I am
> sure Victor thought of writing when GMP came along,
> his current product essentially being the compromise).
>
>
> The point of the implementation I give is to expose
> the underlying GMP types and allow one to work
> directly with GMP functions where available.
>
> I'll work on a replacement for his vec_ZZ (very small)
> and mat_ZZ (more involved) libraries next. I have some
> ideas for how to speed those up a LOT, especially in
> terms of making use of processor caches, etc!!
>
> I'm not sure how much of NTL I need to implement
> before I can implement algebraic number theory stuff.
> But I get the impression it isn't that much.
> Regards,
>
> Bill Hart.
Hi, you seem to row reduce a dense boolean matrix in F2matrix.cpp, so I
decided to spam this list again with recommending the "method of the 4
Russians" for this task: It should reduce the complexity from n^3 to
(n^3)/log(n) without introducing a huge constant factor. The paper can be
found here:
http://eprint.iacr.org/2006/251.pdf
or an older version here:
http://eprint.iacr.org/2006/163.pdf
The author Gregory Bard btw. agreed to release his implementation under a GPL
compatible license for inclusion in SAGE.
Feel free to ignore this - proably not very helpfull - e-mail.
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_icq: 177334829
_jab: martinr...@jabber.ccc.de
>> I decided to see just how much of an overhead there
>> is using NTL as opposed to something written
>> specifically *for* GMP.
>>
>> So over the weekend I wrote my own library (some of it
>> from code I already had lying around) of a selection
>> of functions similar to those in NTL's ZZ library
>> (which end up mimicing pretty much everything NTL does
>> in ZZ that GMP doesn't already do).
>> In comparing the times, I compared my code with NTL
>> compiled on top of GMP and tuned with Victor's script.
>> The result is that everything I implemented was
>> between 1.33 and 2.2 times faster than NTL (including
>> the stuff that doesn't rely on GMP at all - I just use
>> slightly better algorithms and save some time not
>> doing bounds checking in time critical functions).
>>
>> As I suspected, there is quite a lot of loss in
>> building one package on top of another that it wasn't
>> originally written to sit on top of, as NTL does.
Hi Bill,
A few comments.
I should say in advance I don't completely understand where you're
coming from, but hopefully I can work that out over the next few
emails :-)
(1)
I hope you're aware that currently SAGE's NTL wrapper is pretty
hopeless. I mentioned this in one of my talks at Sage Days 2:
http://sage.math.washington.edu/home/dmharvey/padic-heights-talk.pdf
(specifically the slide titled "Problems with NTL interface")
These problems make NTL's functionality not very useful from a SAGE
user's point of view, if they actually want to do any real
calculations. I am planning to fix these problems, but first I need
to do some work on SAGE's polynomial and polynomial ring classes, so
it will take a little time.
(2)
I think it's wonderful that you're beating NTL. On the other hand,
NTL is a big, mature library. To the extent that your code can
actually replace the stuff in the NTL library, you might consider
contributing your code to NTL itself. I've discussed a few
improvements relating to polynomial multiplication with Victor Shoup,
he seems quite receptive to new ideas and code, and there are plenty
of examples in the code where he gives attribution to other people.
He doesn't exactly have a blazingly fast release schedule, but in the
long run that might be a good option.
(3)
The main stuff we're interested in (or at least, the main stuff I am
interested in) from NTL is the following:
* ZZ_X
* ZZ_pX
* finite fields and polynomials over them
* Possibly zz_x, zz_pX
Do you think the kind of things you are doing to beat NTL are going
to be able to beat its implementation of these things too?
I'm not that interested in the matrix classes. Also I'm not
interested in wrapping NTL's ZZ class. (We do have a wrapper
currently, but it's only there for completeness since we have a
wrapper for ZZ_X, but really there's no need to have it.)
David
I think the matrix would be called sparse. There are maybe 10 or 15 x
1's in a row of 5000 x 0's.
>
> http://eprint.iacr.org/2006/251.pdf
>
> or an older version here:
>
> http://eprint.iacr.org/2006/163.pdf
The standard algorithm to use is the block Lanczos algorithm. It is
quite adequate to the task. I do not yet use this algorithm as its
implementation is another 500 lines of code that I haven't had time to
write yet.
Computing its complexity seems to be a little difficult, but in
practice it is asymptotically and computationally much faster than
Gaussian elimination.
>
> The author Gregory Bard btw. agreed to release his implementation under a GPL
> compatible license for inclusion in SAGE.
That may be worth a look. How long will it take to reduce a 5000x5000
matrix over GF_2? My Gaussian elimination will take around 1.5 seconds
on a 2 GHz machine.
Block Lanczos will do about a 10000x10000 matrix in around 3 seconds,
which is a tiny fraction of the runtime of the factoring algorithm.
>
> Feel free to ignore this - proably not very helpfull - e-mail.
>
> Martin
>
> --
> name: Martin Albrecht
> _pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
> _www: http://www.informatik.uni-bremen.de/~malb
> _icq: 177334829
> _jab: martinr...@jabber.ccc.de
Bill.
> I should say in advance I don't completely understand where you're
> coming from, but hopefully I can work that out over the next few
> emails :-)
NTL was originally written when modern processors were not available
and when GMP was not available (or not very good). As a result, NTL was
built for something else entirely.
Using NTL is basically using antiquated software, though I really,
really appreciate what Victor has done for computational number theory,
and I am not saying things that he doesn't know himself.
My intention is to provide coverage of some of what is in NTL, and some
of what is in Pari (another dinosaur), in a modernised library. It
doesn't exist yet, and may never exist, so we are talking hypotheticals
at this point.
>
> (1)
> I hope you're aware that currently SAGE's NTL wrapper is pretty
> hopeless. I mentioned this in one of my talks at Sage Days 2:
I have no understanding of wrappers. I suppose that you could say that,
currently, NTL seems to put some kind of "wrapper" on GMP, and I don't
like it (though hats off to Victor for optimizing it pretty well
notwithstanding).
I'm afraid I have no insight into how wrappers work, so I'm not going
to be able to help you optimize those. Can they even be optimized?
SAGE is written in Python right? And you are calling C functions from
within Python scripts. But some of this is difficult to do by hand, so
you are using Pyrex to do it for you. That's my current level of
understanding of the matter, and no, I have never heard of Pyrex before
I met SAGE and don't program in Python.
I can't say I am a fan of bundling functions together with Python
scripts whichever way it is done. But SAGE pretty much has to do that,
as there is no package which offers everything. I understand that. But
this means there will be *some* limitations to what SAGE can offer if I
am not mistaken.
>
> http://sage.math.washington.edu/home/dmharvey/padic-heights-talk.pdf
>
> (specifically the slide titled "Problems with NTL interface")
>
> These problems make NTL's functionality not very useful from a SAGE
> user's point of view, if they actually want to do any real
> calculations. I am planning to fix these problems, but first I need
> to do some work on SAGE's polynomial and polynomial ring classes, so
> it will take a little time.
When you say SAGE's polynomial classes, etc, what is actually doing the
grunt work there? SAGE? I mean have you written algorithms in Python?
Or are you wrapping some other package, such as NTL or Pari?
>
> (2)
> I think it's wonderful that you're beating NTL. On the other hand,
> NTL is a big, mature library. To the extent that your code can
> actually replace the stuff in the NTL library, you might consider
> contributing your code to NTL itself. I've discussed a few
> improvements relating to polynomial multiplication with Victor Shoup,
> he seems quite receptive to new ideas and code, and there are plenty
> of examples in the code where he gives attribution to other people.
> He doesn't exactly have a blazingly fast release schedule, but in the
> long run that might be a good option.
I may end up just using it myself and not contributing it anywhere in
particular. If a rewrite of NTL can be of use in SAGE somehow, then all
well and good. If not, then that doesn't matter. I am sure there is
plenty more I can contribute to SAGE.
I don't plan to contribute directly to NTL at present. Although I
*could* speed up some of the algorithms within the context of NTL
itself (gcd, extgcd, modinvert, CRT so far seem to be the biggest
winners there), that would only be piecemeal optimization. The real
advantages in runtime come from changing the fundamental implementation
strategy of NTL itself (something Victor no doubt would have strong
opinions about). That means a rewrite from the ground up, though
obviously the further you get from GMP basic types, the more the
library begins to look like NTL and become compatible with it.
What I had in mind was a dictionary for converting any NTL function
call to a call to my library. You could almost get away with replacing
all instances of ZZ, QQ and RR with appropriate GMP types in any NTL
compatible code, and then using a whole pile of #defines to convert
calls to NTL functions to calls to another equivalent library (such as
mine) based on the fundamental GMP types. Absolutely no overhead at all
would incurred in doing this. You'd get the full benefit of the
optimized library and the whole process of converting NTL compatible
code to the alternative form could be pretty much automated.
If a new function is added to NTL, one would simply need to implent it
in the new library and add it to the dictionary. If I can make
something like that work, I'll implement it and offer it to SAGE as a
drop in replacement for NTL. But it obviously isn't going to happen
overnight.
>
> (3)
> The main stuff we're interested in (or at least, the main stuff I am
> interested in) from NTL is the following:
> * ZZ_X
> * ZZ_pX
> * finite fields and polynomials over them
> * Possibly zz_x, zz_pX
Currently I'm working towards an algebraic number theory package
sitting on an optimized rewrite of NTL. The packages you mention are
not high on my list of current priorities though, except for the
polynomials over ZZ. I don't have any really good ideas there yet
though apart from the usual CS type of optimizations.
>
> Do you think the kind of things you are doing to beat NTL are going
> to be able to beat its implementation of these things too?
Of course, most of the speedups I use in code are general speedups for
modern processors, and apply to almost any kind of code. Algorithms on
the other hand take time to develop and happen by chance, though I am a
number theorist, so anything is possible.
>
> I'm not that interested in the matrix classes. Also I'm not
> interested in wrapping NTL's ZZ class. (We do have a wrapper
> currently, but it's only there for completeness since we have a
> wrapper for ZZ_X, but really there's no need to have it.)
Yes, of course ZZ doesn't offer very much, but some of NTL's other
classes use ZZ. So if ZZ is not optimized, the latter packages aren't
either. I wouldn't like to see ZZ wrapped. It performs very few
functions whose runtime wouldn't be dominated by the overheads in the
wrapper.
I presume you wrap Pari for that type of stuff, along with some Python
algorithms for small int stuff.
The only way to wrap something like NTL ZZ would be to write an
interface to NTL in C, which can handle multiple requests at once, or
something like that. But I think it would be a disaster. I personally
wouldn't try, and I think there are distinct benefits in not doing so,
especially if an NTL equivalent library does emerge.
>
> David
Bill.
>> (1)
>> I hope you're aware that currently SAGE's NTL wrapper is pretty
>> hopeless. I mentioned this in one of my talks at Sage Days 2:
>
> I have no understanding of wrappers. I suppose that you could say
> that,
> currently, NTL seems to put some kind of "wrapper" on GMP, and I don't
> like it (though hats off to Victor for optimizing it pretty well
> notwithstanding).
>
> I'm afraid I have no insight into how wrappers work, so I'm not going
> to be able to help you optimize those. Can they even be optimized?
Sometimes, to some extent. For example, at the recent workshop, we
spent a long time working on optimising the SAGE GMP wrapper. (i.e.
the Integer object in SAGE.) Basically what it comes down to is that
an Integer object has to be a Python object to be visible to a SAGE
user, and that's where the slowness comes from. So we were looking at
how one can optimise construction of Python objects. Unfortunately,
one finds for example that with a single-word mpz_add, only something
like 5-10% of the time is currently in mpz_add. This is
disappointing, and we're working on it, but we aren't going to get
too close to what you can get calling directly from C.
> SAGE is written in Python right? And you are calling C functions from
> within Python scripts. But some of this is difficult to do by hand, so
> you are using Pyrex to do it for you. That's my current level of
> understanding of the matter, and no, I have never heard of Pyrex
> before
> I met SAGE and don't program in Python.
>
> I can't say I am a fan of bundling functions together with Python
> scripts whichever way it is done.
Heh I don't think of Python code as "scripts" any more.
> But SAGE pretty much has to do that,
> as there is no package which offers everything. I understand that. But
> this means there will be *some* limitations to what SAGE can offer
> if I
> am not mistaken.
Absolutely there are limitations. As I said above, everything
directly visible to the user needs to go via Python objects, and
that's a fundamental bottleneck.
On the other hand, there is nothing stopping us from writing C code
in subroutines underneath. A good example is the work on exactly
linear algebra being done at the moment. We're writing a bunch of
generic python matrix classes, with specialised C versions (currently
pyrex) for doing e.g. matrix arithmetic over Z/nZ for word-sized n.
>> These problems make NTL's functionality not very useful from a SAGE
>> user's point of view, if they actually want to do any real
>> calculations. I am planning to fix these problems, but first I need
>> to do some work on SAGE's polynomial and polynomial ring classes, so
>> it will take a little time.
>
> When you say SAGE's polynomial classes, etc, what is actually doing
> the
> grunt work there? SAGE? I mean have you written algorithms in Python?
> Or are you wrapping some other package, such as NTL or Pari?
It's complicated, and it's changing. Right now SAGE has a bunch of
generic classes for dense and sparse polynomials over any
(commutative) ring. So it can handle anything you throw at it, but
slowly. On the other hand, if you create a polynomial over Z, it is
represented internally by an NTL object, and arithmetic is done all
using NTL.
Still, the NTL wrapper sucks, and it's high on my list of priorities
to fix.
> I don't plan to contribute directly to NTL at present. Although I
> *could* speed up some of the algorithms within the context of NTL
> itself (gcd, extgcd, modinvert, CRT so far seem to be the biggest
> winners there), that would only be piecemeal optimization. The real
> advantages in runtime come from changing the fundamental
> implementation
> strategy of NTL itself (something Victor no doubt would have strong
> opinions about). That means a rewrite from the ground up, though
> obviously the further you get from GMP basic types, the more the
> library begins to look like NTL and become compatible with it.
If you can write a library that's better than NTL, then SAGE will
welcome it.
Using a different "fundamental implementation strategy" to NTL is no
problem, we're agnostic about that.
(For example, it's a colossal pain to be changing the NTL modulus all
the time, and SAGE buggers it up quite a bit at the moment.)
Also, it would not be necessary to rewrite everything. SAGE uses all
kinds of libraries in bits and pieces. For example, if you could
duplicate the most important functionality in ZZ_X, then we could
talk to that directly without using NTL for that at all.
But this is a big project, it's not the kind of thing anyone can pull
off overnight. Victor Shoup knows his stuff and has obviously put a
lot of time into it.
By the way, you mention gcd, extgcd, modinvert. I know there's work
afoot to improve these things in GMP a lot. They're finally getting
around to making gcd subquadratic, and for modular inverses there
might be more mpn-level support for doing precomputations.
David
> http://sage.math.washington.edu/home/dmharvey/padic-heights-talk.pdf
I just read this. Yeouch!! I see what you mean.
We have wrappers on wrappers on wrappers. And passing data from GMP to
GMP by first converting to strings.... This is awful!! The time
involved with converting binary to decimal is already ridiculously
bad....
By the way, you mention that only Karatsuba is used for multiplying
integers. What did you have in mind to use? FFT really only kicks in
for numbers of a few thousand digits or so, I think.
As far as I am aware, Pari uses all the best tricks for multiplying
numbers of less than about 6000-10000 decimal digits, say. Of course I
may very well not be up with all the latest advances in multiplying
numbers.
Actually, come to think of it, isn't Pari built on GMP, which uses
Karatsuba, Toombe-Cooke 3-way and eventually FFT (I think)?
I'd say the biggest problems you have are that ZZ_pX is not wrappered
in SAGE, and that data is not passed from GMP within NTL to GMP within
SAGE in a sensible way. You'd almost be better passing the mpz_t (which
is a pointer) as a binary string, rather than the number itself. As the
whole of SAGE is compiled together as a single package, I would imagine
this would not initiate a GPF.
Of course, you have three problems here. Firstly, NTL doesn't really
expose the underlying GMP types all that well. Secondly it doesn't
really use mpz_t's, but uses mpn's or something like that. Another
reason for a rewrite of NTL. Thirdly, passing anything as text at the
level of fundamental types is BAD!!
What is really needed is a function which takes an NTL ZZ integer and
makes it available to the rest of SAGE. I guess it is the consequences
of doing that, for the Pyrex glueing code, which inserts the spanner in
the works. People actually want as much as possible dealt with by
Pyrex.
Bill.
> Sometimes, to some extent. For example, at the recent workshop, we
> spent a long time working on optimising the SAGE GMP wrapper. (i.e.
> the Integer object in SAGE.)
Oh, I see. GMP has a wrapper too. I just presumed that SAGE's "seat of
consciousness" was a C program (which would have direct access to GMP)
and that the Python stuff was just to negotiate between all the other
packages. But of course that is illogical. You don't need python to
negotiate between c code and c code.
Unfortunately,
> one finds for example that with a single-word mpz_add, only something
> like 5-10% of the time is currently in mpz_add.
Yeouch!!
I recently came upon such a problem in the web operating system YouOS.
They wrote it all in javascript, which is hopelessly slow for
mathematical computation (what I was interested in it for, because of
its inherent massively distributed capabilities). The solution was to
write mathematics code entirely in native java, and if one really has
to talk to the YouOS operating system, to do so by passing java objects
back out to the javascript.
I quickly discovered that javascript had a bigint library, but didn't
implement FFT (in fact it might even just use an O(n^2) algorithm for
multiplication).
Personally I think the only thing that should be written in Python in
SAGE should be the interface with the user. The core of SAGE should be
written in C. But with my limited understanding of how SAGE works, and
of how Python works, I wouldn't be surprised to find this wasn't even
possible.
> On the other hand, there is nothing stopping us from writing C code
> in subroutines underneath. A good example is the work on exactly
> linear algebra being done at the moment. We're writing a bunch of
> generic python matrix classes, with specialised C versions (currently
> pyrex) for doing e.g. matrix arithmetic over Z/nZ for word-sized n.
Great.
> (For example, it's a colossal pain to be changing the NTL modulus all
> the time, and SAGE buggers it up quite a bit at the moment.)
I'll bear this in mind when I get to it. I'll make sure I post to the
group for comments when I get to implementing this part of the NTL
library.
> Also, it would not be necessary to rewrite everything. SAGE uses all
> kinds of libraries in bits and pieces. For example, if you could
> duplicate the most important functionality in ZZ_X, then we could
> talk to that directly without using NTL for that at all.
Yes, I see that. Any comments on what you'd like this ZZ_X library to
include/do/improve?
After vectors/matrices, it will be the next thing I work on. (I need
the matrices for own projects even if SAGE doesn't.)
> But this is a big project, it's not the kind of thing anyone can pull
> off overnight. Victor Shoup knows his stuff and has obviously put a
> lot of time into it.
Yeah, but improving someone else's project is much easier than writing
it all from scratch yourself. Victor has done an excellent job, but
there are also limitations in his product which even he can't get
around without rewriting it all.
I gather NTL is GPL'd, or it wouldn't be being used. Thus I can modify
and redistribute it if I want (not that it will look too much like the
original).
> By the way, you mention gcd, extgcd, modinvert. I know there's work
> afoot to improve these things in GMP a lot.
Oh at long last!! I can write a modinvert algorithm in C which is
faster than the GMP algorithm written in assembly language!
> They're finally getting
> around to making gcd subquadratic, and for modular inverses there
> might be more mpn-level support for doing precomputations.
That sounds good. I could use that in my sieve, though the self
initialising sieve isn't as affected by this issue as my ordinary MPQS
version was.
> David
Bill.
> David Harvey wrote:
>
>> http://sage.math.washington.edu/home/dmharvey/padic-heights-talk.pdf
>
> I just read this. Yeouch!! I see what you mean.
>
> We have wrappers on wrappers on wrappers. And passing data from GMP to
> GMP by first converting to strings.... This is awful!! The time
> involved with converting binary to decimal is already ridiculously
> bad....
Indeed.
> By the way, you mention that only Karatsuba is used for multiplying
> integers. What did you have in mind to use? FFT really only kicks in
> for numbers of a few thousand digits or so, I think.
I was referring to polynomial multiplication. PARI only knows
karatsuba for polynomials (unlike NTL which knows several FFT variants).
> Actually, come to think of it, isn't Pari built on GMP, which uses
> Karatsuba, Toombe-Cooke 3-way and eventually FFT (I think)?
Yes.
> I'd say the biggest problems you have are that ZZ_pX is not wrappered
> in SAGE, and that data is not passed from GMP within NTL to GMP within
> SAGE in a sensible way.
SAGE has an Integer type and an IntegerMod type (integers mod a fixed
n) which both use GMP for the underlying representation. What I am
planning to do is copy bytes directly from ZZ_pX/ZZ_X coefficients to
our own representation, plus get the sign correct. This can be done
in Pyrex without much trouble. To do it safely we actually have to do
two copies, because NTL exposes e.g. ZZToBytes(), and it's not the
same byte order as GMP's representation. But if we assume that our
NTL is actually using GMP (via mpn_*), which is a pretty safe
assumption in the SAGE environment, then we can literally copy the
data, though we may have to modify NTL slightly to make certain
members public etc.
The IntegerMod type is fairly new (i.e. it got moved down from Python
to Pyrex quite recently) and it's not heavily optimised yet.
> You'd almost be better passing the mpz_t (which
> is a pointer) as a binary string, rather than the number itself. As
> the
> whole of SAGE is compiled together as a single package, I would
> imagine
> this would not initiate a GPF.
>
> Of course, you have three problems here. Firstly, NTL doesn't really
> expose the underlying GMP types all that well. Secondly it doesn't
> really use mpz_t's, but uses mpn's or something like that.
Yes. But we're still okay copying bytes.
>> Sometimes, to some extent. For example, at the recent workshop, we
>> spent a long time working on optimising the SAGE GMP wrapper. (i.e.
>> the Integer object in SAGE.)
>
> Oh, I see. GMP has a wrapper too. I just presumed that SAGE's "seat of
> consciousness" was a C program (which would have direct access to GMP)
> and that the Python stuff was just to negotiate between all the other
> packages. But of course that is illogical. You don't need python to
> negotiate between c code and c code.
That's one of the really nice things about Pyrex. You're really
writing C, except with Python-like syntax, and very convenient access
to Python stuff when you need it.
>> Unfortunately,
>> one finds for example that with a single-word mpz_add, only something
>> like 5-10% of the time is currently in mpz_add.
>
> Yeouch!!
Yeah it sucks. But this is the price you pay for embedding such
functionality in a general-purpose environment like this.
We're still going to get it down a bit. My aim is to have mpz_add
take about 30% of the time.
> Personally I think the only thing that should be written in Python in
> SAGE should be the interface with the user. The core of SAGE should be
> written in C. But with my limited understanding of how SAGE works, and
> of how Python works, I wouldn't be surprised to find this wasn't even
> possible.
The core of SAGE is getting moved down into Pyrex/C, slowly but
surely. The advantage of a Python framework is mostly that the
development time for getting prototypes working is really short, and
then the path to migrate down to Pyrex is pretty painless.
>> Also, it would not be necessary to rewrite everything. SAGE uses all
>> kinds of libraries in bits and pieces. For example, if you could
>> duplicate the most important functionality in ZZ_X, then we could
>> talk to that directly without using NTL for that at all.
>
> Yes, I see that. Any comments on what you'd like this ZZ_X library to
> include/do/improve?
I did some benchmarks of ZZ_X multiplication against MAGMA. It's
really fast when the coefficients are small and the degree is huge,
maybe 1.5x slower than MAGMA when the coefficients are huge and the
degree is small, and it has real overhead problems when both are
small. (MAGMA would be 5-10x faster in this last case.) You can see
some results at:
http://sage.math.washington.edu:9001/DavidHarvey/TimingExperiments
(Warning: my timing methodology wasn't that careful.)
So one of my main gripes with NTL's performance is that it doesn't do
too well on small input data. It's been optimised for really huge stuff.
Apart from that, I haven't really explored the higher-level
algorithms very much, so I can't offer much.
> I gather NTL is GPL'd, or it wouldn't be being used. Thus I can modify
> and redistribute it if I want (not that it will look too much like the
> original).
Yes, the GPL is crucial.
David
The limitations are on the interpreter part of SAGE. But SAGE is not
just the interpreter -- it's:
(1) a collection of software and C/C++ libraries
(2) a new computer algebra system (the Python/Pyrex libraries)
(3) interfaces to other math software.
It's important to keep all three aspects of SAGE in mind for this
discussion.
SAGE is an attempt to answer Gonzalo Tornaria's question: "if I design a
new
algorithm, what should I implement it in". My answer is to implement it
as something
that can be built in the SAGE environment (with the SAGE libraries etc.),
and which
provides an easy-to-use documented interface to the SAGE command line, so
people
can very easily use it.
> Absolutely there are limitations. As I said above, everything
> directly visible to the user needs to go via Python objects, and
> that's a fundamental bottleneck.
>
> On the other hand, there is nothing stopping us from writing C code
> in subroutines underneath. A good example is the work on exactly
> linear algebra being done at the moment. We're writing a bunch of
> generic python matrix classes, with specialised C versions (currently
> pyrex) for doing e.g. matrix arithmetic over Z/nZ for word-sized n.
With proper documentation, etc., there is nothing stopping
any user from writing speed-critical parts of their code in C. A big
part of what SAGE provides is a coherent collection of C/C++ accessible
libraries, with *lots* of examples (in Pyrex code mostly) of how to use
them. The Python-interpreter access to this functionality is often
useful, but it is not the only goal of SAGE (see above).
Even if you never touch the Python interpreter for your work or
contributions
to SAGE, you could create a very useful C library, which others
who write for "the SAGE collection of libraries" would greatly appreciate.
>> I don't plan to contribute directly to NTL at present. Although I
>> *could* speed up some of the algorithms within the context of NTL
>> itself (gcd, extgcd, modinvert, CRT so far seem to be the biggest
>> winners there), that would only be piecemeal optimization. The real
>> advantages in runtime come from changing the fundamental
>> implementation
>> strategy of NTL itself (something Victor no doubt would have strong
>> opinions about). That means a rewrite from the ground up, though
>> obviously the further you get from GMP basic types, the more the
>> library begins to look like NTL and become compatible with it.
>
> If you can write a library that's better than NTL, then SAGE will
> welcome it.
I second that!
> Also, it would not be necessary to rewrite everything. SAGE uses all
> kinds of libraries in bits and pieces. For example, if you could
> duplicate the most important functionality in ZZ_X, then we could
> talk to that directly without using NTL for that at all.
>
> But this is a big project, it's not the kind of thing anyone can pull
> off overnight. Victor Shoup knows his stuff and has obviously put a
> lot of time into it.
That said, I wouldn't underestimate what is possible. I've seen
some absolutely amazing mathematical programming happen, when the
timing is right and there is sufficient will power.
William
I haven't had the chance yet to either implement it or use Gregory Bard's
implementation. However he gives some benchmarks in his papers. Also he
claims to provide very accurate complexity estimations.
> > By the way, you mention that only Karatsuba is used for multiplying
> > integers. What did you have in mind to use? FFT really only kicks in
> > for numbers of a few thousand digits or so, I think.
>
> I was referring to polynomial multiplication. PARI only knows
> karatsuba for polynomials (unlike NTL which knows several FFT variants).
Ah, now I see. That is useful information.
> I did some benchmarks of ZZ_X multiplication against MAGMA. It's
> really fast when the coefficients are small and the degree is huge,
> maybe 1.5x slower than MAGMA when the coefficients are huge and the
> degree is small, and it has real overhead problems when both are
> small. (MAGMA would be 5-10x faster in this last case.) You can see
> some results at:
>
> http://sage.math.washington.edu:9001/DavidHarvey/TimingExperiments
Great!! That will be very useful as a guide when I'm implementing this
library.
I can see where a factor of 2 goes in NTL at the low end, and obviously
Pari doesn't have any decent algorithms at the top end. We'll see what
else I can find when I actually do the implementation. I don't expect
to find more than the factor of 2 at the bottom end. The Z mod nZ
library stands to gain more at the low end.
I imagine the factor of 5-10 you are talking about includes the Python
interpreter overheads or something. I can't see MAGMA being 5x faster
than NTL in this regime.
Incidentally, since GMP is GPL'd and MAGMA is not, I presume they don't
use GMP. What *do* they use?
Bill.
They do (partly?) use GMP as it is not GPL'd but LGPL'd which means you may
link to it.
See http://en.wikipedia.org/wiki/LGPL for details.
Martin
> Bill.
> I imagine the factor of 5-10 you are talking about includes the Python
> interpreter overheads or something. I can't see MAGMA being 5x faster
> than NTL in this regime.
The factor of 10 possibly includes the overhead, but the factor of 5
doesn't. At some point I switched to calling NTL directly from C++ to
convince myself that I wasn't imagining things. In fact, if anything
MAGMA had interpreter overhead. Clearly MAGMA is using a different
algorithm for small polynomials. You may want to repeat the tests
yourself. If you still don't believe me I'm happy to repeat them
myself too, some extra convincing would be helpful for me too.
E.g. degree = 250, with 1000 bits per coefficient was a test case
where MAGMA was much faster than NTL.
Clearly MAGMA is using a different algorithm. NTL is using SSMul
(schonhage-strassen) in this range, at least on my machine. You can
take degree = 255 if you like, to convince yourself it's not a
rounding-up problem.
Here is an email I wrote to Victor some time ago about this case. I
can see where to find about 20-25% of the time, but beyond that I
have no idea.
> I've been looking at the breakdown of timings of SSMul and I think
> I can offer some concrete suggestions on how to speed things up a bit.
>
> When the coefficient size is large, there are really no surprises.
> Here's a breakdown for degree = 250, with 10^5 bits per coefficient
> (this is all on my machine at home, a G5 powerpc):
>
> ffts: 0.958722
> ifft: 0.518043
> points: 12.6698
> rest: 0.183027
>
> (ffts = first two FFTS, ifft = inverse FFT, points = pointwise
> multiplications, rest = everything else)
>
> In this range, the runtime is dominated by the pointwise
> multiplications, and we just have to trust that GMP is doing the
> best it can.
>
> When the coefficients get smaller, funny things start to happen,
> relating to either caching or memory management (or both). For
> example, with 10^4 bits per coefficient I get this the first time:
>
> ffts: 0.159956
> ifft: 0.052138
> points: 0.691226
> rest: 0.019053
>
> But then if I run the same multiplication again, I get something
> more like:
>
> ffts: 0.080769
> ifft: 0.051914
> points: 0.657812
> rest: 0.010194
>
> You can see especially the first FFTs are much faster. I can't
> believe it's an issue with caching of the *input* data becaue the
> random input polynomials were generated just before the
> multiplication. But it's possible that it has to do with caching of
> temporary storage that just got allocated by the multiplication
> routine. For the remainder of this email, I'll assume that our
> temporary storage is nicely cache-primed, and I'll give the timing
> breakdowns for attempts other than the first one.
>
> Now let's go down to 10^3 bits per coefficient, where things get
> more interesting. These are the multiplications where MAGMA's
> advantage over NTL appears to be greatest.
>
> NTL still chooses SSMul over HomMul. The timing breakdown is:
>
> ffts: 0.010266
> ifft: 0.00607
> points: 0.020025
> rest: 0.001379
>
> (By the way, a *really* weird thing is that the cache effects only
> kick in on the FOURTH attempt for these multiplications, even when
> I perform exactly the same multiplication every time. I have no
> idea how to explain this.)
>
> As you can see, the FFTs now take up around 43% of the total time.
> So here it's really worth trying to speed up the FFTs.
>
> There is one glaring thing that can be sped up a lot in the fft()
> and ifft() functions, which is adding and subtracting p. The thing
> is, p contains just two nonzero bits! So calling add() and sub() to
> operate on p is really much slower than necessary. It's just adding
> lots of zeroes most of the time. More specifically: the fft()
> function itself calls add/sub about 3.5 times (I'm counting 0.5 for
> the conditional one that happens probably about half the time). But
> if you do the additions of p properly, it should only be 2
> additions/subtractions, plus a little bit-fiddling. Further, the
> LeftRotate function (which is called by fft()) also has an
> unnecessary 0.5 additions. So really it's doing 4 instead of 2
> additions. So I bet you could speed up the FFTs by a factor of two.
>
> Based on the 43% figure above, this would speed up the whole
> multiplication by around 25%.
>
> If you really want to get down and dirty, I observe that the
> integers we're dealing with here are only about 30 machine words
> long! (Assuming a 32-bit machine.) I bet that all the ZZ overhead
> is pretty heavy compared to the actual arithmetic in these cases.
> It has to check for aliasing, handle carries, check for whether
> there's enough space etc. Ideally you would want to just allocate
> the right amount of space for each coefficient at the beginning
> (which is easy to work out), and then use mpn functions directly.
> But that unfortunately would break your interface, since it's
> supposed to work with the non-GMP integer package too. It's a
> shame, because I think you could get a big speedup by doing things
> directly here. Would it be practical to have a GMP-specific version
> of fft() which does this all at the lower level?
David
>
>> Incidentally, since GMP is GPL'd and MAGMA is not, I presume they
>> don't
>> use GMP. What *do* they use?
>
> They do (partly?) use GMP as it is not GPL'd but LGPL'd which means
> you may
> link to it.
>
> See http://en.wikipedia.org/wiki/LGPL for details.
For MAGMA's position on this, see
https://magma.maths.usyd.edu.au/magma/export/mpfr_gmp.shtml
David
> E.g. degree = 250, with 1000 bits per coefficient was a test case
> where MAGMA was much faster than NTL.
Those numbers are much bigger than what I imagined when you said
"small". I presumed you meant with coefficients that would fit into a
machine word, where there are various things you can do to speed things
up (though obviously more when you are working in Z/nZ).
> Clearly MAGMA is using a different algorithm. NTL is using SSMul
> (schonhage-strassen) in this range, at least on my machine.
It is possibly using a different algorithm, but I'd be more confident
of that if I saw the MAGMA code. One should of course have a set of
different algorithms with tuned cutoff points for switching from one to
another. But I think NTL does this. Could the MAGMA people really have
found an algorithm which is *that* much faster?
> Here is an email I wrote to Victor some time ago about this case. I
> can see where to find about 20-25% of the time, but beyond that I
> have no idea.
In my experience, caching issues (which you indicate exist) can easily
be responsible for a factor of 2 or more. The way NTL sits on GMP may
be responsible for a factor of 1.5 - 2.0. I'll bet there isn't too much
more to it than this. A couple of other minor modifications here and
there may explain the rest.
> > I've been looking at the breakdown of timings of SSMul and I think
> > I can offer some concrete suggestions on how to speed things up a bit.
> >
> > When the coefficient size is large, there are really no surprises.
> > Here's a breakdown for degree = 250, with 10^5 bits per coefficient
<snip>
> > In this range, the runtime is dominated by the pointwise
> > multiplications, and we just have to trust that GMP is doing the
> > best it can.
Well if you have an AMD 64 bit machine, perhaps not. As pointed out on
the MAGMA website, they use a special patch to GMP when working with
AMD 64 machines, since GMP is quite a bit slower there (around 20%
slower) than it should be.
I get the impression that William compiles the GMP in SAGE with the AMD
64 patch, but it might be that the ordinary GMP package on his machine
may not have it. So if you were using William's machine... It might be
worth checking this.
> > Now let's go down to 10^3 bits per coefficient, where things get
> > more interesting. These are the multiplications where MAGMA's
> > advantage over NTL appears to be greatest.
> >
> > NTL still chooses SSMul over HomMul.
Have you tried changing this to see what happens? Perhaps Victor's
tuning script already adjusts this optimally, though one might need to
be careful of the assumptions it makes vs the conditions your
application supplies. It might be worth changing the cutoff manually
and seeing if that helps.
> > (By the way, a *really* weird thing is that the cache effects only
> > kick in on the FOURTH attempt for these multiplications, even when
> > I perform exactly the same multiplication every time. I have no
> > idea how to explain this.)
Really wierd, though I have seen something analogous which I also have
not been able to fully explain. My impression is that it is related to
the allocation of memory, at least in the instance I am aware of.
I mean, have you checked the obvious things, for example that your hard
disk is not thrashing on the first run?
> > If you really want to get down and dirty, I observe that the
> > integers we're dealing with here are only about 30 machine words
> > long! (Assuming a 32-bit machine.) I bet that all the ZZ overhead
> > is pretty heavy compared to the actual arithmetic in these cases.
There's probably a significant constant factor hiding in here. I
actually doubt that the size of the integers makes much difference to
that factor, though obviously for *very* small integers the proportion
of time wasted will be slightly larger.
As you indicate, one of the ways of getting around this factor is to
break Victor's implementation so it doesn't work with the other integer
package. I will certainly be doing that when I reimplement it.
Bill.
You can ask Allan Steel or somebody at MAGMA. They'll probably answer.
>> > In this range, the runtime is dominated by the pointwise
>> > multiplications, and we just have to trust that GMP is doing the
>> > best it can.
>
> Well if you have an AMD 64 bit machine, perhaps not. As pointed out on
> the MAGMA website, they use a special patch to GMP when working with
> AMD 64 machines, since GMP is quite a bit slower there (around 20%
> slower) than it should be.
>
> I get the impression that William compiles the GMP in SAGE with the AMD
> 64 patch, but it might be that the ordinary GMP package on his machine
> may not have it. So if you were using William's machine... It might be
> worth checking this.
I don't know anything about this AMD 64 patch. I don't compile the GMP
in SAGE with it applied. Where do I get it?
>> > (By the way, a *really* weird thing is that the cache effects only
>> > kick in on the FOURTH attempt for these multiplications, even when
>> > I perform exactly the same multiplication every time. I have no
>> > idea how to explain this.)
>
> Really wierd, though I have seen something analogous which I also have
> not been able to fully explain. My impression is that it is related to
> the allocation of memory, at least in the instance I am aware of.
>
> I mean, have you checked the obvious things, for example that your hard
> disk is not thrashing on the first run?
He's running on a machine with 64GB RAM that's in a server room, so
it's harder to tell if that's happening.
William
> I don't know anything about this AMD 64 patch. I don't compile the GMP
> in SAGE with it applied. Where do I get it?
My reading of it is that this patch below will speed up GMP. Though I'm
not sure if that is then automatically compatible with NTL or what.
Worth a try I guess.
http://www.loria.fr/~gaudry/mpn_AMD64/index.html
Bill
>> Clearly MAGMA is using a different algorithm. NTL is using SSMul
>> (schonhage-strassen) in this range, at least on my machine.
>
> It is possibly using a different algorithm, but I'd be more confident
> of that if I saw the MAGMA code. One should of course have a set of
> different algorithms with tuned cutoff points for switching from
> one to
> another. But I think NTL does this. Could the MAGMA people really have
> found an algorithm which is *that* much faster?
I don't know. Anything's possible :-)
One possibility I thought about was some kind of FFT working
simultaneously in both axes (degree and coefficient). It might be the
case that for these sizes, the "rectangle" isn't long enough or tall
enough to do fast arithmetic in either direction independently, but
taken as a whole it might be okay. For example, consider degree 256
times 1024 bit coefficients again. Say the machine word is 32 bits. I
think NTL chooses SSMul here. Then your input data is 8192 machine
words. This is probably large enough to efficiently perform a 2-d FFT
in one go. But if you use SS, then each coefficient multiplication is
only 32 words long, and GMP uses basecase (or possibly one level of
karatsuba?) in this range.
Of course this is a very woolly, imprecise idea, and I'm not sure
exactly what I mean.
I'm kind of surprised that NTL's SSMul is faster than HomMul for this
case. It feels like HomMul should be faster, especially seeing as NTL
does small-coefficient multiplications very quickly.
>> Here is an email I wrote to Victor some time ago about this case. I
>> can see where to find about 20-25% of the time, but beyond that I
>> have no idea.
>
> In my experience, caching issues (which you indicate exist) can easily
> be responsible for a factor of 2 or more. The way NTL sits on GMP may
> be responsible for a factor of 1.5 - 2.0. I'll bet there isn't too
> much
> more to it than this. A couple of other minor modifications here and
> there may explain the rest.
Sure, that's possible.
>>> Now let's go down to 10^3 bits per coefficient, where things get
>>> more interesting. These are the multiplications where MAGMA's
>>> advantage over NTL appears to be greatest.
>>>
>>> NTL still chooses SSMul over HomMul.
>
> Have you tried changing this to see what happens? Perhaps Victor's
> tuning script already adjusts this optimally, though one might need to
> be careful of the assumptions it makes vs the conditions your
> application supplies. It might be worth changing the cutoff manually
> and seeing if that helps.
I did try switching to HomMul and it was even slower. I tried a few
things like this and got the impression that NTL's tuning was pretty
good.
>>> (By the way, a *really* weird thing is that the cache effects only
>>> kick in on the FOURTH attempt for these multiplications, even when
>>> I perform exactly the same multiplication every time. I have no
>>> idea how to explain this.)
>
> Really wierd, though I have seen something analogous which I also have
> not been able to fully explain. My impression is that it is related to
> the allocation of memory, at least in the instance I am aware of.
>
> I mean, have you checked the obvious things, for example that your
> hard
> disk is not thrashing on the first run?
I got the same result over multiple trials. I can't think why the
drive would be thrashing every time I started it. It wasn't a memory-
intensive program.
David
> One possibility I thought about was some kind of FFT working
> simultaneously in both axes (degree and coefficient). It might be the
> case that for these sizes, the "rectangle" isn't long enough or tall
> enough to do fast arithmetic in either direction independently, but
> taken as a whole it might be okay. For example, consider degree 256
> times 1024 bit coefficients again. Say the machine word is 32 bits. I
> think NTL chooses SSMul here. Then your input data is 8192 machine
> words. This is probably large enough to efficiently perform a 2-d FFT
> in one go. But if you use SS, then each coefficient multiplication is
> only 32 words long, and GMP uses basecase (or possibly one level of
> karatsuba?) in this range.
I very much doubt a 2-D FFT would work, though the thought of trying it
had certainly occurred to me. Basically any multiplications in Z that
you do are already highly optimized because of GMP. I don't believe
breaking them apart will help.
I mean, I see what you are suggesting. Don't let GMP sort out any
coefficient multiplication, but do it as part of a 2D FFT. But I don't
think there is a hope of doing in C more efficiently what GMP does in
assembly code, especially for such a complicated algorithm. Thus, what
is lost by working in C, I doubt would be made up in doing a 2D FFT.
Actually, I am sure enough about this that I wouldn't even try.
I am actually extremely skeptical that SSMul is the correct
multiplication to be using in the range that you are speaking of (~300
digit coefficients in ~300 term polynomials). I believe what NTL calls
HomMul (which is actually a generic term as Dan Bernstein has shown,
and probably should be eschewed, since it isn't very descriptive) is
the correct way to do things. This suggests that there is something
dire wrong with the HomMul function in NTL. In fact, I really can't see
why Karatsuba, Toom-3 or one of their many variants shouldn't be enough
if composed in a sufficiently potent cocktail.
I will try implementing some polynomial multiplication routines over
the next week and see just how bad NTL's routines are. I don't expect
to beat NTL straight away, since there are so many possible algorithms
to use, and so many variants, that it is going to take me quite some
time to determine the best ones to use. However, at least I should be
able to get some idea of what is really going on.
I realised that cache locality is not an issue. There is hardly enough
data to actually fill the cache, and certainly not enough to cause a
particularly bad problem. Thus I now believe the algorithm is at fault
after all. Some multiplication algorithms are particularly cache
friendly, but that is not going to be the problem at the regimes you
are talking about.
>
> Of course this is a very woolly, imprecise idea, and I'm not sure
> exactly what I mean.
>
> I'm kind of surprised that NTL's SSMul is faster than HomMul for this
> case. It feels like HomMul should be faster, especially seeing as NTL
> does small-coefficient multiplications very quickly.
Not as quickly as GMP, and this may be part of the problem. I wouldn't
be surprised to find a factor of two here, and a factor of 2 or 3 in
the algorithm used.The MAGMA people probably have a nice intermediate
algorithm.
> I got the same result over multiple trials. I can't think why the
> drive would be thrashing every time I started it. It wasn't a memory-
> intensive program.
A memory leak is about the only thing I can think of. But I am certain
Victor is too careful for that. But strange things can happen and it
would explain the symptoms.
> David
Bill.
> I will try implementing some polynomial multiplication routines over
> the next week and see just how bad NTL's routines are. I don't expect
> to beat NTL straight away, since there are so many possible algorithms
> to use, and so many variants, that it is going to take me quite some
> time to determine the best ones to use. However, at least I should be
> able to get some idea of what is really going on.
Let me know if you come up with anything, I would be extremely
interested. The goal of course is to beat both NTL and MAGMA :-)
David
> Let me know if you come up with anything, I would be extremely
> interested. The goal of course is to beat both NTL and MAGMA :-)
Naturally! We aren't playing little league. :-)
Actually, I have just had a thorough look at NTL, and I see numerous
possible problems.
What Victor calls HomMul, is definitely not a good algorithm. In fact,
I believe that is quite an old algorithm, and not considered to be that
good these days.
It also relies on computing modular inverses and computing with
polynomials over ZZ_p (presumably that's Z/pZ). I certainly know
Victor's modular inverse code is very slow (I already improved it
drastically when I wrote my replacement for ZZ). I haven't looked at
his ZZ_pX library, but that would have to be done extremely carefully
to be optimal, and given the constraints he works with in relation to
supporting two different integer formats, I sincerely doubt this can
have been done optimally.
Thus, what I believe is happening is that SSMul is OK, but for a
completely different regime (though I believe his FFT code can be
improved - at the cost of quite a few lines of C code). But because the
HomMul algorithm is completely out of date, and reliant on a number of
other inefficient pieces of the library, SSMul actually ends up being
the faster algorithm.
The Karatsuba algorithm also has a major flaw that I can see, but I
haven't figured out an elegant solution yet, so I'll keep my mouth shut
until I do.
At any rate, I think I have some ideas about what may work. I'll let
you know if my hunches turn out to be correct. I am feeling more
confident about beating NTL now anyway. I could already substantially
improve what's there, so I'm quite sure when I implement the correct
algorithms and base them on the more efficient ZZ module I wrote, that
I will be ahead.
My first aim is to drastically improve polynomial multiplications in
the regime where you have noticed NTL is particularly bad. Improvements
at other regimes may require quite a few more lines of code than I am
prepared to write at present, and may have to wait for later on. But
I'll get to that eventually.
Bill.