thoughts on SparseUnivariatePolynomial multiplication

Qian Yun

unread,

Oct 2, 2022, 1:18:38 AM10/2/22

to fricas-devel

I'm talking about multiplication in SparseUnivariatePolynomial,
it is implemented in Domain PolynomialRing, the relevant part starts
at line 431 of poly.spad. (aka the 'addm!' local function.)

First, this function has complexity of O(n^3) for "truly sparse" input:

By "truly sparse" I mean multiplication of polynomials with M and N
terms results in (close to) MxN terms, instead of (M+N) terms.

n := 400 ; f := reduce(+,[monomial(1,x)$SUP INT for x in 1..n]) ; g :=
reduce(+,[monomial(1,x*n)$SUP INT for x in 1..n]) ;f*g;

You can check in the above test that, n:=800 takes 8 times longer than
n:=400.

The reason is that for "truly sparse" cases, we iterate through the
"res" list N times, each time it grows by N terms, resulting in total
number of list iteration of N+2*N+3*N+...+N*N ~= 1/2*N^3.

Now, compare with "dense" cases:

n := 5000 ; f := reduce(+,[monomial(1,x)$SUP INT for x in 1..n]) ; g :=
reduce(+,[monomial(1,x)$SUP INT for x in 1..n]) ;f*g;

The time complexity is O(n^2). (Because "res" is growing by 1 each time
in this case.)

====

Now, what about realistic scenarios, is it dense or sparse?

I modify the definition of "*" to print the number of terms in p, q and
(p*q), and run the tests in src/input/integ.input as the workload.

The result is 2234170 multiplication calls in total.

The biggest one is polynomials with 67 and 68 terms result in 134 terms.

So, at least for integration, there is no big SUP multiplication, and
the polynomials are dense in this case.

In the following table, the second column is the terms of multiplication
result and first column is its frequency:

37361 10
35673 9
41543 8
40835 7
56356 6
61486 5
164553 4
1072030 3
183491 2
206388 1

So out of the 2 million multiplications, over half are multiplication
of polynomials with fewer than 3 terms. 85% of results are less than 10
terms.

====

So first conclusion is to optimize for small inputs. There's not much
room for it, I think.

For bigger inputs, I think current implementation is bad both ways:

a) For sparse cases, simply chain the MxN terms together, sort and
dedupe is O(N^2*log(N)) better than current O(N^3).

b) For dense cases, using actual dense representation like array
should be a great improvement over current implementation:
we only need a (slightly larger than) (M+N) length array to store
the temp result.

====

What's your comment? If there's no disagreement for my analysis,
I can start writing actual patch to address the issues mentioned above.
And we can discuss that patch and performance numbers next time.

- Qian

Qian Yun

unread,

Oct 2, 2022, 6:37:36 AM10/2/22

to fricas-devel

On 10/2/22 13:18, Qian Yun wrote:
> So first conclusion is to optimize for small inputs. There's not much
> room for it, I think.
>
> For bigger inputs, I think current implementation is bad both ways:
>
> a) For sparse cases, simply chain the MxN terms together, sort and
> dedupe is O(N^2*log(N)) better than current O(N^3).
>
> b) For dense cases, using actual dense representation like array
> should be a great improvement over current implementation:
> we only need a (slightly larger than) (M+N) length array to store
> the temp result.

No need to create an array to store temp result, we only need to
create array for input (p, q) and the coefficient of (p * q) can
be computed directly, for example:

https://github.com/sympy/sympy/blob/sympy-1.11.1/sympy/polys/densearith.py#L735

I think this will be valid optimization for small (but dense)
inputs as well. Will need data to determine when to use this
method -- what qualifies a polynomial as "dense"?
For example, max degree - min degree < 2 * number of monomials ?

- Qian

Kurt Pagani

unread,

Oct 2, 2022, 9:09:16 AM10/2/22

to fricas...@googlegroups.com

A good question. I've never seen a satisfactory definition. While some think of
(https://math.berkeley.edu/~bernd/cbms.pdf)

"""
Here and throughout this book, a polynomial is called sparse if we know a priori
which monomials appear with non-zero coefficients in that polynomial.
"""

it may well make sense to distinguish sparse/dense:

What Can (and Can’t) we Do with Sparse Polynomials?
https://arxiv.org/pdf/1807.08289.pdf
https://www.usna.edu/Users/cs/roche/research/phd/

>
> - Qian
>

Qian Yun

unread,

Oct 3, 2022, 7:05:33 AM10/3/22

to fricas-devel

Now some comparison with Maxima and Reduce:

IIUC, Maxima implementation is "ptimes" at
https://github.com/calyau/maxima/blob/master/src/rat3a.lisp#L850
And Reduce implementation is "poly!-multf" at
https://github.com/reduce-algebra/reduce-algebra/blob/master/packages/poly/polrep.red#L199

The Maxima algorithm looks similar to ours, although it uses goto
instead of loop. But it turns out it is O(n^3) complexity even
for dense cases.

The Reduce algorithm is recursive, basically p*q = p*car(q)+p*rest(q).
So it's behavior is close to ours, but a bit more inefficient. Also
it can stack overflow because it is recursive. Tests show it is
(a little slower than) O(n^2) when polys are dense and it is O(n^3)
when polys are sparse.

- Qian

Test files are:

(Set "sparse" to 1 for dense cases and set it to "n" for sparse cases.)

=== Maxima ===
n : 400;
sparse : n;
p : 0;
q : 0;
for i : 1 thru n do p : p + x^i;
for i : 1 thru n do q : q + x^(i*sparse);
display2d : false;
ttyoff : true;
showtime : all;
expand(p * q);
====

=== Reduce ===
n:=400;
sparse := n;
p := 0;
q := 0;
for i := 1 step 1 until n do p := p + x^i;
for i := 1 step 1 until n do q := q + x^(i*sparse);
p*q;
showtime;
====

Qian Yun

unread,

Oct 3, 2022, 11:20:10 PM10/3/22

to fricas-devel

On 10/2/22 13:18, Qian Yun wrote:
>

> For bigger inputs, I think current implementation is bad both ways:
>
> a) For sparse cases, simply chain the MxN terms together, sort and
> dedupe is O(N^2*log(N)) better than current O(N^3).

This way the time complexity is O(N^2*log(N^2)) -> O(2*N^2*log(N)),
this is just the problem of merge N sorted list (each length is N),
with "divide and conquer" (destructive in-place) merge sort (which
is convenient for list), the time complexity is O(N^2*log(N)).

We can implement it this way:

Provide (and export) these 2 functions:

add! : (%, %) -> %
mul! : (%, %) -> %

Which do addition and multiplication but may change the structure
of its arguments.

mul!(p, q) can be implemented as
add!(mul!(p, first_half(q)), mul!(p, second_half(q)))
which has the same "divide and conquer" spirit.

- Qian

Qian Yun

unread,

Oct 4, 2022, 7:23:24 AM10/4/22

to fricas-devel

Correction for the Maxima case:

I used expression instead of polynomial in my previous version.
The following version is corrected. (Using "rat" is the key.)

The conclusion also changes: the Maxima implementation is O(n^2)
when polys are dense and O(n^3) when polys are sparse.
Similar to current FriCAS.

- Qian

====

n : 400;
sparse : n;
p : 0;
q : 0;

for i : 1 thru n do p : p + rat(x^i);
for i : 1 thru n do q : q + rat(x^(i*sparse));

display2d : false;
ttyoff : true;
showtime : all;

p * q;

Waldek Hebisch

unread,

Oct 4, 2022, 10:41:37 AM10/4/22

to fricas...@googlegroups.com

On Sun, Oct 02, 2022 at 01:18:32PM +0800, Qian Yun wrote:
> I'm talking about multiplication in SparseUnivariatePolynomial,
> it is implemented in Domain PolynomialRing, the relevant part starts
> at line 431 of poly.spad. (aka the 'addm!' local function.)
>
> First, this function has complexity of O(n^3) for "truly sparse" input:
>
> By "truly sparse" I mean multiplication of polynomials with M and N
> terms results in (close to) MxN terms, instead of (M+N) terms.
>
> n := 400 ; f := reduce(+,[monomial(1,x)$SUP INT for x in 1..n]) ; g :=
> reduce(+,[monomial(1,x*n)$SUP INT for x in 1..n]) ;f*g;
>
> You can check in the above test that, n:=800 takes 8 times longer than
> n:=400.
>
> The reason is that for "truly sparse" cases, we iterate through the
> "res" list N times, each time it grows by N terms, resulting in total
> number of list iteration of N+2*N+3*N+...+N*N ~= 1/2*N^3.
>
> Now, compare with "dense" cases:
>
> n := 5000 ; f := reduce(+,[monomial(1,x)$SUP INT for x in 1..n]) ; g :=
> reduce(+,[monomial(1,x)$SUP INT for x in 1..n]) ;f*g;
>
> The time complexity is O(n^2). (Because "res" is growing by 1 each time
> in this case.)
>
> ====
>
> Now, what about realistic scenarios, is it dense or sparse?

Multivariate. In case of 'integ.input' there is good chance
that most calls to SUP multiplication came from multivariate
multiplications.

There are some extra aspects:
- memory allocation is much more expensive than coefficient operations
on small integer or list traversals, so we perfer "in place" versions
for moderate lengths.
- when coefficient operations are expensive (top level of multivatiate
multiplication) and big N^3 list operations is not that bad, so
main case to optimize is when coefficients are simple and small;
- some sparse cases are "structural", say arithmetic progression of
exponents. This is more frequent in multivariate case but
sometimes happens in univariate case too. With respect to number
of terms such multiplication is more similar to dense one.
- in really large dense cases we may be more limited by memory
than by runtime, so we want to de-dupe relatively early.
- for sparse case it probably makes sense from the start to
optimize multivariate case. There are several approaches in
this direction: Giac uses hash table to collect similar terms.
Other used various heaps and buckets that simultaneously
de-dupe and sort.

There are publications giving some details of fast multiplication
(and also division and gcd) for large sparse case by Monagan and Pearce
(their code is included in Maple) and Gastineau and Laskar (they
develop Trip system).

> b) For dense cases, using actual dense representation like array
> should be a great improvement over current implementation:
> we only need a (slightly larger than) (M+N) length array to store
> the temp result.

Main possible gain here is in modular case, for this we have
POLYVEC (but, as I wrote more like this would be useful).

> ====
>
> What's your comment? If there's no disagreement for my analysis,
> I can start writing actual patch to address the issues mentioned above.
> And we can discuss that patch and performance numbers next time.

--
Waldek Hebisch

Qian Yun

unread,

Oct 4, 2022, 10:31:47 PM10/4/22

to fricas...@googlegroups.com

I have an idea for sparse/dense distinguish in the context of
multiplication:

For polynomial p and q, define:
sp := numberOfMonomials p -- number of cells of sparse rep
sq := numberOfMonomials q
dp := degree p - minimumDegree p + 1 -- length of dense rep
dq := degree q - minimumDegree q + 1

When doing classical (not considering optimization like Karatsuba)
multiplication, there is fixed number of coefficient multiplication:
sp*sq.

The problem we discussed in previous threads is about data structure
-- storage consumption of those potentially temporary sp*sq monomials,
and time consumption of traverse those storage.

Our current algorithm, in best case: when sp = dp, sq = dq,
space complexity is sp+sq, time complexity is sq*(sp+sq).
In worst case, dp>>sp, dq>>sq, space complexity is sp*sq,
time complexity is sq*sp*sq.

My proposed sparse algorithm, best/worst/average space complexity is
sp*sq, time complexity is sp*sq*log(sq).

My proposed dense algorithm, best/worst/average space complexity is
dp+dq, time complexity (spent on data structure) is sp*sq.

So the distinguish could be: when (dp+dq) < sp*sq*Const, use
dense representation, otherwise use sparse representation.

The Const can be determined by benchmark. But IIUC, it should be
at least 4 -- The sparse representation need to store coefficient,
exponent, a cell to cons them together, another cell to store pointer
to next monomial, while the dense representation only need to store
coefficient.

- Qian

Qian Yun

unread,

Oct 20, 2022, 7:46:22 PM10/20/22

to fricas-devel

> For bigger inputs, I think current implementation is bad both ways:
>
> a) For sparse cases, simply chain the MxN terms together, sort and
> dedupe is O(N^2*log(N)) better than current O(N^3).

For sparse cases, another important factor is cell allocation:
M terms times N terms could result in MxN terms or (M+N-1) terms.
Current implementation is efficient in allocation but at the cost
of (worst case) O(n^3) traversal.

The above method is O(n^2) allocation and O(n^2*log(n)) traversal,
but I think following method can achieve minimal allocation (same
as current) and still O(n^2*log(n)) traversal:

Simply put, instead of doing a N-way merge of M-length lists
(assume n < m), we do a N-way merge of streams -- we use a heap
to do the comparison, and the term multiplication happens "lazily",
to reduce unnecessary memory allocation.

> b) For dense cases, using actual dense representation like array
> should be a great improvement over current implementation:
> we only need a (slightly larger than) (M+N) length array to store
> the temp result.

For dense cases, the implementation will look like this:
(Note that this is only valid for "E is NNI", aka this is valid
for SUP instead of more general PR.)

- Qian

vec_to_poly(minDegree : E, vec : PrimitiveArray R) : % ==
res : Rep := []
for i in 0..(# vec - 1) repeat
if not zero? vec.i then res := cons([i + minDegree,
vec.i], res)
res

NNI ==> NonNegativeInteger
mult_dense(p1 : %, p2 : %) : % ==
max_deg1 : NNI := degree p1; min_deg1 : NNI :=
minimumDegree p1
max_deg2 : NNI := degree p2; min_deg2 : NNI :=
minimumDegree p2
new_len := (max_deg1 + max_deg2 - min_deg1 - min_deg2 +
1)::NNI
new_vec : PrimitiveArray R := new(new_len, 0$R)
new_minDegree := min_deg1 + min_deg2
-- for i in 0..(# new_vec - 1) repeat
-- for j in max(0, i - maxIndex a2)..min(i, maxIndex a1)
repeat
-- new_vec.i := new_vec.i + a1.j * a2.(i-j)
for t1 in listOfTerms p1 repeat
t1k := t1.k
t1c := t1.c
for t2 in listOfTerms p2 repeat
i := t1k + t2.k - new_minDegree
new_vec.i := new_vec.i + t1c * t2.c
vec_to_poly(new_minDegree, new_vec)

Reply all

Reply to author

Forward