n-ary roots from complex numbers...

Chris M. Thomasson

unread,

Apr 10, 2017, 7:10:07 PM4/10/17

to

The goal of this crude sample code is to take the n-ary roots of a
complex number. Iterating a complex number z to all of its z^(1/n)
roots. This code derives the polar form from z, and iterates all of the
roots. This is an essential part of providing the ability of storing
n-ary tokens in a set of iterated roots. We can map the n roots, to a
token table, and store data. I will explain more later, however, please
take a look at the code, run it, tear it apart, and give me pointers on
how to make it better, please? Thank you all for your time. The function
at the heart of this is the function ct_roots that breaks apart a
complex number into its n-ary roots. Here is the code:

______________________________
#include <complex>
#include <iostream>
#include <vector>
#include <limits>
#include <cstdint> // want to work with 64-bit numbers
#include <cassert> // want to sanity check run time...

// Well, I need some consistent typenames... ;^)
typedef std::int64_t ct_int;
typedef std::uint64_t ct_uint;
typedef double ct_float;
typedef std::numeric_limits<ct_float> ct_float_nlim;
typedef std::complex<ct_float> ct_complex;
typedef std::complex<ct_uint> ct_complex_uint;
typedef std::vector<ct_complex> ct_complex_vec;

#define CT_PI 3.14159265358979323846

// Round up and convert the real and imaginary
// parts of z to unsigned integers of type ct_uint
// return a complex number with unsigned integer parts
ct_complex_uint
ct_round_uint(
ct_complex const& z
) {
ct_uint re = (ct_uint)std::floor(std::abs(z.real()) + .5);
ct_uint im = (ct_uint)std::floor(std::abs(z.imag()) + .5);
return ct_complex_uint(re, im);
}

// the integer p shall not be zero
// create abs(p) roots of z wrt z^(1/p);
// store them in out, and return the average error.
ct_float
ct_roots(
ct_complex const& z,
ct_int p,
ct_complex_vec& out
) {
assert(p != 0);

// Gain the basics
ct_float radius = std::pow(std::abs(z), 1.0 / p);
ct_float angle_base = std::arg(z) / p;
ct_float angle_step = (CT_PI * 2.0) / p;

// Setup the iteration
ct_uint n = std::abs(p);
ct_float avg_err = 0.0;

// Calculate the n roots...
for (ct_uint i = 0; i < n; ++i)
{
// our angle
ct_float angle = angle_step * i;

// our point
ct_complex c = {
std::cos(angle_base + angle) * radius,
std::sin(angle_base + angle) * radius
};

// output data
out.push_back(c);

// Raise our root the the power...
ct_complex raised = std::pow(c, p);
ct_complex_uint raised_rounded = ct_round_uint(raised);

// Output...
std::cout << "ct_roots::raised[" << i << "]:" <<
raised << "\n";
std::cout << "ct_roots::raised_rounded[" << i << "]:" <<
raised_rounded << "\n";
std::cout << "___________________________________\n";

// Sum up the Go% damn floating point errors!
avg_err = avg_err + std::abs(raised - z);

}

// gain the average error sum... ;^o
return avg_err / n;
}

int main()
{
std::cout.precision(ct_float_nlim::max_digits10);

{
// Our origin point z
ct_complex z = { 94143178827, 0 };

std::cout << "z:" << z << " - " << std::abs(z) << "\n";
std::cout << "**********************************************\n\n";

// The call to ct_roots...
ct_complex_vec roots;
ct_float avg_err = ct_roots(z, 23, roots);
std::cout << "\n\navg_err:" << avg_err << "\n";

// Output
std::cout << "___________________________________\n";
for (unsigned int i = 0; i < roots.size(); ++i)
{
ct_complex& root = roots[i];

std::cout << "root[" << i << "]:" << root << " - " <<
std::abs(root) << "\n";
}
std::cout << "___________________________________\n";
}

// Fin...
std::cout << "\n\nFin, hit <ENTER> to exit...\n";
std::fflush(stdout);
std::cin.get();

return 0;
}
______________________________

Can you run this? BTW, what about possibly reducing the average error
ratio? Any ideas?

Will post more context later on tonight or tommorw.

:^)

hsm...@gmail.com

unread,

Apr 10, 2017, 9:41:38 PM4/10/17

to

I'll give it a go tomorrow! Right now I'm working on a project to check out nearly a thousand proposed fractal formulas *grin* I'm writing a parser for them to rewrite as Ulta Fractal formulas which allows a quick worthiness check…

Chris M. Thomasson

unread,

Apr 10, 2017, 10:35:21 PM4/10/17

to

Nice! I need a good parser for fractal formulas. :^)

Christian Gollwitzer

unread,

Apr 11, 2017, 1:41:59 AM4/11/17

to

Am 11.04.17 um 01:09 schrieb Chris M. Thomasson:

> The goal of this crude sample code is to take the n-ary roots of a
> complex number. Iterating a complex number z to all of its z^(1/n)
> roots. This code derives the polar form from z, and iterates all of the

> // Gain the basics
> ct_float radius = std::pow(std::abs(z), 1.0 / p);
> ct_float angle_base = std::arg(z) / p;
> ct_float angle_step = (CT_PI * 2.0) / p;
>
> // Setup the iteration
> ct_uint n = std::abs(p);
> ct_float avg_err = 0.0;
>
> // Calculate the n roots...
> for (ct_uint i = 0; i < n; ++i)
> {
> // our angle
> ct_float angle = angle_step * i;
>
> // our point
> ct_complex c = {
> std::cos(angle_base + angle) * radius,
> std::sin(angle_base + angle) * radius
> };

While this is correct, I think that the following is easier and should
run faster:

std::vector<ct_complex> myroots;
ct_complex baseroot = std::pow(z, 1.0/p);
myroots.push_back(baseroot);

ct_complex root = baseroot;
ct_complex phasor = baseroot / std::abs(baseroot);
for (int i=1; i<n; i++) {
root *= phasor;
myroots.push_back(root);
}

(untested)

This algorithm saves the computation of 2*(n-1) transcendental
functions. Concerning accuracy, your algorithm is more accurate if you
compute roots of very high order (say 1e10) - probably not for useful
cases.

Christian

Alf P. Steinbach

unread,

Apr 11, 2017, 2:23:44 AM4/11/17

to

On 11-Apr-17 7:41 AM, Christian Gollwitzer wrote:
> Am 11.04.17 um 01:09 schrieb Chris M. Thomasson:
>> The goal of this crude sample code is to take the n-ary roots of a

> > [snip]

>
> While this is correct, I think that the following is easier and should
> run faster:
>
> std::vector<ct_complex> myroots;
> ct_complex baseroot = std::pow(z, 1.0/p);
> myroots.push_back(baseroot);
>
> ct_complex root = baseroot;
> ct_complex phasor = baseroot / std::abs(baseroot);
> for (int i=1; i<n; i++) {
> root *= phasor;
> myroots.push_back(root);
> }
>
> (untested)
>
> This algorithm saves the computation of 2*(n-1) transcendental
> functions. Concerning accuracy, your algorithm is more accurate if you
> compute roots of very high order (say 1e10) - probably not for useful
> cases.

I still don't understand how rotation, and thereby the structure
(metric) of our physical environment, emerges in complex arithmetic.

It's amazing.

It's simple to show that `i` multiplied by itself goes round through
four positions (i, -1, -i, 1, ...), and then that with e.g. the point 45
degrees up right, the powers give a rotation through eight positions,
including `i` so that it can be identified with the square root of `i`,
and so on. But what /causes/ this behavior eludes me. And even though
I'm old and pretty stupid now, I have been good at math, and I believe
I've been thinking about this also in my younger days.

Cheers!,

- Alf (baffled, again :) )

Chris M. Thomasson

unread,

Apr 11, 2017, 3:13:47 PM4/11/17

to

Here is a function I wrote that uses your algorithm:
_________________________________
ct_float
cg_roots(

ct_complex const& z,
ct_int p,
ct_complex_vec& out
) {
assert(p != 0);

/*
Per: Christian Gollwitzer's untested code at:
https://groups.google.com/d/msg/comp.lang.c++/05XwgswUnDg/gm_uqbcwAgAJ

std::vector<ct_complex> myroots;
ct_complex baseroot = std::pow(z, 1.0/p);
myroots.push_back(baseroot);

ct_complex root = baseroot;
ct_complex phasor = baseroot / std::abs(baseroot);
for (int i=1; i<n; i++) {
root *= phasor;
myroots.push_back(root);
}

*/

// Did I fuc% up his code! Read here:

ct_complex baseroot = std::pow(z, 1.0 / p);

out.push_back(baseroot);

ct_complex root = baseroot;
ct_complex phasor = baseroot / std::abs(baseroot);

ct_uint n = std::abs(p);
ct_float avg_err = 0.0;

for (ct_uint i = 1; i < n; ++i)
{
root *= phasor;
out.push_back(root);

// Raise our root the the power...

ct_complex raised = std::pow(root, p);

ct_complex_uint raised_rounded = ct_round_uint(raised);

// Output...

std::cout << "ct_roots_fast::raised[" << i << "]:" <<
raised << "\n";
std::cout << "ct_roots_fast::raised_rounded[" << i << "]:" <<

raised_rounded << "\n";
std::cout << "___________________________________\n";

// Sum up the Go% damn floating point errors!
avg_err = avg_err + std::abs(raised - z);
}

// gain the average error sum... ;^o
return avg_err / n;
}

_________________________________

When I change the call in main to ct_roots to cg_roots in main wrt the
function in:

_________________________________

int main()
{
std::cout.precision(ct_float_nlim::max_digits10);

{
// Our origin point z
ct_complex z = { 94143178827, 0 };

std::cout << "z:" << z << " - " << std::abs(z) << "\n";
std::cout << "**********************************************\n\n";

// The call to ct_roots...
ct_complex_vec roots;

//ct_float avg_err = ct_roots(z, 23, roots);
ct_float avg_err = cg_roots(z, 23, roots);

std::cout << "\n\navg_err:" << avg_err << "\n";

// Output
std::cout << "___________________________________\n";
for (unsigned int i = 0; i < roots.size(); ++i)
{
ct_complex& root = roots[i];

std::cout << "root[" << i << "]:" << root << " - " <<
std::abs(root) << "\n";
}
std::cout << "___________________________________\n";
}

// Fin...
std::cout << "\n\nFin, hit <ENTER> to exit...\n";
std::fflush(stdout);
std::cin.get();

return 0;
}

_________________________________

I get the following incorrect output in the sense that it does show all
of the roots:

z:(94143178827,0) - 94143178827
**********************************************

ct_roots_fast::raised[1]:(94143178827,0)
ct_roots_fast::raised_rounded[1]:(94143178827,0)
___________________________________
ct_roots_fast::raised[2]:(94143178827,0)
ct_roots_fast::raised_rounded[2]:(94143178827,0)
___________________________________
ct_roots_fast::raised[3]:(94143178827,0)
ct_roots_fast::raised_rounded[3]:(94143178827,0)
___________________________________
ct_roots_fast::raised[4]:(94143178827,0)
ct_roots_fast::raised_rounded[4]:(94143178827,0)
___________________________________
ct_roots_fast::raised[5]:(94143178827,0)
ct_roots_fast::raised_rounded[5]:(94143178827,0)
___________________________________
ct_roots_fast::raised[6]:(94143178827,0)
ct_roots_fast::raised_rounded[6]:(94143178827,0)
___________________________________
ct_roots_fast::raised[7]:(94143178827,0)
ct_roots_fast::raised_rounded[7]:(94143178827,0)
___________________________________
ct_roots_fast::raised[8]:(94143178827,0)
ct_roots_fast::raised_rounded[8]:(94143178827,0)
___________________________________
ct_roots_fast::raised[9]:(94143178827,0)
ct_roots_fast::raised_rounded[9]:(94143178827,0)
___________________________________
ct_roots_fast::raised[10]:(94143178827,0)
ct_roots_fast::raised_rounded[10]:(94143178827,0)
___________________________________
ct_roots_fast::raised[11]:(94143178827,0)
ct_roots_fast::raised_rounded[11]:(94143178827,0)
___________________________________
ct_roots_fast::raised[12]:(94143178827,0)
ct_roots_fast::raised_rounded[12]:(94143178827,0)
___________________________________
ct_roots_fast::raised[13]:(94143178827,0)
ct_roots_fast::raised_rounded[13]:(94143178827,0)
___________________________________
ct_roots_fast::raised[14]:(94143178827,0)
ct_roots_fast::raised_rounded[14]:(94143178827,0)
___________________________________
ct_roots_fast::raised[15]:(94143178827,0)
ct_roots_fast::raised_rounded[15]:(94143178827,0)
___________________________________
ct_roots_fast::raised[16]:(94143178827,0)
ct_roots_fast::raised_rounded[16]:(94143178827,0)
___________________________________
ct_roots_fast::raised[17]:(94143178827,0)
ct_roots_fast::raised_rounded[17]:(94143178827,0)
___________________________________
ct_roots_fast::raised[18]:(94143178827,0)
ct_roots_fast::raised_rounded[18]:(94143178827,0)
___________________________________
ct_roots_fast::raised[19]:(94143178827,0)
ct_roots_fast::raised_rounded[19]:(94143178827,0)
___________________________________
ct_roots_fast::raised[20]:(94143178827,0)
ct_roots_fast::raised_rounded[20]:(94143178827,0)
___________________________________
ct_roots_fast::raised[21]:(94143178827,0)
ct_roots_fast::raised_rounded[21]:(94143178827,0)
___________________________________
ct_roots_fast::raised[22]:(94143178827,0)
ct_roots_fast::raised_rounded[22]:(94143178827,0)
___________________________________

avg_err:0
___________________________________
root[0]:(3,0) - 3
root[1]:(3,0) - 3
root[2]:(3,0) - 3
root[3]:(3,0) - 3
root[4]:(3,0) - 3
root[5]:(3,0) - 3
root[6]:(3,0) - 3
root[7]:(3,0) - 3
root[8]:(3,0) - 3
root[9]:(3,0) - 3
root[10]:(3,0) - 3
root[11]:(3,0) - 3
root[12]:(3,0) - 3
root[13]:(3,0) - 3
root[14]:(3,0) - 3
root[15]:(3,0) - 3
root[16]:(3,0) - 3
root[17]:(3,0) - 3
root[18]:(3,0) - 3
root[19]:(3,0) - 3
root[20]:(3,0) - 3
root[21]:(3,0) - 3
root[22]:(3,0) - 3
___________________________________

Fin, hit <ENTER> to exit...

Notice the roots are all (3, 0)? This is not correct. Please try to find
an error I may have created in my implementation of your code Christian
Gollwitzer, wrt the cg_roots function. Sorry if I fuc%ed it up man... ;^o

Fwiw, here is the output I get from using the original, slow trig
infested original of mine:

z:(94143178827,0) - 94143178827
**********************************************

ct_roots::raised[0]:(94143178827,0)
ct_roots::raised_rounded[0]:(94143178827,0)
___________________________________
ct_roots::raised[1]:(94143178826.999603,6.0557511273702414e-05)
ct_roots::raised_rounded[1]:(94143178827,0)
___________________________________
ct_roots::raised[2]:(94143178827.000259,-4.6116857044817097e-05)
ct_roots::raised_rounded[2]:(94143178827,0)
___________________________________
ct_roots::raised[3]:(94143178827.000259,-6.9175285567225643e-05)
ct_roots::raised_rounded[3]:(94143178827,0)
___________________________________
ct_roots::raised[4]:(94143178827.000259,-9.2233714089634195e-05)
ct_roots::raised_rounded[4]:(94143178827,0)
___________________________________
ct_roots::raised[5]:(94143178826.999603,-0.00044975590179648512)
ct_roots::raised_rounded[5]:(94143178827,0)
___________________________________
ct_roots::raised[6]:(94143178826.999603,-0.00013835057113445031)
ct_roots::raised_rounded[6]:(94143178827,0)
___________________________________
ct_roots::raised[7]:(94143178827.000259,-0.00016140899965685982)
ct_roots::raised_rounded[7]:(94143178827,0)
___________________________________
ct_roots::raised[8]:(94143178827.000259,-0.00018446742817926839)
ct_roots::raised_rounded[8]:(94143178827,0)
___________________________________
ct_roots::raised[9]:(94143178827.000259,-0.00087645337507056794)
ct_roots::raised_rounded[9]:(94143178827,0)
___________________________________
ct_roots::raised[10]:(94143178827.000259,-0.00089951180359297653)
ct_roots::raised_rounded[10]:(94143178827,0)
___________________________________
ct_roots::raised[11]:(94143178827.000259,0.00041528480462239701)
ct_roots::raised_rounded[11]:(94143178827,0)
___________________________________
ct_roots::raised[12]:(94143178827.000259,-0.00041528480462239701)
ct_roots::raised_rounded[12]:(94143178827,0)
___________________________________
ct_roots::raised[13]:(94143178827.000259,-0.0011072707515136966)
ct_roots::raised_rounded[13]:(94143178827,0)
___________________________________
ct_roots::raised[14]:(94143178826.999603,-0.00046140166166721089)
ct_roots::raised_rounded[14]:(94143178827,0)
___________________________________
ct_roots::raised[15]:(94143178827.000259,0.00018446742817926839)
ct_roots::raised_rounded[15]:(94143178827,0)
___________________________________
ct_roots::raised[16]:(94143178827.000259,-0.00050751851871203122)
ct_roots::raised_rounded[16]:(94143178827,0)
___________________________________
ct_roots::raised[17]:(94143178827.000259,-0.0005305769472344397)
ct_roots::raised_rounded[17]:(94143178827,0)
___________________________________
ct_roots::raised[18]:(94143178827.000259,-0.00088809913494129387)
ct_roots::raised_rounded[18]:(94143178827,0)
___________________________________
ct_roots::raised[19]:(94143178827.000259,-0.0012456213226481479)
ct_roots::raised_rounded[19]:(94143178827,0)
___________________________________
ct_roots::raised[20]:(94143178827.000259,-0.0012686797511705563)
ct_roots::raised_rounded[20]:(94143178827,0)
___________________________________
ct_roots::raised[21]:(94143178827.000259,4.6116857044817097e-05)
ct_roots::raised_rounded[21]:(94143178827,0)
___________________________________
ct_roots::raised[22]:(94143178827.000259,-0.00031140533066203698)
ct_roots::raised_rounded[22]:(94143178827,0)
___________________________________

avg_err:0.00056815732700208645
___________________________________
root[0]:(3,0) - 3
root[1]:(2.8887518620433976,0.80939031347107282) - 2.9999999999999996
root[2]:(2.5632582136394655,1.5587518501063007) - 3
root[3]:(2.0476594296559623,2.1925078928343722) - 3
root[4]:(1.3801951131934567,2.6636556552071258) - 3
root[5]:(0.61036803915790194,2.9372522630469682) - 2.9999999999999996
root[6]:(-0.20472724009401264,2.9930063075716173) - 2.9999999999999996
root[7]:(-1.0046388365129584,2.8267827663564615) - 3
root[8]:(-1.7300409663446012,2.4509096790313265) - 3
root[9]:(-2.3271338721132588,1.893263832978159) - 3
root[10]:(-2.7516339045163587,1.1952032695387254) - 3
root[11]:(-2.9720578381089924,0.40849994728873995) - 3
root[12]:(-2.9720578381089924,-0.40849994728873917) - 3
root[13]:(-2.7516339045163596,-1.1952032695387236) - 3
root[14]:(-2.3271338721132593,-1.8932638329781581) - 2.9999999999999996
root[15]:(-1.7300409663446015,-2.4509096790313261) - 3
root[16]:(-1.0046388365129593,-2.8267827663564611) - 3
root[17]:(-0.20472724009401405,-2.9930063075716173) - 3
root[18]:(0.61036803915789994,-2.9372522630469691) - 3
root[19]:(1.3801951131934547,-2.6636556552071267) - 3
root[20]:(2.0476594296559609,-2.1925078928343735) - 3
root[21]:(2.5632582136394655,-1.5587518501063007) - 3
root[22]:(2.8887518620433976,-0.80939031347107315) - 3
___________________________________

Fin, hit <ENTER> to exit...

The shows all of the roots. Yours only shows the primary root.

Please correct me if I am totally wrong.

;^o

Chris M. Thomasson

unread,

Apr 11, 2017, 4:39:56 PM4/11/17

to

On 4/10/2017 10:41 PM, Christian Gollwitzer wrote:

> Am 11.04.17 um 01:09 schrieb Chris M. Thomasson:
>> The goal of this crude sample code is to take the n-ary roots of a
>> complex number. Iterating a complex number z to all of its z^(1/n)
>> roots. This code derives the polar form from z, and iterates all of the
>
>> // Gain the basics
>> ct_float radius = std::pow(std::abs(z), 1.0 / p);
>> ct_float angle_base = std::arg(z) / p;
>> ct_float angle_step = (CT_PI * 2.0) / p;

[..]

> While this is correct, I think that the following is easier and should
> run faster:
>
> std::vector<ct_complex> myroots;
> ct_complex baseroot = std::pow(z, 1.0/p);
> myroots.push_back(baseroot);
>
> ct_complex root = baseroot;
> ct_complex phasor = baseroot / std::abs(baseroot);

Afaict, phasor will always be 1+0i.

Christian Gollwitzer

unread,

Apr 11, 2017, 4:46:04 PM4/11/17

to

Am 11.04.17 um 21:13 schrieb Chris M. Thomasson:

> On 4/10/2017 10:41 PM, Christian Gollwitzer wrote:
>> ct_complex phasor = baseroot / std::abs(baseroot);

> [...]

>
> The shows all of the roots. Yours only shows the primary root.
>
> Please correct me if I am totally wrong.

puh. this endless listing of code and results makes it hard to find the
bug. The problem is that the phasor is wrong; it is not z/|z| as I
posted (a "unit" complex pointing to z) but a unit complex with the
angle 2*pi/n.
Try:

const ct_complex i = ct_complex(0,1);
ct_complex phasor = std::exp(2 * i * M_PI / p);

(untested, again).

Sorry for the mistake.

Christian

Christian Gollwitzer

unread,

Apr 11, 2017, 4:58:51 PM4/11/17

to

Am 11.04.17 um 22:39 schrieb Chris M. Thomasson:

> On 4/10/2017 10:41 PM, Christian Gollwitzer wrote:
>> ct_complex baseroot = std::pow(z, 1.0/p);

>> ct_complex root = baseroot;
>> ct_complex phasor = baseroot / std::abs(baseroot);
>
> Afaict, phasor will always be 1+0i.

For positive real input, yes - try a complex number like 1+2i to see
what it gives. Of course, the correct solution is a different number.

Christian

Chris M. Thomasson

unread,

Apr 11, 2017, 5:00:46 PM4/11/17

to

On 4/11/2017 1:45 PM, Christian Gollwitzer wrote:
> Am 11.04.17 um 21:13 schrieb Chris M. Thomasson:
>> On 4/10/2017 10:41 PM, Christian Gollwitzer wrote:
>>> ct_complex phasor = baseroot / std::abs(baseroot);
>> [...]
>>
>> The shows all of the roots. Yours only shows the primary root.
>>
>> Please correct me if I am totally wrong.
>
> puh. this endless listing of code and results makes it hard to find the
> bug.

sorry about that.

> The problem is that the phasor is wrong; it is not z/|z| as I
> posted (a "unit" complex pointing to z) but a unit complex with the
> angle 2*pi/n.
> Try:
>
> const ct_complex i = ct_complex(0,1);
> ct_complex phasor = std::exp(2 * i * M_PI / p);
>
> (untested, again).

Works like a charm. Thank you Christian. :^)

As you mentioned, the average error is greater at:

cg_roots = avg_err:0.0010976249994551542

vs. the slow trig original:

ct_roots = avg_err:0.00056815732700208645

but it works fine. Thanks again.

>
> Sorry for the mistake.

No problem at all! :^)

Chris M. Thomasson

unread,

Apr 11, 2017, 5:15:29 PM4/11/17

to

Doh! Of course you are correct here. Sorry.

Chris M. Thomasson

unread,

Apr 11, 2017, 5:20:44 PM4/11/17

to

On 4/11/2017 11:54 AM, Stefan Ram wrote:

> "Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:
>> I still don't understand how rotation, and thereby the structure
>> (metric) of our physical environment, emerges in complex arithmetic.
>

> The numbers 1 and i also can be represented as 2×2 matrices
>
>
> / \
> | 1 0 |
> 1 = | |
> | 0 1 |
> \ /
>
> and
>
>
> / \
> | 0 -1 |
> i = | | .
> | 1 0 |
> \ /
>
>
> See Wikipedia »7.2 Matrix representation of complex
> numbers«.
>
> Now, compare this with the matrix M_z on page 2 "259" in
>
> www.astro.caltech.edu/~golwala/ph125ab/ph106ab_notes_sec5.1.pdf
>
> . We can thus see that this matrix is the infinitesimal
> generator of rotations around the z axis. (The following
> pages then explain how finite rotations can be obtained
> from such "infinitesimal rotations".)
>

Nice.

Chris M. Thomasson

unread,

Apr 11, 2017, 5:25:22 PM4/11/17

to

[...]

Once one converts a complex number z to polar form, raising to a power
becomes clear to me. Christian's use of raising e to the power of

2.0 * i * CT_PI / (ct_float)p

_________________
const ct_complex i = ct_complex(0, 1);
ct_complex phasor = std::exp(2.0 * i * CT_PI / (ct_float)p);
_________________

is right on par! ;^)

A little less accurate, but much much faster!

Marcel Mueller

unread,

Apr 12, 2017, 2:42:56 AM4/12/17

to

On 11.04.17 07.41, Christian Gollwitzer wrote:
> While this is correct, I think that the following is easier and should
> run faster:
>
> std::vector<ct_complex> myroots;
> ct_complex baseroot = std::pow(z, 1.0/p);
> myroots.push_back(baseroot);
>
> ct_complex root = baseroot;
> ct_complex phasor = baseroot / std::abs(baseroot);
> for (int i=1; i<n; i++) {
> root *= phasor;
> myroots.push_back(root);
> }

I did not ckeck whether your code gives correct results, but this kind
of calculation results in amazingly large floating point errors. A
similar method is used in the GPU FFT algorithm for the Raspberry Pi and
has errors in the order of 10^-3 for n of a few hundred. This even bad
for single precision floats.
The iterative multiplication tends to create cumulative errors for small
angles. Chris's method is by far more accurate.

There is a workaround possible by dealing with residuals of the complex
multiplication, more precisely the additions and subtractions in the
complex multiplication are the root of the errors. This workaround
increases accuracy but not to the level of the transcendental functions.

Marcel

Marcel Mueller

unread,

Apr 12, 2017, 3:09:36 AM4/12/17

to

On 11.04.17 01.09, Chris M. Thomasson wrote:
> // our point
> ct_complex c = {
> std::cos(angle_base + angle) * radius,
> std::sin(angle_base + angle) * radius
> };

Depending on your portability and speed requirements you might want to
calculate sin and cos at once. Unfortunately sincos() never made it in
the standard. However,
ct_complex c = std::polar(radius, angle_base + angle);
might do the job for you if it is appropriately optimized at your target
platform. At least it is more readable.

> // Sum up the Go% damn floating point errors!
> avg_err = avg_err + std::abs(raised - z);
> }
>
> // gain the average error sum... ;^o
> return avg_err / n;

Is there a reason why you calculated the /absolute/ floating point error
rather than the RMS value as usual?

> Can you run this? BTW, what about possibly reducing the average error
> ratio? Any ideas?

Hmm, there is probably no much you can do to reduce floating point
errors except if you do your own calculations at higher precision. E.g.
the complex angle calculations could be done with integer arithmetic.
This gives you about 63 significant bits rather than 52 bits of IEEE 754
double. Since you are operating at the unit circle fixed point is no big
deal. But reasonably implementing the transcendental functions is no fun
either.

Marcel

Manfred

unread,

Apr 12, 2017, 8:58:03 AM4/12/17

to

On 4/11/2017 1:09 AM, Chris M. Thomasson wrote:
> ct_complex c = {
> std::cos(angle_base + angle) * radius,
> std::sin(angle_base + angle) * radius
> };
>

The pair of cos+sin calculations can be efficiently combined in a single
call on platforms that allow that. This makes use of the power of
numerical algorithms to calculate both values simultaneously in
significantly less time than the two separate calls.
I think it is a pity that sincos() was not included (was it considered?)
in the standard, anyway these are the options I know of:
- GNU glibc exports a sincos() function
- on x87 a FSINCOS instruction is built in the FPU which can easily be
used on 32-bit x86
- on x86_64 it requires more work, but still the core numerical
algorithm (e.g. a Newton expansion) can be easily adapted to this.

Chris M. Thomasson

unread,

Apr 13, 2017, 1:26:02 PM4/13/17

to

On 4/12/2017 12:09 AM, Marcel Mueller wrote:
> On 11.04.17 01.09, Chris M. Thomasson wrote:
>> // our point
>> ct_complex c = {
>> std::cos(angle_base + angle) * radius,
>> std::sin(angle_base + angle) * radius
>> };
>
> Depending on your portability and speed requirements you might want to
> calculate sin and cos at once. Unfortunately sincos() never made it in
> the standard. However,
> ct_complex c = std::polar(radius, angle_base + angle);
> might do the job for you if it is appropriately optimized at your target
> platform. At least it is more readable.

Agreed. std::polar is easier to read, and can be faster.

>
>
>> // Sum up the Go% damn floating point errors!
>> avg_err = avg_err + std::abs(raised - z);
>> }
>>
>> // gain the average error sum... ;^o
>> return avg_err / n;
>
> Is there a reason why you calculated the /absolute/ floating point error
> rather than the RMS value as usual?

Not really. I just wanted to get a quick idea about the errors that are
going on.

>
>
>> Can you run this? BTW, what about possibly reducing the average error
>> ratio? Any ideas?
>
> Hmm, there is probably no much you can do to reduce floating point
> errors except if you do your own calculations at higher precision. E.g.
> the complex angle calculations could be done with integer arithmetic.
> This gives you about 63 significant bits rather than 52 bits of IEEE 754
> double. Since you are operating at the unit circle fixed point is no big
> deal. But reasonably implementing the transcendental functions is no fun
> either.

Right. I am going to use a high-precision floating point library in
later versions. 64-bit doubles are pretty good for this experimental phase.

Chris M. Thomasson

unread,

Apr 13, 2017, 1:28:10 PM4/13/17

to

On 4/12/2017 5:57 AM, Manfred wrote:
> On 4/11/2017 1:09 AM, Chris M. Thomasson wrote:
>> ct_complex c = {
>> std::cos(angle_base + angle) * radius,
>> std::sin(angle_base + angle) * radius
>> };
>>
> The pair of cos+sin calculations can be efficiently combined in a single
> call on platforms that allow that. This makes use of the power of
> numerical algorithms to calculate both values simultaneously in
> significantly less time than the two separate calls.
> I think it is a pity that sincos() was not included (was it considered?)

That would have been a nice function to have.

> in the standard, anyway these are the options I know of:
> - GNU glibc exports a sincos() function
> - on x87 a FSINCOS instruction is built in the FPU which can easily be
> used on 32-bit x86
> - on x86_64 it requires more work, but still the core numerical
> algorithm (e.g. a Newton expansion) can be easily adapted to this.

Nice. Well, I think that std::polar has the best chance of using
sincos(). Need to check the implementation.

David Brown

unread,

Apr 13, 2017, 2:19:56 PM4/13/17

to

On 11/04/17 01:09, Chris M. Thomasson wrote:
> The goal of this crude sample code is to take the n-ary roots of a
> complex number. Iterating a complex number z to all of its z^(1/n)
> roots. This code derives the polar form from z, and iterates all of the
> roots. This is an essential part of providing the ability of storing
> n-ary tokens in a set of iterated roots. We can map the n roots, to a
> token table, and store data. I will explain more later, however, please

> take a look at the code, run it, tear it apart, and give me pointers onF

> how to make it better, please? Thank you all for your time. The function
> at the heart of this is the function ct_roots that breaks apart a
> complex number into its n-ary roots. Here is the code:
>

<snip>

> Can you run this? BTW, what about possibly reducing the average error
> ratio? Any ideas?
>
> Will post more context later on tonight or tommorw.
>
> :^)

There are, I think, two big issues here. First, you have started the
coding before really understanding the maths. Second, you can an
iterative algorithm in floating point arithmetic - the errors there will
always build up.

Finding the nth roots of a complex number z is easy in polar
coordinates. Express z = r.e^it, where r = abs(z). Find the real nth
root of r, call it q. All your complex roots of z will have this magnitude.

Then consider the angle. Your original angle is t. Let s = t / n.
This is the angle for your principle root, which is then q.e^is. All
your other roots are this principle root multiplied by the n roots of
unity, e^2iπ.(j/n) for j = 0 .. n-1.

This means that you can easily calculate all the nth roots of your
original number directly, rather than iteratively, and thus avoid any
cumulative floating point rounding errors.

Chris M. Thomasson

unread,

Apr 13, 2017, 2:38:25 PM4/13/17

to

Makes sense to me. Thank you. Will this have the precision that cos/sin
has? The higher the precision, the more data we can store in a complex
number. Also, for the ct_store function, link you said, I do not need to
return all of the roots. Just the n'th root. So I do not need to return
a vector of all the roots in the ct_store function. I need to make two
functions. One that returns all the roots, and one that returns a single
targeted root. Its easy, and should be done today. The parametric nature
can be exploited in ct_store. However, I am not sure how to avoid
returning all the roots in the ct_load function.

Thanks again David.

>
> This means that you can easily calculate all the nth roots of your
> original number directly, rather than iteratively, and thus avoid any
> cumulative floating point rounding errors.
>

I can make two functions. One for targeting the n'th root, and one that
returns all the roots. ct_store uses the targeted version, ct_load uses
the other one to lookup the correct symbol in the roots.

Any more very interesting thoughts? :^)

Chris M. Thomasson

unread,

Apr 13, 2017, 2:40:40 PM4/13/17

to

raising e to gain the base angle, is very fast, and I need to test it to
see if I can store the same amount of data vs the slow trig version. If
I cannot, then the precision loss of the fast method should be closely
examined.

> The higher the precision, the more data we can store in a complex
> number.

[...]

Chris M. Thomasson

unread,

Apr 13, 2017, 2:48:30 PM4/13/17

to

On 4/13/2017 11:19 AM, David Brown wrote:

> On 11/04/17 01:09, Chris M. Thomasson wrote:
>> The goal of this crude sample code is to take the n-ary roots of a
>> complex number. Iterating a complex number z to all of its z^(1/n)
>> roots. This code derives the polar form from z, and iterates all of the
>> roots. This is an essential part of providing the ability of storing
>> n-ary tokens in a set of iterated roots. We can map the n roots, to a
>> token table, and store data. I will explain more later, however, please
>> take a look at the code, run it, tear it apart, and give me pointers onF
>> how to make it better, please? Thank you all for your time. The function
>> at the heart of this is the function ct_roots that breaks apart a
>> complex number into its n-ary roots. Here is the code:
>>
> <snip>
>> Can you run this? BTW, what about possibly reducing the average error
>> ratio? Any ideas?
>>
>> Will post more context later on tonight or tommorw.
>>
>> :^)
>
> There are, I think, two big issues here. First, you have started the
> coding before really understanding the maths.

I am not a mathematician David! Fwiw, I thank you for your comments.

[...]

Ralf Goertz

unread,

Apr 14, 2017, 4:04:27 AM4/14/17

to

Am Thu, 13 Apr 2017 10:28:02 -0700
schrieb "Chris M. Thomasson" <inv...@invalid.invalid>:

> Nice. Well, I think that std::polar has the best chance of using
> sincos(). Need to check the implementation.

It doesn't use it here :-(

template<typename _Tp>
inline complex<_Tp>
polar(const _Tp& __rho, const _Tp& __theta)
{
__glibcxx_assert( __rho >= 0 );
return complex<_Tp>(__rho * cos(__theta), __rho * sin(__theta));
}

> g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/6/lto-wrapper
Target: x86_64-suse-linux
Configured with: ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,java,ada,go --enable-offload-targets=hsa --enable-checking=release --with-gxx-include-dir=/usr/include/c++/6 --enable-ssp --disable-libssp --disable-libvtv --disable-libcc1 --enable-plugin --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' --disable-libgcj --with-slibdir=/lib64 --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-version-specific-runtime-libs --enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function --program-suffix=-6 --without-system-libunwind --enable-multilib --with-arch-32=x86-64 --with-tune=generic --build=x86_64-suse-linux --host=x86_64-suse-linux
Thread model: posix
gcc version 6.3.1 20170202 [gcc-6-branch revision 245119] (SUSE Linux)

Ralf Goertz

unread,

Apr 14, 2017, 5:42:18 AM4/14/17

to

Am Wed, 12 Apr 2017 14:57:52 +0200
schrieb Manfred <non...@invalid.add>:

> The pair of cos+sin calculations can be efficiently combined in a
> single call on platforms that allow that. This makes use of the power
> of numerical algorithms to calculate both values simultaneously in
> significantly less time than the two separate calls.

Do you know where I can read more about that efficient algorithm. Sine
being odd and cos even means that their Taylor serieses don't share
any common powers of x. So I am a bit puzzled how that is accomplished.

> I think it is a pity that sincos() was not included (was it
> considered?) in the standard, anyway these are the options I know of:
> - GNU glibc exports a sincos() function
> - on x87 a FSINCOS instruction is built in the FPU which can easily
> be used on 32-bit x86
> - on x86_64 it requires more work, but still the core numerical
> algorithm (e.g. a Newton expansion) can be easily adapted to this.

Hm, I just experimented a bit. Calculating 10 Million sines and cosine
separately takes as long as doing it in one go using sincos on my linux x86_64
architecture. Is it because the glibc sincos function is bogus here?

Christian Gollwitzer

unread,

Apr 14, 2017, 8:26:23 AM4/14/17

to

Am 14.04.17 um 11:42 schrieb Ralf Goertz:

> Am Wed, 12 Apr 2017 14:57:52 +0200
> schrieb Manfred <non...@invalid.add>:
>
>> The pair of cos+sin calculations can be efficiently combined in a
>> single call on platforms that allow that. This makes use of the power
>> of numerical algorithms to calculate both values simultaneously in
>> significantly less time than the two separate calls.
>
> Do you know where I can read more about that efficient algorithm. Sine
> being odd and cos even means that their Taylor serieses don't share
> any common powers of x. So I am a bit puzzled how that is accomplished.

There are ways to do it together, e.g. if an algorithm is involved which
is based upon rotation like CORDIC. Here is another paper:
https://arxiv.org/abs/cs/0406049

>> I think it is a pity that sincos() was not included (was it
>> considered?) in the standard, anyway these are the options I know of:
>> - GNU glibc exports a sincos() function
>> - on x87 a FSINCOS instruction is built in the FPU which can easily
>> be used on 32-bit x86
>> - on x86_64 it requires more work, but still the core numerical
>> algorithm (e.g. a Newton expansion) can be easily adapted to this.
>
> Hm, I just experimented a bit. Calculating 10 Million sines and cosine
> separately takes as long as doing it in one go using sincos on my linux x86_64
> architecture. Is it because the glibc sincos function is bogus here?
>

1) sincos only exists in (deprecated) 8087 math, so the instruction is
no longer used for SSE math

2) Have you run the sin and cos in spearate loops or in two consecutive
lines? gcc may have optimized it to a single call to _sincos - check the
disassembly to be sure

Christian

Ralf Goertz

unread,

Apr 14, 2017, 9:39:55 AM4/14/17

to

Am Fri, 14 Apr 2017 14:26:07 +0200
schrieb Christian Gollwitzer <auri...@gmx.de>:

> Am 14.04.17 um 11:42 schrieb Ralf Goertz:
> >

> > Do you know where I can read more about that efficient algorithm.
> > Sine being odd and cos even means that their Taylor serieses don't
> > share any common powers of x. So I am a bit puzzled how that is
> > accomplished.
>
> There are ways to do it together, e.g. if an algorithm is involved
> which is based upon rotation like CORDIC. Here is another paper:
> https://arxiv.org/abs/cs/0406049

Thanks, I'll have a look.

> > Hm, I just experimented a bit. Calculating 10 Million sines and
> > cosine separately takes as long as doing it in one go using sincos
> > on my linux x86_64 architecture. Is it because the glibc sincos
> > function is bogus here?
>
> 1) sincos only exists in (deprecated) 8087 math, so the instruction
> is no longer used for SSE math
>
> 2) Have you run the sin and cos in spearate loops or in two
> consecutive lines? gcc may have optimized it to a single call to
> _sincos - check the disassembly to be sure

diff'ing the two assembler files I get:

< movl $_ZZ7mypolarIdESt7complexIT_ERKS1_S4_E1c, %esi
< movl $_ZZ7mypolarIdESt7complexIT_ERKS1_S4_E1s, %edi
1084c1082,1091
< call sincos
---
> call sin
> movq %xmm0, %rax
> movq %rax, _ZZ7mypolarIdESt7complexIT_ERKS1_S4_E1s(%rip)
> movq -32(%rbp), %rax
> movq (%rax), %rax
> movq %rax, -40(%rbp)
> movsd -40(%rbp), %xmm0
> call cos
> movq %xmm0, %rax
> movq %rax, _ZZ7mypolarIdESt7complexIT_ERKS1_S4_E1c(%rip)

So I guess g++ is not that clever. The function (mypolar) I used was
defined according to the std::polar function I posted elsewhere in this
thread:

template<typename T> inline complex<T> mypolar(const T& rho, const T& theta) {
static T s,c;
sincos(theta,&s,&c);
//s=sin(theta);
//c=cos(theta);
return complex<T>(rho*c,rho*s);
}

The 10 million runs of the loop take about 350 ms. The loop does nothing
but replacing the appropriate element of a vector<complex<double>> (vc)
with the result of the mypolar call:

for (auto &i:vc) {
i=mypolar(1.0,i.real());
}

I had placed [0, M_2_PI)-uniformly distributed angles into the real
components of the elements of vc beforehand.

bitrex

unread,

Apr 14, 2017, 4:05:39 PM4/14/17

to

e^(i*x) = cos(x) + i*sin(x).

HTH ;)

bitrex

unread,

Apr 14, 2017, 4:16:43 PM4/14/17

to

The proof is, roughly speaking, given certain assumptions two functions
that are the homogeneous solution to the same linear ordinary
differential equation and have the same Taylor series coefficients over
some domain must be the same function.

Here the domain is actually the entire set of complex numbers, z, of
which the real numbers are a subset.

Manfred

unread,

Apr 15, 2017, 1:39:30 PM4/15/17

to

On 04/14/2017 03:39 PM, Ralf Goertz wrote:
> Am Fri, 14 Apr 2017 14:26:07 +0200
> schrieb Christian Gollwitzer <auri...@gmx.de>:
>
>> Am 14.04.17 um 11:42 schrieb Ralf Goertz:
>>>
>>> Do you know where I can read more about that efficient algorithm.
>>> Sine being odd and cos even means that their Taylor serieses don't
>>> share any common powers of x. So I am a bit puzzled how that is
>>> accomplished.

About the simple observation of the series: they don't share common
powers, but, along the series in a plain implementation, some values
must be calculated even if they do not accumulate into one series, while
they do into the other series: look e.g. at the 1/(n!) terms (and also
x^n, although for these you can use x^2 to step over the non-used powers).
These algorithms have been subject to deep analysis and optimization,
and I can't say if modern implementations actually use these series,
nevertheless the Intel instruction set reference manual says about FSINCOS:
"This instruction is faster than executing the FSIN and FCOS
instructions in succession"
If the Intel designers decided to implement this in their hardware, I
think there must have been good reason.

>>
>> There are ways to do it together, e.g. if an algorithm is involved
>> which is based upon rotation like CORDIC. Here is another paper:
>> https://arxiv.org/abs/cs/0406049
>
> Thanks, I'll have a look.
>
>>> Hm, I just experimented a bit. Calculating 10 Million sines and
>>> cosine separately takes as long as doing it in one go using sincos
>>> on my linux x86_64 architecture. Is it because the glibc sincos
>>> function is bogus here?

I did the same, with 1 billion iterations (code below **):
No optimization sincos is faster:
$ c++ -DUSE_SINCOS sincos.cc -o sincos && ./sincos
evals: 1073741824
cycle result: -1.41666e-14, -1.19209e-07
cycle time: 42.1826 sec
$ c++ sincos.cc -o sincos && ./sincos
evals: 1073741824
cycle result: -1.41666e-14, -1.19209e-07
cycle time: 51.1113 sec

with -O2 no difference (see asm below):
$ c++ -O2 -DUSE_SINCOS sincos.cc -o sincos && ./sincos
evals: 1073741824
cycle result: -1.41666e-14, -1.19209e-07
cycle time: 25.9019 sec
$ c++ -O2 sincos.cc -o sincos && ./sincos
evals: 1073741824
cycle result: -1.41666e-14, -1.19209e-07
cycle time: 26.4266 sec

>>
>> 1) sincos only exists in (deprecated) 8087 math, so the instruction
>> is no longer used for SSE math

x87 FPUs do have some age, but I am not sure they are generally
deprecated for 32-bit code.
It is forbidden to use them in some contexts (e.g. conflict with MMX
which alias its registers with FPU registers)
SSE math does not implement such instructions in hardware, they have to
be implemented in application (library) code.

>>
>> 2) Have you run the sin and cos in spearate loops or in two
>> consecutive lines? gcc may have optimized it to a single call to
>> _sincos - check the disassembly to be sure
>
>
> diff'ing the two assembler files I get:
>
> < movl $_ZZ7mypolarIdESt7complexIT_ERKS1_S4_E1c, %esi
> < movl $_ZZ7mypolarIdESt7complexIT_ERKS1_S4_E1s, %edi
> 1084c1082,1091
> < call sincos
> ---
>> call sin
>> movq %xmm0, %rax
>> movq %rax, _ZZ7mypolarIdESt7complexIT_ERKS1_S4_E1s(%rip)
>> movq -32(%rbp), %rax
>> movq (%rax), %rax
>> movq %rax, -40(%rbp)
>> movsd -40(%rbp), %xmm0
>> call cos
>> movq %xmm0, %rax
>> movq %rax, _ZZ7mypolarIdESt7complexIT_ERKS1_S4_E1c(%rip)
>
> So I guess g++ is not that clever. The function (mypolar) I used was
> defined according to the std::polar function I posted elsewhere in this
> thread:

I got a similar result with no optimization:
$ c++ -S -DUSE_SINCOS sincos.cc -o sincos1.s && c++ -S sincos.cc -o
sincos2.s && diff sincos{1,2}.s | less
< movq %rcx, %rsi
< movq %rdx, %rdi
< movq %rax, -200(%rbp)
< movsd -200(%rbp), %xmm0
< call sincos
< movsd -192(%rbp), %xmm0
< movsd -184(%rbp), %xmm1
< movsd %xmm1, -168(%rbp)
< movsd %xmm0, -176(%rbp)
---
> movq %rax, -168(%rbp)
116c106,108
< movsd -32(%rbp), %xmm1
---
> call sin
> movapd %xmm0, %xmm1
> movsd -32(%rbp), %xmm0
119,120c111,116
< movsd -176(%rbp), %xmm0
< movsd -40(%rbp), %xmm1
---
> movq -72(%rbp), %rax
> movq %rax, -168(%rbp)
> movsd -168(%rbp), %xmm0
> call cos
> movapd %xmm0, %xmm1
> movsd -40(%rbp), %xmm0

But I got a definitely different result with -O2:
$ c++ -S -O2 -DUSE_SINCOS sincos.cc -o sincos1.s && c++ -S -O2 sincos.cc
-o sincos2.s && diff sincos{1,2}.s
55c55
< cmpq %rbx, %rax
---
> cmpq %rax, %rbx
96c96
< cmpq %rbx, %r15
---
> cmpq %r15, %rbx

So indeed, with -O2 the result is practically identical.

>
> template<typename T> inline complex<T> mypolar(const T& rho, const T& theta) {
> static T s,c;
> sincos(theta,&s,&c);
> //s=sin(theta);
> //c=cos(theta);
> return complex<T>(rho*c,rho*s);
> }
>
> The 10 million runs of the loop take about 350 ms. The loop does nothing
> but replacing the appropriate element of a vector<complex<double>> (vc)
> with the result of the mypolar call:
>
> for (auto &i:vc) {
> i=mypolar(1.0,i.real());
> }
>
> I had placed [0, M_2_PI)-uniformly distributed angles into the real
> components of the elements of vc beforehand.

Note that M_2_PI is defined as "two times the reciprocal of pi"

**) sample code:
#include <iostream>

#include <vector>

#include <cmath>
#include <ctime>

#define ANGLES 1024
#define ITER (1024*1024)

int main()
{
unsigned long evalCount = 0;

double sinAcc = 0.0;
double cosAcc = 0.0;

std::vector<double> angles(ANGLES, 0.0);

for(int a=0; ANGLES>a; ++a)
{
angles[a] = a*(2.0*M_PI/ANGLES);
}

clock_t startTime = clock();

for(int iter=0; ITER>iter; ++iter)
{
for(auto a : angles)
{
#ifdef USE_SINCOS
double sinVal, cosVal;
sincos(a, &sinVal, &cosVal);
sinAcc += sinVal;
cosAcc += cosVal;
#else
sinAcc += sin(a);
cosAcc += cos(a);
#endif
++evalCount;
}
}

clock_t endTime = clock();

std::cout << "evals: " << evalCount << std::endl;
std::cout << "cycle result: " << sinAcc << ", " << cosAcc << std::endl;
std::cout << "cycle time: " <<
static_cast<double>(endTime-startTime)/CLOCKS_PER_SEC << " sec" <<
std::endl;

}

Ralf Goertz

unread,

Apr 16, 2017, 7:39:24 AM4/16/17

to

Am Sat, 15 Apr 2017 19:39:21 +0200
schrieb Manfred <non...@invalid.add>:

> On 04/14/2017 03:39 PM, Ralf Goertz wrote:
> >> Am 14.04.17 um 11:42 schrieb Ralf Goertz:
> >>>

> >>> Hm, I just experimented a bit. Calculating 10 Million sines and
> >>> cosine separately takes as long as doing it in one go using sincos
> >>> on my linux x86_64 architecture. Is it because the glibc sincos
> >>> function is bogus here?
> I did the same, with 1 billion iterations (code below **):
> No optimization sincos is faster:

Yeah stupid me. I only tried it with -O2 so I got no difference. When I
compared the assembler codes I forgot the -O2.

> > The 10 million runs of the loop take about 350 ms. The loop does
> > nothing but replacing the appropriate element of a
> > vector<complex<double>> (vc) with the result of the mypolar call:
> >
> > for (auto &i:vc) {
> > i=mypolar(1.0,i.real());
> > }
> >
> > I had placed [0, M_2_PI)-uniformly distributed angles into the real
> > components of the elements of vc beforehand.
> Note that M_2_PI is defined as "two times the reciprocal of pi"

Ouch, thanks. I should have looked at the value. So there seems to be no
constant 2*π defined in math.h? Isn't that constant worth being defined?
Of course I can use 2*M_PI, but then I could also use 2/M_PI.

Tim Rentsch

unread,

Apr 16, 2017, 11:23:34 AM4/16/17

to

Marcel Mueller <news.5...@spamgourmet.org> writes:

> On 11.04.17 07.41, Christian Gollwitzer wrote:
>> While this is correct, I think that the following is easier and should
>> run faster:
>>
>> std::vector<ct_complex> myroots;
>> ct_complex baseroot = std::pow(z, 1.0/p);
>> myroots.push_back(baseroot);
>>
>> ct_complex root = baseroot;
>> ct_complex phasor = baseroot / std::abs(baseroot);
>> for (int i=1; i<n; i++) {
>> root *= phasor;
>> myroots.push_back(root);
>> }
>
> I did not ckeck whether your code gives correct results, but this kind
> of calculation results in amazingly large floating point errors. A
> similar method is used in the GPU FFT algorithm for the Raspberry Pi
> and has errors in the order of 10^-3 for n of a few hundred. This even
> bad for single precision floats.
> The iterative multiplication tends to create cumulative errors for

> small angles. Chris's method is by far more accurate. [...]

I suggest an intermediate approach that uses sin/cos less but
still has good accuracy. The roots needed are all 'baseroot'
times the nth roots of unity. The nth roots of unity can be
calculated using only O( log n ) calls to sin/cos, plus O(n)
multiplications, as follows. Calculate e ** (k*2*pi*i/n), with
k = all powers of two < n/2 (this is done using sin/cos). Now
all the roots of unity can be calculated using these factors
using only one multiplication per root, using a binary-recursive
expansion. The results take one more multiplication each. (In
case it needs saying the aforementioned multiplications are of
complex numbers.)

Also, the roots of unity are symmetric about the x axis, so
only half of them need to be calculated directly, the other
half can be done with just a sign change of the imaginary
component.

David Brown

unread,

Apr 17, 2017, 4:25:37 PM4/17/17

to

Yes.

> The higher the precision, the more data we can store in a complex
> number.

Storing data in a floating point or complex number is possible in theory
with infinitely accurate real numbers - but totally and utterly useless
in practice. So don't worry about such details - if you think this is
fun to play with, then go ahead and play with it. Such experimentation
is fun and educational. Just don't be surprised when your double
precision numbers can't hold more than a couple of chars after manipulation.

> Also, for the ct_store function, link you said, I do not need to
> return all of the roots. Just the n'th root.

Doing the calculations in polar coordinates is perfect for this. You
can jump straight to the n'th root without going through any of the ones
in between.

David Brown

unread,

Apr 17, 2017, 4:29:50 PM4/17/17

to

No problem. As with many of your experiments, I cannot be optimistic
that you will get anything useful or practical in the end - but the
maths and the coding on the journey there can be fun and educational.
So enjoy the road, and don't feel disappointed about the destination.

<https://en.wikipedia.org/wiki/Complex_number#Polar_form>

Chris M. Thomasson

unread,

Apr 17, 2017, 5:53:22 PM4/17/17

to

Totally fair enough. For instance, the size of the "ciphertext" wrt
storing the encoding of a plaintext in complex numbers is not all that
practical. However, it just might being useful in the sense of providing
security. Wrt somebody working in a niche field saying "I do not care
about the large ciphertext..."

> but the
> maths and the coding on the journey there can be fun and educational. So
> enjoy the road,

Oh the journey is very enlightening indeed! Coding the math is just
great and vise-versa. I have always thought is was neat when one of my
teachers accepted my homework as a series of little programs in
AppleSoft BASIC. Fwiw, this neat little site has an online emulator:

http://www.calormen.com/jsbasic

It brings back some memories.

> and don't feel disappointed about the destination.

Never. I need to hear from all the smart people out there. If you find
something iffy, let me know! I try to not get mad about being wrong, and
having to learn new things. Afaict, this is the essence of life!
Learning is fun, and a computer is a fun tool to have.

Thanks again David.

>
> <https://en.wikipedia.org/wiki/Complex_number#Polar_form>
>

Love the polar form. Can you think of a fast way to add two complex
numbers together in polar form? I always have to convert them back to
Cartesian to perform the addition. If this can be avoided, wrt gaining
the roots (z-c)^(1/n) from z^n+c, well, that would be great.

Converting back and forth between polar and Cartesian on every iteration
is very slow. Any more clever thoughts? :^)

Chris M. Thomasson

unread,

Apr 17, 2017, 6:06:16 PM4/17/17

to

On 4/17/2017 1:25 PM, David Brown wrote:
> On 13/04/17 20:38, Chris M. Thomasson wrote:

[...]

>> The higher the precision, the more data we can store in a complex
>> number.
>
> Storing data in a floating point or complex number is possible in theory
> with infinitely accurate real numbers - but totally and utterly useless
> in practice. So don't worry about such details - if you think this is
> fun to play with, then go ahead and play with it. Such experimentation
> is fun and educational. Just don't be surprised when your double
> precision numbers can't hold more than a couple of chars after
> manipulation.

So far, I can only store around 70 random bits in a complex number
comprised of two 64-bit doubles, using heavy rounding. That is not that
good. However, there are certain classes of data that sort of like to be
stored. The tend to spiral in z^n+c. Think of zooming in on a spiral in
a Julia set. They all hold data. We can store it within limit
deliberately, or read it rather wildly from a random starting conditions.

The fact that we can store data in the actual roots of z^n+c via
(z-c)^(1/n) is just neat and very fun to me, and extremely educational:
Especially when I get great feedback like this.

:^)

David Brown

unread,

Apr 17, 2017, 7:16:06 PM4/17/17

to

There are certainly situations in cryptography where the size of the
encrypted message is of little matter. You will often use a large
random key for the main data encryption algorithm, which must be fast
and size efficient. The random key is then encrypted with an algorithm
that must be very secure, but can be slow or space inefficient.

>
>> but the
>> maths and the coding on the journey there can be fun and educational. So
>> enjoy the road,
>
> Oh the journey is very enlightening indeed! Coding the math is just
> great and vise-versa. I have always thought is was neat when one of my
> teachers accepted my homework as a series of little programs in
> AppleSoft BASIC. Fwiw, this neat little site has an online emulator:
>
> http://www.calormen.com/jsbasic
>
> It brings back some memories.
>
>
>> and don't feel disappointed about the destination.
>
> Never. I need to hear from all the smart people out there. If you find
> something iffy, let me know! I try to not get mad about being wrong, and
> having to learn new things. Afaict, this is the essence of life!
> Learning is fun, and a computer is a fun tool to have.
>
> Thanks again David.
>
>>
>> <https://en.wikipedia.org/wiki/Complex_number#Polar_form>
>>
>
> Love the polar form. Can you think of a fast way to add two complex
> numbers together in polar form? I always have to convert them back to
> Cartesian to perform the addition.

Cartesian form is much easier for addition. Polar form is much easier
for powers (including roots) and logarithms. Either form works fine for
multiplication and division.

> If this can be avoided, wrt gaining
> the roots (z-c)^(1/n) from z^n+c, well, that would be great.

I'm afraid not.

>
> Converting back and forth between polar and Cartesian on every iteration
> is very slow. Any more clever thoughts? :^)

Yes - avoid iteration, and do the calculations directly!

Ralf Goertz

unread,

Apr 18, 2017, 6:01:14 AM4/18/17

to

Am Sun, 16 Apr 2017 13:39:14 +0200
schrieb Ralf Goertz <m...@myprovider.invalid>:

I am really interested in an answer to that. Would the value produced
by the compiler for 2/M_PI be worse than the one for 2*M_PI concerning
accuracy so that one really needs M_2_PI defined but no constant for
2*π?

Ralf Goertz

unread,

Apr 18, 2017, 6:45:06 AM4/18/17

to

Am Wed, 12 Apr 2017 14:57:52 +0200
schrieb Manfred <non...@invalid.add>:

> The pair of cos+sin calculations can be efficiently combined in a
> single call on platforms that allow that. This makes use of the power
> of numerical algorithms to calculate both values simultaneously in
> significantly less time than the two separate calls.

I thought I'd give it a try and define my own sincos function using
Taylor approximation. I thoght all I need are the results for the
interval [0,M_PI/4] and use trigonomic identities to figure out the
rest. I guess M_PI/4 is small enough for the series to be cut at x^10.

The output of the program below is

root mean square errors: sine=1.663e-11 cosine=1.663e-11

But I don't know how accurate sincos is, so the error might be smaller.
Or I am measuring noise ;-). At least it is no surprise the both errors
are the same. Anyway, the function is quite fast although not as fast as
sincos or even sin and cos in succession. (By the way, is it possible to
stop g++ from optimizing the single calls into one sincos call with -O2
still intact? In order to do that here I had to cheat and use different
angles for sine and cosine, respectively.) So with my loop of 10000000
calls I got the following times:

mysincos: 0.670069 seconds
sincos: 0.525864 seconds
sin/cos: 0.605176 seconds

Here is the function and the program to find out the rms error. Comments
are very welcome.

#include <iostream>
#include <cmath>

using namespace std;

template<typename T> inline void mysincos(const T &rho, T* s, T*c) {
//taylor coeffs starting from 2
const static T coeffs[]{-T(1)/2,-T(1)/6,T(1)/24,T(1)/120,-T(1)/720,-T(1)/5040,T(1)/40320,T(1)/362880,-T(1)/3628800,-T(1)/39916800};

//sin(-x)=-sin(x) cos(-x)=cos(x) so only deal with positive angles;
bool neg=rho<0;
T r(fabs(rho));

//shift angle into the [0,2π) Intervall
r=fmod(r,2*M_PI);

//figure out in which of 8 parts the angle is
int octo(0);
while (r>=(octo+1)*M_PI/4) ++octo;

//put angle into the first of those parts
r=fmod(r,M_PI/4);

//reflect r at π/8 if original value was in an odd interval
if (octo%2) r=M_PI/4-r;

//initialize taylor approximation
*s=r;*c=1;

//initialize x powers
T x(r*r);
//loop over taylor coefficients
for (auto i=0;i<sizeof(coeffs)/sizeof(T);++i) {
if (i%2) //sine
*s+=x*coeffs[i];
else //cosine
*c+=x*coeffs[i];
//increase the exponent of x
x*=r;
}

//figure out what to do to get back to the full interval
//for octo € {1,2,5,6} swap sine and cosine
auto o4=octo%4;
if (o4 && o4<3) swap(*s,*c);

switch (octo) {
case 2:
case 3:*c=-*c;break;
case 4:
case 5:*s=-*s;*c=-*c;break;
case 6:
case 7:*s=-*s;break;
default:break;
}
if (neg) *s=-*s;
}

int main() {
double s,c;
const int max_rms_n=256;
double rms_s(0),rms_c(0);
for (int i=0;i<max_rms_n;++i) {
double phi(2*i*M_PI/max_rms_n);
sincos(phi,&s,&c);
double sq(s-sin(phi));
rms_s+=sq*sq;
sq=c-cos(phi);
rms_c+=sq*sq;
}
cout<<"root mean square errors: sine="<<sqrt(rms_s/max_rms_n)<<" cosine="<<sqrt(rms_c/max_rms_n)<<endl;
return 0;
}

Ralf Goertz

unread,

Apr 18, 2017, 6:51:02 AM4/18/17

to

Am Tue, 18 Apr 2017 12:43:57 +0200
schrieb Ralf Goertz <m...@myprovider.invalid>:

>
>

> int main() {
> double s,c;
> const int max_rms_n=256;
> double rms_s(0),rms_c(0);
> for (int i=0;i<max_rms_n;++i) {
> double phi(2*i*M_PI/max_rms_n);
> sincos(phi,&s,&c);

correction: mysincos(phi,&s,&c)

Manfred

unread,

Apr 18, 2017, 7:13:08 AM4/18/17

to

I don't have a definite answer. An obvious observation is that in a
radix-2 FP representation if you have a constant x then the expression
2.0*x has exactly the same representation accuracy (being only an
increment in exponent), so there might be other reasons than accuracy or
there may be a difference for non radix-2 representations.

David Brown

unread,

Apr 18, 2017, 7:37:28 AM4/18/17

to

Another obvious observation is that as long as you keep everything
declared "const" (or "constexpr") or use literals, and have optimisation
enabled, then the compiler should do all these calculations at compile
time rather than run time. That may also help avoid rounding errors.

David Brown

unread,

Apr 18, 2017, 8:34:37 AM4/18/17

to

I have a few brief comments. I haven't tested any of these points, so
they may or may not have a significant effect.

1. The function is too long to be "inline" - remove that, and rely on
the compiler to inline the function if appropriate.

2. Use local variables for your sin and cos, rather than *s and *c.
This is especially relevant when the function is not inlined.

3. Make sure "-ffast-math" is in use (for gcc - other compilers may have
similar options). This can make quite a difference to some floating
point code speed.

4. Taylor series are easy to understand, but they are not the fastest
converging polynomial series for sin/cos.

5. You first put the angle into the range [0, π/4). Taylor series are
faster and/or more accurate for a small range, centred around 0.
Consider rotating another π/8 to get in the range [-π/8, π/8). Use:

sin(x ± π/8) = cos(π/8).sin(x) ± sin(π/8).cos(x)
cos(x ± π/8) = cos(π/8).cos(x) ± -sin(π/8).sin(x)

6. Every iteration of your loop, you multiply x by r. This makes it
easy to calculate, but your rounding errors can build up over the ten
cycles. Tracking r^i and r^(i + 1) separately, multiplying each by r²,
will let you have a loop of 5 steps, avoid the conditional on (i % 2),
allow better instruction scheduling, and half the error buildup.

7. Consider only calculating sin or cos, and using c = sqrt(1 - s²) to
calculate the other half.

Ralf Goertz

unread,

Apr 18, 2017, 11:16:38 AM4/18/17

to

Am Tue, 18 Apr 2017 14:34:25 +0200
schrieb David Brown <david...@hesbynett.no>:

> On 18/04/17 12:43, Ralf Goertz wrote:
> >
> >
> > mysincos: 0.670069 seconds
> > sincos: 0.525864 seconds
> > sin/cos: 0.605176 seconds
> >
> > Here is the function and the program to find out the rms error.
> > Comments are very welcome.
> >

> I have a few brief comments. I haven't tested any of these points, so
> they may or may not have a significant effect.

Hi David, thanks for your comments!

> 1. The function is too long to be "inline" - remove that, and rely on
> the compiler to inline the function if appropriate.

Point taken. But the program is actually faster with inline, see below.

> 2. Use local variables for your sin and cos, rather than *s and *c.
> This is especially relevant when the function is not inlined.

Yep, normally I would have used references, but I wanted to be
compatible to sincos function call. I am aware that references are
probably nothing but pointers in disguise, so the problem wouldn't go
away.

> 3. Make sure "-ffast-math" is in use (for gcc - other compilers may
> have similar options). This can make quite a difference to some
> floating point code speed.

I always thought that -ffast-math is dangerous. But with that option my
function actually wins (when inlined)!

mysincos: 0.496899 seconds. (inline)
mysincos: 0.545307 seconds. (no inline)
sincos: 0.522989 seconds.

> 4. Taylor series are easy to understand, but they are not the fastest
> converging polynomial series for sin/cos.

Sigh, as you said, Taylor is easy!

> 5. You first put the angle into the range [0, π/4). Taylor series are
> faster and/or more accurate for a small range, centred around 0.
> Consider rotating another π/8 to get in the range [-π/8, π/8). Use:
>
> sin(x ± π/8) = cos(π/8).sin(x) ± sin(π/8).cos(x)
> cos(x ± π/8) = cos(π/8).cos(x) ± -sin(π/8).sin(x)

This would mean to use hexo instead of octo ;-) Looking at the
extraordinary match of the tenth Taylor polynom even for an Intervall
like [-5,5]
<https://upload.wikimedia.org/wikipedia/commons/c/cc/Sine_GIF.gif> I
thought I could get away with π/4. And I am pretty sure that I don't
need to use the negative half.

> 6. Every iteration of your loop, you multiply x by r. This makes it
> easy to calculate, but your rounding errors can build up over the ten
> cycles. Tracking r^i and r^(i + 1) separately, multiplying each by
> r², will let you have a loop of 5 steps, avoid the conditional on (i
> % 2), allow better instruction scheduling, and half the error buildup.

I will consider this.

> 7. Consider only calculating sin or cos, and using c = sqrt(1 - s²) to
> calculate the other half.

But won't sqrt be as costly as cos itself?

Thanks again for your time.

Ralf

Christian Gollwitzer

unread,

Apr 18, 2017, 2:48:18 PM4/18/17

to

Am 18.04.17 um 17:15 schrieb Ralf Goertz:

> Am Tue, 18 Apr 2017 14:34:25 +0200
> schrieb David Brown <david...@hesbynett.no>:

>> 4. Taylor series are easy to understand, but they are not the fastest
>> converging polynomial series for sin/cos.
>
> Sigh, as you said, Taylor is easy!
>

Taylor is a bad choice for one simple reason: the Taylor approximation
is exact at the point of the approximation, and with all derivatives of
the order of the series; the error increases when you move away from the
expansion point. So your series must be expanded until the error at the
endpoint of the interval reaches is small enough, and close to the point
of expansion the error is much smaller / precision is wasted.

Instead, for numerical approximation, a series should spread the error
evenly across the interval, which is called the a "minimax
approximation". A Chebyshev series approximates this with polynomials.
A very good approximation can often be done with rational functions
(Rational Chebyshev approximation), which cost only a single division in
the end. There is the Remez algorithm to compute this kind of series,
which is a bit complex. A simpler approach yields almost the same result
by least-squares fitting the rational function; see
http://www.aip.de/groups/soe/local/numres/bookfpdf/f5-13.pdf

Boost includes an implementation of the Remez algorithm to implement
special functions
http://www.boost.org/doc/libs/1_63_0/libs/math/doc/html/math_toolkit/remez.html

For sin/cos there is another method, there are recurrence formulae for
doubling or tripling the angle. If you divide the angle in the octant by
4, you can apply a shorter series and then use the angle-doubling
formula twice. This was done in this article
https://arxiv.org/pdf/cs/0406049
This formula also generates both sine and cosine together, therefore
should be faster than any of the aforementioned methods.

Christian

David Brown

unread,

Apr 18, 2017, 3:44:42 PM4/18/17

to

On 18/04/17 17:15, Ralf Goertz wrote:
> Am Tue, 18 Apr 2017 14:34:25 +0200
> schrieb David Brown <david...@hesbynett.no>:
>
>> On 18/04/17 12:43, Ralf Goertz wrote:
>>>
>>>
>>> mysincos: 0.670069 seconds
>>> sincos: 0.525864 seconds
>>> sin/cos: 0.605176 seconds
>>>
>>> Here is the function and the program to find out the rms error.
>>> Comments are very welcome.
>>>
>
>> I have a few brief comments. I haven't tested any of these points, so
>> they may or may not have a significant effect.
>
> Hi David, thanks for your comments!
>
>> 1. The function is too long to be "inline" - remove that, and rely on
>> the compiler to inline the function if appropriate.
>
> Point taken. But the program is actually faster with inline, see below.
>
>> 2. Use local variables for your sin and cos, rather than *s and *c.
>> This is especially relevant when the function is not inlined.
>
> Yep, normally I would have used references, but I wanted to be
> compatible to sincos function call. I am aware that references are
> probably nothing but pointers in disguise, so the problem wouldn't go
> away.
>

Don't use references either - a reference is just a pointer in disguise.
Use local variables for your calculations, and assign the results to
*s and *c at the end. When you are trying to make fast code, never use
your external data for intermediary results. (Personally, I prefer not
to do that anyway - I don't like my "real" data objects to hold
half-calculated values.)

>> 3. Make sure "-ffast-math" is in use (for gcc - other compilers may
>> have similar options). This can make quite a difference to some
>> floating point code speed.
>
> I always thought that -ffast-math is dangerous. But with that option my
> function actually wins (when inlined)!

-ffast-math is not "dangerous" - but it does mean that you lose certain
IEEE floating point guarantees. For example, it assumes your maths is
all finite - no NaNs, infinities, etc. And it lets the compiler treat
maths as associative, use multiplication by reciprocals instead of
division, etc.

>
> mysincos: 0.496899 seconds. (inline)
> mysincos: 0.545307 seconds. (no inline)
> sincos: 0.522989 seconds.
>
>> 4. Taylor series are easy to understand, but they are not the fastest
>> converging polynomial series for sin/cos.
>
> Sigh, as you said, Taylor is easy!
>
>> 5. You first put the angle into the range [0, π/4). Taylor series are
>> faster and/or more accurate for a small range, centred around 0.
>> Consider rotating another π/8 to get in the range [-π/8, π/8). Use:
>>
>> sin(x ± π/8) = cos(π/8).sin(x) ± sin(π/8).cos(x)
>> cos(x ± π/8) = cos(π/8).cos(x) ± -sin(π/8).sin(x)
>
> This would mean to use hexo instead of octo ;-)

No, still octants. But instead of using 0..45 degree octants, you
balance them about 0 for -22.5 to +22.5 degree sectors.

> Looking at the
> extraordinary match of the tenth Taylor polynom even for an Intervall
> like [-5,5]
> <https://upload.wikimedia.org/wikipedia/commons/c/cc/Sine_GIF.gif> I
> thought I could get away with π/4. And I am pretty sure that I don't
> need to use the negative half.

It is all a balance between more complicated algorithms, speed, and
accuracy.

>
>> 6. Every iteration of your loop, you multiply x by r. This makes it
>> easy to calculate, but your rounding errors can build up over the ten
>> cycles. Tracking r^i and r^(i + 1) separately, multiplying each by
>> r², will let you have a loop of 5 steps, avoid the conditional on (i
>> % 2), allow better instruction scheduling, and half the error buildup.
>
> I will consider this.
>
>> 7. Consider only calculating sin or cos, and using c = sqrt(1 - s²) to
>> calculate the other half.
>
> But won't sqrt be as costly as cos itself?

Not necessarily - sqrt is usually cheaper.

>
> Thanks again for your time.
>

Another method that I didn't mention, but which you might like to look
up on Wikipedia, is the CORDIC algorithm.

Alf P. Steinbach

unread,

Apr 18, 2017, 7:58:17 PM4/18/17

to

On 12-Apr-17 2:57 PM, Manfred wrote:
> On 4/11/2017 1:09 AM, Chris M. Thomasson wrote:
>> ct_complex c = {
>> std::cos(angle_base + angle) * radius,
>> std::sin(angle_base + angle) * radius
>> };

>>
> The pair of cos+sin calculations can be efficiently combined in a single
> call on platforms that allow that. This makes use of the power of
> numerical algorithms to calculate both values simultaneously in
> significantly less time than the two separate calls.

I would think simple interpolation (possibly even linear, but I'm
thinking parabola) from a large table of very accurate precomputed
results, is both fastest and simplest and most accurate.

That is, I fail to think of any way that couldn't be.

Disclaimer: haven't implemented.

When you have essentially superfast sin and cos, roughly just a couple
of arithmetic operations, themselves overwhelmed by the time used on a
memory access or two, as the interpolation approach would give, I think
something like sincos only has costs, primarily added complexity.

But in an embedded system with low-end processor the situation might be
that memory, instead of being a cheap throw-away resource, might be a
very limited and costly resource. And then one would need to do run time
computations. And preferentially, fast-ish such computations. :)

> I think it is a pity that sincos() was not included (was it considered?)
> in the standard, anyway these are the options I know of:
> - GNU glibc exports a sincos() function
> - on x87 a FSINCOS instruction is built in the FPU which can easily be
> used on 32-bit x86
> - on x86_64 it requires more work, but still the core numerical
> algorithm (e.g. a Newton expansion) can be easily adapted to this.

I'm still learning something every week, thank you.

It's down from learning something every day, which in turn was down from
learning something at least twice every day, and so on. But I'm not yet
dead. And the level of the discussions here in clc++, the insight and
knowledge that's brought to bear, is just amazing now. :)

Cheers!,

- Alf

Ralf Goertz

unread,

Apr 19, 2017, 2:59:13 AM4/19/17

to

Am Tue, 18 Apr 2017 21:44:26 +0200
schrieb David Brown <david...@hesbynett.no>:

> On 18/04/17 17:15, Ralf Goertz wrote:

> > schrieb David Brown <david...@hesbynett.no>:

> >
> >> 5. You first put the angle into the range [0, π/4). Taylor series
> >> are faster and/or more accurate for a small range, centred around
> >> 0. Consider rotating another π/8 to get in the range [-π/8, π/8).
> >> Use:
> >>
> >> sin(x ± π/8) = cos(π/8).sin(x) ± sin(π/8).cos(x)
> >> cos(x ± π/8) = cos(π/8).cos(x) ± -sin(π/8).sin(x)
> >
> > This would mean to use hexo instead of octo ;-)
>
> No, still octants. But instead of using 0..45 degree octants, you
> balance them about 0 for -22.5 to +22.5 degree sectors.

What I meant is I don't get any information from the negative half of
the interval that I do not get from the positive half due to
sin(-x)=-sin(x) and cos(-x)=cos(x). Therefore, I can only utilise a
sixteenth of the full period.

David Brown

unread,

Apr 19, 2017, 3:24:36 AM4/19/17

to

On 19/04/17 01:58, Alf P. Steinbach wrote:
> On 12-Apr-17 2:57 PM, Manfred wrote:
>> On 4/11/2017 1:09 AM, Chris M. Thomasson wrote:
>>> ct_complex c = {
>>> std::cos(angle_base + angle) * radius,
>>> std::sin(angle_base + angle) * radius
>>> };
>>>
>> The pair of cos+sin calculations can be efficiently combined in a single
>> call on platforms that allow that. This makes use of the power of
>> numerical algorithms to calculate both values simultaneously in
>> significantly less time than the two separate calls.
>
> I would think simple interpolation (possibly even linear, but I'm
> thinking parabola) from a large table of very accurate precomputed
> results, is both fastest and simplest and most accurate.
>
> That is, I fail to think of any way that couldn't be.

It is a question of requirements of accuracy and speed, as well as table
space. If you only need a rough value, then a linear lookup table
(i.e., with linear interpolation between the points) can be fine. But
for more accurate values, the table quickly gets big. And you have the
choice between equally spaced points (which are fast to find), or a more
compact table which then requires a binary search to look up the points.

For my own implementations, I have often found a cubic spline to be the
best compromise. This can get you quite accurate values while keeping
the table small enough to fit in cache. It can also keep the joins at
splice points smoother, which for some applications can be more
important than the absolute accuracy.

>
> Disclaimer: haven't implemented.
>
> When you have essentially superfast sin and cos, roughly just a couple
> of arithmetic operations, themselves overwhelmed by the time used on a
> memory access or two, as the interpolation approach would give, I think
> something like sincos only has costs, primarily added complexity.
>
> But in an embedded system with low-end processor the situation might be
> that memory, instead of being a cheap throw-away resource, might be a
> very limited and costly resource. And then one would need to do run time
> computations. And preferentially, fast-ish such computations. :)
>

I /have/ implemented sine and/or cosine on a number of microcontrollers
over the years - some bigger, some smaller, some with fast
multiplication or hardware floating point support, some 8-bit devices
that don't even have a multiply instruction. There is no single "best"
method.

Ralf Goertz

unread,

Apr 19, 2017, 3:50:41 AM4/19/17

to

Am Tue, 18 Apr 2017 20:48:10 +0200
schrieb Christian Gollwitzer <auri...@gmx.de>:

> Am 18.04.17 um 17:15 schrieb Ralf Goertz:
> > Am Tue, 18 Apr 2017 14:34:25 +0200
> > schrieb David Brown <david...@hesbynett.no>:
> >> 4. Taylor series are easy to understand, but they are not the
> >> fastest converging polynomial series for sin/cos.
> >
> > Sigh, as you said, Taylor is easy!
> >
>
> Taylor is a bad choice for one simple reason: the Taylor
> approximation is exact at the point of the approximation, and with
> all derivatives of the order of the series; the error increases when
> you move away from the expansion point. So your series must be
> expanded until the error at the endpoint of the interval reaches is
> small enough, and close to the point of expansion the error is much
> smaller / precision is wasted.

Of course you are right. I did this for fun just to see how well I can
get away with the first idea that popped up in my mind. I had a llok at
the error knowing that with approach it would be biggest at the upper
end of my intervall, i.e. at around odd multiples of 45°. Here is what I
get there:

31 0.760854470791278: 0.689540544732481 0.689540544737067 0.724247082873142 0.724247082951467
32 0.785398163397448: 0.707106781071925 0.707106781186547 0.707106781179619 0.707106781186548
33 0.809941856003619: 0.724247082873142 0.724247082951467 0.689540544732481 0.689540544737067
34 0.834485548609789: 0.740951125302102 0.740951125354959 0.671558954844023 0.671558954847018

The first number is i, the second the angle (i*π/256) the remaining four
are sin, mysin, cos and mycos. IIANM they are correct at least up to 9
digits which is quite nice I think for a shot in the dark.

> Instead, for numerical approximation, a series should spread the
> error evenly across the interval, which is called the a "minimax
> approximation". A Chebyshev series approximates this with
> polynomials. A very good approximation can often be done with
> rational functions (Rational Chebyshev approximation), which cost
> only a single division in the end. There is the Remez algorithm to
> compute this kind of series, which is a bit complex. A simpler
> approach yields almost the same result by least-squares fitting the
> rational function; see
> http://www.aip.de/groups/soe/local/numres/bookfpdf/f5-13.pdf
>
> Boost includes an implementation of the Remez algorithm to implement
> special functions
> http://www.boost.org/doc/libs/1_63_0/libs/math/doc/html/math_toolkit/remez.html
>
> For sin/cos there is another method, there are recurrence formulae
> for doubling or tripling the angle. If you divide the angle in the
> octant by 4, you can apply a shorter series and then use the
> angle-doubling formula twice. This was done in this article
> https://arxiv.org/pdf/cs/0406049
> This formula also generates both sine and cosine together, therefore
> should be faster than any of the aforementioned methods.

Thanks for the references. I will have a look at them. As I said my goal
was to see how far and fast I can get without consulting clever papers
on how to do it right. Not that I am not interested to learn how to do
it right ;-)

Alf P. Steinbach

unread,

Apr 19, 2017, 7:20:38 AM4/19/17

to

On 19-Apr-17 9:24 AM, David Brown wrote:
> On 19/04/17 01:58, Alf P. Steinbach wrote:
>> On 12-Apr-17 2:57 PM, Manfred wrote:
>>> On 4/11/2017 1:09 AM, Chris M. Thomasson wrote:
>>>> ct_complex c = {
>>>> std::cos(angle_base + angle) * radius,
>>>> std::sin(angle_base + angle) * radius
>>>> };
>>>>
>>> The pair of cos+sin calculations can be efficiently combined in a single
>>> call on platforms that allow that. This makes use of the power of
>>> numerical algorithms to calculate both values simultaneously in
>>> significantly less time than the two separate calls.
>>
>> I would think simple interpolation (possibly even linear, but I'm
>> thinking parabola) from a large table of very accurate precomputed
>> results, is both fastest and simplest and most accurate.
>>
>> That is, I fail to think of any way that couldn't be.
>
> It is a question of requirements of accuracy and speed, as well as table
> space. If you only need a rough value, then a linear lookup table
> (i.e., with linear interpolation between the points) can be fine. But
> for more accurate values, the table quickly gets big.

Well, with parabola fitting, and 1000 precomputed points over the first
quadrant, using 64-bit `double`, I get a maximum error of roughly

2.3893426126520012e-010

… compared to the built-in `sin`.

Increasing to 10 000 precomputed points the error reduces to

2.4825280720008891e-013

… which surprised me: 3 orders of magnitude improvement for 1 order of
magnitude increased table size.

The memory cost isn't that much: 30 000 doubles at 8 bytes each, that's
240 000 bytes, less than a quarter of a thousandth of a GB.

I haven't timed the execution, but I in spite of the memory accesses I
/expect/ this function to be faster than the standard library's `sin`
(sorry, I forgot to use `constexpr`, might make a little difference):

----------------------------------------------------------------------
#include "sin_generator_values.hpp"

#define _USE_MATH_DEFINES
#include <algorithm>
#include <iostream>
#include <math.h>
using namespace std;

namespace cppx{
double const half_pi = M_PI/2;

inline auto basic_fast_sin( double const radians )
-> double
{
int const n = sin_generator_values().size();
int const i = n*radians/half_pi;
Poly2 const& poly = sin_generator_values()[i];
return (poly.a*radians + poly.b)*radians + poly.c;
}
} // namespace cppx

auto main()
-> int
{
double max_diff = 0;
for( int i = 0; i < 57; ++i )
{
double const x = i*cppx::half_pi/57;
double const std_y = sin( x );
double const my_y = cppx::basic_fast_sin( x );

max_diff = max( max_diff, abs( my_y - std_y ) );
}
cout.precision( 17 );
cout << max_diff << endl;
}
----------------------------------------------------------------------

For completeness, here's the code I used to generate the
`"sin_generator_values.hpp"` header:

----------------------------------------------------------------------
#define _USE_MATH_DEFINES
#include <iostream>
#include <math.h> // sin
#include <string>
using namespace std;

/*
Points: (t, y0) (t+d, y1) (t+2d, y2)

[0]
Ax² + Bx + C = 0

[1]
At² + Bt + C = y0
A(t² + 2td + d²) + B(t+d) + C = y1
A(t² + 4td + 4d²) + B(t+2d) + C = y2

[2]
A(2td + d²) + Bd = y1 - y0
A(2td + 3d²) + Bd = y2 - y1

[3]
A2d² = (y2 - y1) - (y1 - y0) = y2 - 2y1 + y0

[4]
A = (y2 - 2y1 + y0)/(2d²)
B = ((y1 - y0) - A(2td + d²))/d
C = y0 - Bt - At²
*/

#define N 10000
#define STR_( a ) #a
#define STR( a ) STR_( a )

char const top[] =
"#pragma once\n"
"#include <array>\n"
"namespace cppx{\n"
" struct Poly2{ double a, b, c; };\n"
" inline auto sin_generator_values()\n"
" -> std::array<Poly2, " STR(N) "> const&\n"
" {\n"
" static std::array<Poly2," STR(N) "> const the_values =\n"
" {{";

char const bottom[] =
" }};\n"
" return the_values;\n"
" }\n"
"} // namespace cppx";

auto main() -> int
{
int const n = N;
double const half_pi = M_PI/2;

cout << top << endl;
cout.precision( 17 );
auto const indent = string( 3*4, ' ' );
for( int i = 0; i < n; ++i )
{
if( i > 0 ) { cout << "," << endl; }

double const t = half_pi*i/n;
double const d = half_pi/n;
double const y0 = sin( t );
double const y1 = sin( t + d );
double const y2 = sin( t + 2*d );
double const a = (y2 - 2*y1 + y0)/(2*d*d);
double const b = ((y1 - y0) - a*(2*t*d + d*d))/d;
double const c = y0 - b*t - a*t*t;

cout << indent << "{ " << a << ", " << b << ", " << c << " }";
}
cout << endl;
cout << bottom << endl;
}
----------------------------------------------------------------------

> And you have the
> choice between equally spaced points (which are fast to find), or a more
> compact table which then requires a binary search to look up the points.

Well, when one does this for speed then there's not much choice, I think.

> For my own implementations, I have often found a cubic spline to be the
> best compromise. This can get you quite accurate values while keeping
> the table small enough to fit in cache. It can also keep the joins at
> splice points smoother, which for some applications can be more
> important than the absolute accuracy.

Now don't get mathy with me. ;-) I know splines but I would have to look
them up and just copy a formula from e.g. Wikipedia.

Cheers!,

- Alf

David Brown

unread,

Apr 19, 2017, 9:40:50 AM4/19/17

to

You are only making 57 stab-in-the-dark tests, so there is a fair amount
of luck here.

>
> The memory cost isn't that much: 30 000 doubles at 8 bytes each, that's
> 240 000 bytes, less than a quarter of a thousandth of a GB.

In my world, 240 KB is /huge/ :-)

As I said, different algorithms suit different purposes. 240 KB is tiny
for some uses, but far too big for others. That is what makes this sort
of thing fun.

Surely you could generate this table directly at compile time, with
constexpr functions?

You might also get higher accuracy using long doubles for the
calculations (but only storing double precision in the table).

>
>> And you have the
>> choice between equally spaced points (which are fast to find), or a more
>> compact table which then requires a binary search to look up the points.
>
> Well, when one does this for speed then there's not much choice, I think.

Yes.

>
>
>> For my own implementations, I have often found a cubic spline to be the
>> best compromise. This can get you quite accurate values while keeping
>> the table small enough to fit in cache. It can also keep the joins at
>> splice points smoother, which for some applications can be more
>> important than the absolute accuracy.
>
> Now don't get mathy with me. ;-) I know splines but I would have to look
> them up and just copy a formula from e.g. Wikipedia.
>

For one system, I used a quadratic spline such as yours. But I
calculated the spline coefficients for cos(sqrt(x)), since this function
is close to linear. When calculating my_cos(x), I first squared x and
then used it in the lookup table and interpolation. In effect, I was
getting roughly 1 - x²/2! + x⁴/4!, forth-power interpolation, without
using more than quadratic calculations. You can also imagine it as
using the squaring to bias the x axis spacing of the table into having
more points where the derivative is higher, while still keeping the fast
access to the table.

A 16 entry table was sufficient for 21 bits of accuracy - fine for
single precision maths.

>
> Cheers!,
>
> - Alf
>

Christian Gollwitzer

unread,

Apr 19, 2017, 2:35:27 PM4/19/17

to

Am 19.04.17 um 13:20 schrieb Alf P. Steinbach:

> Well, with parabola fitting, and 1000 precomputed points over the first
> quadrant, using 64-bit `double`, I get a maximum error of roughly
>
> 2.3893426126520012e-010
>
> … compared to the built-in `sin`.
>
> Increasing to 10 000 precomputed points the error reduces to
>
> 2.4825280720008891e-013
>
> … which surprised me: 3 orders of magnitude improvement for 1 order of
> magnitude increased table size.

Now compare this with an implementation by approximation:

http://www.netlib.org/fdlibm/k_sin.c

This implementation for one quadrant has full double precision and
requires only a couple multiplications, which today are cheaper than
memory retrieval. It'd be interestint to compare both speed and accuracy
for a large number of values.

Christian

Alf P. Steinbach

unread,

Apr 19, 2017, 7:00:56 PM4/19/17

to

Nice. I count 8 to 10 multiplications, and 6 to 8 additions, depending
on the `iy` argument.

Argument `y` is described as “the tail of x”.

Whatever that “the tail of x” is, I could compare it to the function I
posted (2 multiplications, 2 additions and like 4 memory accesses) if
you could wrap up that efficient internal basic function in a function
of one argument `x` that calculates sin(x).

I would guess that due to the memory accesses the relative performance
would depend on whether the argument values are wildly different or just
increasing, in a sequence of calls.

Still, that beast looks so lean 'n mean it's probably fastest. :)

Cheers!,

- Alf

Alf P. Steinbach

unread,

Apr 19, 2017, 7:09:57 PM4/19/17

to

No, 57 is a prime number. That means the stabs fall all over the range
of the approximation. In different parabolas, yes, but they're all the
same kind.

>>
>> The memory cost isn't that much: 30 000 doubles at 8 bytes each, that's
>> 240 000 bytes, less than a quarter of a thousandth of a GB.
>
> In my world, 240 KB is /huge/ :-)

He he, teeny tiny. :)

Well it would be huge if most every function was outfitted with that
kind of supporting baggage. But `sin` is sort of fundamental.

>> [snip]

No, sorry. First of all the standard library's `sin` isn't `constexpr`.
And secondly, the usual way to generate a table at compile time is to
use `std::index_sequence`, a type templated by the index values, and I
guess one would run into an implementation limit long before 10 000.

> You might also get higher accuracy using long doubles for the
> calculations (but only storing double precision in the table).

Yeah, maybe. But I think errors with `double` here would be in the last
digit only.

[snip]

> For one system, I used a quadratic spline such as yours. But I
> calculated the spline coefficients for cos(sqrt(x)), since this function
> is close to linear. When calculating my_cos(x), I first squared x and
> then used it in the lookup table and interpolation. In effect, I was
> getting roughly 1 - x²/2! + x⁴/4!, forth-power interpolation, without
> using more than quadratic calculations. You can also imagine it as
> using the squaring to bias the x axis spacing of the table into having
> more points where the derivative is higher, while still keeping the fast
> access to the table.
>
> A 16 entry table was sufficient for 21 bits of accuracy - fine for
> single precision maths.

Oh, that's pretty smart. :)

Cheers,

- Alf

Alain Ketterlin

unread,

Apr 20, 2017, 5:49:18 AM4/20/17

to

Sorry I'm off-topic, but I can't resist...

[...]

> No, 57 is a prime number.

[...]

Sure, it's called the "Grothendieck prime". See

https://en.wikipedia.org/wiki/57_(number)

The fact that 57 = 19*3 is irrelevant here.

-- Alain.

David Brown

unread,

Apr 20, 2017, 7:24:56 AM4/20/17

to

On 20/04/17 01:09, Alf P. Steinbach wrote:
> On 19-Apr-17 3:40 PM, David Brown wrote:
>> On 19/04/17 13:20, Alf P. Steinbach wrote:
>>> On 19-Apr-17 9:24 AM, David Brown wrote:
>>>> On 19/04/17 01:58, Alf P. Steinbach wrote:

<snip>

>> You are only making 57 stab-in-the-dark tests, so there is a fair amount
>> of luck here.
>
> No, 57 is a prime number. That means the stabs fall all over the range
> of the approximation. In different parabolas, yes, but they're all the
> same kind.

57 is perhaps better than, say, 60. But it is still just stabs. It is
just a test routine done on a PC - why not use more stabs? 92221 would
be a nice number (it's prime too).

>
>
>>>
>>> The memory cost isn't that much: 30 000 doubles at 8 bytes each, that's
>>> 240 000 bytes, less than a quarter of a thousandth of a GB.
>>
>> In my world, 240 KB is /huge/ :-)
>
> He he, teeny tiny. :)
>
> Well it would be huge if most every function was outfitted with that
> kind of supporting baggage. But `sin` is sort of fundamental.

I once worked on a program where a new feature the customer wanted took
a few hours to code - but it took a couple of days to free up the extra
/bit/ of data space needed. (It was not written in C++ :-) )

>>
>> Surely you could generate this table directly at compile time, with
>> constexpr functions?
>
> No, sorry. First of all the standard library's `sin` isn't `constexpr`.

You could write a slow but accurate constexpr sin, using the Taylor series.

> And secondly, the usual way to generate a table at compile time is to
> use `std::index_sequence`, a type templated by the index values, and I
> guess one would run into an implementation limit long before 10 000.
>

Yes, you need to be more clever than that. I remember reading somewhere
that you cannot initialise a constexpr std::array using a loop, but it
is possible to do with something that looks mostly like a std:array. I
have, unfortunately, forgotten the details - maybe I will try to figure
it out when I get the time.

If you want to use template metaprogramming for something like this, you
have to build up using binary trees of templates, rather than a simple
head++tail list, so that your depth here will be 13 rather than 10000.

>
>> You might also get higher accuracy using long doubles for the
>> calculations (but only storing double precision in the table).
>
> Yeah, maybe. But I think errors with `double` here would be in the last
> digit only.
>

Yes - but if you can avoid the error easily, why not do so?

>
>
> [snip]
>> For one system, I used a quadratic spline such as yours. But I
>> calculated the spline coefficients for cos(sqrt(x)), since this function
>> is close to linear. When calculating my_cos(x), I first squared x and
>> then used it in the lookup table and interpolation. In effect, I was
>> getting roughly 1 - x²/2! + x⁴/4!, forth-power interpolation, without
>> using more than quadratic calculations. You can also imagine it as
>> using the squaring to bias the x axis spacing of the table into having
>> more points where the derivative is higher, while still keeping the fast
>> access to the table.
>>
>> A 16 entry table was sufficient for 21 bits of accuracy - fine for
>> single precision maths.
>
> Oh, that's pretty smart. :)
>

Yes, it was fun. The result was the customer's critical control loop
running in less than 1% of the time it originally took - they were
rather pleased with that.

I made another version that used larger tables, combined with a series
of complicated DMA setups triggered by hardware timers that worked even
faster, and generated sine waves on an analogue output even while the
processor was stopped. Once I got down to 0% processor usage, it was
time to deliver the software.

Alf P. Steinbach

unread,

Apr 20, 2017, 8:21:39 AM4/20/17

to

Lols. :)

It spreads nicely anyway. Sorry about that gaffe.

Cheers!,

- Alf