Single-threadedly calculating hex digit of pi (compared to Python)

Frederick Gotham

unread,

Jan 20, 2020, 6:08:14 AM1/20/20

to

I have multi-posted this to 'comp.lang.c' and 'comp.lang.c++'. I would have gotten more flack if I cross-posted.

I have taken single-threaded code written in Python to calculate hex digits of pi, and I have ported it to C and C++ to compare speed. Both C and C++ versions are faster than the Python version. The C version is about 10% faster than Python. For some reason there is a 20% speed difference between the C and C++ versions, with the C++ one being faster. I don't know why.

All versions use the GNU Multiprecision library (-lgmp).

Here's the original Python code:

#!/usr/bin/env python

from gmpy import mpq, mpz

def Discard_Whole(arg):
return arg-mpz(arg)

def next_pi(n,x):
top_line = mpz( n*(n*120-89) + 16 )
bottom_line = mpz( n*(n*(n*(512*n-1024)+712)-206) + 21 )
p = mpq(top_line,bottom_line)
q = mpq(p + x*16)
z = Discard_Whole(q)
return z

def allpi():
from sys import stdout
n = 1
last_retval = 0
while 1:
this_retval = next_pi(n,last_retval)
digit = 16 * this_retval
stdout.write("%x" % digit)
last_retval = this_retval
if 80000 == n:
break
n += 1

allpi()

And here's the C/C++ code:

#include <stdio.h>

//#define USE_CXX

#ifdef USE_CXX
# include <gmpxx.h>
#else
# include <gmp.h>
#endif

#ifdef USE_CXX
typedef mpz_class Bint;
typedef mpq_class Brat;
#else
typedef mpz_t Bint;
typedef mpq_t Brat;
#endif

int main(void)
{
#ifdef USE_CXX
Brat const_rat_16(16ul,1ul);
#else
Brat const_rat_16;
mpq_init(const_rat_16);
mpq_set_ui(const_rat_16,16ul,1ul);
#endif

#ifdef USE_CXX
Brat tmp;
Bint &top_line = tmp.get_num();
Bint &bottom_line = tmp.get_den();
Bint tmp_for_integral_part;
Brat last_retval{0};
Bint n{1};
#else
Brat tmp;
mpq_init(tmp);
Bint *const top_line = (Bint*)mpq_numref(tmp);
Bint *const bottom_line = (Bint*)mpq_denref(tmp);
Bint tmp_integral_part;
mpz_init2(tmp_integral_part,256u);
Brat last_retval;
mpq_init(last_retval);
Bint n;
mpz_init2(n,256u);
mpz_set_ui(n,1u);
#endif

double digit;

for (; /* ever */ ;)
{
#ifdef USE_CXX
top_line = n;
top_line *= 120u;
top_line -= 89u;
top_line *= n;
top_line += 16u;
#else
mpz_mul_ui(*top_line,n,120u);
mpz_sub_ui(*top_line,*top_line,89u);
mpz_mul(*top_line,*top_line,n);
mpz_add_ui(*top_line,*top_line,16u);
#endif

#ifdef USE_CXX
bottom_line = n;
bottom_line *= 512u;
bottom_line -= 1024u;
bottom_line *= n;
bottom_line += 712u;
bottom_line *= n;
bottom_line -= 206u;
bottom_line *= n;
bottom_line += 21u;

tmp.canonicalize();
#else
mpz_mul_ui(*bottom_line,n,512u);
mpz_sub_ui(*bottom_line,*bottom_line,1024u);
mpz_mul(*bottom_line,*bottom_line,n);
mpz_add_ui(*bottom_line,*bottom_line,712u);
mpz_mul(*bottom_line,*bottom_line,n);
mpz_sub_ui(*bottom_line,*bottom_line,206u);
mpz_mul(*bottom_line,*bottom_line,n);
mpz_add_ui(*bottom_line,*bottom_line,21u);

mpq_canonicalize(tmp);
#endif

#ifdef USE_CXX
last_retval += tmp;
tmp_for_integral_part = last_retval;
last_retval -= tmp_for_integral_part;
last_retval *= 16u; /* Don't multiply by another rational here */
digit = last_retval.get_d();
#else
mpq_add(last_retval,last_retval,tmp);
mpz_set_q(tmp_integral_part,last_retval);
mpq_set_z(tmp,tmp_integral_part);
mpq_sub(last_retval,last_retval,tmp);
mpq_mul(last_retval,last_retval,const_rat_16);
digit = mpq_get_d(last_retval);
#endif

printf("%01X", (unsigned)digit);

#ifdef USE_CXX
++n;

if ( 80000ul == n )
break;
#else
mpz_add_ui(n,n,1u);

if ( 0 == mpz_cmp_ui(n,80000ul) )
break;
#endif
}

fflush(stdout);
}

I compiled the C and C++ versions with "-O3 -DNDEBUG". It doesn't make sense to me that the C++ one is faster if the GNU Multiprecision C++ classes are just an interface/wrapper around the C code.

Anyone got any ideas?

Frederick

Paavo Helde

unread,

Jan 20, 2020, 7:15:42 AM1/20/20

to

On 20.01.2020 13:08, Frederick Gotham wrote:
>
> I have multi-posted this to 'comp.lang.c' and 'comp.lang.c++'. I would have gotten more flack if I cross-posted.
>
> I have taken single-threaded code written in Python to calculate hex digits of pi, and I have ported it to C and C++ to compare speed. Both C and C++ versions are faster than the Python version. The C version is about 10% faster than Python. For some reason there is a 20% speed difference between the C and C++ versions, with the C++ one being faster. I don't know why.
>
> All versions use the GNU Multiprecision library (-lgmp).

[...]

> I compiled the C and C++ versions with "-O3 -DNDEBUG". It doesn't make sense to me that the C++ one is faster if the GNU Multiprecision C++ classes are just an interface/wrapper around the C code.
>
> Anyone got any ideas?

Code speed depends on many factors. 20% is not so much and might be
specific to your compiler version, hardware and hardware-specific
compiler options. So there are no quick answers.

Anyway, one usual source of speed differences is the function inlining.
It might be that by some reason the compiler can inline the C++ Bint *=
operator, but not the C mpz_mul_ui() function call. You can see if this
is the case by studying the generated assembler.

My 1-minute googling shows that indeed mpz_mul_ui() is only declared in
gmph.h while Bint *= seems to be both declared and defined in gmpxx.h.
If so, the compiler/linker need to do much more work for getting
functions like mpz_mul_ui() inlined. For that you would probably need to
pass more compiler flags to switch on whole program optimization and
even then it is not sure that it can be done.

Even if the C version does not inline the gmp library calls, it is not
certain this is the actual reason of slowdown. Even if it is the reason
with your artificial test case, the slowdown might not appear with real
data. Etc.

Also, note that -O3 is not guaranteed to produce fastest code always.
Sometimes it can optimize the wrong thing and produce slower code. For
any case, you should try -O2 as well.

Öö Tiib

unread,

Jan 20, 2020, 7:17:17 AM1/20/20

to

What is the supposed benefit of slicing .c and .cpp files into
one "c/c++" file using preprocessor? For me that makes it just
more tedious to read and hard to follow.

> I compiled the C and C++ versions with "-O3 -DNDEBUG". It doesn't make sense to me that the C++ one is faster if the GNU Multiprecision C++ classes are just an interface/wrapper around the C code.

Actually C++ code is very rarely dumb bloat wrapper around
C code. Modern C++ contains fair number of performance
features that C does not have and so it is unusual to expect
that library maintainers (typically decent specialists) ignore
those features.

>
> Anyone got any ideas?

Best idea on case of performance questions is to use profiler. If
you do not have commercial profiler then take some open source.
For example for Windows take that:
<http://www.codersnotes.com/sleepy/>. It works with MSVC and
MinGW/gcc. For other platforms there are several similar tools.

Commercial profiling tools can be easier to integrate into some
kind of DevOps Continuous Integration Pipelines and will produce
more sophisticated and/or browser-friendly output but accuracy-
wise the open source tools are precise enough. Also reduced
garbage bells and whistles may even help to get the grasp of
main point of those tools bit faster.

Juha Nieminen

unread,

Jan 20, 2020, 8:48:37 AM1/20/20

to

Frederick Gotham <cauldwel...@gmail.com> wrote:
> I compiled the C and C++ versions with "-O3 -DNDEBUG".

By the way, I would recommend using "-Ofast -march=native".

(And even better, find out which arch that 'native' is choosing, and if
it doesn't match your actual computer closely, choose one that does.
Like for example -march=skylake, or whichever processor family you
are using.)

With certain kind of code, choosing the proper -march can have a
surprisingly large effect in terms of speed. (I have absolutely no
idea if it will make a difference in this particular case, but it
doesn't hurt to try.)

-Ofast might generate faster code than -O3 (because it turns on
optimization options that -O3 does not.)