nbody code - any suggestions for improvement?

Ian Ozsvald

unread,

Jun 1, 2011, 1:08:55 PM6/1/11

to shedskin...@googlegroups.com

Here's the nbody code I'm using for the EuroPython tutorial. I'm
comparing CPython, PyPy, Cython, ShedSkin and a few others tools (and
looking at profiling first), the goal is to teach people about their
options for making CPU-bound code faster.

The nbody code is based on the language shootout here:
http://shootout.alioth.debian.org/u32/performance.php?test=nbody

I'm not worried about whether the test is really 'apples vs apples', I
just care about the rough orders of magnitude. E.g. CPython on their
Intel box takes 20 mins, JavaScript V8 takes 71 seconds, Fortran, C
and Java take about 20 seconds each. Their site seems be undergoing an
upgrade right now - the graphs don't work and you can't seem to get to
the source code (but I have a copy of the nbody code).

I had to make a minor change to the code so that it would compile in
ShedSkin (the main dictionary had lists and floats, now it just has
lists *of* floats and I dereference mass where required). Here's the
code:
http://dl.dropbox.com/u/1314015/nbody_shedskin.py
- it is just a couple of functions, 'advance' is the monster that eats
all the time.

Roughly speaking for 50,000,000 iterations (e.g. 'python2.7
nbody_shedskin.py 50000000) on my MacBook 2GHz:
CPython 35 mins
PyPy 1.5 JIT 5mins11sec
ShedSkin0.8 -l 2min28sec
ShedSkin0.8 -l -b -w 1min56sec # e.g. shedskin -l -b -w
nbody_shedskin.py; make; ./nbody_shedskin 50000000

Ultimately this means that the compiled and 'best' version of ShedSkin
that I can make (and I'm hoping you can spot any flaws I've made...)
is still beaten by JavaScript V8! I'd love to be able to announce
better figures during my tutorial at EuroPython. However - ShedSkin
does beat PyPy (and PyPy nicely beats CPython). These are great
results, I'd just like to know if I've missed anything obvious in the
benchmark.

In each of the above 4 test cases I confirm that only 1 CPU (50% of my
dual-core MacBook) is used. I'm using CPython 2.7 32bit.

Any feedback gratefully received,
Ian.

--
Ian Ozsvald (A.I. researcher, screencaster)
i...@IanOzsvald.com

http://IanOzsvald.com
http://SocialTiesApp.com/
http://MorConsulting.com/
http://blog.AICookbook.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald

Mark Dufour

unread,

Jun 1, 2011, 2:21:55 PM6/1/11

to shedskin...@googlegroups.com

hi ian,

thanks for all your mails :) I will try to reply to all of them soon. for now, please see attachment for a version of your code that is almost twice as fast after compilation.

the main difference is that I added a 'body' class, and use attributes instead of dicts and tuples to hold the physical constants. dict and tuple indexing is very slow compared to simple attribute accesses in C/C++. attributes are slower in cpython I guess, and that's probably the reason this shootout implementation was coded up as it was, but there you have it. imnsho, this version is also more readable.. ;)

thanks!
mark.

--
You received this message because you are subscribed to the Google Groups "shedskin-discuss" group.
To post to this group, send email to shedskin...@googlegroups.com.
To unsubscribe from this group, send email to shedskin-discu...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/shedskin-discuss?hl=en.

--
http://www.youtube.com/watch?v=E6LsfnBmdnk

nbody2.py

Brent Pedersen

unread,

Jun 1, 2011, 2:24:05 PM6/1/11

to shedskin...@googlegroups.com

On Wed, Jun 1, 2011 at 12:21 PM, Mark Dufour <mark....@gmail.com> wrote:
> hi ian,
>
> thanks for all your mails :) I will try to reply to all of them soon. for
> now, please see attachment for a version of your code that is almost twice
> as fast after compilation.
>
> the main difference is that I added a 'body' class, and use attributes
> instead of dicts and tuples to hold the physical constants. dict and tuple
> indexing is very slow compared to simple attribute accesses in C/C++.
> attributes are slower in cpython I guess, and that's probably the reason
> this shootout implementation was coded up as it was, but there you have it.
> imnsho, this version is also more readable.. ;)
>
>
> thanks!
> mark.

Heh, I was just writing to explain that I did almost identical here:
http://paste.pocoo.org/show/399045/

Mark Dufour

unread,

Jun 1, 2011, 2:32:58 PM6/1/11

to shedskin...@googlegroups.com

Heh, I was just writing to explain that I did almost identical here:
http://paste.pocoo.org/show/399045/

ah, I was probably faster because I practiced this.. :S

http://gitorious.org/shedskin/mainline/blobs/master/examples/nbody.py

note that here the algorithm has also been improved (by douglas mcneill iirc).

thanks!!
mark.

--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Brent Pedersen

unread,

Jun 1, 2011, 2:39:48 PM6/1/11

to shedskin...@googlegroups.com

On Wed, Jun 1, 2011 at 12:32 PM, Mark Dufour <mark....@gmail.com> wrote:
>
>> Heh, I was just writing to explain that I did almost identical here:
>> http://paste.pocoo.org/show/399045/
>>
>
> ah, I was probably faster because I practiced this.. :S
>
> http://gitorious.org/shedskin/mainline/blobs/master/examples/nbody.py
>

cool! I didn't know you could use an empty class like that and just
assign the attributes after.
That--and yours-- is also faster since it's not indexing into the
global MASS thing--which seems
weird even for a normal python implementation.

Also, Ian, you can make the code shorter using itertools.combinations:

PAIRS = list(combinations(SYSTEM, 2))

not sure of the effect on speed.

> note that here the algorithm has also been improved (by douglas mcneill
> iirc).
>
>
> thanks!!
> mark.
> --
> http://www.youtube.com/watch?v=E6LsfnBmdnk
>

Ian Ozsvald

unread,

Jun 1, 2011, 5:09:38 PM6/1/11

to shedskin...@googlegroups.com

Gentlemen - many thanks :-)

Mark - I'm using your version. Brent, cheers for your version too.

Re. the version in gitorious, it evolves differently so I'll ignore it
(trying not to investigate too many things at once!), I hadn't
realised that someone had tried this already :-) If I'd have looked in
the examples/ directory I'd have known!

My goal is to try to keep the programs mostly the same (I only changed
the shedskin version to make it compile) and to try various tools to
make the code faster. I'm being pragmatic and trying to teach how I
make code faster+maintainable for clients (and often - clients who
don't want to learn new things or change the way they support
things!).

More tomorrow I guess,
i.

--

Ian Ozsvald

unread,

Jun 1, 2011, 5:11:47 PM6/1/11

to shedskin...@googlegroups.com

Interestingly - I think I can claim that the Mark/Brent version would
beat the V8 Javascript benchmark. In terms of squeezing a lot of
performance out of a piece of code with little work, it is quite
impressive.

Brent - I note your point about the odd indexing approach in the
original code. I agree, I found it quite tricky to read. However, that
often occurs in client HPC projects and if I go changing their code
too much, they reject the alterations in favour of what they know.
That'll be part of the story I tell.

i.

Thomas Spura

unread,

Jun 1, 2011, 8:02:06 PM6/1/11

to shedskin...@googlegroups.com, mark....@gmail.com

On Wed, 1 Jun 2011 20:32:58 +0200
Mark Dufour wrote:

> > Heh, I was just writing to explain that I did almost identical here:
> > http://paste.pocoo.org/show/399045/
> >
> >
> ah, I was probably faster because I practiced this.. :S
>
> http://gitorious.org/shedskin/mainline/blobs/master/examples/nbody.py
>
> note that here the algorithm has also been improved (by douglas
> mcneill iirc).
>
>
> thanks!!
> mark.

When changing the devision of distande**3 to /
(distance*distance*distance) the time drops to 55-60%.

Maybe the power function could still need some love to be optimized...

Thomas

Isaac Gouy

unread,

Jun 1, 2011, 6:15:09 PM6/1/11

to shedskin-discuss

On Jun 1, 10:08 am, Ian Ozsvald <i...@ianozsvald.com> wrote:
-snip-

> Their site seems be undergoing an upgrade right now - the graphs don't work and you
> can't seem to get to the source code (but I have a copy of the nbody code).

1) The benchmarks game website is OK now - please try refreshing your
browser cache.

2) You can web browse CVS to see Python 2 and Python 3 implementations
of n-body

http://anonscm.debian.org/viewvc/shootout/shootout/bench/nbody/?hideattic=0

3) And there's an implementation specifcally for PyPy

http://anonscm.debian.org/viewvc/shootout/shootout/bench/nbody/nbody.py?hideattic=0&revision=1.2&view=markup

Ian Ozsvald

unread,

Jun 2, 2011, 4:10:46 AM6/2/11

to shedskin...@googlegroups.com

Thanks Thomas, that's on my list to try (but I'm switching to a
mandelbrot solver for now). Hopefully I can return to this tomorrow.
i.

Mark Dufour

unread,

Jun 2, 2011, 4:20:37 AM6/2/11

to shedskin...@googlegroups.com

On Wed, Jun 1, 2011 at 11:11 PM, Ian Ozsvald <i...@ianozsvald.com> wrote:

Interestingly - I think I can claim that the Mark/Brent version would
beat the V8 Javascript benchmark. In terms of squeezing a lot of
performance out of a piece of code with little work, it is quite
impressive.

I played a bit with GCC flags, and tried something suggested to me a while a go.. -ffast-math. IIUC, we basically tell the CPU here that it shouldn't care about the precise IEEE specification in weird corner cases, such as dividing infinity by -1, things you may never want to occur in your code anyway.

CCFLAGS=-O2 -march=native

srepmub@akemi:~/shedskin$ ./nbody2 10000000
-0.169075164
-0.169077842
Took: 0:00:12.129899

CCFLAGS=-O3 -ffast-math (so without -march=native!)

srepmub@akemi:~/shedskin$ ./nbody2 10000000
-0.169075164
-0.169077842
Took: 0:00:02.625101

I wasn't sure why the difference is this big, until I read games usually don't care about IEEE preciseness....

so I should probably add this to the 'performance tips' section of the tutorial.. :-)

thanks,
mark.

--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Ian Ozsvald

unread,

Jun 2, 2011, 4:41:52 AM6/2/11

to shedskin...@googlegroups.com

The website is better - last night I saw the graphs again but the
source code links still failed, this morning I see that the src code
pages work again too. Cool.

Re. the pypy version - maybe I'm staring at it too much but it looks
very much like the cpython version. What's different?

i.

> --
> You received this message because you are subscribed to the Google Groups "shedskin-discuss" group.
> To post to this group, send email to shedskin...@googlegroups.com.
> To unsubscribe from this group, send email to shedskin-discu...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/shedskin-discuss?hl=en.
>
>

--

Ian Ozsvald

unread,

Jun 2, 2011, 4:58:09 AM6/2/11

to shedskin...@googlegroups.com

IIRC ffast-math will use register-based short-cut arithmetic which can
have shorter precision than IEEE based arithmetic. From memory it'll
give you more rounding errors but should run faster. Normally my goal
is to preserve absolute precision (e.g. for physics - double precision
is already imprecise enough!). Definitely one for the tips section
though, many apps don't need all that precision.

E.g.
http://gcc.gnu.org/onlinedocs/gcc-4.3.0/gcc/Optimize-Options.html
"This option is not turned on by any -O option since it can result in
incorrect output for programs which depend on an exact implementation
of IEEE or ISO rules/specifications for math functions. It may,
however, yield faster code for programs that do not require the
guarantees of these specifications. "

I'm checking it here:
shedskin0.8 on MacOS X (no native flag anyhow), 02 takes 1min16 using
your class-based code from yesterday.
Adding -ffast-math - no change in speed (result exactly the same)
-O3 and --fast-math - no change in speed (result exactly the same)

So my GCC on a Core2 Duo Macbook doesn't show any improvement (boo),
but that's probably because GCC is already using my hardware fairly
efficiently (yay). Anyhow, I've got to move on to the next task! Maybe
for you the difference was more to do with the native flag (you didn't
say if you'd tried that independently?)?

i.

Mark Dufour

unread,

Jun 2, 2011, 5:10:16 AM6/2/11

to shedskin...@googlegroups.com

So my GCC on a Core2 Duo Macbook doesn't show any improvement (boo),
but that's probably because GCC is already using my hardware fairly
efficiently (yay). Anyhow, I've got to move on to the next task! Maybe
for you the difference was more to do with the native flag (you didn't
say if you'd tried that independently?)?

I tried it independently just now, and it's definitely -ffast-math.

thanks,
mark.
--

http://www.youtube.com/watch?v=E6LsfnBmdnk

Ian Ozsvald

unread,

Jun 2, 2011, 5:14:51 AM6/2/11

to shedskin...@googlegroups.com

Interesting...what's your CPU? My Core2 Duo is old, it might be that
mine isn't that clever and yours is smarter?

Mark Dufour

unread,

Jun 2, 2011, 5:23:01 AM6/2/11

to shedskin...@googlegroups.com

On Thu, Jun 2, 2011 at 11:14 AM, Ian Ozsvald <i...@ianozsvald.com> wrote:

Interesting...what's your CPU? My Core2 Duo is old, it might be that
mine isn't that clever and yours is smarter?

mine's a bit newer I guess, but not that much:

model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dts tpr_shadow vnmi flexpriority

--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Ian Ozsvald

unread,

Jun 2, 2011, 5:41:32 AM6/2/11

to shedskin...@googlegroups.com

Certainly my Snow Leopard's g++ is older than most (4.2.1 - a common
complaint from Mac users). Since -ffast-math is potentially unsafe I'm
not so worried but it is nice to know that the option can do something
useful (it was the only real 'trick' I had back as a Senior Programmer
if IEEE precision wasn't required!).

Mark Dufour

unread,

Jun 2, 2011, 5:44:42 AM6/2/11

to shedskin...@googlegroups.com

On Thu, Jun 2, 2011 at 11:41 AM, Ian Ozsvald <i...@ianozsvald.com> wrote:

Certainly my Snow Leopard's g++ is older than most (4.2.1 - a common
complaint from Mac users). Since -ffast-math is potentially unsafe I'm
not so worried but it is nice to know that the option can do something
useful (it was the only real 'trick' I had back as a Senior Programmer
if IEEE precision wasn't required!).

it would be interesting to see what happens when you install a recent linux distro on there, with GCC 4.5 or 4.6.. ;-)

mark.

--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Ian Ozsvald

unread,

Jun 2, 2011, 5:46:43 AM6/2/11

to shedskin...@googlegroups.com

Sadly that's a right pain on MacOS and/or might get in the way of
system libs. I know people do upgrade GCC but I'd frankly be a bit
scared! I've nuked this machine once, I'm not losing a day again like
that :-) I hope to give the timings another go on my bigger
physics-office machine in a few weeks (but that's Windows - does
ShedSkin work with MSVC?).
i.

Mark Dufour

unread,

Jun 2, 2011, 5:52:58 AM6/2/11

to shedskin...@googlegroups.com

On Thu, Jun 2, 2011 at 11:46 AM, Ian Ozsvald <i...@ianozsvald.com> wrote:

Sadly that's a right pain on MacOS and/or might get in the way of
system libs. I know people do upgrade GCC but I'd frankly be a bit

I didn't mean to upgrade GCC, but to install a nice linux distro on a separate partition.. :-) this distribution will probably upgrade your GCC every 6 months or so for you.

scared! I've nuked this machine once, I'm not losing a day again like
that :-) I hope to give the timings another go on my bigger
physics-office machine in a few weeks (but that's Windows - does
ShedSkin work with MSVC?).

well, it doesn't really have to, because the windows version comes with GCC (4.5.. :-)).. but there is a hidden (-v) flag, to generate more or less MSVC compatible output (including makefile). I haven't heard of anyone trying this recently, so I actually just made the option hidden for 0.8.. there's also a hidden 'pypy' compatibility (iirc, -p) mode that someone sent a patch for at one point.

mark.
--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Ian Ozsvald

unread,

Jun 2, 2011, 7:02:21 AM6/2/11

to shedskin...@googlegroups.com

Mark suggested I try -march=native as it'll enable SSE2 (I confess I'd
forgotten that - the native architecture switches did give me small
benefits in the past on e.g. Pentium, AMD64 specific platforms).

Annoyingly this switch doesn't work in my g++ (4.2.1), the suggestion
online is to use:
-m64 -mtune=core2
in its place. This doesn't make it run any faster. I also added
--fast-math but the speed didn't change.

Can someone else confirm Mark's -ffast-math switch improves
performance without changing the numerical output?

Ian.

Mark Dewing

unread,

Jun 3, 2011, 12:37:28 AM6/3/11

to shedskin-discuss

I tried it using gcc 4.5.0 (on OpenSuse 11.3) and the version of
nbody.py in the shedskin examples (0.7.1.3).
Running with the default parameters on a Core 2 Duo, I get

4.0 seconds for -march=native
0.4 seconds for -ffast-math

One potential issue is the code raises values to the power of 0.5
rather than calling sqrt. When I change the power to sqrt (either in
the python file or in the cpp file), the -march=native time drops to
2.8s. (The -ffast-math time is unaffected).

Mark

On Jun 2, 6:02 am, Ian Ozsvald <i...@ianozsvald.com> wrote:
> Mark suggested I try -march=native as it'll enable SSE2 (I confess I'd
> forgotten that - the native architecture switches did give me small
> benefits in the past on e.g. Pentium, AMD64 specific platforms).
>
> Annoyingly this switch doesn't work in my g++ (4.2.1), the suggestion
> online is to use:
> -m64 -mtune=core2
> in its place. This doesn't make it run any faster. I also added
> --fast-math but the speed didn't change.
>
> Can someone else confirm Mark's -ffast-math switch improves
> performance without changing the numerical output?
>
> Ian.
>

Ian Ozsvald

unread,

Jun 3, 2011, 4:11:14 AM6/3/11

to shedskin...@googlegroups.com

I'm definitely missing something at this end it seems :-) I'll try
changing **2 -> sqrt, right now I'm arguing with a set of mandelbrot
solvers (I'll probably submit the shedskin version here shortly for
more suggestions!).

Cheers,
Ian.

http://IanOzsvald.com

Mark Dufour

unread,

Jun 3, 2011, 4:22:04 AM6/3/11

to shedskin...@googlegroups.com

On Fri, Jun 3, 2011 at 10:11 AM, Ian Ozsvald <i...@ianozsvald.com> wrote:

I'm definitely missing something at this end it seems :-) I'll try
changing **2 -> sqrt, right now I'm arguing with a set of mandelbrot
solvers (I'll probably submit the shedskin version here shortly for
more suggestions!).

I also see the speedup with just -O2 -ffast-math (without -march=native), so it's starting to look like GCC 4.2 is missing something.. :-)

a speedup of 10 times seems absurd though! (thanks mark, btw!)

mark.
--

http://www.youtube.com/watch?v=E6LsfnBmdnk

Ian Ozsvald

unread,

Jun 3, 2011, 5:05:12 AM6/3/11

to shedskin...@googlegroups.com

Yeah, I've just tried a bunch of optimisation flags on my mandelbrot
problem and they barely change the speed at all. I think my gcc might
be out of date. I might have found a drag/drop newer version of gcc
for Snow Leopard (but I'm wary about conflicts).

I might just use your timing results in my talk!

i.

Mark Dufour

unread,

Jun 4, 2011, 9:52:08 AM6/4/11

to Thomas Spura, shedskin...@googlegroups.com

When changing the devision of distande**3 to /

(distance*distance*distance) the time drops to 55-60%.

Maybe the power function could still need some love to be optimized...

hmm, seems we forgot to look at the float**int case for 0.7.1. this should fix that for common powers of 2 and 3:

http://gitorious.org/shedskin/mainline/commit/7eee6c957bfda20aba79cf85229ceae684045680

GCC will inline the crap out of this, so there's no real difference with straight multiplication now I think.

thanks for noting!!
mark.
--
http://www.youtube.com/watch?v=E6LsfnBmdnk

Ian Ozsvald

unread,

Jun 4, 2011, 2:02:12 PM6/4/11

to shedskin...@googlegroups.com

Is there any value in adding x**0.5 -> sqrt(x) in there, or is that
going too far? I've seen the pattern on a bunch of occasions (and
again it cropped up for the nbody problem), maybe it'll help compilers
in a few situations?
i.

Reply all

Reply to author

Forward