94% Instruction fetch stall.

23 views

Skip to first unread message

Daniel Hill

unread,

May 31, 2016, 7:30:53 AM5/31/16

to Haskell Repa

I while back I wrote an NBody sim, and then ported it to Repa shortly after because of unsatisfactory performance, so I was happy, but a few days ago I did the math on what I should be expecting from my CPU, and it was a good order of magnitude out. (800 bodies = 640,000 compares in about 16ms, even with hundreds of instructions per compare, we are only in the hundreds of MHz range.)

So I fired up my code in a performance counter (AMD's code analysis), And the performance stats were odd, nothing on the data side is the issue, very little cache misses, and the unboxed array should be large enough to fit in to L1 cache (so I guess that makes sense). What is happening is that the CPU is stalling a bunch waiting for the next instruction to fetch.

(as a side note, I'm running the code on Windows, and llvm segfaults my code).

My binary is huge (21MB).

My code: http://lpaste.net/164681 (which is a bit of a mess at the moment as I tried to disable/enable things to test the segfault / understand what could be causing the issues with instruction fetching).

my build options: -O2 -rtsopts -threaded -funfolding-use-threshold1000 -funfolding-keeness-factor1000

Ben Lippmeier

unread,

May 31, 2016, 7:41:08 AM5/31/16

to haskel...@googlegroups.com

Begin forwarded message:

From: Ben Lippmeier <be...@ouroborus.net>
Subject: Re: 94% Instruction fetch stall.
Date: 31 May 2016 9:40:41 pm AEST
To: Daniel Hill <dan...@enemyplanet.geek.nz>

On 28 May 2016, at 3:07 pm, Daniel Hill <dan...@enemyplanet.geek.nz> wrote:

I while back I wrote an NBody sim, and then ported it to Repa shortly after because of unsatisfactory performance, so I was happy, but a few days ago I did the math on what I should be expecting from my CPU, and it was a good order of magnitude out. (800 bodies = 640,000 compares in about 16ms, even with hundreds of instructions per compare, we are only in the hundreds of MHz range.)

Yeah, this is a real problem with programming techniques that rely on general purpose compiler optimisations to such an extent. If the compiler doesn’t do what it’s supposed to then the performance of compiled code will be awful, but there isn’t a way to guarantee that it does what it’s supposed to. In practice I end up dumping the intermediate core code generated during compilation and checking that.

So I fired up my code in a performance counter (AMD's code analysis), And the performance stats were odd, nothing on the data side is the issue, very little cache misses, and the unboxed array should be large enough to fit in to L1 cache (so I guess that makes sense). What is happening is that the CPU is stalling a bunch waiting for the next instruction to fetch.

Probably because you’re still have mostly lazy code. In compiled lazy code the CPU does lots of indirect jumps (jump to an address loaded from data memory) which confounds the branch predictor.

(as a side note, I'm running the code on Windows, and llvm segfaults my code).

My binary is huge (21MB).

GHC compiled programs are statically linked, so that will be part of it.

My code: http://lpaste.net/164681 (which is a bit of a mess at the moment as I tried to disable/enable things to test the segfault / understand what could be causing the issues with instruction fetching).

Just eyeballing it, It’s probably not inlining enough of your numerical functions. Add inline pragmas like the one you have for ‘gravity’ to the other functions. If GHC compiles a polymorphic function like ‘toBody’ using dictionary passing to implement the type class then performance will be non starter.

my build options: -O2 -rtsopts -threaded -funfolding-use-threshold1000 -funfolding-keeness-factor1000