It's possible that your code represents some edge condition where the
distinction matters, but it's unlikely.
If it helps I use 200+ MB of heap and very little stack(short functions with tiny scopes) in most of the code, no recursion of any kind, and hundreds of goroutines and channels ( I hope the scheduler is cache friendly :) ).
For heavy data crunching where data locality does not really matter
because you just access it once and it's interleaved with I/O, using
multiple threads really helps.
About hyperthreading, I simply never managed to understand what are
the use case it applies to. I don't know if you have to share data to
make it efficient or exactly the inverse. In what cases does it
actually improve something?
Rémy.
After digging around for metrics, I found that the one most relevant (best correlation with speed) to me was pipeline stalls ( performance registers are sweet) which confuses me (I thought stalls were failures of branch prediction?? w/e). My most stable configuration seems to be with gomaxprocs set to logical processors minus one. Not a general rule obviously.
Just something else to add to the list of tbds in optimization.
Thanks
-Simon
Thanks for the help and insight guys. And thanks for the warning about the scheduler Kyle.
After digging around for metrics, I found that the one most relevant (best correlation with speed) to me was pipeline stalls ( performance registers are sweet) which confuses me (I thought stalls were failures of branch prediction?? w/e). My most stable configuration seems to be with gomaxprocs set to logical processors minus one. Not a general rule obviously.
Depending on who's benchmark you read HT is almost as good as a real processor, or barely worth it. IMO the variance is explained by the type of code being executed, how decomposable it is to superscaler instruction dispatch (if all execution units are busy, then HT will be ineffective), or how the memory access patterns allow simple arithmetic operations to be interleaved because they are memory bound.
Making very broad generalizations, HT was shown to be a boon to business or productivity applications, but showed little improvement when fed transcoding or ray tracing jobs.
HTH
Dave
The main loop is a massive set of FP calculations, since this is a physics simulator for a multiplayer game engine.
I've tried to follow some sort of traditional design, one core loop that communicates asynchronously with client handler threads. The main loop does not wait for net io to continue processing. Everything assumes lag and delayed input, and everything is based on buffered inputs and available-data checks. I like it for it's robustness, but it leaves me in the dark as to how to optimize as everything is so squishy.
I will try locking my main loop to an OS thread, and locking processor affinity for that thread via syscalls to see if that improves things. So much voodoo.
-Simon
http://www.infoq.com/presentations/click-crash-course-modern-hardware
For me, his assertion that CPU performance today is dominated by
memory accesses, and delaying pipeline stalls long as possible to
'preload' cache misses, were very enlightening.
Later, as the schedule matures and if library code is deemed OK, it can go back in. (It would let Go beat GMP ;-)
As an attempt at validating my comments about HT, you may enjoy this
presentation by Cliff Clickhttp://www.infoq.com/presentations/click-crash-course-modern-hardware
For me, his assertion that CPU performance today is dominated by
memory accesses, and delaying pipeline stalls long as possible to
'preload' cache misses, were very enlightening.
On Go parallelism:I put considerable effort into making a few of the math/big routines able to compute in parallel. Despite being very careful about allocation/reuse and alignment and cache issues, it seemed impossible to get a reliable benefit on my 8 physical core Xeon system. The scheduler seems not up to the task (at the moment) for completely compute-bound threads. However, for threaded web applications that make frequent/mostly system and I/O calls it seems to work well indeed. None of the parallel code is checked into big yet for just that reason. Later, as the schedule matures and if library code is deemed OK, it can go back in. (It would let Go beat GMP ;-)
cd $GOROOT/src/pkg/math/big
go test -test.bench="Scan.*Base10"
go test -test.bench="String.*Base10"
C'est la vie
--
On Mar 14, 2:22 pm, ⚛ <0xe2.0x9a.0...@gmail.com> wrote:
> On Wednesday, March 14, 2012 4:33:30 AM UTC+1, Dave Cheney wrote:
>
> > As an attempt at validating my comments about HT, you may enjoy this
> > presentation by Cliff Click
>
> >http://www.infoq.com/presentations/click-crash-course-modern-hardware
>
> > For me, his assertion that CPU performance today is dominated by
> > memory accesses, and delaying pipeline stalls long as possible to
> > 'preload' cache misses, were very enlightening.
>
> The average instructions-per-clock (IPC) nowadays is somewhere near 1.0. If
> performance was dominated by memory accesses, IPC would be much lower (0.1
> and below) - so I do not agree that performance is dominated by memory
> accesses. A multiplication instruction has approximately the same
> performance as a memory access instruction.
You appear to hold a contrary view to some of the leading lights in
the field - Dr. Click is hardly a lone voice holding this view. I am
typically not persuaded by arguments that resort to authorities, but
Cliff Click is a serious practitioner in the field and also an
independent and out-of-box thinker (c.f. his FSM based concurrent non-
blocking data structs).
Possibly the issue is over the semantics of
his word choice of "dominated". Would "critically constrained by
memory access" work better for you?
(I would appreciate any citations
you have for the IPC average you mention.)