General question about hyper-threading (non-critical)

891 views
Skip to first unread message

si guy

unread,
Mar 13, 2012, 5:43:13 AM3/13/12
to golan...@googlegroups.com
Hi all!
Before I draw a number out of a hat for GOMAXPROCS, I was wondering...

I'm not sure whether to cater to hyper-threaded  cores or not (ie 4 vs 8). Performance tests I've run on my program yields... ambiguous results.
More procs leads to thrashing eventually, but seems to run slightly faster... sometimes...
I _think_ that even with GOMAXPROCS set to < 4 that some processes are getting hyper-threaded anyway...(is there a way to prevent this? Would that be a good idea?)
I've looked at some old discussions on the topic, but I do not have the expertise to judge their merits at all. ( I know how cpu's work... maybe, but I have only an abstract idea of how the intel pipeline works, and _no_ idea how it interacts with code compiled by the gc)
Has anyone experienced issues with the gc on intel x64 and hyper-threading before? Is there anything to look for? Any advice on where to start?
If it helps I use 200+ MB of heap and very little stack(short functions with tiny scopes) in most of the code, no recursion of any kind, and hundreds of goroutines and channels ( I hope the scheduler is cache friendly :) ).
Not really a critical thing, just wondering if there was a simple answer(doubtful) or some decent links(everything I can find reads like a brochure, lots of fancy words and no help).

Please let me know if I've made mistakes in my assumption or I appear to be talking out of my... ahem... somewhere.
platform= Debian-linux 2.6x 4xIntelx86 . GB ram (will be larger in production)
Thanks

-Simon

ps: if one would absolutely need to see code to help, thank you but I'm not allowed to post the whole thing :(

Peter Bourgon

unread,
Mar 13, 2012, 6:23:19 AM3/13/12
to si guy, golan...@googlegroups.com
In general, hyperthreaded cores "count" as real cores at the user
layer. runtime.NumCPU() "returns the number of logical CPUs", ie.
inclusive of hyperthreading. And Linux systems report them as distinct
in /proc/cpuinfo (albeit annotated as virtual).

It's possible that your code represents some edge condition where the
distinction matters, but it's unlikely.

Hotei

unread,
Mar 13, 2012, 11:56:22 AM3/13/12
to golan...@googlegroups.com
I get somewhat more consistent results with a program that is completely "splitable" so pieces of arbitrary size can be chunked off to a goroutine/CPU. When I bench it in multi-cpu mode I get about 5.5 x speedup with a 6 core AMD 1090T computing pieces in parallel as compared to a single goroutine running the whole program.

On Intel i5 with same program I get a ratio of 3.5 x with 2 Core and HT (compared to a single goroutine run on same machine).  So while not quite a "whole" CPU's worth of power I think HT is definitely worth looking at.  Obviously your workload's serial/parallel nature determines the benefit but the potential seems to be there.

The surest way to prevent your code from being hyper-threaded is to not use goroutines :-(  I know, not the right answer.

If you print the number of goroutines running at different places in your code don't forget that the "main" program has a goroutine of its own and maybe others ? I/O?  I have observed up to 3 goroutines running just before program exit even after I have collected (WaitGroup) all those that I spawned myself. 

André Moraes

unread,
Mar 13, 2012, 12:22:55 PM3/13/12
to golan...@googlegroups.com
This might help (or not), but you could make one goroutine to run always in the same thread:


That way you can have a little more control on where things are running.

Kyle Lemons

unread,
Mar 13, 2012, 4:04:56 PM3/13/12
to si guy, golan...@googlegroups.com
If it helps I use 200+ MB of heap and very little stack(short functions with tiny scopes) in most of the code, no recursion of any kind, and hundreds of goroutines and channels ( I hope the scheduler is cache friendly :) ).

It is not.  Honestly, I never use GOMAXPROCS>1 in my production code, since the scheduler is not smart enough to manage data locality.  If I had a CPU bound application (which I do not) I may shard the data among processes instead of goroutines for the time being.

Rémy Oudompheng

unread,
Mar 13, 2012, 4:47:37 PM3/13/12
to Kyle Lemons, si guy, golan...@googlegroups.com

For heavy data crunching where data locality does not really matter
because you just access it once and it's interleaved with I/O, using
multiple threads really helps.

About hyperthreading, I simply never managed to understand what are
the use case it applies to. I don't know if you have to share data to
make it efficient or exactly the inverse. In what cases does it
actually improve something?

Rémy.

si guy

unread,
Mar 13, 2012, 5:15:36 PM3/13/12
to golan...@googlegroups.com
Thanks for the help and insight guys. And thanks for the warning about the scheduler Kyle.

After digging around for metrics, I found that the one most relevant (best correlation with speed) to me was pipeline stalls ( performance registers are sweet) which confuses me (I thought stalls were failures of branch prediction?? w/e). My most stable configuration seems to be with gomaxprocs set to logical processors minus one. Not a general rule obviously.

Just something else to add to the list of tbds in optimization.

Thanks
-Simon

Kyle Lemons

unread,
Mar 13, 2012, 5:21:47 PM3/13/12
to si guy, golan...@googlegroups.com
On Tue, Mar 13, 2012 at 2:15 PM, si guy <sjw...@gmail.com> wrote:
Thanks for the help and insight guys. And thanks for the warning about the scheduler Kyle.

After digging around for metrics, I found that the one most relevant (best correlation with speed) to me was pipeline stalls ( performance registers are sweet) which confuses me (I thought stalls were failures of branch prediction?? w/e). My most stable configuration seems to be with gomaxprocs set to logical processors minus one. Not a general rule obviously.

For a CPU bound application, that seems reasonable; the gc runs partly in a goroutine, and the scheduler may also be its own goroutine (I don't remember).  I seem to recall also needing an extra "empty" core when I added networking or timers because those both can use a goroutine behind the scenes that would cause scheduling churn.

Dave Cheney

unread,
Mar 13, 2012, 5:32:22 PM3/13/12
to si guy, golan...@googlegroups.com
I always thought the HT worked by assigning two (or more in the case of Oracles Niagara processor) sets of registers and instruction pointers per real core. The instruction decoder and execution units continue to be shared by all HT logical processors, so become the bottleneck.

Depending on who's benchmark you read HT is almost as good as a real processor, or barely worth it. IMO the variance is explained by the type of code being executed, how decomposable it is to superscaler instruction dispatch (if all execution units are busy, then HT will be ineffective), or how the memory access patterns allow simple arithmetic operations to be interleaved because they are memory bound.

Making very broad generalizations, HT was shown to be a boon to business or productivity applications, but showed little improvement when fed transcoding or ray tracing jobs.

HTH

Dave

si guy

unread,
Mar 13, 2012, 5:37:51 PM3/13/12
to golan...@googlegroups.com
Well, my app is both CPU and network IO bound, but the net IO is totally asynchronous so only parts of the app are IO bound.

The main loop is a massive set of FP calculations, since this is a physics simulator for a multiplayer game engine.

I've tried to follow some sort of traditional design, one core loop that communicates asynchronously with client handler threads. The main loop does not wait for net io to continue processing. Everything assumes lag and delayed input, and everything is based on buffered inputs and available-data checks. I like it for it's robustness, but it leaves me in the dark as to how to optimize as everything is so squishy.

I will try locking my main loop to an OS thread, and locking processor affinity for that thread via syscalls to see if that improves things. So much voodoo.


-Simon

Dave Cheney

unread,
Mar 13, 2012, 11:33:30 PM3/13/12
to si guy, golan...@googlegroups.com
As an attempt at validating my comments about HT, you may enjoy this
presentation by Cliff Click

http://www.infoq.com/presentations/click-crash-course-modern-hardware

For me, his assertion that CPU performance today is dominated by
memory accesses, and delaying pipeline stalls long as possible to
'preload' cache misses, were very enlightening.

Peter Bourgon

unread,
Mar 14, 2012, 5:09:58 AM3/14/12
to si guy, golan...@googlegroups.com
Wow, I totally misread your original question. Sorry for the noise.
(Twice, now.)

si guy

unread,
Mar 14, 2012, 10:00:35 AM3/14/12
to golan...@googlegroups.com
So, for anyone still listening, I tried the lockosthread and lock affinity thing... It made it _much_ worse, perhaps a 5 to 10 times slowdown. I'm starting to feel like a caveman trying to tune up a formula one car with a hammer. Best left alone then. :)

-Simon

Michael Jones

unread,
Mar 14, 2012, 10:57:21 AM3/14/12
to golan...@googlegroups.com
On Go parallelism:

I put considerable effort into making a few of the math/big routines able to compute in parallel. Despite being very careful about allocation/reuse and alignment and cache issues, it seemed impossible to get a reliable benefit on my 8 physical core Xeon system. The scheduler seems not up to the task (at the moment) for completely compute-bound threads. However, for threaded web applications that make frequent/mostly system and I/O calls it seems to work well indeed. None of the parallel code is checked into big yet for just that reason. Later, as the schedule matures and if library code is deemed OK, it can go back in. (It would let Go beat GMP ;-)

On Hyperthreading:

A quick comment on "but showed little improvement when fed transcoding or ray tracing jobs." The inner truth in all such schemes (Cray's 6600 barrel processor, IBM 360/91, ATI/NVIDIA GPU implementations, Intel "SMT" etc.) is about instruction issue vs execution unit dispatch rate. Each type of execution desire (say floating point multiply or integer divide) has a maximum sustainable issue rate based on the pipelining within said execution unit, number of such units, and completion rate of such units. Natively, you can always do an integer add faster than a divide, but you could have a corresponding number of dividers so that you could issue either operation on each issue cycle.

Once you get that fancy you discover that there are not enough divides in the instruction stream to keep the dividers busy. (s/divide/everything/g) One approach is to have "just enough" execution units for the common cases. This is Intel and the common case is Windows/Office. Even so, you may have more fast-completion capability than slow-completion capability such that when a thread does something expensive, like an integer divide, and there are dependencies that keep other computations from continuing very far, the bulk of the execution units stall with nothing to do. If only there was something to keep them busy...thus, SMT aka hyperthreading. 

The relative speed between one physical/two logical CPUs in this case is about the instruction issue priority scheme (A whenever possible, A and B in alternation, A or B until they block) and the degree to which there are enough execution units of the particular types that the threads need to minimize contention. If you had enough of each unit that you could issue two of each per clock then the two threads, in alternating mode, would run at the same rate. If they use a max of resources and their average uses are compatible then you get almost that speedup. However, if you have two threads that do operations for which the CPU has few units and slow completion rates (aka, floating point on Intel CPUs) then each will inhibit progress on the other.

This is enough to explain why a benchmark of hyperthreading on Intel x86 CPUs would show raytracing with less benefit. The same on an Itanium (with more FP execution units) would show better paired performance if not absolute performance. AMD has long had greater allocation of resources to FP, and back in the day when other workstation CPUs mattered, they all had greater resources in FP by comparison too.
--
Michael T. Jones | Chief Technology Advocate  | m...@google.com |  +1 650-335-5765

Jan Mercl

unread,
Mar 14, 2012, 1:14:45 PM3/14/12
to golan...@googlegroups.com
On Wednesday, March 14, 2012 3:57:21 PM UTC+1, Michael Jones wrote:
Later, as the schedule matures and if library code is deemed OK, it can go back in. (It would let Go beat GMP ;-)

Looking forward FFT multiplication ;-)

unread,
Mar 14, 2012, 2:22:52 PM3/14/12
to golan...@googlegroups.com, si guy
On Wednesday, March 14, 2012 4:33:30 AM UTC+1, Dave Cheney wrote:
As an attempt at validating my comments about HT, you may enjoy this
presentation by Cliff Click

http://www.infoq.com/presentations/click-crash-course-modern-hardware

For me, his assertion that CPU performance today is dominated by
memory accesses, and delaying pipeline stalls long as possible to
'preload' cache misses, were very enlightening.

The average instructions-per-clock (IPC) nowadays is somewhere near 1.0. If performance was dominated by memory accesses, IPC would be much lower (0.1 and below) - so I do not agree that performance is dominated by memory accesses. A multiplication instruction has approximately the same performance as a memory access instruction.

unread,
Mar 14, 2012, 2:28:49 PM3/14/12
to golan...@googlegroups.com
On Wednesday, March 14, 2012 3:57:21 PM UTC+1, Michael Jones wrote:
On Go parallelism:

I put considerable effort into making a few of the math/big routines able to compute in parallel. Despite being very careful about allocation/reuse and alignment and cache issues, it seemed impossible to get a reliable benefit on my 8 physical core Xeon system. The scheduler seems not up to the task (at the moment) for completely compute-bound threads. However, for threaded web applications that make frequent/mostly system and I/O calls it seems to work well indeed. None of the parallel code is checked into big yet for just that reason. Later, as the schedule matures and if library code is deemed OK, it can go back in. (It would let Go beat GMP ;-)

The cost of creating a goroutine (and returning the resources used by a goroutine when it is destoyed) is too high?

Alternatively: If your implementation preallocated a couple (8) of worker goroutines, the cost of sending workunits through a channel and resuming the receiving goroutine was too high?

Paul Borman

unread,
Mar 14, 2012, 2:37:01 PM3/14/12
to ⚛, golan...@googlegroups.com, si guy
You are confusing L1 cached memory with L2 cached memory with L3 cached memory with actual memory.  Not all memory accesses are from the L1 cache.  The situation is even worse on intel when the Nehalem line was introduced.  For L1 cached memory one would hope you would be right.

Kyle Lemons

unread,
Mar 14, 2012, 2:37:37 PM3/14/12
to ⚛, golan...@googlegroups.com
In my experience, it's the context switching (think three goroutines on two threads, [A B] then [B C] then [C A] and so on) because goroutines have no affinity for threads.  Hence the scheduler needs to get smarter before gomaxprocs introduces improvements broadly.

unread,
Mar 14, 2012, 3:26:37 PM3/14/12
to golan...@googlegroups.com, ⚛, si guy
I stand by my previous statement: CPU performance today is not dominated by memory accesses.

Joubin Houshyar

unread,
Mar 14, 2012, 3:43:20 PM3/14/12
to golang-nuts
You appear to hold a contrary view to some of the leading lights in
the field - Dr. Click is hardly a lone voice holding this view. I am
typically not persuaded by arguments that resort to authorities, but
Cliff Click is a serious practitioner in the field and also an
independent and out-of-box thinker (c.f. his FSM based concurrent non-
blocking data structs). Possibly the issue is over the semantics of
his word choice of "dominated". Would "critically constrained by
memory access" work better for you? (I would appreciate any citations
you have for the IPC average you mention.)

Paul Borman

unread,
Mar 14, 2012, 3:52:41 PM3/14/12
to Kyle Lemons, ⚛, golan...@googlegroups.com
I did quite a bit of research into multi-threaded scheduling several years back.  As soon as threads move between chips their performance suffers greatly because of memory access (they lose their cache).  Migrating between cores on a single chip is not nearly as bad, but still a penalty from staying on a single core.  Multi-threaded programming can grossly distort this if threads are on different chips as often it can induce extreme memory contention and slowdown.  Again, this is exaggerated by the Nehalem  architecture.  BTW, I think Nehalem was a great step forward, but it did make memory more NUMA than before.  CPU performance is dominated by application/workload specific contraints.  Memory access is often one of those contraints.

Michael Jones

unread,
Mar 14, 2012, 4:13:40 PM3/14/12
to Paul Borman, Kyle Lemons, ⚛, golan...@googlegroups.com
One anecdote on cache and memory is in the math/big benchmarks of
String and Scan where you can see performance fall off a cliff (beyond
the underlying non-linearity in the algorithms) as the inner
computation "working set" moves beyond the size of the cache
hierarchy.

cd $GOROOT/src/pkg/math/big
go test -test.bench="Scan.*Base10"
go test -test.bench="String.*Base10"

C'est la vie

--

unread,
Mar 16, 2012, 2:14:01 PM3/16/12
to golan...@googlegroups.com
On Wednesday, March 14, 2012 8:43:20 PM UTC+1, Joubin Houshyar wrote:
On Mar 14, 2:22 pm, ⚛ <0xe2.0x9a.0...@gmail.com> wrote:
> On Wednesday, March 14, 2012 4:33:30 AM UTC+1, Dave Cheney wrote:
>
> > As an attempt at validating my comments about HT, you may enjoy this
> > presentation by Cliff Click
>
> >http://www.infoq.com/presentations/click-crash-course-modern-hardware
>
> > For me, his assertion that CPU performance today is dominated by
> > memory accesses, and delaying pipeline stalls long as possible to
> > 'preload' cache misses, were very enlightening.
>
> The average instructions-per-clock (IPC) nowadays is somewhere near 1.0. If
> performance was dominated by memory accesses, IPC would be much lower (0.1
> and below) - so I do not agree that performance is dominated by memory
> accesses. A multiplication instruction has approximately the same
> performance as a memory access instruction.

You appear to hold a contrary view to some of the leading lights in
the field - Dr. Click is hardly a lone voice holding this view.  I am
typically not persuaded by arguments that resort to authorities, but
Cliff Click is a serious practitioner in the field and also an
independent and out-of-box thinker (c.f. his FSM based concurrent non-
blocking data structs).

I did not mean any disrespect.
  
Possibly the issue is over the semantics of
his word choice of "dominated".  Would "critically constrained by
memory access" work better for you?

I would prefer: Software and algorithms need to take memory architecture (DRAM+caches) into account.

In my opinion, Java as a programming language is ignoring the memory architecture. There isn't any explicit control over placement of objects in memory, except for value types (int, byte, ...). If Java leads some people to the conclusion that "CPU performance is dominated by DRAM memory accesses", it should be noted that it is partially Java's own fault.
 
(I would appreciate any citations
you have for the IPC average you mention.)

I do not know what kind of workload you are running. Measuring IPC of a running program in Linux is easy with the 'perf' tool from Linux kernel. So, you can measure IPC (as well as cache misses) on your computer to find out whether IPC significantly differs from 1.0.
 
Reply all
Reply to author
Forward
0 new messages