Performance of SMT on Atom

Anton Ertl

unread,

Nov 7, 2009, 12:17:37 PM11/7/09

to

We recently got a Zotac IONATX A board with a 1600MHz Atom N330 CPU, which
supports SMT (or "hyperthreading" in Intel's marketingspeak).

We tested it using our LaTeX benchmark
<http://www.complang.tuwien.ac.at/anton/latex-bench/>. It runs in
2.3s-2.4s (in 32-bit mode), about the same speed as a 900MHz Athlon,
or a little faster than a 1066MHz PPC 7447A, or about 5 times slowr
than a 3GHz Core 2 Duo.

Then we tested the performance when other processes were running.
With 4 hardware threads (2 cores with two threads each), we ran three
processes doing "yes >/dev/null" and one process running our LaTeX
benchmark. The results varied, but we saw user times of 5.5s and 6s
for the LaTeX benchmark.

Just for comparison, we turned off hyperthreading in the BIOS, and ran
the same setup again (i.e., 3 yes processes and one latex process).
This time we saw 2.3s-2.4s user time for the latex benchmark and 4.7s
real time for the latex benchmark.

So, at least for this benchmark setup, hyperthreading is a significant
loss on the Atom.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Bernd Paysan

unread,

Nov 7, 2009, 4:17:55 PM11/7/09

to

Anton Ertl wrote:
> So, at least for this benchmark setup, hyperthreading is a significant
> loss on the Atom.

Probably not a real surprise. The Atom is in-order, and SMT probably
helps when you have many cache misses. Cache misses in the LaTeX
benchmark should be rare.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Anton Ertl

unread,

Nov 8, 2009, 1:29:36 PM11/8/09

to

Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:
>> So, at least for this benchmark setup, hyperthreading is a significant
>> loss on the Atom.
>
>Probably not a real surprise. The Atom is in-order, and SMT probably
>helps when you have many cache misses. Cache misses in the LaTeX
>benchmark should be rare.

They certainly are on the machines (IIRC Athlon 64 and Pentium 4)
where I measured cache misses. Ideally SMT would also help when the
functional units are not completely utilized even with loads hitting
the D-cache (which is probably quite frequent on an in-order machine),
but I don't know if that's the case for the Atom.

In any case, no speedup from SMT is one thing, but a significant
slowdown is pretty disappointing. Unless you know that you run lots
of code that benefits from SMT, it's probably better to disable SMT on

nm...@cam.ac.uk

unread,

Nov 8, 2009, 1:57:02 PM11/8/09

to

In article <2009Nov...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>Bernd Paysan <bernd....@gmx.de> writes:
>>Anton Ertl wrote:
>>> So, at least for this benchmark setup, hyperthreading is a significant
>>> loss on the Atom.
>>
>>Probably not a real surprise. The Atom is in-order, and SMT probably
>>helps when you have many cache misses. Cache misses in the LaTeX
>>benchmark should be rare.
>
>They certainly are on the machines (IIRC Athlon 64 and Pentium 4)
>where I measured cache misses. Ideally SMT would also help when the
>functional units are not completely utilized even with loads hitting
>the D-cache (which is probably quite frequent on an in-order machine),
>but I don't know if that's the case for the Atom.
>
>In any case, no speedup from SMT is one thing, but a significant
>slowdown is pretty disappointing. Unless you know that you run lots
>of code that benefits from SMT, it's probably better to disable SMT on
>the Atom.

And not just on the Atom. I ran some tests on the Core i7, and got
a degradation of throughput by using more threads. My limited
experience is that applies to virtually anything where the bottleneck
is memory accesses. There MAY be some programs where SMT helps with
cache misses, but I haven't seen them.

Where I think that it helps is with heterogeneous process mixtures;
e.g. one is heavy on floating-point, another on memory accesses, and
another on branching. I could be wrong, as that's based on as much
guesswork as knowledge, but it matches what I know.

Regards,
Nick Maclaren.

Andy "Krazy" Glew

unread,

Nov 8, 2009, 3:18:50 PM11/8/09

to nm...@cam.ac.uk

This is interesting. What Nick says about heterogenous workloads is certainly
true - e.g. a compute intensive non-cache missing thread to switch to
when a memory intensive thread cache misses.
aking L1
(Or, rather, that is always running, and which keeps running when the memory
intensve thread cache misses.)

However, in theory two memory intensive threads should be able to coexist
- computing when the other thread is idle. E.g. two cache missing pointer chasing
threads should be able to practically double throughput.
(I've usually been on the other side of this argument, since as comp.arch
knows I am the leading exponent of single threaded MLP architectures.
My opponents in industry would usually say "Can't you just get MLP from TLP?"
and I would have to say "Yes, but...".)

That so many people find threading a lossage for memory intensive workloads
(and it is not just these comp.arch posters - most people in the supercomputer
community disable hyperthreading) implies

a) workloads that are already highly MLP, e.g. throughput limited workloads

b) lousy threading microarchitectures. Which is typical - so many Intel processors
arbitrarily split the instruction window in half, giving half to the compute intensive
threads which do not need the window, and only half to the cache missing thread
which can use more.

c) contention between threads - e.g. thrashing out of useful D$ state.

It's ironic - take a long latency L3 cache miss to DRAM, and the chances of more such
are increased - because the other threads, which may only be taking L1 misses to L2,
are thrashing your state out of the caches. Positive feedback.

nm...@cam.ac.uk

unread,

Nov 8, 2009, 4:04:18 PM11/8/09

to

In article <4AF727A...@patten-glew.net>,

Andy \"Krazy\" Glew <ag-...@patten-glew.net> wrote:
>>
>> And not just on the Atom. I ran some tests on the Core i7, and got
>> a degradation of throughput by using more threads. My limited
>> experience is that applies to virtually anything where the bottleneck
>> is memory accesses. There MAY be some programs where SMT helps with
>> cache misses, but I haven't seen them.
>>
>> Where I think that it helps is with heterogeneous process mixtures;
>> e.g. one is heavy on floating-point, another on memory accesses, and
>> another on branching. I could be wrong, as that's based on as much
>> guesswork as knowledge, but it matches what I know.
>
>This is interesting. What Nick says about heterogenous workloads is certainly
>true - e.g. a compute intensive non-cache missing thread to switch to
>when a memory intensive thread cache misses.
>aking L1
>(Or, rather, that is always running, and which keeps running when the memory
>intensve thread cache misses.)
>
>However, in theory two memory intensive threads should be able to coexist
>- computing when the other thread is idle. E.g. two cache missing pointer chasing
>threads should be able to practically double throughput.

Yes. I am puzzled by the slowdowns I have seen, and which have been
reported to me by reliable sources, but none of us have had the time
to investigate the matter in depth. The issue is certainly rather
more complicated than the simplistic analyses make out.

It is quite possible that my description above is also simplistic,
and assigns the cause incorrectly.

Regards,
Nick Maclaren.

Bernd Paysan

unread,

Nov 8, 2009, 4:53:53 PM11/8/09

to

Andy "Krazy" Glew wrote:
> However, in theory two memory intensive threads should be able to
> coexist
> - computing when the other thread is idle. E.g. two cache missing
> pointer chasing threads should be able to practically double
> throughput.

How? Let's assume there is only one memory interface: the pointer
chasing threads will ask for new memory data as soon as they got their
previous cache-line, and therefore just compete for the same resource.
There can be an advantage when there are two memory interfaces, so the
chance of having them both busy is 50% - the throughput then should go
up to 150% in SMT mode (or even more with three memory interfaces).
However, as the Core i7 is already a quad-core, four native threads
already compete for three memory channels, and therefore, adding SMT
threads can't possibly help for this kind of stuff.

If there's a moderate cache miss rate, and the process is still doing
useful work between the memory requests, so the memory bandwidth is used
up only to 50% (which also takes about 50% of the execution time), then
SMT should help.

Niels Jørgen Kruse

unread,

Nov 8, 2009, 5:33:55 PM11/8/09

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

> Bernd Paysan <bernd....@gmx.de> writes:
> >Anton Ertl wrote:
> >> So, at least for this benchmark setup, hyperthreading is a significant
> >> loss on the Atom.
> >
> >Probably not a real surprise. The Atom is in-order, and SMT probably
> >helps when you have many cache misses. Cache misses in the LaTeX
> >benchmark should be rare.
>
> They certainly are on the machines (IIRC Athlon 64 and Pentium 4)
> where I measured cache misses. Ideally SMT would also help when the
> functional units are not completely utilized even with loads hitting
> the D-cache (which is probably quite frequent on an in-order machine),
> but I don't know if that's the case for the Atom.
>
> In any case, no speedup from SMT is one thing, but a significant
> slowdown is pretty disappointing. Unless you know that you run lots
> of code that benefits from SMT, it's probably better to disable SMT on
> the Atom.

Running 'yes' may be quite L1 unfriendly, depending on the size of IO
buffer. Perhaps 4 copies of Latex would run better.

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark

Bernd Paysan

unread,

Nov 8, 2009, 6:05:07 PM11/8/09

to

Niels Jørgen Kruse wrote:
> Running 'yes' may be quite L1 unfriendly, depending on the size of IO
> buffer. Perhaps 4 copies of Latex would run better.

Or for Anton, something like

gforth -e ": endless begin again; endless"

which would just branch in an endless loop (no memory resources used,
cache footprint minimal, just one slot of the branch target prediction).

Chris Gray

unread,

Nov 8, 2009, 6:18:15 PM11/8/09

to

Bernd Paysan <bernd....@gmx.de> writes:

> How? Let's assume there is only one memory interface: the pointer
> chasing threads will ask for new memory data as soon as they got their
> previous cache-line, and therefore just compete for the same resource.
> There can be an advantage when there are two memory interfaces, so the
> chance of having them both busy is 50% - the throughput then should go
> up to 150% in SMT mode (or even more with three memory interfaces).
> However, as the Core i7 is already a quad-core, four native threads
> already compete for three memory channels, and therefore, adding SMT
> threads can't possibly help for this kind of stuff.

I didn't do performance stuff on the Tera MTA, but I'm thinking that
from this discussion you could view it as having 128-way SMT, with
only one memory interface. However, that one memory interface could
have an outstanding fetch for each of the 128 threads, so maybe that
means it had 128 memory interfaces for this purpose. Things did
speed up with more threads running. Perhaps the relative costs of
the various activities was so different that the comparison doesn't
work?

Excuse my ignorance here - are today's memory systems limited to one
outstanding fetch per CPU memory interface?

--
Experience should guide us, not rule us.

Chris Gray c...@GraySage.COM
http://www.Nalug.ORG/ (Lego)
http://www.GraySage.COM/cg/ (Other)

Robert Myers

unread,

Nov 8, 2009, 10:01:21 PM11/8/09

to

On Nov 8, 3:18 pm, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:
> n...@cam.ac.uk wrote:
> > In article <2009Nov8.192...@mips.complang.tuwien.ac.at>,
> > Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

I don't know how you can discuss hyper-threading without discussing
the scheduler. There is a recent discussion on lkml.org, which seems,
well, primitive

http://lkml.org/lkml/2009/10/28/287

It refers to an Intel document

http://software.intel.com/sites/oss/pdfs/mclinux.pdf

which also seems primitive.

As to "memory-intensive." Does someone really mean "memory-bound?"
If something is memory-bound, which many HPC applications are, that's
it. Either you optimally use bandwidth or you don't. If a single
thread is memory-bound, then SMT is a loser. If a single thread on a
single core is memory bound, then using more than one core is a loser,
too.

Robert.

Andrew Reilly

unread,

Nov 9, 2009, 2:52:35 AM11/9/09

to

On Sun, 08 Nov 2009 19:01:21 -0800, Robert Myers wrote:

> I don't know how you can discuss hyper-threading without discussing the
> scheduler.

Why is that? I thought that schedulers were largely ignorant of SMT
threads, other than, perhaps, as pairs of cores with fully-shared cache.
Should the scheduler to take notice of the uber-NUMA characteristics of
the pair of shared virtual processors and schedule only appropriately-
matching processes on each? I think that there is a certain amount of
NUMA awareness in most modern (Unix) schedulers, but no-doubt there could
be more. I haven't heard of any that (for example) opt to schedule a
process with active FPU state and one without on the same physical CPU.
Could be interesting? It seems to me from this discussion that it's not
at all clear what characteristics would ideally be selected-for, in
making such a decision. [*] Have threads from the same process share an
SMT core, on the grounds that they might also share hot cache rows, and
save some fetches, or have them use separate cores, on the grounds that
they want to work on separate data, and more cache is better?

Seems like an intractable problem to me.

Maybe we could add some sort of a notion of "progress made good" hint
that applications could provide to the OS, so that it could have a better
chance at scheduling them stochastically?

[*] We often hear of loads that perform worse with SMT. Are there
equivalent rules of thumbs for load classes that *do* show improvement
with HyperThreading turned on?

Cheers,

--
Andrew

nm...@cam.ac.uk

unread,

Nov 9, 2009, 2:59:48 AM11/9/09

to

In article <87aaywf...@ami-cg.GraySage.com>,

Chris Gray <c...@graysage.com> wrote:
>
>I didn't do performance stuff on the Tera MTA, but I'm thinking that
>from this discussion you could view it as having 128-way SMT, with
>only one memory interface. However, that one memory interface could
>have an outstanding fetch for each of the 128 threads, so maybe that
>means it had 128 memory interfaces for this purpose. Things did
>speed up with more threads running. Perhaps the relative costs of
>the various activities was so different that the comparison doesn't
>work?

Yes. That is why the effects are somewhat puzzling.

>Excuse my ignorance here - are today's memory systems limited to one
>outstanding fetch per CPU memory interface?

No. But the rules are non-trivial.

Regards,
Nick Maclaren.

Robert Myers

unread,

Nov 9, 2009, 3:16:48 AM11/9/09

to

On Nov 9, 2:52 am, Andrew Reilly <areilly...@bigpond.net.au> wrote:
> On Sun, 08 Nov 2009 19:01:21 -0800, Robert Myers wrote:
> > I don't know how you can discuss hyper-threading without discussing the
> > scheduler.
>
> Why is that? I thought that schedulers were largely ignorant of SMT
> threads, other than, perhaps, as pairs of cores with fully-shared cache.
> Should the scheduler to take notice of the uber-NUMA characteristics of
> the pair of shared virtual processors and schedule only appropriately-
> matching processes on each? I think that there is a certain amount of
> NUMA awareness in most modern (Unix) schedulers, but no-doubt there could
> be more. I haven't heard of any that (for example) opt to schedule a
> process with active FPU state and one without on the same physical CPU.
> Could be interesting? It seems to me from this discussion that it's not
> at all clear what characteristics would ideally be selected-for, in
> making such a decision. [*] Have threads from the same process share an
> SMT core, on the grounds that they might also share hot cache rows, and
> save some fetches, or have them use separate cores, on the grounds that
> they want to work on separate data, and more cache is better?
>
> Seems like an intractable problem to me.
>
> Maybe we could add some sort of a notion of "progress made good" hint
> that applications could provide to the OS, so that it could have a better
> chance at scheduling them stochastically?

The current Linux scheduler is SMT-aware. It knows which "processors"
are on the same core and will load balance so that two CPU-hungry
threads won't compete on the same physical core.

I can imagine all kinds of possibilities that would monitor activity
in more detail and attempt to place threads accordingly, but I've
heard no one who would be up to the task propose such a thing.

> [*] We often hear of loads that perform worse with SMT. Are there
> equivalent rules of thumbs for load classes that *do* show improvement
> with HyperThreading turned on?

The best results I saw for the P4 were as much as a 35% improvement
for a chess-playing game. Lots of pointer-chasing?

Robert.

James Van Buskirk

unread,

Nov 9, 2009, 12:04:13 PM11/9/09

to

"Robert Myers" <rbmye...@gmail.com> wrote in message
news:1f518c33-55c6-4b2c...@j4g2000yqe.googlegroups.com...

http://www.mikusite.de/pages/x86.htm

Scroll down to the last table of results, compare Intel Dual Xeon
Nacona 2800 MHz HT on/off in the FPU speed column: 320.813/177.028
million iterations/second.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Rick Jones

unread,

Nov 9, 2009, 1:21:40 PM11/9/09

to

nm...@cam.ac.uk wrote:
> And not just on the Atom. I ran some tests on the Core i7, and got
> a degradation of throughput by using more threads. My limited
> experience is that applies to virtually anything where the
> bottleneck is memory accesses.

By that I presume you mean throughput?

> There MAY be some programs where SMT helps with cache misses, but I
> haven't seen them.

Wouldn't they be alluded to in some of the SPECcpu2006 "rate"
benchmarks published with HT on vs off? The "base" rules require that
all benchmarks run the same number of copies, so loss vs gain may be
obscured, but peak allows different numbers of copies for each
benchmark, so one might see copy number changes from base to peak as
suggesting something about the effectiveness of HT for that benchmark.

rick jones
--
oxymoron n, commuter in a gas-guzzling luxury SUV with an American flag
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

nm...@cam.ac.uk

unread,

Nov 9, 2009, 2:20:15 PM11/9/09

to

In article <hd9mjk$use$3...@usenet01.boi.hp.com>,

Rick Jones <rick....@hp.com> wrote:
>
>> And not just on the Atom. I ran some tests on the Core i7, and got
>> a degradation of throughput by using more threads. My limited
>> experience is that applies to virtually anything where the
>> bottleneck is memory accesses.
>
>By that I presume you mean throughput?

Yes.

>> There MAY be some programs where SMT helps with cache misses, but I
>> haven't seen them.
>
>Wouldn't they be alluded to in some of the SPECcpu2006 "rate"
>benchmarks published with HT on vs off? The "base" rules require that
>all benchmarks run the same number of copies, so loss vs gain may be
>obscured, but peak allows different numbers of copies for each
>benchmark, so one might see copy number changes from base to peak as
>suggesting something about the effectiveness of HT for that benchmark.

Yes. As I said, I haven't had time to study this area in depth.

Regards,
Nick Maclaren.

Gavin Scott

unread,

Nov 9, 2009, 2:27:54 PM11/9/09

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> So, at least for this benchmark setup, hyperthreading is a significant
> loss on the Atom.

Just for anecdote, on my dual Nehalem E5530 Dell T7500, if I run a
3D rendering test I see linear speedup going from 1->2->4->8 threads,
then about a 20-25% improvement going from 8->16 threads which seems
pretty good to me.

I haven't tried physically disabling hyperthreading, so this assumes
Windows (Vista 64) scheduler doesn't suck completely.

G.

Rick Jones

unread,

Nov 9, 2009, 2:38:04 PM11/9/09

to

I've always ass-u-me-d that HT helped when the application(s) in use
were unable to generate as many cache misses per unit time as the
processor(s) and memory subsystem could handle. Put another, perhaps
overly flippant way, HT helps with crappy programs produced by
crappy compilers :)

Of course 'crappy' is entirely subjective here - hence branding it
flippant.

Andrew Reilly

unread,

Nov 9, 2009, 5:11:28 PM11/9/09

to

On Mon, 09 Nov 2009 10:04:13 -0700, James Van Buskirk wrote:

> "Robert Myers" <rbmye...@gmail.com> wrote in message
> news:1f518c33-55c6-4b2c...@j4g2000yqe.googlegroups.com...
>
>> On Nov 9, 2:52 am, Andrew Reilly <areilly...@bigpond.net.au> wrote:
>
>> > [*] We often hear of loads that perform worse with SMT. Are there
>> > equivalent rules of thumbs for load classes that *do* show
>> > improvement with HyperThreading turned on?
>
>> The best results I saw for the P4 were as much as a 35% improvement for
>> a chess-playing game. Lots of pointer-chasing?
>
> http://www.mikusite.de/pages/x86.htm
>
> Scroll down to the last table of results, compare Intel Dual Xeon Nacona
> 2800 MHz HT on/off in the FPU speed column: 320.813/177.028 million
> iterations/second.

Closer to the top, though, is a pair of Core i7 920 results at 3200MHz
(admittedly already four cores/socket: don't know how much this benchmark
uses out-of-cache memory) FPU Mill iter/sec drops from 1869244 to 1820197
when HT is turned on. SSE performance goes up from 4573828 to 5138498
though. That suggests that memory isn't an issue, but that the SSE units
are better at being shared than the traditional FPU?

The page is a bit of a blog, with new items at the top. The figures
you've quoted are from July 2006.

Mandelbrot calculation is a benchmark of fairly limited predictive power,
IMO. :-)

Cheers,

--
Andrew

James Van Buskirk

unread,

Nov 9, 2009, 11:50:52 PM11/9/09

to

"Andrew Reilly" <areil...@bigpond.net.au> wrote in message
news:7lricgF...@mid.individual.net...

> On Mon, 09 Nov 2009 10:04:13 -0700, James Van Buskirk wrote:

>> http://www.mikusite.de/pages/x86.htm

>> Scroll down to the last table of results, compare Intel Dual Xeon Nacona
>> 2800 MHz HT on/off in the FPU speed column: 320.813/177.028 million
>> iterations/second.

> Closer to the top, though, is a pair of Core i7 920 results at 3200MHz
> (admittedly already four cores/socket: don't know how much this benchmark
> uses out-of-cache memory) FPU Mill iter/sec drops from 1869244 to 1820197
> when HT is turned on. SSE performance goes up from 4573828 to 5138498
> though. That suggests that memory isn't an issue, but that the SSE units
> are better at being shared than the traditional FPU?

Actually the benchmark uses no memory. The table closer to the top
is a different benchmark that tries harder to saturate the FPU than
the earliest versions. It doesn't follow that the FPU is saturated
yet because there may be some sequence that prevents the CPU from
reordering instructions (when the CPU is a Core i7).