Funky divide by immediate instruction?

Paul A. Clayton

unread,

Dec 28, 2012, 6:01:20 PM12/28/12

to

Thinking about the use of multiplication by
reciprocal for handling division by a constant
brought me to wondering whether it would be
useful to have a multiply high immediate
instruction which expanded a relatively small
immediate (16 bits or less) into a full size
reciprocal. For reciprocals of small integers,
converting such an approximation to a far more
precise value would seem to require less
hardware than a similarly capable table lookup.

I suspect such is not worthwhile (division by
a non-power-of-two small constant is probably
very rare and it would probably not be that
difficult to provide a hardware reciprocal
estimate table that special-cased certain
values, providing fully precise "estimates"),
but I thought I would post this anyway.

Somewhat offtopic, I wonder if the replacement
of something like rand()%d with something like
multiply_high(rand(), d) has been used. (For
C's rand function, I think this would require a
left shift first--perhaps usually pre-applied to
the d value. This assumes that all bits are
equally random. For small divisors, on ISAs
that do not provide a multiply high, right
shifts before and after an ordinary multiply
could be used.) Doubtless there are more
clever methods of generating an integer between
zero and d-1, but having seen the rand()%d
method used (and being easily obsessed with
pointless micro-optimizations) I thought at
least this somewhat more efficient technique
should be more broadly known.

We now return you to your regular program. :-)

Quadibloc

unread,

Dec 28, 2012, 10:45:59 PM12/28/12

to

On Dec 28, 4:01 pm, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
> Thinking about the use of multiplication by
> reciprocal for handling division by a constant
> brought me to wondering whether it would be
> useful to have a multiply high immediate
> instruction which expanded a relatively small
> immediate (16 bits or less) into a full size
> reciprocal.

1) That kind of instruction would leave a lot of people disappointed
who were hoping for an accurate divide immediate instruction.

2) Exactly where are you going to store the *precomputed* long
reciprocal if this instruction is supposed to be faster than divide
immediate? In other words, this sounds more like a candidate for an
assembly-language pseudo-op than a real instruction, at least on
*conventional* architectures. (P-code/bytecode and anything of that
sort, though, doesn't play by the same rules, so you could do it on
architectures that are somehow interpretive.)

John Savard

Paul A. Clayton

unread,

Dec 29, 2012, 10:00:17 AM12/29/12

to

On Friday, December 28, 2012 10:45:59 PM UTC-5, Quadibloc wrote:
> On Dec 28, 4:01 pm, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
> > Thinking about the use of multiplication by
> > reciprocal for handling division by a constant
> > brought me to wondering whether it would be
> > useful to have a multiply high immediate
> > instruction which expanded a relatively small
> > immediate (16 bits or less) into a full size
> > reciprocal.
>
> 1) That kind of instruction would leave a lot of people disappointed
> who were hoping for an accurate divide immediate instruction.

Other than compiler writers, few would be aware of
the existence of such an instruction.

In addition, the instruction would be exact or off
by one (requiring a compare followed by an addition)
for most inputs. The hardware would expand the
immediate based on the repeating bit pattern; however,
after looking at this a tiny bit more, the repeating
bit pattern might be too long in too many cases for
such to be useful (e.g., a 7th repeats every 12 bits).
In addition, small divisors might be associated with
small dividends, so even a one cycle per bit divider
would not be especially slow.

With the software interface issues and such, it
would probably be better to just provide a divide
by immediate. (I guess thinking out loud can be
dangerous. :-)

> 2) Exactly where are you going to store the *precomputed* long
> reciprocal if this instruction is supposed to be faster than divide
> immediate? In other words, this sounds more like a candidate for an
> assembly-language pseudo-op than a real instruction, at least on
> *conventional* architectures. (P-code/bytecode and anything of that
> sort, though, doesn't play by the same rules, so you could do it on
> architectures that are somehow interpretive.)

The intention was for hardware to support extension
of a short immediate by exploiting repeating bit
patterns for fractions, thinking that cutting and
pasting bit patterns would be relatively simple.
However, with a bit more consideration, the idea
seems even sillier than it did at first because the
length of the immediate needed to communicate enough
information for simple generation of an "exact"
(as precise as desired) reciprocal would be too great.

Terje Mathisen

unread,

Jan 1, 2013, 11:47:36 AM1/1/13

to

Paul A. Clayton wrote:
> Thinking about the use of multiplication by
> reciprocal for handling division by a constant
> brought me to wondering whether it would be
> useful to have a multiply high immediate
> instruction which expanded a relatively small
> immediate (16 bits or less) into a full size
> reciprocal. For reciprocals of small integers,
> converting such an approximation to a far more
> precise value would seem to require less
> hardware than a similarly capable table lookup.

We already have all of this!

For compile-time constants all current compilers already generate
reciprocal muls.

We have (4-way parallel) reciprocal lookup which is the best for doing
many independent divisions, using zero to two NR iterations if you need
more than ~12 bits of precision.

For run-time constants that will be used for many divisions the best
approach is to use the reciprocal lookup + NR to generate a rounded up
reciprocal.

>
> I suspect such is not worthwhile (division by
> a non-power-of-two small constant is probably
> very rare and it would probably not be that
> difficult to provide a hardware reciprocal
> estimate table that special-cased certain
> values, providing fully precise "estimates"),
> but I thought I would post this anyway.
>
> Somewhat offtopic, I wonder if the replacement
> of something like rand()%d with something like
> multiply_high(rand(), d) has been used. (For

I have always done this, even before I learned about how expensive DIV
was relative to MUL. :-)

With a fp [0-1> rand value this approach is very natural, for integer
rand values of full register size a MUL by the desired modulus leaves
the desired result in the high (EDX) register.

> C's rand function, I think this would require a
> left shift first--perhaps usually pre-applied to
> the d value. This assumes that all bits are
> equally random. For small divisors, on ISAs
> that do not provide a multiply high, right
> shifts before and after an ordinary multiply
> could be used.) Doubtless there are more
> clever methods of generating an integer between
> zero and d-1, but having seen the rand()%d
> method used (and being easily obsessed with
> pointless micro-optimizations) I thought at
> least this somewhat more efficient technique
> should be more broadly known.

There are no pointless optimizations, only some that might not be
currently applicable: These are still valuable as a learning experience. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Paul A. Clayton

unread,

Jan 1, 2013, 2:31:17 PM1/1/13

to

On Tuesday, January 1, 2013 11:47:36 AM UTC-5, Terje Mathisen wrote:
> Paul A. Clayton wrote:
[snip immediate multiply high reciprocal instruction]

>
> We already have all of this!
>
> For compile-time constants all current compilers already generate
> reciprocal muls.

I was proposing a single instruction that shifted
the immediate, extended it with the repeating
portion, and performed the multiply. I do not know
of any ISA that had such an instruction (likely
because loading an immediate--even with low
precision--and multiplying is good enough for
most uses and iterative refinement and/or larger
immediate values could be used for other cases).

> We have (4-way parallel) reciprocal lookup which is the best for doing
> many independent divisions, using zero to two NR iterations if you need
> more than ~12 bits of precision.

Of course, some ISAs do not have reciprocal approximation
instructions (especially for scalar operands).

[snip]

>> Somewhat offtopic, I wonder if the replacement
>> of something like rand()%d with something like
>> multiply_high(rand(), d) has been used. (For
>
> I have always done this, even before I learned about how expensive DIV
> was relative to MUL. :-)

If you did not know DIV was more expensive, why
did you prefer MUL?

> With a fp [0-1> rand value this approach is very natural, for integer
> rand values of full register size a MUL by the desired modulus leaves
> the desired result in the high (EDX) register.

For x86. (MIPS also has a HI multiply result.)
Some ISAs (e.g., at least older ARM) do not even
have a multiply instruction that provides the
high result.

[snip]

>> Doubtless there are more
>> clever methods of generating an integer between
>> zero and d-1, but having seen the rand()%d
>> method used (and being easily obsessed with
>> pointless micro-optimizations) I thought at
>> least this somewhat more efficient technique
>> should be more broadly known.
>
> There are no pointless optimizations, only some that might not be
> currently applicable: These are still valuable as a learning experience. :-)

I was thinking pointless in the sense of the
benefit not justifying the cost. Maintenance
costs can easily overshadow benefits from a
micro-optimization. Micro-optimizations can
also interfere with higher level optimizations
which would provide greater benefit. If the
micro-optimization is not in a critical path,
then the optimization may be relatively
pointless. (Even in a critical path, if a
small improvement makes that path non-critical,
then a large improvement of that path without
improvements in the other paths becomes
"pointless".) Furthermore, the same effort
applied to another area might produce greater
benefit.

The learning benefit of misapplied optimization
may be significant; not just to reinforce the
dangers of 'premature optimization'--one could
also learn better how to evaluate optimizations,
learn about some subtle interactions or
assumptions in the code (which could make later
debugging and extension easier), learn about
strengths and weaknesses of one's co-workers
and development tools, etc.

Optimizing for fun or as a work of art may not
require practical justification, and morale
factors can justify some acceptance of "pointless"
efforts even from a practical perspective.

(I suspect all the posters here know all of
that, but perhaps some lurkers might not.)

Terje Mathisen

unread,

Jan 1, 2013, 3:56:06 PM1/1/13

to

Paul A. Clayton wrote:
> On Tuesday, January 1, 2013 11:47:36 AM UTC-5, Terje Mathisen wrote:
>>> Somewhat offtopic, I wonder if the replacement
>>> of something like rand()%d with something like
>>> multiply_high(rand(), d) has been used. (For
>>
>> I have always done this, even before I learned about how expensive DIV
>> was relative to MUL. :-)
>
> If you did not know DIV was more expensive, why
> did you prefer MUL?

Working with pen & paper, DIV and MOD were approximately as hard as MUL,
but when programming the only way I knew to do a MOD was by first
dividing, then back-multiply and subtract.

At that point it was obviously better to just multiply. :-)

>> There are no pointless optimizations, only some that might not be
>> currently applicable: These are still valuable as a learning experience. :-)
>
> I was thinking pointless in the sense of the
> benefit not justifying the cost. Maintenance
> costs can easily overshadow benefits from a
> micro-optimization. Micro-optimizations can
> also interfere with higher level optimizations
> which would provide greater benefit. If the
> micro-optimization is not in a critical path,
> then the optimization may be relatively
> pointless. (Even in a critical path, if a
> small improvement makes that path non-critical,
> then a large improvement of that path without
> improvements in the other paths becomes
> "pointless".) Furthermore, the same effort
> applied to another area might produce greater
> benefit.

I'm currently working on a multithreaded version of a common internet
protocol, the multithreading is to enable a current x86 multicore cpu to
handle more clients at the same time.

While doing this I have of course had to carefully evaluate the current
reference (open source) implementation, and it has become very obvious
that I can probably at least double (maybe even more?) the speed of the
single-threaded code by relatively simple (and totally standard)
re-organization of the code, in order to construct a fast path through
for the most common case.

Since my target is x86 and I'll have a single receive function that gets
the incoming requests I want to hand them out to secondary worker
threads using a round robin set of fifos, one per thread.

This means that I have single producer/single receiver for each fifo,
making the code very simple.

I should also be able to use MONITOR/MWAIT in the worker threads so they
can consist of a simple spin loop, but without the actual power cost of
spinning.

Piotr Wyderski

unread,

Jan 2, 2013, 4:58:04 AM1/2/13

to

Terje Mathisen wrote:

> I should also be able to use MONITOR/MWAIT in the worker threads so they
> can consist of a simple spin loop, but without the actual power cost of
> spinning.

There is no such thing as monitor/mwait, Terje, you should better
forget about them. They are CPL0 instructions with conditional
ability to be executed at CPL3. And, guess what, on Linux their
execution in user mode is prohibited, so if you do not want an
immediate core dump, don't use them.

I think the main purpose of these instructions is pissing off
advanced programmers...

Best regards, Piotr

Terje Mathisen

unread,

Jan 2, 2013, 6:07:28 AM1/2/13

to

OUCH!!!

That means that the best I can do is probably to use the Linux futex
facility, unless I want to spend the watts to let all worker threads
spin waiting for their own queue to be updated. :-(

Any other ideas?

Piotr Wyderski

unread,

Jan 2, 2013, 8:08:59 AM1/2/13

to

Terje Mathisen wrote:

> OUCH!!!

Unfortunately -- two weeks ago I stepped on the same mine... :-/

> That means that the best I can do is probably to use the Linux futex
> facility, unless I want to spend the watts to let all worker threads
> spin waiting for their own queue to be updated. :-(
>
> Any other ideas?

Nothing more than pause, i.e. rep nop. And pthread_yield()
when spinning for too long.

Best regards, Piotr

Andy Valencia

unread,

Jan 2, 2013, 12:11:11 PM1/2/13

to

Terje Mathisen <"terje.mathisen at tmsw.no"> writes:
> > I think the main purpose of these instructions is pissing off
> > advanced programmers...

> That means that the best I can do is probably to use the Linux futex
> facility, unless I want to spend the watts to let all worker threads
> spin waiting for their own queue to be updated. :-(

Or change Linux.

Andy Valencia
Home page: http://www.vsta.org/andy/
To contact me: http://www.vsta.org/contact/andy.html

Andy (Super) Glew

unread,

Jan 2, 2013, 1:54:18 PM1/2/13

to

On 1/2/2013 3:07 AM, Terje Mathisen wrote:> Piotr Wyderski wrote:
>> Terje Mathisen wrote:
>>
>>> I should also be able to use MONITOR/MWAIT in the worker threads so
they
>>> can consist of a simple spin loop, but without the actual power cost of
>>> spinning.
>>
>> There is no such thing as monitor/mwait, Terje, you should better
>> forget about them. They are CPL0 instructions with conditional
>> ability to be executed at CPL3. And, guess what, on Linux their
>> execution in user mode is prohibited, so if you do not want an
>> immediate core dump, don't use them.
>>
>> I think the main purpose of these instructions is pissing off
>> advanced programmers...
>
> OUCH!!!

I wish that there was a public explnataion of why MONITOR/MWAIT are only
available in CPL=0.

Patents such as http://www.google.com/patents/US20030126379
and
http://www.google.com/patents/US7328293 (Queued locks using
monitor-memory wait)

imply that MONITOPR/MWAIT was originally intended for user level code.

Although by the time
http://download.intel.com/technology/itj/2004/volume08issue01/art01_microarchitecture/vol8iss1_art01.pdf
was published, the spin says that MONITOR/MWAIT are for OS code only.

It would be a very good case study for students of computer architecture
to know why this restriction was imposed.

One may conjecture that

> That means that the best I can do is probably to use the Linux futex
> facility, unless I want to spend the watts to let all worker threads
> spin waiting for their own queue to be updated. :-(
>
> Any other ideas?
>
> Terje

I am not a big fan of futex. It's much more complex than simpler
examples of similar code, which solves the simple case of spinning on a
process that has been suspended.

The basic idea is to have the OS write a status field in shared memory
when it has context switched a process (or thread) out. Locks hold the
identity of the lock-owners; waiters look at the swapped-out status
field for the lock-owner, and if the lock-owner is swapped out, they ask
the OS to suspend themselves (i.e. if the waiter sees the lock-owner is
swapped out, the waiter asks to be suspended).

Yeah, yeah, ideally the waiter is only suspended until the lock-owner is
swapped back in and is releasing the lock. Which begins the slippery
slope to futex. Especially if you want to make these operations atomic.

But if you can just get that teensy little bit of an OS change -
register a location that the OS updates when it swaps out a
process/thread - then you can get most of the goodness.

... And, yeah, use a queue lock for the user code only stuff.

--
The content of this message is my personal opinion only. Although I am
an employee (currently of MIPS Technologies; in the past of companies
such as Intellectual Ventures and QIPS, Intel, AMD, Motorola, and
Gould), I reveal this only so that the reader may account for any
possible bias I may have towards my employer's products. The statements
I make here in no way represent my employers' positions on the issue,
nor am I authorized to speak on behalf of my employers, past or present.

Paul A. Clayton

unread,

Jan 2, 2013, 5:29:51 PM1/2/13

to

On Wednesday, January 2, 2013 1:54:18 PM UTC-5, Andy (Super) Glew wrote:
[snip]

> I wish that there was a public explnataion of why MONITOR/MWAIT are only
> available in CPL=0.
>
> Patents such as http://www.google.com/patents/US20030126379
> and
> http://www.google.com/patents/US7328293 (Queued locks using
> monitor-memory wait)
>
> imply that MONITOPR/MWAIT was originally intended for user level code.
>
> Although by the time
> http://download.intel.com/technology/itj/2004/volume08issue01/art01_microarchitecture/vol8iss1_art01.pdf
> was published, the spin says that MONITOR/MWAIT are for OS code only.
>
> It would be a very good case study for students of computer architecture
> to know why this restriction was imposed.
>
> One may conjecture that

Since the conjecture did not make it into that message,
I will offer a wild(ish) guess: Unprivileged MWAIT
could waste execution resources when the thread to
write to memory (or otherwise wake the waiting thread)
is rescheduled.

OS code has several advantages: 1) it generally has
substantial control of thread scheduling; 2) it is
generally somewhat commonly developed and maintained
software (so the code maintainers have incentives and
ability to avoid bad behavior like using MWAIT on
persistent locks); 3) it is usually expected to be the
most reliable code in the system (so bad behavior is
less likely). (Another advantage might be that OS
code is not expected to typically run for full
timeslices.)

Could requiring the MONITORed event be registered by
the event provider (by a MONITOR-like operation)
avoid such persistent stalls in an implementable
manner? If the thread providing the event is
rescheduled, it would lose its registration and
inform any waiting threads of this event. If the
thread providing the event misbehaves and does not
register, the MONITOR/MWAIT would always fail.

Obviously, that would not work well with software
that is not cooperative and would only help (a
little) in the case of thread rescheduling.

Along similar lines, I wonder if a lock elision
mechanism might be able to attempt a transaction
even if the lock is held by not committing until
the lock is released. Such might allow a little
more parallelism under moderate, relatively
isolated contention with relatively coarse-grained
locks. Or not.

Ivan Godard

unread,

Jan 2, 2013, 6:52:52 PM1/2/13

to

IMO, the prospect of losing control while still holding a lock is the
fundamental failing of pessimistic concurrency control.

The only trouble-free implementation surrenders all held locks when
losing control. This is a well-known and widely used technique; it's
called optimistic concurrency :-)

Hardware implementations of optimistic concurrency have strictly bounded
read/write set sizes, fewer for LLSC and more for the cache based
schemes, but in all cases much smaller than the sets that are frequent
under locking. There is also a well known and widely used solution to
this problem: it's called two-phase commit. Unfortunately TPC requires a
bit more front-end thinking on the part of the programmer than is needed
for locks.

Of course, TPC requires no back-end thinking, while locking is rife with
failures requiring massive post-failure thunk.

But that cost is some other programmer thinking and somebody else's
budget. So locks remain popular.

Andy (Super) Glew

unread,

Jan 2, 2013, 10:57:57 PM1/2/13

to

On 1/2/2013 2:29 PM, Paul A. Clayton wrote:

> On Wednesday, January 2, 2013 1:54:18 PM UTC-5, Andy (Super) Glew wrote:
> [snip]
>> I wish that there was a public explnataion of why MONITOR/MWAIT are only
>> available in CPL=0.
>>
>> Patents such as http://www.google.com/patents/US20030126379
>> and
>> http://www.google.com/patents/US7328293 (Queued locks using
>> monitor-memory wait)
>>
>> imply that MONITOPR/MWAIT was originally intended for user level code.
>>
>> Although by the time
>> http://download.intel.com/technology/itj/2004/volume08issue01/art01_microarchitecture/vol8iss1_art01.pdf
>> was published, the spin says that MONITOR/MWAIT are for OS code only.
>>
>> It would be a very good case study for students of computer architecture
>> to know why this restriction was imposed.
>>
>> One may conjecture that
>
> Since the conjecture did not make it into that message,

Sorry about that.

Now I am not even sure what I was conjecturing.

This is what I get when I try to squeeze in a post while tests /
simulations are running.

Michael S

unread,

Jan 3, 2013, 8:49:06 AM1/3/13

to

I can tell for Terje, but I, personally, preferred, and still prefer,
MUL because assumption about bits being equally random is correct only
for "good" implementations of rand(). For "less good" implementations,
which were very common 10-20 years ago, and still can be encountered
even today, less significant bits are less random.

Piotr Wyderski

unread,

Jan 3, 2013, 9:53:15 AM1/3/13

to

Andy Valencia wrote:

> Or change Linux.

In many companies you cannot change the preinstalled base.
And even if you can, it is a problem with non-free Linuxes like RHEL.

Best regards, Piotr

Torben Ægidius Mogensen

unread,

Jan 3, 2013, 10:12:28 AM1/3/13

to

"Paul A. Clayton" <paaron...@gmail.com> writes:

> Somewhat offtopic, I wonder if the replacement
> of something like rand()%d with something like
> multiply_high(rand(), d) has been used.

It was a very common trick in BASIC on home computers: In some BASIC
interpreters, the RND function would just generate a floating-point
value between 0 and 1 (not included), so generating an integer from 0 to
N-1 meant multiplying this by N and taking the integer part. Other
BASIC interpreters had built-in integer RND functions, making this trick
redundant. I never looked deeply enough into the interpreters to see
what trick they used internally, but I would guess something similar was
done.

I used a similar trick in ARM assembly language on my Archimedes
computer: Get a 32-bit random number from a PRNG, clear the upper
log2(N) bits, multiply by N and shift down by 32-log2(N) bits.

Torben

EricP

unread,

Jan 3, 2013, 12:58:42 PM1/3/13

to

Paul A. Clayton wrote:
> On Wednesday, January 2, 2013 1:54:18 PM UTC-5, Andy (Super) Glew wrote:
> [snip]
>> I wish that there was a public explnataion of why MONITOR/MWAIT are only
>> available in CPL=0.
>>
>> Patents such as http://www.google.com/patents/US20030126379
>> and
>> http://www.google.com/patents/US7328293 (Queued locks using
>> monitor-memory wait)
>>
>> imply that MONITOPR/MWAIT was originally intended for user level code.
>>
>> Although by the time
>> http://download.intel.com/technology/itj/2004/volume08issue01/art01_microarchitecture/vol8iss1_art01.pdf
>> was published, the spin says that MONITOR/MWAIT are for OS code only.
>>
>> It would be a very good case study for students of computer architecture
>> to know why this restriction was imposed.
>>
>> One may conjecture that
>
> Since the conjecture did not make it into that message,
> I will offer a wild(ish) guess: Unprivileged MWAIT
> could waste execution resources when the thread to
> write to memory (or otherwise wake the waiting thread)
> is rescheduled.

Don't think of these as a waste of resources,
think of them as an optimized spinwait.

Firstly, it waste time but saves on power (and heat),
and on an SMT (hyperthreaded) cpu it doesn't even waste time.
The alternative is not a thread yield, it is a spinwait
which waste time and power and hampers hyperthreads.

Secondly, it is not for the OS to judge how a thread
spends its time slice. That's a can of worms.

Also while the Intel instruction manual says its only
available at CPL=0, newer versions of the SysProg manual
say it _may_ be available to higher CPL (user mode) in the future.

I'm guessing the reason they restricted the initial version was
because MONITOR/MWAIT is so intimately tied to power state transitions.
It interacts with so many other subsystems (power, various sleep states,
interrupts, NMI, SMI, multiprocessing, hyperthreading)
that they may have wanted to get the mechanism to market
quickly for the controlled kernel environment, then do a
more limited user mode version later when they had
more time to analyze the consequences more fully.

For example, whereas the kernel MONITOR/MWAIT instructions
pass information in EAX, ECX and EDX registers that control
important kernel operations like sleep states,
you would not want that specified by user mode.
Instead this info could be put in a protected MSR register
set up by the kernel on behalf of the user threads.
If the user thread execute MONITOR/MWAIT then the
control and hints values in the MSR are used.

Eric

Terje Mathisen

unread,

Jan 3, 2013, 2:15:33 PM1/3/13

to

EricP wrote:
> I'm guessing the reason they restricted the initial version was
> because MONITOR/MWAIT is so intimately tied to power state transitions.
> It interacts with so many other subsystems (power, various sleep states,
> interrupts, NMI, SMI, multiprocessing, hyperthreading)
> that they may have wanted to get the mechanism to market
> quickly for the controlled kernel environment, then do a
> more limited user mode version later when they had
> more time to analyze the consequences more fully.

Sounds like a very believable scenario.

>
> For example, whereas the kernel MONITOR/MWAIT instructions
> pass information in EAX, ECX and EDX registers that control
> important kernel operations like sleep states,
> you would not want that specified by user mode.
> Instead this info could be put in a protected MSR register
> set up by the kernel on behalf of the user threads.
> If the user thread execute MONITOR/MWAIT then the
> control and hints values in the MSR are used.

I'm hoping this turns out to be true!

For my particular problem each worker thread would map very naturally to
a setup with a tight spin loop that used MWAIT to save power when its
work queue was empty.

What I did today instead was to spin until there was no more work, then
yield (in the form of a sleep(0) call) before repeating the spin.

Andy Valencia

unread,

Jan 3, 2013, 2:39:46 PM1/3/13

to

Piotr Wyderski <pete...@neverland.mil> writes:
> Andy Valencia wrote:
> > Or change Linux.
> In many companies you cannot change the preinstalled base.
> And even if you can, it is a problem with non-free Linuxes like RHEL.

I still remember the "Live Free Or Die" UNIX license plate which made
the rounds at USENIX....

Piotr Wyderski

unread,

Jan 3, 2013, 2:40:29 PM1/3/13

to

Andy (Super) Glew wrote:

> It would be a very good case study for students of computer architecture
> to know why this restriction was imposed.

I think that there is no need for a deeper theory.
A plain old fear is enough. Fear that the instruction
could be abused in an unpredicted way by some malware
and jeopardize Intel's good name and profits and, as
a consequence, cause the originator to be immediately fired. In kernel
mode it is much safer, as, first of all, one must somehow
enter CPL0, wich is a much harder problem. And because
of such a fear the implementation of this nice tool has
been screwed.

I presume that you, Andy, have a lot of technical reasons
to do it the way it has been done, but I dare to claim that
the real reason is that simple: fear.

Best regards, Piotr

Paul A. Clayton

unread,

Jan 3, 2013, 5:33:22 PM1/3/13

to

On Thursday, January 3, 2013 2:40:29 PM UTC-5, Piotr Wyderski wrote:
> Andy (Super) Glew wrote:
>
>> It would be a very good case study for students of computer architecture
>> to know why this restriction was imposed.
>
> I think that there is no need for a deeper theory.
> A plain old fear is enough. Fear that the instruction
> could be abused in an unpredicted way by some malware
> and jeopardize Intel's good name and profits and, as
> a consequence, cause the originator to be immediately fired. In kernel
> mode it is much safer, as, first of all, one must somehow
> enter CPL0, wich is a much harder problem. And because
> of such a fear the implementation of this nice tool has
> been screwed.

It is probably also easier to ensure that undefined
behavior is limited to the equivalent of an
arbitrary instruction sequence without violating
privilege guarantees when in most privileged mode.
(I do not remember MONITOR/MWAIT having any cases of
undefined behavior, but in most privileged mode
undefined behavior [in some ISAs] is allowed to do
anything short of locking up the system to the point
of needing a hard reset.)

This would not seem to be a significant consideration,
but it is a "benefit" of allowing only most
privileged use.

Quadibloc

unread,

Jan 3, 2013, 11:55:45 PM1/3/13

to

On Jan 3, 10:58 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:

> Secondly, it is not for the OS to judge how a thread
> spends its time slice. That's a can of worms.

What? If useful work can be done in another thread when one thread is
waiting, *of course* the OS had better see to it that it is so done,
and *of course* CPU designers should ensure that no unprivileged
instruction has the opportunity to waste time.

There's a system call for throwing control away until the thing you're
waiting for happens; if that can be the model for an instruction, so
much the better.

Performance is to be maximized. As well as civilized behavior among
the hundreds of users who are paying $100 per hour for CPU time on our
computer... (what? Are you trying to tell me that this isn't 1967 any
more?)

John Savard

Quadibloc

unread,

Jan 3, 2013, 11:58:28 PM1/3/13

to

On Jan 3, 6:49 am, Michael S <already5cho...@yahoo.com> wrote:
> For "less good" implementations,
> which were very common 10-20 years ago, and still can be encountered
> even today, less significant bits are less random.

I thought that the "less good" remained the universal standard to this
very day. I'm surprised to hear that it is now instead in the
minority.

John Savard

Noob

unread,

Jan 4, 2013, 2:43:38 AM1/4/13

to

Piotr Wyderski wrote:

> In many companies you cannot change the preinstalled base.
> And even if you can, it is a problem with non-free Linuxes like RHEL.

The last statement is ambiguous, at best.

https://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux

> While Red Hat uses strict trademark rules to restrict free
> re-distribution of their officially supported versions of Red Hat
> Enterprise Linux, Red Hat freely provides the source code for the
> distribution's software even for software where this is not
> mandatory. As a result, several distributors have created re-branded
> and/or community-supported re-builds of Red Hat Enterprise Linux that
> can legally be made available, without official support from Red Hat.
> CentOS aims to provide 100% binary compatibility to Red hat Linux.

Michael S

unread,

Jan 4, 2013, 4:48:19 AM1/4/13

to

For plain clib rand() "less good" is, indeed, universal even today.
However, in most environments better PRNGs are available today either
out of the box or through simple download (boost, gsl, etc).

Also there are different levels of "less good".
Wikipedia article lists several applications of Linear Congruential
Generator in various clib rand() implementations:
http://en.wikipedia.org/wiki/Linear_congruential_generator
Out of those, only glibc rand() and Borland C++ lrand() are in
category of "bad enough to often cause problem to majority of naive
novices". And in case of Borland, if the person picked lrand() over
rand(), then he is more likely to know what he is doing and to not use
% to reduce range. So, I'd say that only glibc left LS bits pitfall
wide open.

Piotr Wyderski

unread,

Jan 4, 2013, 4:53:29 AM1/4/13

to

Noob wrote:

>> While Red Hat uses strict trademark rules to restrict free
>> re-distribution of their officially supported versions of Red Hat
>> Enterprise Linux, Red Hat freely provides the source code for the
>> distribution's software even for software where this is not
>> mandatory. As a result, several distributors have created re-branded
>> and/or community-supported re-builds of Red Hat Enterprise Linux that
>> can legally be made available, without official support from Red Hat.
>> CentOS aims to provide 100% binary compatibility to Red hat Linux.

Will a user-modified RHEL still be recognized by RedHat as
a genuine RHEL and supported accordingly?

Best regards, Piotr

Andy (Super) Glew

unread,

Jan 4, 2013, 12:34:00 PM1/4/13

to

By the way: this is one of the reasons why I am an advocate of hardware
schedulers for CPUs - much as there are already hardware schedulers for
GPUs. Although I suspect that many of them are just really software
schedulers running on an embedded processor. Just as what I am
proposing may often end up being microcode or similar.

It is often too expensive to do a system call if you are spinning to
check to see if you should throw away control.
This is the reason why we have stuff like what I proposed, and
FUTEXes for that matter: some user level way to determine that you
should surrender control, combined with the OS call if you should.

For that matter, it is often too expensive to do a system to throw away
control.
This is why we have the PAUSE and should have a user level MWAIT
instructions - they don't throw away contrl all the way to the OS, but
the do make the currently running thread "cheaper", e.g. by putting it
to sleep for 300 cycles or going into a low power state.
(Note the division of responsibilities: the user thread may decide
that it should PAUSE or wait for an event, but some of the steps
involved may require privilege.)

Heck: IBM mainframe Z-series has the ability for a virtual machine
monitor to intercept synchronization instructions like CAS (Compare and
Swap). So that the VMM can control policy - so that guest OSes that do
not know they are running in a virtual machine can be prevented from
spinning uselessly.
(In a virtual machine environment, every place that the app or
guest thinks it has full control, interrupts blocked, etc - no longer is.)

Now, PAUSE just puts one thread of an SMT to sleep for a short time.
It can pause/unpause much faster than a roundtrip through a syscall.

Basically, PAUSE is a hardware scheduling operation, where the
"scheduler queue" is trivial - just one of the currently running SMT
threads.

We can imagine SMT machines that have larger number of thread contexts,
say 16, out of which only a smaller number are active, say 4. In which
case the hardware scheduler may be making one of the inactive but loaded
threads active, and vice versa. Both for the PAUSE instruction, but
potentially for other reasons. Like hardware timesharing. (I.e.
instead of SMT, interleaving threads within a cycle, or barrel,
interleaving ion cycle basis, interleave at a coarser scale - 10 cycles?
100 cycles? Anything up to the typical system call time would be useful.)

We don't need to stop at "thread contexts loaded into a hardware
register file". If you have microcode you might have the "hardware"
(microcode) scheduled threads from a simple "hardware" (microcode)
scheduling queue that is stored in memory.

(BTW, I say "microcode", because I don't like hardware instructions that
do complicated sequences of access to memory. I'm a RISC guy at heart.
But I am okay with microcode - which is just software, except with
special features and support, really a pecial privilege level.)

Hierarchy:
* SMT threads loaded into register file and running
* Less active threads loaded into (slower) RFs
* Memory data structures managed by hw/microcode
* Memory datastructures managed by OS
* Memory and disk datastructures managed by network batch queueing
engines, Grid Engines, like Condor

You draw the line according to speed. If you can do fast syscalls, and
wake up quickly on events, then the OS level may be all you need.

But hw/ucode memory scheduler queues can be useful
a) when OS is slow and stupid
b) to access even signalling mechanisms that are aster than normal
interrupts, but which you are not quite ready to expose at the ISA level
for OS.

EricP

unread,

Jan 4, 2013, 2:40:40 PM1/4/13

to

The reason I said it's a can of worms is because how is an
operating system to measure what constitutes "useful" work
(forward progress) and judge when I am making enough of it?

Any such discussion is context dependent, highly subjective,
and likely to end in up talking about morality
and mans obligations to fellow man (or timesharers).
It gets all metaphysical very quickly.

Is a spinwait ok if I do it on my PC but morally wrong on a cloud?
I am paying for time on both.

How about on a dedicated HPC system to coordinate
threads doing weather prediction?

On Windows the user mode critical section allows an optional
spinwait value. If lock duration is low, say guarding a
double linked list, the odds in a collision are usually
low and the odds in a double collsion even lower.
If a requester finds the structure locked then the
*best* course of action is to spinwait for "a while" before
giving up and yielding because yielding has a cost too.
How big should "a while" be allowed to get?

Like I said, a can of worms.

So anticipating all of this I decided the best course is to
define my timeslice as my ration to do with as I choose.
Then the OS doesn't have to define or measure forward progress,
or decide concepts like what "worthwhile" is.

(Actually, I would set the "moral" spinwait limit as the cost
in clock cycles of at least two trips through the scheduler.
If the average spinwait is less than that then there is no
point in checking whether there is something better to do.
But others may define that limit differently,
and there is no way to enforce this anyway.)

Eric

Noob

unread,

Jan 4, 2013, 2:45:26 PM1/4/13

to

What does support have to do with your assertion that RHEL is one
of several "non-free Linuxes" ?

Quadibloc

unread,

Jan 4, 2013, 3:11:55 PM1/4/13

to

On Jan 4, 12:40 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:

> Is a spinwait ok if I do it on my PC but morally wrong on a cloud?
> I am paying for time on both.

That depends if the owner/operator of the cloud only cares about
making his money, or if he is trying to provide a useful service to as
many people as possible (even if that is merely to better motivate
others to give him their money). That will determine what the fine
print of the terms of service contain.

John Savard

Stephen Fuld

unread,

Jan 5, 2013, 2:38:30 PM1/5/13

to

On 1/4/2013 9:34 AM, Andy (Super) Glew wrote:

snip

This is an interesting possibility, but there are a number of "issues"

> We can imagine SMT machines that have larger number of thread contexts,
> say 16, out of which only a smaller number are active, say 4. In which
> case the hardware scheduler may be making one of the inactive but loaded
> threads active, and vice versa. Both for the PAUSE instruction, but
> potentially for other reasons.

But if the cores are still superscaler, wouldn't the size of the
multiported register file be an issue? I suppose you could add register
save and restore to the hardware/microcode scheduler, as long as you had
agreement on the memory layout of the save area.

Would you want to allow multiple scheduling policies to help make
choices among multiple ready threads?

> Like hardware timesharing. (I.e.
> instead of SMT, interleaving threads within a cycle, or barrel,
> interleaving ion cycle basis, interleave at a coarser scale - 10 cycles?
> 100 cycles? Anything up to the typical system call time would be useful.)

Wasn't there a version of the Power processor that switched on cache misses?

> We don't need to stop at "thread contexts loaded into a hardware
> register file". If you have microcode you might have the "hardware"
> (microcode) scheduled threads from a simple "hardware" (microcode)
> scheduling queue that is stored in memory.

Absolutely, but see my comments about register save/restore.

> (BTW, I say "microcode", because I don't like hardware instructions that
> do complicated sequences of access to memory. I'm a RISC guy at heart.
> But I am okay with microcode - which is just software, except with
> special features and support, really a pecial privilege level.)

Yup. Note that you allow the hardware scheduler to be disabled by the
OS in order to allow an OS to use an existing code, a unique scheduling
requirement etc. But since the scheduler is a fairly highly used portion
of the OS, and presumably the HW implementation would be faster, OS
vendors would be encouraged to use it.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Michael S

unread,

Jan 5, 2013, 4:46:40 PM1/5/13

to

On Jan 5, 9:38 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid> wrote:
> On 1/4/2013 9:34 AM, Andy (Super) Glew wrote:
>
> snip
>
> This is an interesting possibility, but there are a number of "issues"
>
> > We can imagine SMT machines that have larger number of thread contexts,
> > say 16, out of which only a smaller number are active, say 4. In which
> > case the hardware scheduler may be making one of the inactive but loaded
> > threads active, and vice versa. Both for the PAUSE instruction, but
> > potentially for other reasons.
>
> But if the cores are still superscaler, wouldn't the size of the
> multiported register file be an issue?

How about register cache?

Paul A. Clayton

unread,

Jan 5, 2013, 6:12:57 PM1/5/13

to

On Saturday, January 5, 2013 2:38:30 PM UTC-5, Stephen Fuld wrote:
> On 1/4/2013 9:34 AM, Andy (Super) Glew wrote:

[snip]

>> We can imagine SMT machines that have larger number of thread contexts,
>> say 16, out of which only a smaller number are active, say 4. In which
>> case the hardware scheduler may be making one of the inactive but loaded
>> threads active, and vice versa. Both for the PAUSE instruction, but
>> potentially for other reasons.
>
> But if the cores are still superscaler, wouldn't the size of the
> multiported register file be an issue? I suppose you could add register
> save and restore to the hardware/microcode scheduler, as long as you had
> agreement on the memory layout of the save area.

There are several possibilities:

Itanium uses a register file that shares ports between
the two threads. (The same basic mechanism was used
on at least one Sun SPARC design for register
windows.) This design exploits the fact that
registers from different threads are guaranteed to be
accessed in different cycles. For a register file
with a large port count, this can provide a
significant savings in area. (It *might* be possible
to use such knowledge to reduce static power use in
the inactive portion.)

Saving/restoring to/from a secondary register file
would also be possible (given reasonably coarse-
grained switching). Register save/restore could
generally substantially benefit from banking
(especially if logical register name correlates
with physical bank number), so bandwidth could
be relatively high. In addition, switches
might often occur when one thread is relatively
idle, so some extra ports may be available.

In some ways, the power save SRAM region in some
Intel Atom processors might be viewed as a
thread context save area--except that it also
saves other processor state and is only used for
power saving not to allow a different thread to
execute (though I have suggested--on comp.arch,
I think--that such might be a useful extension
of the hardware).

> Would you want to allow multiple scheduling policies to help make
> choices among multiple ready threads?

Yes. I think one would want to be able to
optimize for performance (perhaps a priority-
oriented scheduler) or throughput. One might
also want to account for power/energy and
thermal considerations (e.g., saving up thermal
headroom to allow a turboboost when it would be
most useful). Best effort also differs from
guaranteed performance/resource use (and in
some cases more than adequate performance has
little value). The threads themselves could
also give scheduling hints/directives (along
the lines of MIPS MT-ASE 'yield' instruction).

>> Like hardware timesharing. (I.e.
>> instead of SMT, interleaving threads within a cycle, or barrel,
>> interleaving ion cycle basis, interleave at a coarser scale - 10 cycles?
>> 100 cycles? Anything up to the typical system call time would be useful.)
>
> Wasn't there a version of the Power processor that switched on cache misses?

You might be thinking of Northstar and its follow-ons:
en.wikipedia.org/wiki/IBM_RS64#RS64-II

More recent Itanium implementations also have SoEMT.

>> We don't need to stop at "thread contexts loaded into a hardware
>> register file". If you have microcode you might have the "hardware"
>> (microcode) scheduled threads from a simple "hardware" (microcode)
>> scheduling queue that is stored in memory.
>
> Absolutely, but see my comments about register save/restore.
>
>
>> (BTW, I say "microcode", because I don't like hardware instructions that
>> do complicated sequences of access to memory. I'm a RISC guy at heart.
>> But I am okay with microcode - which is just software, except with
>> special features and support, really a pecial privilege level.)
>
>
> Yup. Note that you allow the hardware scheduler to be disabled by the
> OS in order to allow an OS to use an existing code, a unique scheduling
> requirement etc. But since the scheduler is a fairly highly used portion
> of the OS, and presumably the HW implementation would be faster, OS
> vendors would be encouraged to use it.

Faster and quite possibly more informed since
exposing implementation-specific event counters
and configuration details to OSes may be
problematic (since such details might not be
accepted as implementation specific and there
may be issues with trade secrets).

MitchAlsup

unread,

Jan 5, 2013, 8:06:23 PM1/5/13

to SF...@alumni.cmu.edu.invalid

On Saturday, January 5, 2013 1:38:30 PM UTC-6, Stephen Fuld wrote:
> But if the cores are still superscaler, wouldn't the size of the
> multiported register file be an issue?

GPUs contain thousands of threads each with their register file on
die simultaneously--no saving or restoring needed. These files are
multiported (via banking) and support 12 register reads and 4 register
writes per cycle continuously. The cores are 4-5 way SuperScalar.

After thinking about this for a couple of months, the problem CPU
designers have is that they believe that register file access BW
has to be accompanied by low latency to the RF. Eliminate this
belief and the problem can be made to "go away"

Mitch

Paul A. Clayton

unread,

Jan 5, 2013, 9:39:57 PM1/5/13

to

On Saturday, January 5, 2013 8:06:23 PM UTC-5, MitchAlsup wrote:
> On Saturday, January 5, 2013 1:38:30 PM UTC-6, Stephen Fuld wrote:
>> But if the cores are still superscaler, wouldn't the size of the
>> multiported register file be an issue?
>
> GPUs contain thousands of threads each with their register file on
> die simultaneously--no saving or restoring needed. These files are
> multiported (via banking) and support 12 register reads and 4 register
> writes per cycle continuously. The cores are 4-5 way SuperScalar.

Aargh! Please do not count SIMD elements as threads!
(Yes, from the standpoint of storage capacity, they
are threads, but the access control complexity is
quite different as you know better than I do.)

AMD's VLIW4 ISA is also much friendlier to a graphics
workload (extremely great available parallelism)
than to even a GPGPU workload. (I do not _think_ it
was just to simplify the compiler that AMD moved to
Graphics Core Next, which is a much more CPU-like
architecture. I suspect that even with off-line
compilation, VLIW4 would be less than ideal even for
GPGPU codes--which as a rule have high degrees of
parallelism--, though some Architecturally-defined
banking _is_ probably practical, at least for off-
line compilation [it might even be extended somewhat
to cache/scratchpad memory]. Yes, I am just guessing.)

Each "core" (SIMD) is four wide and four "deep" in
terms of vectors and the VLIW operations (as you
pointed out) use distinct register files (banks) to
provide high port counts (somewhat like some DSPs).

> After thinking about this for a couple of months, the problem CPU
> designers have is that they believe that register file access BW
> has to be accompanied by low latency to the RF. Eliminate this
> belief and the problem can be made to "go away"

But is it a groundless belief (I very much doubt
that), a belief grounded only on existing
software practice (I am skeptical), a belief
extremely biased by existing software practice
(this is more credible to me), or something else?

Rewriting software to fit a better model of
computation seems _less_ likely to occur (in the
short term) than the abandonment of x86. Which,
as you have pointed out before, is _quite_
unlikely. (I doubt x86 will be modified--or
even use hints based on logical register names--
to expose register file banks to software. Yes,
if one did not care much about latency and had
moderate parallelism, banking and other tricks
could be used even without software assistance
to increase access bandwidth.)

The above seems to be a bit of a knee-jerk
reaction ("Threads! Did he write 'threads'?!"),
but even so there might be more than two bits
of truth there (four states not $0.25).

Michael S

unread,

Jan 6, 2013, 3:58:23 AM1/6/13

to

Eliminate this belief and the usefulness of CPU as the most "general-
purpose" in the spectrum of computing devices goes away.

Michael S

unread,

Jan 6, 2013, 4:38:37 AM1/6/13

to

On Jan 5, 9:38 pm, Stephen Fuld <SF...@alumni.cmu.edu.invalid> wrote:

> On 1/4/2013 9:34 AM, Andy (Super) Glew wrote:
>
> snip
>
> This is an interesting possibility, but there are a number of "issues"
>
> > We can imagine SMT machines that have larger number of thread contexts,
> > say 16, out of which only a smaller number are active, say 4. In which
> > case the hardware scheduler may be making one of the inactive but loaded
> > threads active, and vice versa. Both for the PAUSE instruction, but
> > potentially for other reasons.
>
> But if the cores are still superscaler, wouldn't the size of the
> multiported register file be an issue?

Thinking about it a little more, today's dual-thread x86 cores have
~100 physical GP registers, i.e. ~70 more than architected.
Following the same "70 more" guideline will have 330 physical
registers. with assumptions like
a) Cycle time is limited by register file access time
(overpessimistic)
b) Register file access time is dominated by gate delays
(overoptimistic) = O(logN)
we end up with factor 1.25 impact to the cycle time.
Which could be either acceptable or not.

Quadibloc

unread,

Jan 6, 2013, 2:00:50 PM1/6/13

to

On Jan 5, 6:06 pm, MitchAlsup <MitchAl...@aol.com> wrote:

> After thinking about this for a couple of months, the problem CPU
> designers have is that they believe that register file access BW
> has to be accompanied by low latency to the RF. Eliminate this
> belief and the problem can be made to "go away"

Is this a belief, or a requirement for what a CPU does?

If it were possible to eliminate conditional branches from long
stretches of code, and write them so that multiple logical threads are
interleaved - so that the result of one instruction isn't needed until
many instructions later - then there would be no problem.

Without that, the option of fetching register contents well in advance
of when they are used is not a possible way to eliminate delays.

John Savard

Ivan Godard

unread,

Jan 6, 2013, 3:25:38 PM1/6/13

to

On 1/6/2013 11:00 AM, Quadibloc wrote:
> On Jan 5, 6:06 pm, MitchAlsup <MitchAl...@aol.com> wrote:
>
>> After thinking about this for a couple of months, the problem CPU
>> designers have is that they believe that register file access BW
>> has to be accompanied by low latency to the RF. Eliminate this
>> belief and the problem can be made to "go away"
>
> Is this a belief, or a requirement for what a CPU does?
>
> If it were possible to eliminate conditional branches from long
> stretches of code, and write them so that multiple logical threads are
> interleaved - so that the result of one instruction isn't needed until
> many instructions later - then there would be no problem.

Been done. http://en.wikipedia.org/wiki/Barrel_processor

Andy (Super) Glew

unread,

Jan 6, 2013, 4:00:59 PM1/6/13

to

On 1/5/2013 11:38 AM, Stephen Fuld wrote:
> On 1/4/2013 9:34 AM, Andy (Super) Glew wrote:
>
> snip
>
> This is an interesting possibility, but there are a number of "issues"
>
>
>> We can imagine SMT machines that have larger number of thread contexts,
>> say 16, out of which only a smaller number are active, say 4. In which
>> case the hardware scheduler may be making one of the inactive but loaded
>> threads active, and vice versa. Both for the PAUSE instruction, but
>> potentially for other reasons.
>
> But if the cores are still superscaler, wouldn't the size of the
> multiported register file be an issue? I suppose you could add register
> save and restore to the hardware/microcode scheduler, as long as you had
> agreement on the memory layout of the save area.

Yes.

But...

I can imagine that the distinction between active/inactive threads is
reflected in register file.

Naively, the inactive threads might be stored in a large,
non-multiported register file. The active threads in a smaller
multiported regisyter file. Microcode might copy.

More intelligently, use register renaming to track values between the
less-ported and more-ported register files. No need for microcode:
register renaming hardware naturally flows registers between inactive
and active.

> Would you want to allow multiple scheduling policies to help make
> choices among multiple ready threads?

That's the sticky part.

Probably the hardware thread scheduler would have knobs, parameters that
can be tweaked.

E.g. making some threads always highest priority.

E.g. I am a big advocate of fair share scheduling, esp. easy to
implement using randomization. Knobs to change weights, number and kind
of lottery tickets.

But a hardwired, baked in, hardware (or even microcode) is always going
to be less flexible than an OS scheduler in SW. Not that IS schedulers
are very sophisticated nowadays.

The hard part about this design direction is figuring out a good-enough
hardware scheduler, that the OS can live with.

Of course, we already do that all the time. Consider hyperthreading.

>> Like hardware timesharing. (I.e.
>> instead of SMT, interleaving threads within a cycle, or barrel,
>> interleaving ion cycle basis, interleave at a coarser scale - 10 cycles?
>> 100 cycles? Anything up to the typical system call time would be
>> useful.)
>
> Wasn't there a version of the Power processor that switched on cache
> misses?

http://semipublic.comp-arch.net/wiki/Switch_on_Event_Multithreading_(SoEMT)

SoEMT has been implemented in IBM's RS64-II, RS64-III, and RS64-IV and
Intel's Itanium2 9000 series (Montecito). All of these implementations
targeted commercial workloads, which tend both to be heavily threaded
and to have a larger frequency of high-latency events (cache misses and
perhaps also (uncached) I/O accesses).

>> We don't need to stop at "thread contexts loaded into a hardware
>> register file". If you have microcode you might have the "hardware"
>> (microcode) scheduled threads from a simple "hardware" (microcode)
>> scheduling queue that is stored in memory.
>
> Absolutely, but see my comments about register save/restore.
>
>
>> (BTW, I say "microcode", because I don't like hardware instructions that
>> do complicated sequences of access to memory. I'm a RISC guy at heart.
>> But I am okay with microcode - which is just software, except with

>> special features and support, really a special privilege level.)

>
>
> Yup. Note that you allow the hardware scheduler to be disabled by the
> OS in order to allow an OS to use an existing code, a unique scheduling
> requirement etc. But since the scheduler is a fairly highly used portion
> of the OS, and presumably the HW implementation would be faster, OS
> vendors would be encouraged to use it.

I think the biggest thing is that the hardware/microcode scheduler can
do optimizations that the OS cannot.

For example, the hw/uc scheduler can run threads even though not all
registers are loaded. It can use the register renamer to fetch
registers that are not loaded yet.

You might be able to run more active SMT threads if there are each only
using 8 registers, and fewer if they are using all 32? 64? regs.

Of course, an OS scheduler can do this if it has "fault on access" bits
on the register file. GPUs do this (more explicitly, with a count).
There have been many proposals for such. But, no matter what, at the
moment taking a fault to the OS to fetch in a missing register is a lot
slower than the hardware would be. And it is an OS change for many
machines.

Ditto the part about having microarchitecturally heterogenous register
files (low versus high porting) implementing architecturallyt
homogenous register files.

Stuff flows across this boundary a lot.

Leave something in microarchitecture if SW can't use it effectively.
Or if it is useful in some configfurations, but not others, and it is
too expensive to customize the software for each configuration, for each
OS. Or to get goodness to OS and SW that is slow to change, and/or is
in a fragmented market, where it is too expensive to change every
version of the umpteen versions of OS/SW that matter.

Move to OS/SW if good enough performance, and if the feature is long
term stable.

Andy (Super) Glew

unread,

Jan 6, 2013, 4:12:29 PM1/6/13

to

On 1/5/2013 6:39 PM, Paul A. Clayton wrote:
> On Saturday, January 5, 2013 8:06:23 PM UTC-5, MitchAlsup wrote:
>> On Saturday, January 5, 2013 1:38:30 PM UTC-6, Stephen Fuld wrote:
>>> But if the cores are still superscaler, wouldn't the size of the
>>> multiported register file be an issue?
>>
>> GPUs contain thousands of threads each with their register file on
>> die simultaneously--no saving or restoring needed. These files are
>> multiported (via banking) and support 12 register reads and 4 register
>> writes per cycle continuously. The cores are 4-5 way SuperScalar.
>
> Aargh! Please do not count SIMD elements as threads!
> (Yes, from the standpoint of storage capacity, they
> are threads, but the access control complexity is
> quite different as you know better than I do.)

I heard many people from Intel complain that it was dishonest for GPU
vendors to count SIMD elements as threads.

So I went and asked some authorities: Burton Smith and Arvind of MIT.
Two guys who may have invented the concept of threading, or if not have
certainly been involved in threading longer than nearly anyone on this
newsgroup.

After I explained "GPU vector lane threads" to them, they said: "Yes,
these are threads".

Like me, they prefer to qualify them: "Vector lane threads". Rather
than completely independent hardware threads.

But they are still threads from a hardware perspective.

---

Anyway, Paul: even if you discount vector lane threads, GPUs still have
many, many more threads than, CPUs. The bigger machines have in the
high hundreds, approaching 1000. Typically 32-100 threads per group of
SIMD units.

You need to do this when, as is not uncommon, a dependent operation from
the same thread cannot run for 40 cycles.

> AMD's VLIW4 ISA

seems now to be old news.
dead.

moved away from.

>> After thinking about this for a couple of months, the problem CPU
>> designers have is that they believe that register file access BW
>> has to be accompanied by low latency to the RF. Eliminate this
>> belief and the problem can be made to "go away"
>
> But is it a groundless belief (I very much doubt
> that), a belief grounded only on existing
> software practice (I am skeptical), a belief
> extremely biased by existing software practice
> (this is more credible to me), or something else?

Some workloads need low latency back to back operations.

Other workloads do not.

We can build heterogenous CPU/GPU systems that run the different
workloads on different processors in the same system.

I guess I am figuring out an intermediate, a processor that can do both,
without losing much if anything in either workload.

Andy (Super) Glew

unread,

Jan 6, 2013, 4:19:04 PM1/6/13

to

On 1/6/2013 11:00 AM, Quadibloc wrote:

It's not just rewriting. It may not require rewriting.

Your 1980s era Artifical Intelligence code might have only one or a few
processors traversing the complicated linked data structures that are
the knowledge base.

In that case, any branch misprediction, any extra latency, hurts.

But now, more and more, we use almost exactly the same codes, with
flakey conditional branches and cache misses - but we run many more
instances of the AI code, on different parts of the knowledge base.

In this case, stalls can be tolerated.

Sometimes these parallel works feed summaries to a smaller set of
processors. *These* guys are latency oriented.

It's a question of hardware cost, leakage, bandwidth utilization.

If hardware is cheap, and you have enough threads, even running flakey
unpredictable code, it is cheaper to stall their processor. At so,e
intermediate point, switch to a different thread on the same processor.
At some other point, spend hardware to reduce stalls.

Flakey code with conditional branches and cache misses will always be
with us. What has changed is that we now can run more copies of it.
What has changed is that the data is now bigger, so that it is
worthwhile running more copies of the flakey code.

Andy (Super) Glew

unread,

Jan 6, 2013, 4:23:42 PM1/6/13

to

On 1/4/2013 11:40 AM, EricP wrote:
> Quadibloc wrote:
>> On Jan 3, 10:58 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
>>
>>> Secondly, it is not for the OS to judge how a thread
>>> spends its time slice. That's a can of worms.

Yes, a can of worms - but one that was opened long ago.

E.g. IBM Z-series mainframes have the ability to intercept
synchronization instructions like CAS (Compare and Swap), so that the
virtual machine host can context switch gusts that are in spin loops.

This is decades old.

E.g. one of the most important things in improving power efficiency for
laptops was Intel and Microsoft doing a hunt and kill on software apps
that did spinloops.

You don't buy timeslices from a webhosting provider.

You buy work.

If the webhoster can squeeze more work onto the same hardware by context
switching, turning threading on or off, etc. --- well, he can do so.

Andy (Super) Glew

unread,

Jan 6, 2013, 4:25:01 PM1/6/13

to

Or better yet: stupidity.

MitchAlsup

unread,

Jan 6, 2013, 4:42:03 PM1/6/13

to

On Saturday, January 5, 2013 8:39:57 PM UTC-6, Paul A. Clayton wrote:
> But is it a groundless belief (I very much doubt
> that), a belief grounded only on existing
> software practice (I am skeptical), a belief
> extremely biased by existing software practice
> (this is more credible to me), or something else?

Having longer latency through the register file only
adds complexity to the forwarding logic--you can 'pipe'
the rest of it away.

> Rewriting software to fit a better model of
> computation seems _less_ likely to occur (in the
> short term) than the abandonment of x86.

Who said anything about changing the ISA?

What I am implying is that 'it' is still an x86 if
the register file takes 5-10 cycles to access as if
the register file takes 1-2 cycles to access.

With 5-10 cycles of access latency, one can afford
to build a register file that supports dozens to
hundreds of threads.

Mitch

MitchAlsup

unread,

Jan 6, 2013, 4:48:00 PM1/6/13

to

On Sunday, January 6, 2013 3:38:37 AM UTC-6, Michael S wrote:
> a) Cycle time is limited by register file access time
> (overpessimistic)
> b) Register file access time is dominated by gate delays
> (overoptimistic) = O(logN)

There is no reason the cycle time of a CPU should be limited
by the access time of the register file. I built a machine
in the early 00s where the register file took 3 cycles to
access. What you want to say is that the cycle time of the
slowest stage of the register file access pipeline can be no
slower than the slowest stage of the calculation pipeline.

Register file access tmes are more dominated by wire delay
than by gate delay. There is wire delay into the decodeer,
wire delay down th select (word) line, and wire delay over
the read-out (bit) lines. The actual decoder is 2 gates of
delay, the select is 2 gates of delay, and the read-out is
another 2 gates of delay.

Mitch

Stephen Fuld

unread,

Jan 6, 2013, 6:20:24 PM1/6/13

to

OK, Neat idea!

>> Would you want to allow multiple scheduling policies to help make
>> choices among multiple ready threads?
>
> That's the sticky part.
>
> Probably the hardware thread scheduler would have knobs, parameters that
> can be tweaked.
>
> E.g. making some threads always highest priority.
>
> E.g. I am a big advocate of fair share scheduling, esp. easy to
> implement using randomization. Knobs to change weights, number and kind
> of lottery tickets.
>
> But a hardwired, baked in, hardware (or even microcode) is always going
> to be less flexible than an OS scheduler in SW. Not that IS schedulers
> are very sophisticated nowadays.
>
> The hard part about this design direction is figuring out a good-enough
> hardware scheduler, that the OS can live with.

Yes. I am out of date on what current OS schedulers do, but I get the
impression that they aren't super sophisticated.

> Of course, we already do that all the time. Consider hyperthreading.

Certainly.

snip

> I think the biggest thing is that the hardware/microcode scheduler can
> do optimizations that the OS cannot.

I thought of one. Say you have instructions "Lock" and Unlock", which
can be for example, atomic fetch and increment and store zero. The key
is that the hardware recognizes that the instructions are for locks.
Then you add a table in the CPU able to hold one real memory address per
thread. Whenever a lock instruction is encountered, if the attempting
to get the lock fails, the hardware puts the address of the lock (the
real memory address resulting from the lock instruction) in its table
entry. The thread is then automatically marked as inactive, waiting for
a lock. When the Hardware encounters an unlock instruction, it checks
to see if there is a matching lock address in its table. If so, the
corresponding thread is marked available.

This eliminates the need for the application to code spin locks, mwaits,
etc. and allows faster start of the next lock requester when a lock is
unlocked (no need to wait for an arbitrary length of an mwait to
expire). It Reduces bus contention from eliminating the spin locks and
may save power as well. Lastly, if the hardware chooses only one thread
to activate, it prevents a storm when multiple requesters are waiting
for the same lock.

> For example, the hw/uc scheduler can run threads even though not all
> registers are loaded. It can use the register renamer to fetch
> registers that are not loaded yet.

Another good idea.

Andy (Super) Glew

unread,

Jan 6, 2013, 8:31:00 PM1/6/13

to an...@spam.comp-arch.net

Ooops.

I meant to say that they are still threads from a SOFTWARE perspective.

Sure, you may only want to use these threads on simple kernels, with
limited internal flow control. But they are still exactly what you
would get if you did

FOR ALL loop index i
operate on A[i]

Which is different from classic vectors.

Although an interesting thing is the tradeoff between vector execution
models, and vector threaded execution models.

Andy (Super) Glew

unread,

Jan 6, 2013, 8:35:15 PM1/6/13

to

On 1/6/2013 12:25 PM, Ivan Godard wrote:
> On 1/6/2013 11:00 AM, Quadibloc wrote:
>> On Jan 5, 6:06 pm, MitchAlsup <MitchAl...@aol.com> wrote:
>>
>>> After thinking about this for a couple of months, the problem CPU
>>> designers have is that they believe that register file access BW
>>> has to be accompanied by low latency to the RF. Eliminate this
>>> belief and the problem can be made to "go away"
>>
>> Is this a belief, or a requirement for what a CPU does?
>>
>> If it were possible to eliminate conditional branches from long
>> stretches of code, and write them so that multiple logical threads are
>> interleaved - so that the result of one instruction isn't needed until
>> many instructions later - then there would be no problem.
>
> Been done. http://en.wikipedia.org/wiki/Barrel_processor

So has Mitch: Denelcoe HEP, the poster child for barrel processing.

Piotr Wyderski

unread,

Jan 7, 2013, 6:34:38 AM1/7/13

to

Noob wrote:

> What does support have to do with your assertion that RHEL is one
> of several "non-free Linuxes" ?

If RedHat will not support such a modified RHEL and your
client, say a bank, has RHEL installed on their servers,
it means you cannot modify the kernel in order to enable
CPL3 monitor/mwait. And support costs a lot of money.

Best regards, Piotr

Tom Gardner

unread,

Jan 7, 2013, 6:50:35 AM1/7/13

to

You are at liberty to modify the RHEL kernel if you wish,
and you don't have to pay any money to anybody in order
to do that. Hence RHEL is free, as in both beer and liberty.

If you choose not to do that for commercial reasons
applicable in some cases but not in others, then that
has no bearing on whether or not RHEL is free. Can I
suggest that such commercial discussions are off topic?

Terje Mathisen

unread,

Jan 7, 2013, 7:31:17 AM1/7/13

to

Stephen Fuld wrote:
> I thought of one. Say you have instructions "Lock" and Unlock", which
> can be for example, atomic fetch and increment and store zero. The key
> is that the hardware recognizes that the instructions are for locks.
> Then you add a table in the CPU able to hold one real memory address per
> thread. Whenever a lock instruction is encountered, if the attempting
> to get the lock fails, the hardware puts the address of the lock (the
> real memory address resulting from the lock instruction) in its table
> entry. The thread is then automatically marked as inactive, waiting for
> a lock. When the Hardware encounters an unlock instruction, it checks
> to see if there is a matching lock address in its table. If so, the
> corresponding thread is marked available.
>
> This eliminates the need for the application to code spin locks, mwaits,
> etc. and allows faster start of the next lock requester when a lock is

The scenario you are describing _IS_ MONITOR/MWAIT, isn't it?

You try to get a lock and fail, so you use your single thread-private
monitor resource to point at the lock variable, then go to sleep (mwait).

As soon as the lock is updated you wake up again...

> unlocked (no need to wait for an arbitrary length of an mwait to
> expire). It Reduces bus contention from eliminating the spin locks and
> may save power as well. Lastly, if the hardware chooses only one thread
> to activate, it prevents a storm when multiple requesters are waiting
> for the same lock.

This is the reason I decided early on to have one queue per thread,
distributing work in a round-robin fashion.

When you wake up, you know you have some work to do. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,

Jan 7, 2013, 11:39:32 AM1/7/13

to an...@spam.comp-arch.net

On Sunday, January 6, 2013 7:35:15 PM UTC-6, Andy (Super) Glew wrote:
> Denelcoe HEP

Make that Denelcor, H.E.P.

Mitch

Stephen Fuld

unread,

Jan 7, 2013, 1:18:22 PM1/7/13

to

On 1/7/2013 4:31 AM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>> I thought of one. Say you have instructions "Lock" and Unlock", which
>> can be for example, atomic fetch and increment and store zero. The key
>> is that the hardware recognizes that the instructions are for locks.
>> Then you add a table in the CPU able to hold one real memory address per
>> thread. Whenever a lock instruction is encountered, if the attempting
>> to get the lock fails, the hardware puts the address of the lock (the
>> real memory address resulting from the lock instruction) in its table
>> entry. The thread is then automatically marked as inactive, waiting for
>> a lock. When the Hardware encounters an unlock instruction, it checks
>> to see if there is a matching lock address in its table. If so, the
>> corresponding thread is marked available.
>>
>> This eliminates the need for the application to code spin locks, mwaits,
>> etc. and allows faster start of the next lock requester when a lock is
>
> The scenario you are describing _IS_ MONITOR/MWAIT, isn't it?

Yes, but you don't have to explicitly code them so it saves some work,
both by the user and the CPU.

Andy (Super) Glew

unread,

Jan 7, 2013, 3:39:27 PM1/7/13

to

By the way, I believe that Willamette's "fireball" had a 7 cycle
register file access. At least multiple cycles. All hidden by
pipelining. It's the bypass that needs to be low latency
- but for cycle of latency you add a layer of bypassing.

Ivan Godard

unread,

Jan 7, 2013, 8:01:39 PM1/7/13

to

On 1/7/2013 12:39 PM, Andy (Super) Glew wrote:
> On 1/6/2013 1:48 PM, MitchAlsup wrote:
>> On Sunday, January 6, 2013 3:38:37 AM UTC-6, Michael S wrote:
>>> a) Cycle time is limited by register file access time
>>> (overpessimistic)
>>> b) Register file access time is dominated by gate delays
>>> (overoptimistic) = O(logN)
>>
>> There is no reason the cycle time of a CPU should be limited
>> by the access time of the register file. I built a machine
>> in the early 00s where the register file took 3 cycles to
>> access. What you want to say is that the cycle time of the
>> slowest stage of the register file access pipeline can be no
>> slower than the slowest stage of the calculation pipeline.
>>
>> Register file access tmes are more dominated by wire delay
>> than by gate delay. There is wire delay into the decodeer,
>> wire delay down th select (word) line, and wire delay over
>> the read-out (bit) lines. The actual decoder is 2 gates of
>> delay, the select is 2 gates of delay, and the read-out is
>> another 2 gates of delay.
>>
>> Mitch
>>
>
> By the way, I believe that Willamette's "fireball" had a 7 cycle
> register file access. At least multiple cycles. All hidden by
> pipelining. It's the bypass that needs to be low latency
> - but for cycle of latency you add a layer of bypassing.
>
>

Well, all hidden by pipelining until you mispredict.

Andy (Super) Glew

unread,

Jan 8, 2013, 12:11:34 AM1/8/13

to

On 1/5/2013 3:12 PM, Paul A. Clayton wrote:
> On Saturday, January 5, 2013 2:38:30 PM UTC-5, Stephen Fuld wrote:
>> On 1/4/2013 9:34 AM, Andy (Super) Glew wrote:
> [snip]

> Itanium uses a register file that shares ports between
> the two threads. (The same basic mechanism was used
> on at least one Sun SPARC design for register
> windows.) This design exploits the fact that
> registers from different threads are guaranteed to be
> accessed in different cycles. For a register file
> with a large port count, this can provide a
> significant savings in area. (It *might* be possible
> to use such knowledge to reduce static power use in
> the inactive portion.)

I am not quite sure how what you describe works.

... Oh, I get it.

You don't mean "shares ports between the two threads". *All* Intel
processors share ports - the hyperthreaded/SMT x86 processors
essentially have a single unified register file, each register with N
read and M write ports, determined by the worst case for single threaded
behavior.

Or, at least, the register file ARRAY ports are always shared.

Instead, what you mean, I think, is that the ports of individual
registers, individual bitcells, are shared. Each wordline selects a set
of T=2 registers, for T=2 threads, which also share a bitline.
Which of those T=2 registers in a set selected by an active bitline is
selected is determined by which of the non-simultaneously running
threads is active.

What this means is for a pair of logical registers, => 4 registers in a
=2 threaded machine - whereas you would have 4 wordlines for these 4
registers, with 1 bitline per bit - you could fairly easily get away
with 1 wordline per set of two registers, plus another "active thread
select" wordline shared amongst all 4 registers, i.e. amongst both pairs.

I.e. 4 wordlines => 2 wordlines + 1 set select wordline

And, depending on whether layout allowed it, you might be able to go
further, and get one set-select wordline per 4 logical registers

I.e. 8 wordlines => 4 wordlines + 1 set select wordline.

In a stacked array, you would not win anything for the bitlines,
but you might get some benefit if you are squaring out the array anyway.

>
> Saving/restoring to/from a secondary register file
> would also be possible (given reasonably coarse-
> grained switching). Register save/restore could
> generally substantially benefit from banking
> (especially if logical register name correlates
> with physical bank number), so bandwidth could
> be relatively high. In addition, switches
> might often occur when one thread is relatively
> idle, so some extra ports may be available.
>
> In some ways, the power save SRAM region in some
> Intel Atom processors might be viewed as a
> thread context save area--except that it also
> saves other processor state and is only used for
> power saving not to allow a different thread to
> execute (though I have suggested--on comp.arch,
> I think--that such might be a useful extension
> of the hardware).

I think Penryn also had such an SRAM state save area,
for when the CPU was powered off and the CPU registers would not retain
state,

If you power down the CPU and registers, but not (part of) the cache,
you can also save state to parts of the cache that have been locked down
(after flushing dirty memory data fist, of course).

So this is even more like... Except that this save area is not memory
mapped.

But of course, you can also save all processor state to DRAM, or to
disk. But that's usually done under SW control.

Paul A. Clayton

unread,

Jan 8, 2013, 1:07:04 AM1/8/13

to

On Tuesday, January 8, 2013 12:11:34 AM UTC-5, Andy (Super) Glew wrote:
> On 1/5/2013 3:12 PM, Paul A. Clayton wrote:

[snip]

>> In some ways, the power save SRAM region in some
>> Intel Atom processors might be viewed as a
>> thread context save area--except that it also
>> saves other processor state and is only used for
>> power saving not to allow a different thread to
>> execute (though I have suggested--on comp.arch,
>> I think--that such might be a useful extension
>> of the hardware).
>
> I think Penryn also had such an SRAM state save area,
> for when the CPU was powered off and the CPU registers would not retain
> state,
>
> If you power down the CPU and registers, but not (part of) the cache,
> you can also save state to parts of the cache that have been locked down
> (after flushing dirty memory data fist, of course).

Saving to cache would seem to have two issues. 1) The
cache storage cells are unlikely to be optimized for
low leakage at the cost of access latency to the
degree that a specialized save area might be. 2) The
cache access bandwidth is unlikely to be optimized to
support fast saves and restores as a specialized save
area might be (though, with a traditional set
associative parallel read implementation and support
for wide SIMD register accesses, this might not be an
issue).

(I had proposed using SIMD register storage to support
a larger number of scalar contexts. Thread switching
overhead would be probably be significant, though
some cleverness might be applied. Executing from
SIMD register elements would avoid some issues but
introduce others.)

> So this is even more like... Except that this save area is not memory
> mapped.

Of course, if one were allowed to lock part of
the cache capacity as a scratchpad memory with a
specific address (or even page-granular tags--
page granular tags would facilitate virtualization),
then this would be treated as a memory region.
(Note, this is distinct from block or way locking
which still uses tags and replicates main memory
contents.) I am somewhat disappointed that--as
far as I know (huge caveat, especially with the
diversity of embedded systems)--no system
supports such a mapping of cache sections.
(This incidentally can be used to decrease
block size with the same tags [possibly requiring
an extra bit per tag if the maximum physical
address space was used] or use the tags that
are made unnecessary for other purposes.)

Just babbling, it seems.

> But of course, you can also save all processor state to DRAM, or to
> disk. But that's usually done under SW control.

As you have pointed out the distinction between
software (OS) and hardware (processor vendor
firmware/hypervisor) need not be obvious in
terms of functionality.

Paul A. Clayton

unread,

Jan 8, 2013, 1:40:07 AM1/8/13

to

On Sunday, January 6, 2013 4:42:03 PM UTC-5, MitchAlsup wrote:
> On Saturday, January 5, 2013 8:39:57 PM UTC-6, Paul A. Clayton wrote:
>> But is it a groundless belief (I very much doubt
>> that), a belief grounded only on existing
>> software practice (I am skeptical), a belief
>> extremely biased by existing software practice
>> (this is more credible to me), or something else?
>
> Having longer latency through the register file only
> adds complexity to the forwarding logic--you can 'pipe'
> the rest of it away.

OK. I was obviously not thinking very clearly!

Even so, longer register file access would presumably
have some overheads in power and area (from more
complex forwarding [and dividing RF access into
stages?]). Branch mispredictions might be a bit
more expensive as well.

(Operand capture and/or register caching might be
more attractive? For an OoO implementation, read
before scheduling might be more attractive--such
might give more time to resolve bank conflicts?--
though perhaps doing some reads relatively late
might reduce the storage requirements for
operands?)

>> Rewriting software to fit a better model of
>> computation seems _less_ likely to occur (in the
>> short term) than the abandonment of x86.
>
> Who said anything about changing the ISA?

I was thinking that your mentioning GPUs explicit
banking meant that such was the best (or perhaps
most practical) way to support effective banking.

(Yes, hardware could dynamically discover such
access patterns and use renaming to provide better
bank behavior. Hardware might also be able to
exploit at least some common compiler use patterns
such that register names act as hints. OTOH,
having only 16 architected registers would seem
to present some constraints. Perhaps a range of
stack pointer offsets could be hinted for
register allocation [a bit like Todd Austin's
Knapsack cache, which allowed fast access to a
modest area of global memory], though dynamic
registerification choices may not benefit from
such hints.)

> What I am implying is that 'it' is still an x86 if
> the register file takes 5-10 cycles to access as if
> the register file takes 1-2 cycles to access.
>
> With 5-10 cycles of access latency, one can afford
> to build a register file that supports dozens to
> hundreds of threads.

Presumably there are tradeoffs in having a uniform
huge register file.

Beyond banking, it seems that there should be some
ability to exploit diversity of port requirements
(similar to the difference between registers and
cache) or latency/power. Not every storage element
has the same access pattern (especially as some
threads will likely be relatively dormant at some
times).

(Another possible trick to expand register capacity
for multithreading might be something like the
checkpoint registers proposed for OoO execution.
A single ported storage cell is attached to a
multiported cell with the value saved and restored.
["Cherry: Checkpointed Early Resource Recycling in
Out-of-order Microprocessors", José F. Martínez et
al., 2002] With extremely restricted access
patterns, might it even be possible to use some
kind of FIFO of latch-like storage cells??)

EricP

unread,

Jan 8, 2013, 1:55:22 PM1/8/13

to

Andy (Super) Glew wrote:
> On 1/4/2013 11:40 AM, EricP wrote:
>> Quadibloc wrote:
>>> On Jan 3, 10:58 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
>>>
>>>> Secondly, it is not for the OS to judge how a thread
>>>> spends its time slice. That's a can of worms.
>
>
> Yes, a can of worms - but one that was opened long ago.
>
> E.g. IBM Z-series mainframes have the ability to intercept
> synchronization instructions like CAS (Compare and Swap), so that the
> virtual machine host can context switch gusts that are in spin loops.
>
> This is decades old.
>
> E.g. one of the most important things in improving power efficiency for
> laptops was Intel and Microsoft doing a hunt and kill on software apps
> that did spinloops.

I would hope that MS utilized MWAIT as soon as it was available
irrespective of the market.

The best approach for a kernel is to use an LH spin queue
(linked list of cache lines discussed in comp.arch earlier),
then do an MWAIT on that cache line.
The result is almost zero bus traffic and low power waste.

For user mode sometimes spinwaiting a while is the cheapest solution.
If a critical section is guarding something simple like a double
linked list, loading the 3 cache lines might take 300 ns.
If the critical section is locked when another requester arrives
then spinning for up to 350 ns and watching for a release is
often much cheaper than going straight to a WaitForEvent+SetEvent
(depending on the odds of a collision).

A user mode MWAIT would certainly improve this but it should
include something like an upper time limit in rdtsc ticks
with zero meaning wait forever (which is essentially a user
mode wait for interrupt). Maybe the kernel could specify
an upper tick count limit for user mode in an MSR.

Going much beyond that gets difficult. Each OS has its own
process and thread activation and deactivation sequences,
and their own synchronization (mutex, events, condvar) mechanisms.
Once the thread is activated it is bound to a core.
Hardware can do some fine grain SMT scheduling
but the OS will consider that SMT-core to be in use.

Eric

EricP

unread,

Jan 9, 2013, 1:11:03 PM1/9/13

to

EricP wrote:
>
> Going much beyond that gets difficult. Each OS has its own
> process and thread activation and deactivation sequences,
> and their own synchronization (mutex, events, condvar) mechanisms.
> Once the thread is activated it is bound to a core.
> Hardware can do some fine grain SMT scheduling
> but the OS will consider that SMT-core to be in use.

Most multi threaded programs are within the same process, and many
events and futexes (critical sections) are between those threads.
A user mode futex is easily created with atomic ops for the
fast path and WaitForEvent+SetEvent for the slow path.

Hardware could easily provide an intra-process event service.
Each process has a process Id or address Space ID (ASID)
used by the TLB to tag per-process entries.
Each event has a unique handle (EVHDL) assigned by the OS at creation.
The concatenation of ASID to EVHDL gives a system wide unique id.
The state transitions for different event operations are straight forward.

A CAM could attach to the system QPI/HT interconnect bus
storing the unique ID and event state. Instructions for
EventReset, EventSet, EventPulseOne, EventPulseAll and EventWait
wait for and release one or more threads. EventWait instruction
can return a status in eax indicating what happened.
Status code could indicate if ID not in CAM table so
thread falls back to use the slower OS sync tools.

The OS runs a ready thread, activating it on a particular core.
Activation involves some generic OS and some cpu specific manipulations.
For example, on x86 the FS segment register to points to the thread header.

A thread waits on an hardware event but the OS thinks it is running.
At what point should the core inform the OS that the thread is
waiting and tying up resources? Presumably this is done by an
exception but then it is pretty much the same cost as a system call.
Maybe a tick time limit, or if say 1/2 of the SMT threads are waiting
then the processor kicks the OS for a course grain reschedule?
At that point the scheduler deactivates the waiting threads
and activates new ready threads.

So basically the saving here is in avoiding the system calls
and not invoking the scheduler for short duration waits.

Eric

MitchAlsup

unread,

Jan 10, 2013, 12:14:10 AM1/10/13

to

On Wednesday, January 9, 2013 12:11:03 PM UTC-6, EricP wrote:
> A CAM could attach to the system QPI/HT interconnect bus
> storing the unique ID and event state. Instructions for
> EventReset, EventSet, EventPulseOne, EventPulseAll and EventWait
> wait for and release one or more threads. EventWait instruction
> can return a status in eax indicating what happened.
> Status code could indicate if ID not in CAM table so
> thread falls back to use the slower OS sync tools.

I think you could do just as well with a straight cache where
the base+UniqueID<<2 contains the data one wants. something like
8 bytes per thread overhead, so you can manage 512 threads per
4KB page.

Given you want tens of thousands of threads (do you really?)
you can manage this with 20 pages of overhead (80KB). This
is essentially invisible in the OS footprint.

Then you do the event stuff as R-M-W operations at the final
resting spot (DRAM) and send back the result, possibly wakening
sleeping threads through another mechanism.

The only thing hard is getting R-M-W operatings into the DRAM
controller! And yet the DRAM controller already does R-M-W when
it detects ECC failures and attemts recovery. Just put the M
part of R-M-W in the ECC Read-to-Write feed back loop used by
ECC repair.

No CAM or cache required.

Mitch

Ivan Godard

unread,

Jan 10, 2013, 2:38:14 AM1/10/13

to

And for the markets that can't afford registered memory?

Michael S

unread,

Jan 10, 2013, 8:02:15 AM1/10/13

to

Michael S

unread,

Jan 10, 2013, 8:12:42 AM1/10/13

to

On Jan 10, 7:14 am, MitchAlsup <MitchAl...@aol.com> wrote:

Do you mean, during patrol scrubbing?
I think, at least logically, scrubbing is done by separate agent
rather than controller itself. Controller sees regular reads and
writes.

MitchAlsup

unread,

Jan 10, 2013, 12:17:09 PM1/10/13

to

On Thursday, January 10, 2013 1:38:14 AM UTC-6, Ivan Godard wrote:
> And for the markets that can't afford registered memory?

Has nothing to do with Registered memory, this is all done
in the DRAM controller on the ECC repair feedback loop. That
is it is independent of the memory technology.

Mitch

MitchAlsup

unread,

Jan 10, 2013, 12:19:12 PM1/10/13

to

On Thursday, January 10, 2013 7:12:42 AM UTC-6, Michael S wrote:
> Do you mean, during patrol scrubbing?

I mean ANYTIME the DRAM controller reads out a hunk of data, detects
a correctable error, it has to write back the updated data. This could
be the scrubber or it could be the demand read ECC repair. Which in my
DRAM controllers were the same logic, its just the scrubber invents its
own demand stream so as to lower the visible ECC failure rate.

Mitch

Ivan Godard

unread,

Jan 10, 2013, 12:28:43 PM1/10/13

to

Sorry, I meant ECC memory - I'm so used to the notions always being
both-or-neither. So what do you do if there's no ECC?

EricP

unread,

Jan 10, 2013, 2:14:25 PM1/10/13

to

MitchAlsup wrote:
>
> I think you could do just as well with a straight cache where
> the base+UniqueID<<2 contains the data one wants. something like
> 8 bytes per thread overhead, so you can manage 512 threads per
> 4KB page.
>
> Given you want tens of thousands of threads (do you really?)
> you can manage this with 20 pages of overhead (80KB). This
> is essentially invisible in the OS footprint.
>
> Then you do the event stuff as R-M-W operations at the final
> resting spot (DRAM) and send back the result, possibly wakening
> sleeping threads through another mechanism.
>
> The only thing hard is getting R-M-W operatings into the DRAM
> controller! And yet the DRAM controller already does R-M-W when
> it detects ECC failures and attemts recovery. Just put the M
> part of R-M-W in the ECC Read-to-Write feed back loop used by
> ECC repair.
>
> No CAM or cache required.
>
> Mitch
>

Ok, I was jumping ahead here. I had assumed we were talking about a
processor with, say, 4 SMT threads Running out of a total pool of say
8 SMT threads that were Running, Ready or Waiting on that processor.
Then there can be, say, 128 processors in a SMP system.
I was exploring the interaction of this and the OS.
A major job of the OS is moving threads onto the processor,
or moving them to a different SMP processor if a thread is Ready
but unable to run and execution resources are available elsewhere.

Probably want to keep the number of running threads on a single
processor low for temporal locality of reference on the TLB and cache.
An alternative is to time slice per instruction -
if N threads are Ready then we round robin among all of them.
If N was large that would thrash the TLB and cache.

So one question would be how many Running SMT threads does it
take to fully utilize a processor without thrashing?
Another would be how big should the total pool of threads
in all states of Running, Ready and Waiting be?
And how does the OS recognize that processor P1 has
some Ready threads that can't Run because it is full,
and processor P2 has available SMT thread slots
(maybe timers and/or total count of Ready threads)?

So are we talking maybe 128 SMP processors, each of which has a
pool of 64 SMT threads, of which it can run 4 threads simultaneously.
Or something bigger?

Eric

Andy (Super) Glew

unread,

Jan 10, 2013, 10:26:54 PM1/10/13

to

ECC?

What fraction of the PC market uses ECC? What fraction of tablets?
What fraction of smartphones?

And, where can I buy a cheap PC that uses ECC?

(Anecdote: one of my friends at Intel always used to buy ECC PCs for
personal use. He was always debugging systems on new chips, and did not
need the hassle at home.)

--

But now I'll expose my ignorance:

ECC loop?

Yes, DRAMs are destructive read. So RMW is usually done at the lowest
level. But that's inside the DRAM chip, isn't it?

(I also say usually, because in some applications you don't care about
destroying the data read. But that's not commodity.)

Are you saying that ECC controllers do any RMW on top of this internal
RMW? This sounds wasteful!

Especially since one would hope that ECC errors are infrequent.

... oh, I get it. You aren't saying "ECC RMWs are always done". You
are saying "many memory controllers support ECC, even though they often
are not connected to ECC memory. So you could use the ECC RMW state
machine, slightly extended, to do this other RMW."

OK.

Andy (Super) Glew

unread,

Jan 10, 2013, 10:28:58 PM1/10/13

to

On 1/10/2013 11:14 AM, EricP wrote:

> So one question would be how many Running SMT threads does it
> take to fully utilize a processor without thrashing?

On Willamette: 1. Maybe 0.5.

Many supercomputer folks disable multithreading.

MitchAlsup

unread,

Jan 10, 2013, 11:24:41 PM1/10/13

to

On Thursday, January 10, 2013 11:28:43 AM UTC-6, Ivan Godard wrote:
> So what do you do if there's no ECC?

The CRAM controller figures out whether the DRAM is ECC or not
during boot time, and then configures itself to the memory that
is plugged in.

Thus, the ECC feedback loop is present (hint it's inside the chip
containing the CPUs/GPUs) and thus the ECC repair feedback loop
is present whether ot not ECC memory is plugged in.

So, you just have to put R-M-W operations into the DRC, and then
provide access to this new functionality in the instruction set.

So, if there is no ECC, i still have the R-M-W functionality.

Mitch

MitchAlsup

unread,

Jan 10, 2013, 11:34:47 PM1/10/13

to an...@spam.comp-arch.net

On Thursday, January 10, 2013 9:26:54 PM UTC-6, Andy (Super) Glew wrote:
> Yes, DRAMs are destructive read. So RMW is usually done at the lowest
> level. But that's inside the DRAM chip, isn't it?

Yes.

> Are you saying that ECC controllers do any RMW on top of this internal
> RMW?

No.

I am saying that the proper place to perform the R-M-W is in the DRAM
controller. Most of the time, the kinds of accesses will be just like
they are today. But on the occasional desire for software to be able to
atomically modify a memory element, that SW be able to send the modify
opcode and the modify operand to the DRC. The DRRC, then, reads the
specified address; applies the modification, and schedules the write
back to the DRAM chips. At the other side of the DRC, the DRC does not
begin processing of an address to the same line as the R-M-W operation
is scheduled.

The only reason the write and read would be coupled is that the DRC
sees a pending read to an open page, and it is power efficient to
do the write before leaving (prechargine the page). Thus, the DRC
reads teh data (probably a line), performs the modification, drops the
modified line in the to_be_written buffer, and signals completion of
the request, optionally returning the previous value in the memory
location.

It just so happens that if you look at modern DRAM controller micro-
architectures, the ECC repair feedback loop has all the properties
other than the atomic calculation. It attaches to the read data,
the read data attaches to the repair input fifo. The repair input fifo
attaches to the write buffer input. the write buffer input attaches
to the DRAM scheduler. Also attached to the DRAM scheduler is the
currnt state fo the open DRAM paages. The rest is simple sequencing.

Mitch

John Levine

unread,

Jan 11, 2013, 12:57:37 AM1/11/13

to

>And, where can I buy a cheap PC that uses ECC?

I got a nice HP DL385 server on ebay with ECC memory and a six drive RAID for $300.
Does that count?

--
Regards,
John Levine, jo...@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. http://jl.ly

Andy (Super) Glew

unread,

Jan 11, 2013, 1:20:48 AM1/11/13

to

Be kind, man, I eventually figured out what you meant.

But, just to be contrarian: no, I do NOT necessarily agree with you.

The DRAM ECC RMW is one of the places where this could be put. Perhaps
the best place for a not very often contended synchronization variable
that exhibits little locality.

But... there have been several proposals to do what you describe, and
they have mostly failed to reach critical mass.

With the possible except of atomics on GPUs, in particular GPUs that
tend not to use the caches very much.

They win for some workloads, but not all.

I conjecture that one of the reasons they have not taken off for CPUs is
that this "obvious" place seems like low-hanging fruit. But then you
do the work, perf eval, and it turns out to not work as well as you thought.

Reason: some workloads exhibit both spatial and temporal locality in
their synchronization variables. Heck: even GUPS, the painful benchmark
in this area, has locality in some sense, enough that you can benefit by
convoying and combining.

If you have an atomic RMW instruction that is implemented by always
bypassing the cache, probably flushing the cache lines affected, and
then being implemented at the controller, then...
a) either it sucjs when there is locality, and the variable is
somewhere in your cache hierarchy.
b) how does the compiler decide when to tge ache-bypassing or
cache-resident forms of the RMW. (As you know, Mitch, Intel x86 does
fairly well on cache resident RMWs. So does AMD, although when we were
there AMD was lagging in this area.)

Here's what you need to do:

Create an "External RMW Op" that does the operation. That is requiered
by your scheme in any case.

Expose the x_rmw_op to the L1 cache. If the data is M or E, do it in the
L1. No further traffic.

Expose the x_rmw_op to the L2 cache. If the data is M or E, do it in
the L2. No further traffic.

And so on, down the memory hierarchy.

Consider combining - both ops to exactly the same location, as in
classic Gottlieb Ultracomputer fetch-and-op combing. And ops taht are
to different parts of the same cache line / DRAM page.

Consider both the synch variable closer to the point of request - e.g.
from L3 to L2, etc.

In any of these levels, the ECC RMW can be used. But not necessarily.

E.g. there is benefit to batching reads and wriotes - RMW of the same
page is faster than switching to a different page (in the same bank),
but slower than doing a bunch of reads to open pages, followed by a
bunch of writes.

Often acquiring a synchronization variable is accompanied by such a
bunch of reads. Often synch variables are colocated with the data they
are synchronizing.

Therefore, do the RMW in the queue for the memory controller. Read the
old, place the value in the queue. Do the RMW, the value is still in
the queue. Maintain coherence, and memory ordering. Do a bunch of reads
to the open pages. After a while, do a bunch of writes, including
possibly writing out the synch variable(s).

Andy (Super) Glew

unread,

Jan 11, 2013, 1:22:35 AM1/11/13

to

On 1/10/2013 9:57 PM, John Levine wrote:
>> And, where can I buy a cheap PC that uses ECC?
>
> I got a nice HP DL385 server on ebay with ECC memory and a six drive RAID for $300.
> Does that count?

Not bad. I need a rack...

MitchAlsup

unread,

Jan 11, 2013, 12:40:11 PM1/11/13

to an...@spam.comp-arch.net

On Friday, January 11, 2013 12:20:48 AM UTC-6, Andy (Super) Glew wrote:

There is no reason this same R-M-W cannot be placed throughout the
memory hierarchy. Operation climbs the cache hierarchy, and whereever
the data is found, it gets modified.

Since most CPUs have ECC in the larger caches, one can use the cache's
ECC repair loop to perform the R-M-W.

Mitch

EricP

unread,

Jan 11, 2013, 1:26:03 PM1/11/13

to

Andy (Super) Glew wrote:
> On 1/10/2013 11:14 AM, EricP wrote:
>
>> So one question would be how many Running SMT threads does it
>> take to fully utilize a processor without thrashing?
>
> On Willamette: 1. Maybe 0.5.
>
> Many supercomputer folks disable multithreading.

Ok, but Willamette was a P4 and didn't they have replay cyclones?
And those replay cyclones were due to early issue of the uOp
before the operands were ready, which was in part due to the
design goal of maximizing the clock Hz?

So if one targets maximizing aggregate SMT throughput
one would back off on all those things. Only issuing uOps when
all operands are ready should open up execution holes that can be
filled in by other SMT threads, and save on aggregate power too.

Eric

Michael S

unread,

Jan 12, 2013, 6:09:38 PM1/12/13

to

On Jan 11, 8:26 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> Andy (Super) Glew wrote:
> > On 1/10/2013 11:14 AM, EricP wrote:
>
> >> So one question would be how many Running SMT threads does it
> >> take to fully utilize a processor without thrashing?
>
> > On Willamette: 1. Maybe 0.5.
>
> > Many supercomputer folks disable multithreading.
>
> Ok, but Willamette was a P4

Yes.

> and didn't they have replay cyclones?

More like Willamette/Northwood had them. Prescott and later - not so
much.

> And those replay cyclones were due to early issue of the uOp
> before the operands were ready,

Main source of replays on Willamette/Northwood is as trivial as L1D
cache miss.
However, IIRC, it only happens for L1D miss of integer load
instruction. FP loads are scheduled deterministically. So, I'd expect
that in typical supercomputing scenarios, where FP load misses are far
more common than integer load misses, replay cyclones are not common,
even on Wmt/Nwd.

I'd guess, there are different, and simpler, reasons for inefficiency
of SMT for scientific apps on Wmt/Nwd:
1. FPU accepts only one FP_ADD opcode per 2 clocks and one FP_MUL
opcode per 2 clocks. So, well-behaving vector code often becomes
either FP_MUL or, more often, FP_ADD bandwidth limited with just one
thread.
2 .Execution core can issue only 1 load per clock. So, well-behaving
vector code etc..
3. L1D cache is small 2-way set-associated. So, trashing
4. L2 cache bandwidth is not very high ether. Relatively to (limited)
capabilities of the core, probably not as bad as Merom/Penryn (where
L2 cache was shared by 2 cores, each of them ~ 2 times more capable
than P4), but worse than Nehalem and esp. than SandyB. And in store-
intensive scenarios Wmt L2 bandwidth was bad even relatively to Merom/
Penryn.
5. On Wmt/Nwd (but not Psc) the latency-throughput product of FP_ADD
and FP_MUL was relatively low. So, even in latency-bound FP scenarios,
potential SMT gain is not as big as on Nehalem, and, especially, on
Power5/6/7 and SandyB.

EricP

unread,

Jan 13, 2013, 2:28:11 PM1/13/13

to

I wasn't saying it is inefficient (at least, not in a pejorative way)
as early issue is a consequence of setting maximum GHz and straight
line performance as the design goals. The compiler guide for the
Alpha 21264 also mentions quite a few sources for replay traps.
Just that Willamette should not be held up as a poster child
for SMT (non)performance.

Replay: Unknown Features of the NetBurst Core
http://www.xbitlabs.com/articles/cpu/display/replay.html

and in particular: Replay Influence on Hyper-Threading
http://www.xbitlabs.com/articles/cpu/display/replay_15.html#sect0

which says "The results are more than illustrative. While one thread
is waiting for the data to arrive from the memory, it slows down the
processing speed of the second thread (>35% compared with the situation
when the data is expected from L1 cache)."

While they say that Northwood mostly corrects this, they also note
"second thread in Prescott still slows down by about 20%, if the
first thread working with the commands of the same type suffers
a lot of L1 cache misses".

So the question is what happens if one starts with design
goals of maximizing aggregate SMT performance?

An interesting example is the XMOS XS1 core which is designed
for real time SMT. However as far as I can tell it does not
have a cache or TLB or user+supervisor modes so is really not
comparable to x64.

http://en.wikipedia.org/wiki/XCore_XS1
https://www.xmos.com/published/xs1-architecture-product-brief
https://www.xmos.com/published/xmos-xs1-architecture

Eric

Ivan Godard

unread,

Jan 13, 2013, 3:36:00 PM1/13/13

to

On 1/13/2013 11:28 AM, EricP wrote:
> Michael S wrote:
<snip>

>
> So the question is what happens if one starts with design
> goals of maximizing aggregate SMT performance?
>
> An interesting example is the XMOS XS1 core which is designed
> for real time SMT. However as far as I can tell it does not
> have a cache or TLB or user+supervisor modes so is really not
> comparable to x64.
>
> http://en.wikipedia.org/wiki/XCore_XS1
> https://www.xmos.com/published/xs1-architecture-product-brief
> https://www.xmos.com/published/xmos-xs1-architecture
>
> Eric
>

The XMOS is a cute little barrel processor, not SMT. It will brush your
teeth for you if you need an i/o processor or other real-time
controller, and the hardware threads make programming much easier than
on a regular DSP, but it is in no way a CPU.

Ivan

Andy (Super) Glew

unread,

Jan 13, 2013, 7:17:34 PM1/13/13

to

We are in massive agreement here. In fact, I thought that i had posted
that in the post that you replied to - but apparently not.

Further:

Make the request look like

REQUEST_RMW(opcode,address,value_used_in_modification)

make the reply look like

REQUEST_RMW_REPLY(reply_state,reply_data)

If the guy that you sent the request to was capable of doing the RMW,
he replies

REQUEST_RMW_REPLY(reply_state=DONE,reply_data=old_data)

If he was not capable, or if he dd not want to, he replies

REQUEST_RMW_REPLY(
reply_state=NOT_DONE_you_now-own_data_in_M_or_E,
reply_data=cache_line_data_not_yet_modified
)

This allows you to talk to dumb systems.

And also to implement protocols that move the data around
if it seems like a good idea.

Andy (Super) Glew

unread,

Jan 13, 2013, 7:34:47 PM1/13/13

to

Yes, Wmt was substantially mismatched to multithreading, especially on
HPC. I don't want to diss multithreading more widely - after all,
MIPS does threads, as do all GPUs - but Intel has not by any means made
HT universally good. But even recent HPC articles are by no means
universal about recommending hyperthreading, as evidenced by the below:

Microway is a good "mass market HPC" vendor.

This recommendation from last year is not encouraging about Sandybridge
SMT / HT.

http://microway.com/hpc-tech-tips/2012/04/achieve-the-best-performance-intel-xeon-e5-2600-sandy-bridge/

Hyperthreading
Hyperthreading has long been a part of the Intel processor designs.
However, it has rarely shown benefit for computationally intensive
applications. It doesn’t provide faster access to data or a larger
number of math units, it simply allows additional threads to be in
flight at the same time.

You will have to test your application to be certain it offers any
benefit. With Hyperthreading enabled, the operating system will see
twice as many processor cores as are actually in the hardware. You’ll
want to run test jobs on both the real and virtual numbers of cores.
Then disable Hyperthreading and run a test again using one thread for
each real/physical processor core (Hyperthreading may be disabled from
the BIOS). Typically, we do not see dramatic performance differences.

Niels Jørgen Kruse

unread,

Jan 14, 2013, 10:54:55 AM1/14/13

to

EricP <ThatWould...@thevillage.com> wrote:

> Probably want to keep the number of running threads on a single
> processor low for temporal locality of reference on the TLB and cache.
> An alternative is to time slice per instruction -
> if N threads are Ready then we round robin among all of them.
> If N was large that would thrash the TLB and cache.

It seems to me that round robin would maximise thrashing.

Better to keep a strict priority among the threads. When a new thread
starts up, it goes into the highest priority slot because it might do
something urgent. The rest move one priority down. The new thread is
going to miss cache a lot and the old first priority thread has the best
chance of doing useful work in the openings.

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark

Mark Thorson

unread,

Jan 14, 2013, 9:59:27 PM1/14/13

to

"Andy (Super) Glew" wrote:
>
> Microway is a good "mass market HPC" vendor.

Wow, there's a name I haven't heard since
I worked for Weitek.

http://www.microway.com/companyprofile.html

They must be doing something right to still
be in business nearly 30 years later.