On 1/5/2013 11:38 AM, Stephen Fuld wrote:
> On 1/4/2013 9:34 AM, Andy (Super) Glew wrote:
>
> snip
>
> This is an interesting possibility, but there are a number of "issues"
>
>
>> We can imagine SMT machines that have larger number of thread contexts,
>> say 16, out of which only a smaller number are active, say 4. In which
>> case the hardware scheduler may be making one of the inactive but loaded
>> threads active, and vice versa. Both for the PAUSE instruction, but
>> potentially for other reasons.
>
> But if the cores are still superscaler, wouldn't the size of the
> multiported register file be an issue? I suppose you could add register
> save and restore to the hardware/microcode scheduler, as long as you had
> agreement on the memory layout of the save area.
Yes.
But...
I can imagine that the distinction between active/inactive threads is
reflected in register file.
Naively, the inactive threads might be stored in a large,
non-multiported register file. The active threads in a smaller
multiported regisyter file. Microcode might copy.
More intelligently, use register renaming to track values between the
less-ported and more-ported register files. No need for microcode:
register renaming hardware naturally flows registers between inactive
and active.
> Would you want to allow multiple scheduling policies to help make
> choices among multiple ready threads?
That's the sticky part.
Probably the hardware thread scheduler would have knobs, parameters that
can be tweaked.
E.g. making some threads always highest priority.
E.g. I am a big advocate of fair share scheduling, esp. easy to
implement using randomization. Knobs to change weights, number and kind
of lottery tickets.
But a hardwired, baked in, hardware (or even microcode) is always going
to be less flexible than an OS scheduler in SW. Not that IS schedulers
are very sophisticated nowadays.
The hard part about this design direction is figuring out a good-enough
hardware scheduler, that the OS can live with.
Of course, we already do that all the time. Consider hyperthreading.
>> Like hardware timesharing. (I.e.
>> instead of SMT, interleaving threads within a cycle, or barrel,
>> interleaving ion cycle basis, interleave at a coarser scale - 10 cycles?
>> 100 cycles? Anything up to the typical system call time would be
>> useful.)
>
> Wasn't there a version of the Power processor that switched on cache
> misses?
http://semipublic.comp-arch.net/wiki/Switch_on_Event_Multithreading_(SoEMT)
SoEMT has been implemented in IBM's RS64-II, RS64-III, and RS64-IV and
Intel's Itanium2 9000 series (Montecito). All of these implementations
targeted commercial workloads, which tend both to be heavily threaded
and to have a larger frequency of high-latency events (cache misses and
perhaps also (uncached) I/O accesses).
>> We don't need to stop at "thread contexts loaded into a hardware
>> register file". If you have microcode you might have the "hardware"
>> (microcode) scheduled threads from a simple "hardware" (microcode)
>> scheduling queue that is stored in memory.
>
> Absolutely, but see my comments about register save/restore.
>
>
>> (BTW, I say "microcode", because I don't like hardware instructions that
>> do complicated sequences of access to memory. I'm a RISC guy at heart.
>> But I am okay with microcode - which is just software, except with
>> special features and support, really a special privilege level.)
>
>
> Yup. Note that you allow the hardware scheduler to be disabled by the
> OS in order to allow an OS to use an existing code, a unique scheduling
> requirement etc. But since the scheduler is a fairly highly used portion
> of the OS, and presumably the HW implementation would be faster, OS
> vendors would be encouraged to use it.
I think the biggest thing is that the hardware/microcode scheduler can
do optimizations that the OS cannot.
For example, the hw/uc scheduler can run threads even though not all
registers are loaded. It can use the register renamer to fetch
registers that are not loaded yet.
You might be able to run more active SMT threads if there are each only
using 8 registers, and fewer if they are using all 32? 64? regs.
Of course, an OS scheduler can do this if it has "fault on access" bits
on the register file. GPUs do this (more explicitly, with a count).
There have been many proposals for such. But, no matter what, at the
moment taking a fault to the OS to fetch in a missing register is a lot
slower than the hardware would be. And it is an OS change for many
machines.
Ditto the part about having microarchitecturally heterogenous register
files (low versus high porting) implementing architecturallyt
homogenous register files.
Stuff flows across this boundary a lot.
Leave something in microarchitecture if SW can't use it effectively.
Or if it is useful in some configfurations, but not others, and it is
too expensive to customize the software for each configuration, for each
OS. Or to get goodness to OS and SW that is slow to change, and/or is
in a fragmented market, where it is too expensive to change every
version of the umpteen versions of OS/SW that matter.
Move to OS/SW if good enough performance, and if the feature is long
term stable.