I only got a few responses, so I thought I would try here.
Suppose you were to build an ISA from scratch, with the primary
purpose being superior multiprocessor performance on commercial
workloads. How would this ISA differ from current high performance
designs?
The responses I got were:
1. Create instructions that could access memory without checking the
entire memory space for coherency conflicts. These could then be used
for hand coding and compiler optimizations to avoid snarling the
interconnect with snooping/coherency checks.
2. Create a local and global memory space; why, same idea as above
(avoiding coherency checks on the local memory space).
Anyway, I am fairly curious about this, since it seems we are pretty
much stuck with ISAs aimed at single processor performance rather than
MP performance.
Cheers,
David
How about instructions for supporting "Space Time Computing":
http://spacejug.org/resources/Embedded_Java/MAJC_Processor/Wirespeed_Multimedia.pdf
In brief -> A 1.5x speed-up for single threaded (Java) programs
running on a multi-processor CPU. Can this be applied as well to other
langauges?
Anyone know what happened with MAJC?
Regards,
JonB
David Kanter wrote:
>
> I had a short and semi-unsatisfying discussion of multiprocessing ISAs
> at Real World Tech over the weekend.
>
> I only got a few responses, so I thought I would try here.
>
> Suppose you were to build an ISA from scratch, with the primary
> purpose being superior multiprocessor performance on commercial
> workloads. How would this ISA differ from current high performance
> designs?
>
1) Get rid of strongly coherent cache. It just generates a lot of cache
traffic that is unnessary. The program mechanisms to indicate when
cache has to be updated are mostly there. Programs have to execute
proper synchronization to get valid data anyway. A program calculation
that is incorrect with data that is 10 msec stale will still be incorrect
even if the data is only 10 pico sec stale.
2) Standardize the interlocked instructions a little better. It is a pain
to see these instructions pop in and out of architectures. Try writing
anything portable these days.
candidates here: doubleword compare and swap (not DCAS), load locked/store
conditional combined with monitor/mwait.
3) Restartable intervals. These would be instruction intervals that if interrupted
would be restarted at beginning of the interval. This would let you do things
like implement RCU in user space and get the huge scalability benefits that have
heretofore been limited to non-preemptive kernels. Restartable intervals
can be done in the OS but it would be nice to have hw support. Kernel developers
get touchy about adding anything to the main dispatch path.
Joe Seigh
These are as relevant to single CPUs as to multiple ones, because
of the extra flexibility in scheduling. The big problem is
generating code that can make use of them, starting from spaghetti
C and similar codes.
Regards,
Nick Maclaren.
I've implemented RCU before. It's fairly simple.
Joe Seigh
>I had a short and semi-unsatisfying discussion of multiprocessing ISAs
>at Real World Tech over the weekend.
>
>I only got a few responses, so I thought I would try here.
I was thinking of responding, but I still have not thought enough
about the question.
I will hand off the following quickee: If the MP aspect is a tight
CMP or an SMT, having a JAL_parallel instruction might be
useful. This would allow a thread to be spawned if resources
are available. It might also be worthwhile for a traditional SMP
(though it would involve transmitting the spawn_thread command,
the start address, the appropriate register contents, the page
table pointer, etc. across chip boundaries). This would have
an advantage over a system call of having minimal overhead
when all thread resources are active (i.e., it would simply be a
regular procedure call). Providing a similar hint mechanism
for loops might also be valuable.
Coarse-grain multiprocessing (both SMP/NUMA and message passing)
probably doesn't need MP-awareness at the (user) ISA level apart from
synchronization primitives, which already exist in all ISAs I know.
Even 1. & 2. of Rob Thorpe don't require user-level ISA support. Both
could be adequately handled by control bits in PTE/TLB and aliasing of
the physical pages.
Intel managed to cover SMT spin lock issues with just one user-level
instruction - pause.
Fine-grain multiprocessing could be more interesting and involving.
Currently. we have two alternatives for inter-processor
synchronization - external interrupts and spin locks.
Interrupts are scalable, but heavy weighted. On most modern OSes they
suffer from the overhead of multiple contest-switches and kernel-user
transitions.
Spin locks are lightweight, but non-scalable. Attempts to use spin
locks for fine-grain synchronization in the big system will take away
significant parts of the precise shared bandwidth.
Since neither solution is ideal, it's prompting to try to combine best
sides of each one - scalability of interrupts with light weight of the
spin locks - in the single mechanism.
Let's provide in each processor N pairs of input and output lines
called synchronization flags. In the mid-range system the flags are
connected to N peers in point-to-point manner. In the large-scale
system - build hierarchy of crossbar switches.
Now we can start to invent synchronization primitives based on these
flags - wait on set of events while all, wait on set of events while
any, etc...
The biggest problem I see with this approach is an OS issue. The basic
requirement for all modern general-purpose OSes is an ability to
virtualize everything. The flags we are talking about are not easy to
virtualize, mainly because there are too few of them.
However, I am not sure how fine-grain multiprocessing is applicable to
commercial workloads...
I would do exactly the oposite of what your respondants suggested: move all
coherency checking into the load/store instructions, which are pretty long
latency already, and leave the system bus free of general snooping traffic.
Of course, this wouldn't change the ISA so much as the implementation. Under
this design loads would need to stall while other processors were queried
to see if the target address was modified by another processor, and stores
would update some processor local table (an extra bit in the cache tags)
to indicate dirty locations. I guess stores might also need to stall to
make certain that no other processor is trying to update the same location
in another cache. If several stores hit the same location, some mechanism
would need to arbitrate.
Again, this is an implementation issue, not an architectural one, and it
relies on loads/stores already being very slow (so that the added stalls
wouldn't greatly affect performance). For an architectural change I might
like to see semephore registers common to all processors in the system.
The processors would snoop the limited space of the semephore reigisters
to maintain coherency, but would be, otherwise, incoherent. This would
allow us to do away with atomic memory operations.
- Jeff Dutky
>There are several other known technologies that could be of immense
>benefit, several of which have few downsides. They include:
>
> 1) User mode interrupts, handled in thread. This immediately
>eliminates a vast proportion of the nastiest sorts of system global
>synchronisation. One can extend this to some system mode interrupts,
>such as TLB misses where there is no page fault, as on some existing
>systems.
Hey, I thought you wanted this for uniprocessing (e.g., to avoid
the OS overhead just to send a signal back to the same process). :-)
> 2) Proper lightweight threads, with UNPRIVILEGED HARDWARE
>operations for communication and 'scheduling' within a process.
>E.g. one thread could say 'suspend me and run X' without even
>informing the operating system. POSIX threads need not apply.
A more useful/flexible mechanism than JAL_parallel would
probably be nice. However, can't 'suspend me and run X' be
done in software with current hardware (well, I guess this would
have the nasty of allowing all threads to access thread-private
memory). Having something a little like x86 call-gates might
be useful (This could allow user-level process scheduling.).
> 3) Much better performance monitoring and tuning facilities,
>including interrupt handling and memory affinity. Take a look at
>the Pentium 4, and do something completely different! This is a
>more radical proposal than it appears, especially in the light of
>(1), (2) and (4).
Could you elaborate? My ignorance is enormous WRT performance
monitoring hardware.
Providing user-level controllable memory move engines (to
accelerate user-level replication and migration [with
other uses]) might be helpful. (These might accessed
simply through memory-mapped I/O [so not part of the
ISA as such], but their guaranteed presence and feature
set would make them more likely to be used.)
> 4) Support for large thread count models. I am talking about
>the ones where a program might want to spin off 5,000 threads for
>16 CPUs. The main extra need here above (2) is a thread context
>caching facility.
What would the benefits be for 5K threads vs. fewer threads
with (e.g.) queues of 'work units'? (Distinct threads might
allow hardware to accelerate saving the old thread/work unit
context and restoring the new thread/work unit context.
Providing block loads--a more broadly useful, more RISCy
approach--could provide similar bandwidth. Stacks and other
memory areas could be made thread-private.)
Providing scratchpad RAM might handle the context caching
with the flexibility of being useful in other ways. (The contexts
would be written to memory addresses that map to the on-die
scratchpad.)
>And so on ....
And so we find Nick guilty of felinocide--inducing fatal curiosity.
Paul A. Clayton
Curiosity killed the cat--but satisfaction brought him back?
Oh, yes, it does that, too. But such an approach has the advantage
that it ELIMINATES the need for a global operation on interrupt,
rather than merely reducing its cost.
>> 2) Proper lightweight threads, with UNPRIVILEGED HARDWARE
>>operations for communication and 'scheduling' within a process.
>>E.g. one thread could say 'suspend me and run X' without even
>>informing the operating system. POSIX threads need not apply.
>
>A more useful/flexible mechanism than JAL_parallel would
>probably be nice. However, can't 'suspend me and run X' be
>done in software with current hardware (well, I guess this would
>have the nasty of allowing all threads to access thread-private
>memory). Having something a little like x86 call-gates might
>be useful (This could allow user-level process scheduling.).
Generally, no. The problem is that context switching is a privileged
operation. This HAS been done by the sort of application that runs
privileged all the time (e.g. some embedded codes), but can't be
done under any mainstream operating system both for software and
hardware reasons.
>> 3) Much better performance monitoring and tuning facilities,
>>including interrupt handling and memory affinity. Take a look at
>>the Pentium 4, and do something completely different! This is a
>>more radical proposal than it appears, especially in the light of
>>(1), (2) and (4).
>
>Could you elaborate? My ignorance is enormous WRT performance
>monitoring hardware.
Not easily, especially as the "gotchas' are very system-specific.
a couple of common problems include the inability to do any of
the following:
Accumulate accurate monitoring counts for each thread of a
program, unpolluted by actions done during interrupt.
Get reliable information on interrupts and their cost, caused
by actions of the program (think SIGFPE).
Allow the administrator to collect critical counts in the
process logs, in such a way as they are not corrupted if the user
does performance analysis.
Get any data on delays caused by switching from one thread to
another, as distinct from switching from the application to another.
>Providing user-level controllable memory move engines (to
>accelerate user-level replication and migration [with
>other uses]) might be helpful. (These might accessed
>simply through memory-mapped I/O [so not part of the
>ISA as such], but their guaranteed presence and feature
>set would make them more likely to be used.)
Yes. This is a pure software issue, as almost all modern systems
can do the work on complete pages simply by manipulating the page
tables. Actual copying might need a bit of hardware support.
>> 4) Support for large thread count models. I am talking about
>>the ones where a program might want to spin off 5,000 threads for
>>16 CPUs. The main extra need here above (2) is a thread context
>>caching facility.
>
>What would the benefits be for 5K threads vs. fewer threads
>with (e.g.) queues of 'work units'? (Distinct threads might
>allow hardware to accelerate saving the old thread/work unit
>context and restoring the new thread/work unit context.
>Providing block loads--a more broadly useful, more RISCy
>approach--could provide similar bandwidth. Stacks and other
>memory areas could be made thread-private.)
It allows dynamic and near-optimal load balancing, whereas current
approaches allow only semi-static load balancing. This sort of
transaction-based approach was studied some 20+ years back, and
looked good. The Tera MTA rediscovered it, and did too.
In such a model, you probably do NOT split the stack, but each
thread uses a small, fixed work area.
>Providing scratchpad RAM might handle the context caching
>with the flexibility of being useful in other ways. (The contexts
>would be written to memory addresses that map to the on-die
>scratchpad.)
I don't see that context caching is any different in concept from
TLBs versus page tables. The problem with a more general scratchpad
model is ensuring that its invalidation is integrated with its use.
>And so we find Nick guilty of felinocide--inducing fatal curiosity.
Thanks for the compliment!
Regards,
Nick Maclaren.
Lets look at some statistics for comertial workloads: 1 MB L2 global
mis rate 1% to 5%, TLB miss rate 1/4 to 1/20 the miss rate of the L1
Caches, distance to main memory 50ns to 250n, and a 3 GHz processor.
A single thread will have at least 750 ps of memory overhead from
the L2 miss rate, maybe another 100 ps from the TLBs. In effect,
a single thread can utilize only 1/3rd of a (1 wide) 3 GHz CPU. The
question is whether we sould try to drive 3 threads through an
underutilized core, or have many cores on a 1:1 basis. As a worse
case, a single thread may incur as much as 400 ns per memory
reference or 1/50th of a (1 wide) 3 GHz CPU. So, it might take as
many ar 50 threads to fully occupy a single modern CPU on a comercial
workload.
If our clean sheet CPU were de-piplined to operate at the same gate
performance as the 3 GHz CPU, the fetch-decode-execute-retire pipeline
would be more than correspondingly reduced, and large amounts of power
would be saved, large number of storage elements deleted, clock power
reduced, area reuced, pressure is taken of any predictors, schedulers
have plenty of time for back to back operations, and lots of other
good things would acrue. This design point requires no particular
out-of-order stuff, so all that can be jetisoned, leaving only branch
prediction. In effect we get to consider a machine with at least 24
gates/cycle and in practice more like 30-34 real gates per cycle;
486-style architecture. At this point you have a CPU core that is
on-the-order of 1/8th to 1/10th the size of a K8 core (with large
1 caches).
6 to 8 of these things will dramatically outperforms either the
hyperthreaded P4 approach and the "one-big-node" K8 approach
ON comertial workloads, take less power, yada, yada, yada
Share some of the expensive (and underutilized) execution resources
between cores, and voila, 16 cores might be possible. For comertial
workloads, this thing would kick serious TPM-Cs.
Then, you can take these simplified cores and multithread them
until the size of the register file limits cycle time.
Problem is that is would seriously underperform on single user
applications (e.g. almost everything except comercial processing).
> >
>
> 1) Get rid of strongly coherent cache. It just generates a lot of cache
> traffic that is unnessary. The program mechanisms to indicate when
> cache has to be updated are mostly there. Programs have to execute
> proper synchronization to get valid data anyway. A program calculation
> that is incorrect with data that is 10 msec stale will still be incorrect
> even if the data is only 10 pico sec stale.
1.a) Make the TLBs coherent, or at least automate TLB shootdowns.
Mitch Alsup wrote:
>
> > 1) Get rid of strongly coherent cache. It just generates a lot of cache
> > traffic that is unnessary. The program mechanisms to indicate when
> > cache has to be updated are mostly there. Programs have to execute
> > proper synchronization to get valid data anyway. A program calculation
> > that is incorrect with data that is 10 msec stale will still be incorrect
> > even if the data is only 10 pico sec stale.
>
> 1.a) Make the TLBs coherent, or at least automate TLB shootdowns.
RCU presumably. Page reclaim/stealing is just a form of GC. On the newer
style TLBs with the memory space ids as part of the key, you'd probably
do the invalidate TLB's as part of a local processor dispatch queue reshuffle
to create an event of interest for RCU.
Joe Seigh
Yes :-( Actually, I believe that it could deliver really impressive
SPECFp, NASPB, Linpack etc. figures, too - IFF it had sufficient
compiler support! That is now a proven technology. But, outside
the HPC area, those are really only used for bragging rights.
What I really don't know is how much other use could be redesigned
to take advantage of such a model. More than almost nothing, but
less than almost everything - which leaves a lot of uncertainty.
|> > 1) Get rid of strongly coherent cache. It just generates a lot of cache
|> > traffic that is unnessary. The program mechanisms to indicate when
|> > cache has to be updated are mostly there. Programs have to execute
|> > proper synchronization to get valid data anyway. A program calculation
|> > that is incorrect with data that is 10 msec stale will still be incorrect
|> > even if the data is only 10 pico sec stale.
|>
|> 1.a) Make the TLBs coherent, or at least automate TLB shootdowns.
One easy way to do both of this is to move to a single writer or
multiple reader model BY PAGE. This would be fine for almost all
multi-process applications, and 90% of 'shared memory' use. Don't
even think of using OpenMP ....
Regards,
Nick Maclaren.
[snip]
>Generally, no. The problem is that context switching is a privileged
>operation. This HAS been done by the sort of application that runs
>privileged all the time (e.g. some embedded codes), but can't be
>done under any mainstream operating system both for software and
>hardware reasons.
This was in the context of threads (shared global/heap memory).
If one allows the thread-private memory areas to be unprotected
(within a thread group), how is context switching a privileged
operation? One is merely saving one set of registers and
restoring another set. (If the different threads had differing
state in privileged registers, this would not work; but it is not
clear to (ignorant) me why this would be the case in
commercial workloads.)
I was under the impression that some thread libraries
implemented thread scheduling in user-land (i.e., the OS
scheduled thread groups and each thread group scheduled
'work units' within its group).
[snip; 5K threads for 16-way SMP vs. fewer threads with internal
scheduling of 'work units']
>It allows dynamic and near-optimal load balancing, whereas current
>approaches allow only semi-static load balancing. This sort of
>transaction-based approach was studied some 20+ years back, and
>looked good. The Tera MTA rediscovered it, and did too.
Hmm? How is it less dynamic for software to notice a stalling
event and change work units? A 'work unit' would not be visible
to the OS (so OS-based scheduling and protection would be at
the thread level--this differs from threads in a kernel-level
implementation of threading) and could have a specialized
concept of context (e.g., some registers might be considered
part of the group's context--this differs from threads in a user-level
implementation of threading which generalizes the interface).
A simple standard interface would allow simple hardware to
accelerate context switches (insert arguments translated
from the software vs. hardware TLB fill debate).
>I don't see that context caching is any different in concept from
>TLBs versus page tables. The problem with a more general scratchpad
>model is ensuring that its invalidation is integrated with its use.
Well there are some significant differences. A TLB is read for
every memory operation; a context cache would be read less
frequently. A TLB is much more frequently read than written
(especially excluding updating 'accessed' and 'dirty' bits)
relative to a context cache (typically an old context would be
written when a new context is read). A TLB entry's common
use involves limited communication with the processor core
(sending information to the cache for tag matching does not
involve the register file); a context cache almost exclusively
interacts with the core/register file.
These differences make a slower, more general interface
less problematic (and a more general interface allows for
other uses which makes its implementation more likely).
Providing 'invalid' bits for the scratchpad might help WRT
coherence for a context cache (and might be useful for
message passing, using the valid bit to indicate that a
buffer entry is ready for reading??).
Paul A. Clayton
highly ignorant, but not a complete idiot
Fine - in theory! Problem number one is that, for performance, you
need to use multiple CPUs and need to be able to control what another
CPU does from unprivileged code. The next problem is that there is
usually far too much that is bound to a CPU and needs operating
system help to move (e.g. performance counters!)
In the case of anything that uses POSIX threads or equivalent, it
is very likely that the state is different. Consider signal
masking in I/O, for one horrible example.
|> I was under the impression that some thread libraries
|> implemented thread scheduling in user-land (i.e., the OS
|> scheduled thread groups and each thread group scheduled
|> 'work units' within its group).
That is common; it is not the only way, and can have quite a lot of
problems, depending on the system.
|> [snip; 5K threads for 16-way SMP vs. fewer threads with internal
|> scheduling of 'work units']
|>
|> >It allows dynamic and near-optimal load balancing, whereas current
|> >approaches allow only semi-static load balancing. This sort of
|> >transaction-based approach was studied some 20+ years back, and
|> >looked good. The Tera MTA rediscovered it, and did too.
|>
|> Hmm? How is it less dynamic for software to notice a stalling
|> event and change work units? A 'work unit' would not be visible
|> to the OS (so OS-based scheduling and protection would be at
|> the thread level--this differs from threads in a kernel-level
|> implementation of threading) and could have a specialized
|> concept of context (e.g., some registers might be considered
|> part of the group's context--this differs from threads in a user-level
|> implementation of threading which generalizes the interface).
How do you dynamically change the number of work groups, move
threads between them, and so on? Yes, it can be done, but the
cost of the extra level of abstraction is overhead.
|> >I don't see that context caching is any different in concept from
|> >TLBs versus page tables. The problem with a more general scratchpad
|> >model is ensuring that its invalidation is integrated with its use.
|>
|> Well there are some significant differences. A TLB is read for
|> every memory operation; a context cache would be read less
|> frequently. A TLB is much more frequently read than written
|> (especially excluding updating 'accessed' and 'dirty' bits)
|> relative to a context cache (typically an old context would be
|> written when a new context is read). A TLB entry's common
|> use involves limited communication with the processor core
|> (sending information to the cache for tag matching does not
|> involve the register file); a context cache almost exclusively
|> interacts with the core/register file.
No. A TLB entry is. But, similarly, the context needs to be read
for every operation to check if the process's modes. In practice,
of course, THAT part of the context is kept in a register. My
point wasn't that the details are the same (as you point out, they
aren't) but that both are caches of largely hidden state.
|> These differences make a slower, more general interface
|> less problematic (and a more general interface allows for
|> other uses which makes its implementation more likely).
Less, actually. Think security. That would work if the security
separation had already been done, but it hasn't.
Regards,
Nick Maclaren.
>The
>question is whether we sould try to drive 3 threads through an
>underutilized core, or have many cores on a 1:1 basis.
I'm amazed that you neglected to mention that the number of
outstanding memory ops is a key element of this problem. It's typical
for a system to allow 3 oustanding ops, but 50?
>Problem is that is would seriously underperform on single user
>applications (e.g. almost everything except comercial processing).
There's still a large market for high performance technical computing,
a large fraction of which is not single-threaded.
-- greg
> |> > 1) Get rid of strongly coherent cache. ...
Incidentally, I second (third? fourth?) this one.
> |> 1.a) Make the TLBs coherent, or at least automate TLB shootdowns.
>
> One easy way to do both of this is to move to a single writer or
> multiple reader model BY PAGE. This would be fine for almost all
> multi-process applications, and 90% of 'shared memory' use.
Do you have any data (or any other reasoning) to back up that last
sentence? Surely a fair amount of data has been collected (e.g. from
various DSM algorithms and performance data). Even if it's true, might
it be so just only because of the rather crude tools currently available
for sharing at all?
I would imagine that as an execution progresses, more and more of the
data could become read only, but even if so, determining when all of the
data on a page has permanently gone to "no more writing will be
necessary" may well be more trouble than it's worth. (SC is able to
keep that info on a data structure-by-data structure basis.)
Or maybe your statement is based on an assumption that "almost all
multi-process applications" read (and don't update) some sort of large
database or other dataset?
-- Dave
-----------------------------------------------------------------
David C. DiNucci Elepar Tools for portable grid,
da...@elepar.com http://www.elepar.com parallel, distributed, &
503-439-9431 Beaverton, OR 97006 peer-to-peer computing
Yes, but I think that we are at cross-purposes. See below.
>I would imagine that as an execution progresses, more and more of the
>data could become read only, but even if so, determining when all of the
>data on a page has permanently gone to "no more writing will be
>necessary" may well be more trouble than it's worth. (SC is able to
>keep that info on a data structure-by-data structure basis.)
If it were IMPOSSIBLE to change back to single writer mode, then I
agree that few programs would benefit.
>Or maybe your statement is based on an assumption that "almost all
>multi-process applications" read (and don't update) some sort of large
>database or other dataset?
Not that as such, but you are right that it is a statement about
application design.
Most of the shared memory multi-process and many of the multi-thread
programs I have seen do not really use shared memory for fine grain
inter-thread work. Quite a lot of them use it for one of the
following:
Read-only databases, of one sort or another.
Low-overhead point-to-point message passing (e.g. Cray SHMEM).
Simple initial broadcasting and/or final data gathering.
Simple, unshared random access to a large file.
A convenient way of passing arguments and results between threads.
And so on.
All of those have the property that they can be implemented efficiently
on a single writer or multiple reader model, provided that there is a
system call to switch between states (and that call need not be fast)
and another to switch owners of a single writer page (which should be).
If done at the page level, that is trivial with existing hardware.
As I said somewhere else, don't even think about using OpenMP (or
POSIX threads for fine grained work) if those are your primitives!
But there are very few applications that do - at present, at least.
Regards,
Nick Maclaren.
I think the SV1 allowed 384 outstanding references, the X1 does
more. ISTR 'thousands' from a talk, I don't remember if they dropped a
number or not.
-George
I am wondering where you got those statistics, cuz I have never really
seen anything like it. I would be quite interested in reading and
finding out more about such sources.
So you would advocate the Piranha approach to commercial workloads?
> Share some of the expensive (and underutilized) execution resources
> between cores, and voila, 16 cores might be possible. For comertial
> workloads, this thing would kick serious TPM-Cs.
>
> Then, you can take these simplified cores and multithread them
> until the size of the register file limits cycle time.
>
> Problem is that is would seriously underperform on single user
> applications (e.g. almost everything except comercial processing).
Yes that it is definitely a problem :) Incidentally, I was only
wondering about architectural designs, rather than the actual
implementation.
David Kanter
Nick Maclaren wrote:
>
> Most of the shared memory multi-process and many of the multi-thread
> programs I have seen do not really use shared memory for fine grain
> inter-thread work. Quite a lot of them use it for one of the
> following:
>
> Read-only databases, of one sort or another.
> Low-overhead point-to-point message passing (e.g. Cray SHMEM).
> Simple initial broadcasting and/or final data gathering.
> Simple, unshared random access to a large file.
> A convenient way of passing arguments and results between threads.
> And so on.
>
I don't think fine grained multi-threading is supported by either the
hardware or the OS, so it's no suprise you don't see much fine grained
multi-threading out there. And not much interest in it as far as I
can tell.
Joe Seigh
Not quite. Cache coherence is specifically to support it, and is
a major cost when developing systems. There is also more in most
operating systems than you seem to think, and then there is OpenMP
and some aspects of POSIX threads. There are also research projects.
You are correct that one reason that you don't see much is that it
is not supported WELL, and one of my frequent complaints is that
system developers should make up their minds whether to support it
as a first-level facility or gain the advantages of dropping it.
As my postings probably make clear, I can adapt to either solution,
and should ideally like some systems to do one and some the other.
Regards,
Nick Maclaren.
Nick Maclaren wrote:
> |> I don't think fine grained multi-threading is supported by either the
> |> hardware or the OS, so it's no suprise you don't see much fine grained
> |> multi-threading out there. And not much interest in it as far as I
> |> can tell.
>
> Not quite. Cache coherence is specifically to support it, and is
> a major cost when developing systems. There is also more in most
> operating systems than you seem to think, and then there is OpenMP
> and some aspects of POSIX threads. There are also research projects.
>
POSIX threads kind of falls short there. Any assumptions about the
conditions necessary for efficient fine grained multi-threading may
or may not hold. So you're damned if you do and damned if you don't.
Ironic really, since POSIX is supposed to buy you some kind of
portablility. Also, the pthreads orthodoxy is fairly antogonistic
towards fine grained multi-threading. Contention is everything.
Ideally the perfect pthreads program is a single thread. No contention
and no context switching overhead.
Just to give me a better idea of what you are talking about, what kind
of support in OS's do you think is there?
Joe Seigh
Pretty close :-) Let's ignore POSIX threads as suffering from a
terminal case of cancerous featuritis.
|> Just to give me a better idea of what you are talking about, what kind
|> of support in OS's do you think is there?
Take a look under the covers and in the system-dependent facilities,
and you will see quite a lot of things implemented to be better for
fine-grained parallelism than POSIX threads. In most cases, they
are not intended for normal users, but for OpenMP support - and, in
some cases, for the support of OpenMP's predecessors.
Regards,
Nick Maclaren.
[lots of small cores on chip]
>What I really don't know is how much other use could be redesigned
>to take advantage of such a model. More than almost nothing, but
>less than almost everything - which leaves a lot of uncertainty.
My guess is that almost everything could be, if you had a clean sheet
of paper.
Doing it with an existing program consisting of a million lines of
spaghetti C++, on the other hand, would be... nontrivial :(
--
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
http://www.esatclear.ie/~rwallace
Nick Maclaren wrote:
>
> |> Just to give me a better idea of what you are talking about, what kind
> |> of support in OS's do you think is there?
>
> Take a look under the covers and in the system-dependent facilities,
> and you will see quite a lot of things implemented to be better for
> fine-grained parallelism than POSIX threads. In most cases, they
> are not intended for normal users, but for OpenMP support - and, in
> some cases, for the support of OpenMP's predecessors.
>
Nothing specific though. This is why I generally ignore the "issues"
because nobody wants to say what those issues actually are. And I
can pretty much deal with what the real issues are.
Joe Seigh
And before filling that paper with anything too complex and messy,
remember that making an application consisting of lots of independent
"threadlets" (i.e. with synchronization between rather than within) is
what Software Cabling is all about.
(Well OK, not *all* about.)
Is library-based threading that great of a performance problem?
>usually far too much that is bound to a CPU and needs operating
>system help to move (e.g. performance counters!)
I thought you were going to fix much of that problem (e.g., user-land
interrupts and user-writable counters)! :-)
>In the case of anything that uses POSIX threads or equivalent, it
>is very likely that the state is different. Consider signal
>masking in I/O, for one horrible example.
Is that really a problem? ISTM that one is only changing where
the selection happens (kernel vs. thread library vs. application's
thread/work-unit 'scheduler').
[snip]
>|> Hmm? How is it less dynamic for software to notice a stalling
>|> event and change work units? A 'work unit' would not be visible
>|> to the OS (so OS-based scheduling and protection would be at
>|> the thread level--this differs from threads in a kernel-level
>|> implementation of threading) and could have a specialized
>|> concept of context (e.g., some registers might be considered
>|> part of the group's context--this differs from threads in a user-level
>|> implementation of threading which generalizes the interface).
>
>How do you dynamically change the number of work groups, move
>threads between them, and so on? Yes, it can be done, but the
>cost of the extra level of abstraction is overhead.
For commercial workloads it is not clear how much extra overhead
this would introduce. Such presumably already have queues of
work units (transactions et al.) and probably do some scheduling of
these. Load balancing across _process_ boundaries might be
problematic (the same problem that user-land thread libraries have).
Note: I am not arguing against providing fast paths to 'privileged'
operations, but ISTM that such are not _necessary_ for decent
performance in 'commercial' workloads. (I am a bit sceptical
that a process would benefit from spawning 5K threads for 16
CPUs, and I do think that the OS need not be bothered with
some fine-grained scheduling.)
[snip]
>|> These differences make a slower, more general interface
>|> less problematic (and a more general interface allows for
>|> other uses which makes its implementation more likely).
>
>Less, actually. Think security. That would work if the security
>separation had already been done, but it hasn't.
Hmm? What is the problem with page-based protection of
the scratchpad? I agree that it would be nicer to have finer
separation of privilege (just as it is nice to be able to have
threads with stacks with separate protection and otherwise
selective sharing), but if a process (group of threads) can
only mangle threads in its group that might be considered
adequate (since any thread could mangle the shared
memory anyway). I suspect I am missing something
obvious.
Paul A. Clayton
just a technophile--not a software engineer, systems
architect, processor designer, or any kind of IT professional
This is basically how CDS works, but the "system calls" you mention are
user level, and access is at the data object (memory region)
granularity, rather than page. These regions are used for exactly the
sorts of purposes you mention above. It is what prompted my comment that
one would still need to determine whether everything (e.g. all of the
CDS memory regions) on a particular page were shared read. Maybe you
are just suggesting that the data structures above will often be large
enough as to be allocated (and access controlled) in page granularity?
e.g. in your SHMEM message passing item (again very CDS-ish), a "sender"
may be preparing a new message (perhaps destined for a different
"receiver") which is up against the "read-only" one being received.
In sum, color me skeptical that page-based access states would be very
useful for inter-thread communication optimization.
Nothing. I was commenting on the reasons that you don't want to
use it for saving contexts!
Regards,
Nick Maclaren.
Oh, I agree there! The reason that I favour them is that they allow
a great deal of simplification of the within-CPU performance. My
point is that, if you aren't actually using fine-grained parallelism,
then why pay for it in the form of cache coherence?
Regards,
Nick Maclaren.
David Kanter wrote:
>
> I had a short and semi-unsatisfying discussion of multiprocessing ISAs
> at Real World Tech over the weekend.
>
> I only got a few responses, so I thought I would try here.
>
> Suppose you were to build an ISA from scratch, with the primary
> purpose being superior multiprocessor performance on commercial
> workloads. How would this ISA differ from current high performance
> designs?
>
It occurred to me that actually some of the current multiprocessing
ISA is not adequately documented. So for all we know, the present
ISAs on the more recent processors are perfectly suited to producing
superior multiprocessor performance.
In fact I asked some ISA specific questions and despite c.a. having
more than a few knowledgable people, nobody was able to explain those
features.
Joe Seigh
Still nothing brilliant to add, but here are a few thoughts:
Since a multiprocessor distributes memory and processing,
a multiprocessor ISA's distinctives would presumably relate to
communication of data and processing directives. Data
communications seems to involve the problem of moving data
to maximize probable locality with minimal eviction of soon-to-be-
used data from local memory and minimal use of interconnect
bandwidth and the problem of memory consistency and
synchronization. Being able to mark specific blocks or
pages as write-update might be useful (or not).
The principle of 'primitives not solutions' (i.e., it is usually
safer and more helpful to provide a simpler more broadly
useful operation than an 'efficient', specifically-targeted
operation) would tend to discourage use of commercial-
workload-specific operations. (This principle does not
exclude the possibility of optimizing a specific idiomatic
use of the primitives at the microarchitectural level.)
Nick Maclaren's much earlier suggestion of a sync
instruction that provided a sequence number might be
a useful primitive.
WRT communicating processing information, supporting
small contexts would be useful. (Although this can be
done at the OS level--not saving/restoring unused
registers in a 'small' context and zeroing excluded
registers when switching from a 'large' context to a 'small'
context--, some additional hardware support might be
useful. [E.g., an SMT system might support running
twice as many small threads as large processes with
modest additional hardware.] Providing only a smaller
context might also be useful for code density.)
It might be useful to provide the ability to generate
a user-level exception on a cache miss for a remote
memory location along with some means to request
local processing. (This might allow pointer-chasing
code to be more localized to minimize latency as
well as bandwidth.)
Code density might be a minor factor (replicating an
additional 16MiB of code per node is not a major
problem with 8GiB of memory per node, and even
the improved hit rate and reduced memory bandwidth
use might not be signficant benefits).
At a higher level than ISA (in the sense of what an
assembler needs to know), address translation
choices could have an effect on multiprocessor
performance. E.g., large pages are nice to reduce
TLB misses, but introduce significant overhead if
dynamic page replication and migration are desired
(and such can improve performance).
Paul A. Clayton
somnolence and ignorance--a deadly combination :-\
"Paul A. Clayton" wrote:
>
> Nick Maclaren's much earlier suggestion of a sync
> instruction that provided a sequence number might be
> a useful primitive.
>
"much earlier"? You mean like interlocked increment
instructions or atomic fetch and add instructions?
There's also the STCKE * (Store Clock Extended) instruction
in the Z architecture which guarantees a unique value
(provided the TOD clocks are in sync) and has membars
in the right places.
Also, you have things like eventcounts and eventtickets
which can be implemented really efficiently if you
know the trick.
* weird. It used to be that the 64 bit version, STCK,
guaranteed a unique value but they dropped that. CPUs
are getting too fast, 64 bits is not enough. Also,
nice leap second table for STCKE. Makes the unix "epoch
that is not an epoch" look like a joke.
Joe Seigh
I never said that I had invented the idea :-)
Regards,
Nick Maclaren.
NO WAY would IBM "drop" an architectural promise. The problem is
that keeping the promise is expensive. Machines are just about
becoming fast enough that STCK has to be artificially delayed until
time has caught up with the low-order ticking bit, and the ticking
bit has to tick slower as the size of the processor-number field
used to guarantee cross-CPU uniqueness (without slow cross-CPU
communication) is getting larger to accomodate larger maximum
CPUs/system counts. STCKE has enough bits to keep disambiguating
CPU-number bits way below the ticking bit, which can tick as fast
as necessary, even beyond the quarter-nanosecond resolution of the
64-bit TOD.
The 8 extra high-order bits to extend the epoch from 142.7 years to
more than 36,000 years are just gravy...
Michel.