I'm - happy? chuffed? not surprised? interested? - to see one of the
last bastions of RISC fall down. Fujitsu has added an instruction
prefix. Albeit a 32 bit instruction prefix, not an 8 bit prefix like
the amd x86-64 REX byte. But same idea. New register for the prefix state.
Also specifies extended opcodes for new instructions.
Pardon the mess, but I'll just cut and paste the text from the slide:
Large register sets 2/2
Instruction format for 256 FP registers
8 bit x 4 (3 read +1 write) register number fields are necessary for FMA
(Floating-point Multiply and Add) instruction.
But SPARC-V9 instruction length is limited (32bits �fixed)
Defined a new prefix instruction (SXAR) to specify upper-3bit of
register numbers of the following two instructions.
SXAR
inst1
inst2
Lower-5bit x4
Upper-3bit x4
SXAR (Set XAR) instruction
XAR: Extended Arithmetic Register
�Set by the SXAR instruction
�Valid bit is cleared once the corresponding subsequent instruction gets
executed.
Operand fields of SXAR
fv
furd
furs1
furs2
furs3
sv
surd
surs1
surs2
surs3
31
0
fsimd
ssimd
16
15
First Upper Register Source-1 bits
SXAR1: set XAR for subsequent one instruction.
SXAR2: set XAR for subsequent two instructions.
Does anyone still care about SPARC? If they do, that would be the
real news.
Robert.
Oracle say they do...
-p
--
Paul Gotch
--------------------------------------------------------------------
I care, because Fujitsu has been, IMHO, one of the most interesting
microprocessor design companies in the last 10 years.
Fujitsu SPARC.
Not Sun. Not Sun SPARC. Sun, to be honest, has not been terribly
interesting.
Overall, it only adds about 16 gates of total pipeline delay to have
byte level instruction lengths. Doing it at the word level cannot add
even this amount of gates. That is if a RISC design has 70 gates of
fall through delay, and x86 will have but 85-86. The fact that the
typical x86 has 256 gates of fall through delay cannot be blamed on
the instruction set! (But I digress)
In addition, last year I did some consultation with a company that was
considering adding a "payload" instruction to the instruction set. The
payload instruction carried a number of bits that other instructions
(already defined) could consume. You might use such a feature to carry
some more addressing bits, some register specifying bits, or some
instruciton set expanding bits. The payload instruction did not care
how the bits were consumed. (But again, I digress)
What we are arriving at is a point where we (the microarchitects and
implementers) have exploited all that is exploitable from the
architects of the past (6600, 360-91, 360-85, 360-67) in the context
of general purpose. If one looks at the distance in time between 360-
ISA introduction and the first RISC ISA introductions we have about 20
years. Now, it has been another 20 years and this keg seams tapped
out. In order to accrete that last modicum of performance for that
last application someone cares about, half a zillion instructions are
throw in. This is a sign that things are not well in architecture-
ville.
But, of course, the problem is not even in the instruction set, and
since the 1 million transistor level has not been. That is, as long as
the instructions that get created exist within the kinds of data-flow
the microarchitecture already supports; adding instructions is, for
all intents and purposes, (almost) free. It certainly takes more die
area to manage the data-flow than to manage the data-computations, so
to a first order, adding instructions is free (at the large end of
processor microarchitecture.) In addition, to a large extent, nobody
cares about the instruction set since compilers got "reasonably good".
As long as the programmer does not have to see the instructionset, why
should the customer care? This new foray into MAC-ville with 3-operand
instructions (sometimes with a 4-th destination) simply causes the
microarchitect to provide adequate register ports, and adequate
reservation station tracking. As long as this does not break the
camels back, its OK--not great but not worse than OK either. Just plan
for it and get on with life.
So, where are the instructions designed to allow the n-way
multiprocessor do synchronizations 10X faster than current? (OK, how
about 2X with guarenteed forward progress for at least one thread.)
This is really the kind of breakthrough that the large machines need.
(Where n is greater than 64) Even the scientific number chrunchers
would benefit from better synchronizations.
So, where are the new technologies to allow greater bandwidths to
greater memory with lesser latency? Say, 1 TB main memories with (say)
100 ns total latency average case (OK, maybe 150 ns total latency with
up to 64 nodes accessing the cabinet filled with DRAMs.)
Seems to me, that too many clever people doing the processors
(squeezing the last blood from the stone), and too few doing the
microarchitecture of the rest of the system (adding blood to the
stone).
Merry christmas
Mitch
>
> So, where are the instructions designed to allow the n-way
> multiprocessor do synchronizations 10X faster than current? (OK, how
> about 2X with guarenteed forward progress for at least one thread.)
Synchronization is just one part of the communication between two CPUs;
its generally followed by a transfer of some amount of data. In many
cases, the data-transfer completely dominates this overall
communication, so the cost of the synchronization is in the noise.
Further, synchronization is done at the level of "processes", not
hardware. If a process happens to be swapped out or not yet ready to
synchronize, the wait time for the last processes to get to the
synchronization point will dominate the overall cost.
The overall performance impact on the program of improving the hardware
support for synchronization is, in IMO, generally going to be
insignificant.
Can you show studies to the contrary?
> This is really the kind of breakthrough that the large machines need.
> (Where n is greater than 64) Even the scientific number chrunchers
> would benefit from better synchronizations.
Are there any studies, particularily on non-micro-benchmark codes, that
would quantify this improvement?
For message-passing codes, perhaps. For the shared-memory parallel
paradigms that are currently trendy, not at all.
>Further, synchronization is done at the level of "processes", not
>hardware. If a process happens to be swapped out or not yet ready to
>synchronize, the wait time for the last processes to get to the
>synchronization point will dominate the overall cost.
If you are working with very coarse-grained parallelism, then I agree
hardware instructions are irrelevant.
>The overall performance impact on the program of improving the hardware
>support for synchronization is, in IMO, generally going to be
>insignificant.
Don't bet on it. What it does is to make it feasible to parallelise
the sort of program where the parallelism cannot be made coarse-grained,
or where there is potentially much more gain for fine-grained.
>Can you show studies to the contrary?
I could, once. I no longer have easy access to the relevant classes
of system.
>> This is really the kind of breakthrough that the large machines need.
>> (Where n is greater than 64) Even the scientific number chrunchers
>> would benefit from better synchronizations.
>
>Are there any studies, particularily on non-micro-benchmark codes, that
>would quantify this improvement?
Yes. How many have been published in places you can find them, or
even written up suitable for publication, I don't know. I know that
mine weren't.
Note that the situation involves more than just the synchronisation
operations, because a lot of it is about scheduling. If you are
trying to parallelise code with a 10 microsecond grain, having to do
ANY interaction with the system scheduler runs the risk of a major
problem. That is one of the main reasons that almost all HPC codes
rely on gang scheduling, with all threads running all the time.
Regards,
Nick Maclaren.
So core 1 writes some data, core 1&2 synchronize, and core 2 reads the
data. What actually happens post-synchronization?
Well, cache lines get copied from dcache-CPU-1 to dcache-CPU-2. This
takes time. This time will be proportional to the shared data. The cost
can actually be higher than in the case of an explicit message passing
system.
The synchronization, by contrast, can involve the tranfer of exactly one
cache-line [e.g. if you're doing an atomic-increment].
More heavyweight synchronization operations (such as a lock with suspend
on the lock if already locked) *can* be more expensive - but the cost
is due to all the additional function in the operation. Its not clear
that tweaking the underlying hardware primitives is going to do much for
this.
> Yes. How many have been published in places you can find them, or
> even written up suitable for publication, I don't know. I know that
> mine weren't.
Pity
> Note that the situation involves more than just the synchronisation
> operations, because a lot of it is about scheduling. If you are
> trying to parallelise code with a 10 microsecond grain, having to do
> ANY interaction with the system scheduler runs the risk of a major
> problem. That is one of the main reasons that almost all HPC codes
> rely on gang scheduling, with all threads running all the time.
>
Agreed.
BTW: my experience is with systems where we're synchronizing on less
than 100 cycle granulatrity - at that granularity, you're basically
programming against bare metal, with fixed thread mappings, all-or-none
thread scheduling and no "system software" to speak of.
It's not clear, I agree, but one problem with existing ones is that
they are usually privileged, which forces a system call. That isn't
what you want, for many reasons.
>BTW: my experience is with systems where we're synchronizing on less
>than 100 cycle granulatrity - at that granularity, you're basically
>programming against bare metal, with fixed thread mappings, all-or-none
>thread scheduling and no "system software" to speak of.
That's largely because there are no adequate facilities for doing it
any other way :-(
Regards,
Nick Maclaren.
> In article <hIGdnRxV8ZyP1qvW...@bestweb.net>,
> Mayan Moudgill <ma...@bestweb.net> wrote:
>
>>More heavyweight synchronization operations (such as a lock with suspend
>> on the lock if already locked) *can* be more expensive - but the cost
>>is due to all the additional function in the operation. Its not clear
>>that tweaking the underlying hardware primitives is going to do much for
>>this.
>
>
> It's not clear, I agree, but one problem with existing ones is that
> they are usually privileged, which forces a system call. That isn't
> what you want, for many reasons.
>
Again, that supports my original point - the performance of
synchronization has nothing to do with improving synchronization
primitives, but with everything else in the system.
The reason you need that system call, I assume, is to suspend a thread
on a contended lock or to resume suspended threads. You could always use
spin-locks and avoid that system call - but then you get into the issue
of utilization.
"Nothing to to with" is too strong - part of the reason that the
rest of a system gets it wrong is that the hardware primitives do.
Only a part, I agree.
>The reason you need that system call, I assume, is to suspend a thread
>on a contended lock or to resume suspended threads. You could always use
>spin-locks and avoid that system call - but then you get into the issue
>of utilization.
It's worse than that :-(
Let's say that thread A wants to suspend itself in favour of thread B,
until the latter next suspends itself. If thread A uses a spin-loop
for its wait, thread B may never get to run, so thread A will wait for
ever ....
There are lots of important threading paradigms, which are known to be
useful, which are close to infeasible to use on modern systems.
Regards,
Nick Maclaren.
I believe Mitch is referring to potential new hardware functionality
like AMD's Advanced Synchronization Facility proposal.
I can't seem to find any info on it on the AMD website as the proposal
seems to have degenerated into just a registered trademark notice.
Having the ability to perform a LoadLocked/StoreConditional on
up to 4 separate memory locations would eliminate much of the
need to escalate to the heavyweight OS synchronization ops.
Eric
This appears to require a method of establishing a global
first-come-first-served ordering for the cpus that is
independent of the physical memory locations involved.
In the unlikely event of cache line ownership contention then
the first cpu to begin its multiple update sequence wins
and the other rolls back.
The trick is for it be a low cost mechanism (ideally the cost of
a single cache miss to establish the order) that works within
the existing cpu hardware, bus and coherency protocol.
For that I'm thinking that maybe a global device located
at some special physical memory location would establish
the global order at the start of a multiple update sequence.
Then using Invalidate bus ops to a set of other special
physical memory locations could communicate that ordering
to other cpus and they can associate that with the bus id.
So in this example the overhead cost would be 3 bus ops to
Read the global device, an Invalidate indicate my order to
the peers, and an Invalidate at the end of the sequence.
Eric
Huray for Google?
http://developer.amd.com/CPU/ASF/Pages/default.aspx
The problem with the current definition of ASF is that it has
_no_ guarantee of forward progress. Two contenders can wind up
ping-ponging back and forth in a livelock and software must
provide intervention that resolves this.
I was suggesting in my other message that it looks to me that
in order to guarantee that _one_ contender makes forward progress,
one need only establish a global order for aborts in the face of
contention. That does not mean fairness or eliminate livelock
as an individual can still wind up being continuously victimized.
To guarantee fairness there must be some sort of queue to
fall back upon.
Eric
I'm not so sure. I tend to agree with sonething that is implicit in what
Mayan is saying.
I have seen two main classes of system:
(1) Systems that allow thread and process migration from CPU to CPU.
(2) Systems that don't.
When you allow migration, if you are cache coherent you need do nothing
much. Maybe a fence, but no cache flushes.
However, if you use caches but are not cache coherent, then on a process
or thread migration you need to flush all of the cache footprint of the
thread being migrated. Not necessarily all the way to memory, but at
least to the shared cache level.
E.g. in a multicore chip, typically the L1 and L2 caches are private,
but the LLC is shared. So if you migrate between processors on the same
chip, you need to flush to the LLC. If you migrate between chips,
further out, typically to memory.
With AMD Bulldozer's MCMT, private L1's but shared L2s, sometimes you
would need to flush to L2, sometimes to L3, and sometimes all the way to
memory.
Unfortunately, x86 doesn't have all the varieties of cache flush you
would need to do this.
Unfortunately, it's chicken and egg: the vast majority of shared memory
cache coherent systems don't need this. So it isn't even provided, let
alone made fast.
Because of this, there seems to be a clustering of usage models:
(1) General purpose cache coherent systems that allow thread migration
(2) Special purpose systems that are not necessarily cache coherent,
which disallow, or at least highly deprecate, thread migration.
These latter I call embedded systems. Even if they are in big
supercomputers.
Unfortunately, high degrees of parallelism require thread migration for
load balancing. But thread migration in the absence of cache coherency
is really slow, unless somebody creates the appropriate cache flush
operations. Which are easy to do, but hard to justify in today's
incremental climate.
Actually I was being a bit dumb here.
I was thinking that each locker would acquire an 'order ticket'
and if there is a collision, the one with the higher (older)
order would abort, acquire a new ticket, and retry.
And that does lead to potential retry starvation.
However if the loser just aborts but retains its ticket
for the next try then that is a fair queue.
And since the winner knows who it collided with, it can send
a message to the loser letting it know when it is ok to retry.
So the loser can watch for a 'Done' message or have a timer
(just in case) and retry the multiple update sequence.
That doesn't guarantee success on the next try, but does
minimize latencies due timer delay retry loops.
Eric
> Mayan Moudgill wrote:
>> More heavyweight synchronization operations (such as a lock with
>> suspend on the lock if already locked) *can* be more expensive - but
>> the cost is due to all the additional function in the operation. Its
>> not clear that tweaking the underlying hardware primitives is going to
>> do much for this.
>
>
> I believe Mitch is referring to potential new hardware functionality
> like AMD's Advanced Synchronization Facility proposal.
> I can't seem to find any info on it on the AMD website as the proposal
> seems to have degenerated into just a registered trademark notice.
>
> Having the ability to perform a LoadLocked/StoreConditional on
> up to 4 separate memory locations would eliminate much of the
> need to escalate to the heavyweight OS synchronization ops.
>
> Eric
Let me ask the following question: assume that we have a N-process
barrier, how do you avoid using N system calls (use any hardware
synchronization you want)?
The fundamental problem becomes that, in the general case, the first N-1
processes that arrive at the barrier will need to (have the kernel) put
themselves on a waiting queue, and the last process to arrive will need
to (have the kernel) put the N-1 waiting processes back on the ready queue.
I doubt that software transactional memory solves this problem.
OK, so you want all N threads/cores/cpus to wait here until the last one
arrives?
>
> The fundamental problem becomes that, in the general case, the first N-1
> processes that arrive at the barrier will need to (have the kernel) put
> themselves on a waiting queue, and the last process to arrive will need
> to (have the kernel) put the N-1 waiting processes back on the ready queue.
>
> I doubt that software transactional memory solves this problem.
First of all, I assume dedicated cores for each thread, i.e. gang
scheduling, otherwise the OS is involved in thread switches anyway, right?
It seems to me that you could use a single LOCK XADD(counter,-1) from
every thread, followed by a spin loop that contains a MONITOR
instruction on 'counter' and a counter re-read to determine if it has
made it all the way down to zero by now.
This would seem to give forward progress for each bus transaction, the
others involved in a contention will do a hardware retry, leading to at
most N updates total as long as the bus protocol is such that when
multiple cores try to acquire the same cache line for update, one of
them will win and go on to decrement the counter and flush the line.
Alternatively, if N is _very_ large, then a multi-stage counter where
sqrt(N) cores use each of the sqrt(N) first-level counters, and the
"winners" of each of those update the top-level counter, and report back
to the other first-level cores.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
>
> First of all, I assume dedicated cores for each thread, i.e. gang
> scheduling, otherwise the OS is involved in thread switches anyway,
right?
>
> It seems to me that you could use a single LOCK XADD(counter,-1) from
> every thread, followed by a spin loop that contains a MONITOR
> instruction on 'counter' and a counter re-read to determine if it has
> made it all the way down to zero by now.
>
This solution is (as stated) not general; in the restricted case being
considered, it is .... ummm ... baroque? Cleaner solutions are
definitely possible.
One particular red-flag is the use of the MONITOR instruction. Its only
available at privilege level 0. How did you get there? With a system call?
This barrier question is quite different from the
atomic multiple update issue.
Monitor is available in user mode, and Mwait can be
if the OS enables it.
The general solution for a barrier spinwait requires 2 counters,
a WaitCount and RunCount, ideally on separate cache lines.
Each thread increments WaitCount and if less than N spinwaits
while RunCount == 0.
When WaitCount hits N then that thread resets it to zero
and sets RunCount = N-1 to release the peers and proceeds.
Each peer decrements RunCount and proceeds.
Its may not be a great solution because when the pack is
released they all immediately try to decrement RunCount.
Depending on the system that might cause a lot of contention
but can't be avoided as the last thread leaving the barrier
must shut the door behind it.
volatile int gWaitCount;
volatile int gRunCount;
if (AtomicFetchAdd (&gWaitCount, 1) == THREADCOUNT-1)
{ // Release the peers
AtomicWrite (&gWaitCount, 0);
AtomicWrite (&gRunCount, THREADCOUNT-1);
}
else
{ // Wait for quorum
while (AtomicRead (&gRunCount) == 0)
{ CpuPause(); }
AtomicFetchAdd (&gRunCount, -1);
}
Eric
If so, then I really don't see what the problem is?
>
> One particular red-flag is the use of the MONITOR instruction. Its only
> available at privilege level 0. How did you get there? With a system call?
I didn't know that, until now I had assumed that MONITOR+PAUSE was a
user-level way to sleep a core, without involving the OS at all.
Skipping the MONITOR part leaves me with a stock spin loop where every
core will re-read the counter until it has reached zero, which also
seems OK, since there's nothing else useful they can do.
What's important is _only_ the latency from when the last core executes
its XADD() operation until all the waiting cores have been able to
reload the counter and seen that they are ready to go on.
Since I have my head in hardware multithreading land - GPUs, various
flavours of CPUs with a small number of actively running threads and a
larger number of threads that are not actively running in a thread pool
- then it is straightforward.
The first N threads to arrive at the barrier switch to not activey
running. They are in the thread pool, but other threads take their
place on the actively running hardware thread contexts.
When the last thread arrives at the barrier, they all are marked
runnable again.
If the OS wants to context switch other threads/processes in/out, it can.
---
Yes, I am talking about a hardware or microcode thread scheduler.
I got into this space because of my work on speculative multithreading.
In an SpMT system you typically need many more potential threads that
you need actively running threads. If you don't want the SpMT to be
software visible, then you need to do such scheduling in hardware or
microcode, underneath the OS or VMM.
Some folks want to get the OS involved in such scheduling. Sure, maybe
if you work at Microsoft. Not if you work at Intel or AMD. If you work
at a hardware company, you don't want your project schedule to depend on
a different company.
However, my attitude has always been to reduce the need for software
support, but to allow software access. So, if you have a hardware or
microcode thread scheduler for SpMT, why not, eventually, expose it to
software? If it has any advantages.
The sorts of advantages it might expose things like power management,
load balancing, QOS.
Correct.
> In the unlikely event of cache line ownership contention then
> the first cpu to begin its multiple update sequence wins
> and the other rolls back.
But more importantly, that entity that determins who wins also
establishes order over current participants avoid contention on the
subsequent access. Thus, one can achieve something on the order of BigO
(log(n) instead of BigO(n**2) memory references worst case. You cannot
get to this point unless the synchronization 'event' returns an
integer number instead of simple win-retry.
>
> The trick is for it be a low cost mechanism (ideally the cost of
> a single cache miss to establish the order) that works within
> the existing cpu hardware, bus and coherency protocol.
In practice it requires a two way transfer through the fabric, but
does not require a DRAM access delay. So the latency is better than a
DRAM access. The entity looks and smells remarkably like a TLB and can
process a stream of requests as fast as the fabric can deliver a
stream of requests (i.e. no back pressure--at least none required).
And the TLB does not have to be "that big" either.
> For that I'm thinking that maybe a global device located
> at some special physical memory location would establish
> the global order at the start of a multiple update sequence.
Yep, programmed up by a standard header making it look like a device
witting anywhere in fabric addressible space.
> Then using Invalidate bus ops to a set of other special
> physical memory locations could communicate that ordering
> to other cpus and they can associate that with the bus id.
Nope, dead wrong, here. You return the order as an integer response to
a message that contains all of the participating addresses. This part
of the process does not use any side-band signaling. After a CPU hase
been granted exclusive access to those cache lines, it, then, is
enabled to NAK requests from other CPUs (or devices) so that it, the
blessed CPU makes forward progress while the unblessed are delayed.
> So in this example the overhead cost would be 3 bus ops to
> Read the global device, an Invalidate indicate my order to
> the peers, and an Invalidate at the end of the sequence.
In my model, there is a message carrying up to eight 64-bit physicall
addreses to the Observer entity, if there are no current grants to any
of the requested cache lines, the transaction as a whole is granted
and a 64-bit response is given as a response to the sending CPU. Most
would call this one fabric transaction. Just like a Read-cache-Line is
one fabric operation.
The CPUs contend for the actual cache lines in the standard maner
(with the exception of the NAK above).
Mitch
In your method, the MU (Multiple Update) participants collect
their physical addresses together into a batch and send them
to an arbiter.
I am exploring a slightly different approach which
dynamically self organizes amongst the processors.
It is based on the idea that only the order number is important
for dispute resolution. The physical addresses are just triggers.
If there is a collision then the lower (earlier) order number
wins and the higher number aborts.
Once the order is established, the lower order will always
eventually win. The loser will retry using its original order
number so it will eventually win too. The addresses don't matter
for dispute resolution.
It requires 2 bus message features: a broadcast of the order
number to all peers at the start of an MU attempt,
and the ability to NAK a ReadToOwn cache line request with
a special error code that triggers an Abort in the requester.
- A processor starting a MU attempt reads a global 64 unsigned
integer counter from a global up-counter device.
It then broadcasts that value to all processors so
all agree on that processor's order.
The broadcast is necessary because there can be multiple buses
and therefore multiple pathways through the interconnect.
Only a central global counter can establish a single order.
- Each processor has a flag bit vector indexed by bus id #.
Each peer processor compares the order number to itself
and sets a flag for that bus id of the broadcaster.
If a peer is not in a MU sequence the the flag = 0.
If a peer is in a MU sequence then this cpu has its own
order number. We compare the broadcast number to ours
and the flag = 1 if broadcast is higher, otherwise = 0.
- Each cpu now has a bit vector, indexed by bus id #,
that tells that processor whether it should respond to
an individual ReadToOwn by sending a line and aborting myself,
or sending a NAK which will trigger an abort in the peer.
- Each MU cpu executes a mu_load instruction.
If successful it records the physical address in a special
MU collision detect register.
If any peer later executes a mu_load to that same address
then there is a collision and we consult the bit vector
to decide how to respond.
- As an optimization, when a cpu wins a contention and
sends a Nak, then it also remembers the bus id of who it Nak'ed.
When it finished its sequence, or if it aborts, it sends
a TryAgain message to the loser who spins on a special
mu_wait instruction.
- The MU instruction sequence is then designed to support the
above handshake sequence.
Just thinking out loud...
Eric
This could also be done without a NAK, though it is
not very elegant: it could do a grab-back.
If a line owner receives a ReadToOwn it consults the bus id bit vector.
If the requester is lower order, this cpu replies as normal and
aborts its own MU sequence.
If the requester is higher order, this cpu replies with the value
but immediately requests it back. That will trigger the same logic
sequence in the requester (because we all agree on the order numbers)
who will reply and abort its MU sequence.
Eric