Integration of the right kind is finally upon us.

Robert Myers

unread,

Feb 15, 2013, 8:47:51 PM2/15/13

to

Whether the graphics are on-board or not is of no interest to HPC. Whether the fabric is on-board or not is. Possibly this is what the transputer thought it was doing.

http://www.computerworld.com/s/article/9236846/Intel_works_toward_integration_as_it_aims_for_exascale

Unfortunately, Intel is trading for its dividend and appears to be EOL. It's hard to believe that HPC could matter enough to keep its foundries busy.

Robert.

MitchAlsup

unread,

Feb 15, 2013, 11:28:21 PM2/15/13

to

On Friday, February 15, 2013 7:47:51 PM UTC-6, Robert Myers wrote:
>
I wish them luck; but consider that in less than 5 years, it is expected that 1TFlop will be available inside the power envelope of a cell phone.

Mitch

EricP

unread,

Feb 16, 2013, 11:50:08 AM2/16/13

to

Robert Myers wrote:
> Whether the graphics are on-board or not is of no interest to HPC. Whether the fabric is on-board or not is. Possibly this is what the transputer thought it was doing.
>
> http://www.computerworld.com/s/article/9236846/Intel_works_toward_integration_as_it_aims_for_exascale

Intel's Infiniband products seem to be called TrueScale not TrueSwitch.
Their web site has only info on PCIx cards though.

It could be for "big data" too.
For distributed processing it would be nice to have device controllers
directly connected to the quick-path (or hypertransport) interconnect
so you don't stall writing to control registers and devices
can do coherent scatter/gather dma at top speed.

With all the spare transistors it is a bit surprising
it took so long to happen.

At Hot Chips 24:

Intel
http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-8-DataCenter/HC24.29.827-Xeon-Rowland-Xeon-E5-2600-Disclaimer.pdf

It could be a "me too" to IBM's "big data" systems

Power7
http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-8-DataCenter/HC24.29.815-Power7-Taylor-IBM-120828-Final.pdf

At Hot Chips 22:

PowerEN
http://www.hotchips.org/wp-content/uploads/hc_archives/archive22/HC22.23.310-Brown-PowerEN%20Presentation%2028July2010.pdf
Power7
http://www.hotchips.org/wp-content/uploads/hc_archives/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf

Eric

Robert Myers

unread,

Feb 16, 2013, 12:47:43 PM2/16/13

to

On Friday, February 15, 2013 11:28:21 PM UTC-5, MitchAlsup wrote:
> On Friday, February 15, 2013 7:47:51 PM UTC-6, Robert Myers wrote:
>
> >
>
> I wish them luck; but consider that in less than 5 years, it is expected that 1TFlop will be available inside the power envelope of a cell phone.
>
>

Connected to how much memory how? Flops are free. Bits per second are still expensive, latency is still critical, and really big memory is still microseconds away.

Robert.

Thomas Womack

unread,

Feb 16, 2013, 1:06:22 PM2/16/13

to

In article <f5b3b2de-ceb2-43b7...@googlegroups.com>,

Calxeda have a reasonable number of reasonably fast links and a
distinctly non-trivial router on their proposed ARM server chips

http://www.calxeda.com/technology/products/processors/ecx-1000-techspecs/

(indeed, I would be quite amazed if four 1.4GHz Cortex-A9 could come
close to keeping eight 10Gbit channels busy; it's got off-chip network
bandwidth roughly equal to main memory bandwidth! On another hand,
bandwidth substitutes for cleverness quite well, and having hundreds
of the cores each putting their penny worth into the fabric probably
doesn't come out too badly)

I don't know enough about the standardisation of fast links on
motherboards; obviously you don't want to use 10GbaseT PHYs or
Infiniband, I don't know what the lowest-available-power solution for
running 10Gbit/s over a foot of (parallel differential-paired) trace
on CR4 is.

Tom

nm...@cam.ac.uk

unread,

Feb 16, 2013, 1:35:41 PM2/16/13

to

In article <ae52e959-7043-4e0f...@googlegroups.com>,

If the power is dropped enough, back-to-back CPU and memory becomes
possible, which increases the memory bandwidth by a huge factor,
vastly reduces the cost, and reduces the latency a bit.

Latency remains an issue, but the other aspect that will hit is my
old hobby-horse of consistency - to get 1 TFlop within that power
envelope is probably incompatible with fully consistent memory.

Infiniband is a horror, but is a vastly better way of getting
bandwidth than Ethernet. And my guess is that Intel are planning
to use it to replace some of their current zoo of interconnects.

Regards,
Nick Maclaren.

Robert Myers

unread,

Feb 16, 2013, 2:04:09 PM2/16/13

to

On Friday, February 15, 2013 8:47:51 PM UTC-5, Robert Myers wrote:
> Whether the graphics are on-board or not is of no interest to HPC. Whether the fabric is on-board or not is. Possibly this is what the transputer thought it was doing.
>

I should acknowledge that the computer architecture that initially set me off, Blue Gene, did, in fact, integrate the fabric with the processor. It just didn't accomplish all that much by doing so (as I measure success, of course). Blue Gene, at least in its initial incarnation, got admirable global latency and power consumption for the time. I've already said enough on the subject of global bandwidth. I just wanted to acknowledge that integrating the fabric with the processor was a part of the Blue Gene architecture from the beginning.

Robert.

Stephen Fuld

unread,

Feb 19, 2013, 7:09:04 PM2/19/13

to

On 2/16/2013 8:50 AM, EricP wrote:
> Robert Myers wrote:
>> Whether the graphics are on-board or not is of no interest to HPC.
>> Whether the fabric is on-board or not is. Possibly this is what the
>> transputer thought it was doing.
>> http://www.computerworld.com/s/article/9236846/Intel_works_toward_integration_as_it_aims_for_exascale
>>
>
> Intel's Infiniband products seem to be called TrueScale not TrueSwitch.
> Their web site has only info on PCIx cards though.
>
> It could be for "big data" too.

Yes. People who have been here a while might remember that I have been
ranting about this for some time. The server group at Intel originally
planned to have Infiniband integrated into the then prevalent chip set
(i.e. north bridge). This got killed by the desktop group in favor of
serial PCI, which doesn't have nearly the capabilities but is an easier
change to absorb. Perhaps this change is partly due to recognition of
the increase in CPUs going into sites like Google data centers and
relative decrease in those going to desktops.

> For distributed processing it would be nice to have device controllers
> directly connected to the quick-path (or hypertransport) interconnect
> so you don't stall writing to control registers and devices
> can do coherent scatter/gather dma at top speed.

Infiniband is not on the QPI, but there is, in general, no need to write
to control registers nor devices. You create a packet in memory that
describes the operation then add it to the queue. If necessary, one
"poke" to a reserved location tells the built-in hardware to start
processing the packet. Even this could be eliminated at the cost of
adding an instruction to the ISA to accomplish the same thing. It has
built-in scatter/gather at top speed and can even directly operate
into/out of user virtual memory space, with appropriate safeguards.

> With all the spare transistors it is a bit surprising
> it took so long to happen.

Internal politics. :-(

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

EricP

unread,

Feb 20, 2013, 4:05:28 PM2/20/13

to

Yes but that poke has to be communicated somehow to a device register.
In the past that control register would be on the PCI bus and the
processor would stall while it wrote the poke to the slow device.

Looking at that Hot Chips 24 Intel document shows

E5-2600 <---> Interface <---> Tylersburg <---> ?Infiniband?
ring QPI PCIe

In the E5-2600 processor, the 8 Sandybridge cores talk over
an internal 2 lane (1 clockwise, 1 counter clockwise) ring interconnect.
The E5-2600 connects to QPI,
the Tylersburg I/O caching agent connects to QPI on one side,
and PCIe conections on the other.

I assumed the Infiniband plugs into one of the PCIe connections
and I was thinking that it might get even higher bandwidth
by connect directly into the QPI.

The recent Register article
http://www.theregister.co.uk/2013/02/19/intel_qlogic_network_update/

refers to QDR-80 and has a little picture but doesn't say
how the QDR-80 connects. Other Intel documents refer to
HCA or Host Channel Adaptors as being PCIe to Infiniband
adaptors so it sounds like QDR-80 is such an HCA.

I also see reference to "Intel Data Direct I/O Technology"
http://www.intel.com/content/www/us/en/io/direct-data-i-o.html

there is a short fluffy video which talks about moving data
directly into the cache but doesn't say exactly how.
But this says a bit more
http://www.intel.com/content/www/xr/en/io/data-direct-i-o-technology-brief.html

Eric

Stephen Fuld

unread,

Feb 20, 2013, 6:37:03 PM2/20/13

to

On 2/20/2013 1:05 PM, EricP wrote:
> Stephen Fuld wrote:
>> On 2/16/2013 8:50 AM, EricP wrote:
>>> For distributed processing it would be nice to have device controllers
>>> directly connected to the quick-path (or hypertransport) interconnect
>>> so you don't stall writing to control registers and devices
>>> can do coherent scatter/gather dma at top speed.
>>
>> Infiniband is not on the QPI, but there is, in general, no need to
>> write to control registers nor devices. You create a packet in memory
>> that describes the operation then add it to the queue. If necessary,
>> one "poke" to a reserved location tells the built-in hardware to start
>> processing the packet. Even this could be eliminated at the cost of
>> adding an instruction to the ISA to accomplish the same thing. It has
>> built-in scatter/gather at top speed and can even directly operate
>> into/out of user virtual memory space, with appropriate safeguards.

Let me preface this with the statement that I have no inside knowledge
of what Intel is planning.

> Yes but that poke has to be communicated somehow to a device register.

Not exactly. See below.

> In the past that control register would be on the PCI bus and the
> processor would stall while it wrote the poke to the slow device.

Yes. That was the problem with Intel's decision to scrap the direct
interface to Infiniband and replace it with PCI-e.

>
> Looking at that Hot Chips 24 Intel document shows
>
> E5-2600 <---> Interface <---> Tylersburg <---> ?Infiniband?
> ring QPI PCIe
>
> In the E5-2600 processor, the 8 Sandybridge cores talk over
> an internal 2 lane (1 clockwise, 1 counter clockwise) ring interconnect.
> The E5-2600 connects to QPI,
> the Tylersburg I/O caching agent connects to QPI on one side,
> and PCIe conections on the other.

OK.

> I assumed the Infiniband plugs into one of the PCIe connections
> and I was thinking that it might get even higher bandwidth
> by connect directly into the QPI.

The in the direct implementation (at least as I see it), Infiniband
doesn't plug into the PCIe connection, it *replaces* the PCIe
connection. Thus in your diagram above, Infiniband would be on the
other side of the Tylersburg logic from the QPI.

Since Infiniband itself is packet driven, this by itself eliminates the
pokes to the PCI registers. The pokes to the device registers are
eliminated by replacing the "poke a register" interface of older devices
with a "take this packet and do what it says" interface. Think SCSI
versus much older disk interfaces.

In the circa early 1990s plan, where the Infiniband logic was in the
"north bridge", you needed one reserved address that, when poked, told
the Infiniband logic that there was a packet to process. In the new
implementations, I don't know if they will retain this or implement a
new CPU instruction or special register. But there isn't much
contention for this interface as it is only "strobed" once to wake up
the hardware, and in this implementation, it is on the CPU chip itself,
not on the other side of the PCIe bus.

Michael S

unread,

Feb 21, 2013, 12:02:10 PM2/21/13

to

On Feb 20, 11:05 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> Stephen Fuld wrote:
> > On 2/16/2013 8:50 AM, EricP wrote:
> >> For distributed processing it would be nice to have device controllers
> >> directly connected to the quick-path (or hypertransport) interconnect
> >> so you don't stall writing to control registers and devices
> >> can do coherent scatter/gather dma at top speed.
>
> > Infiniband is not on the QPI, but there is, in general, no need to write
> > to control registers nor devices. You create a packet in memory that
> > describes the operation then add it to the queue. If necessary, one
> > "poke" to a reserved location tells the built-in hardware to start
> > processing the packet. Even this could be eliminated at the cost of
> > adding an instruction to the ISA to accomplish the same thing. It has
> > built-in scatter/gather at top speed and can even directly operate
> > into/out of user virtual memory space, with appropriate safeguards.
>
> Yes but that poke has to be communicated somehow to a device register.
> In the past that control register would be on the PCI bus and the
> processor would stall while it wrote the poke to the slow device.
>
> Looking at that Hot Chips 24 Intel document shows
>
> E5-2600 <---> Interface <---> Tylersburg <---> ?Infiniband?
> ring QPI PCIe
>

This picture does not look logical.
Tylersburg PCIe is for "slow" stuff, and, may be, for hot plug.
For fast "cold plug" devices like infiniband adapter you almost for
sure want to use build-in Xeon-E5 PCIe ports.

> In the E5-2600 processor, the 8 Sandybridge cores talk over
> an internal 2 lane (1 clockwise, 1 counter clockwise) ring interconnect.
> The E5-2600 connects to QPI,
> the Tylersburg I/O caching agent connects to QPI on one side,
> and PCIe conections on the other.
>

> I assumed the Infiniband plugs into one of the PCIe connections
> and I was thinking that it might get even higher bandwidth
> by connect directly into the QPI.
>

Xeon-E5 has 40 PCIe lanes capable of running at gen 3 speed.
Bandwidth-wise that's approximately twice more then its two full
width QPI ports at 8 GT/s.QPI.
But, in dual-socket mode, only one QPI port is available for I/O
devices, so, in dual-socket, there is 4x bandwidth difference is in
favor of PCI.

Of course, latency-wise the figure is different, but still it does not
change the conclusion - Intel QPI based I/O is a technological dead
end that may survive for few more years in the tiny Xeon-E7 niche, but
is irrelevant for mass-market Xeon-E3/Xeon-E5.
Development of IB adapters with QPI host bus has little business sense
for Intel and no business sense for 3rd parties.

> The recent Register articlehttp://www.theregister.co.uk/2013/02/19/intel_qlogic_network_update/

>
> refers to QDR-80 and has a little picture but doesn't say
> how the QDR-80 connects. Other Intel documents refer to
> HCA or Host Channel Adaptors as being PCIe to Infiniband
> adaptors so it sounds like QDR-80 is such an HCA.
>
> I also see reference to "Intel Data Direct I/O Technology"http://www.intel.com/content/www/us/en/io/direct-data-i-o.html
>
> there is a short fluffy video which talks about moving data
> directly into the cache but doesn't say exactly how.

> But this says a bit morehttp://www.intel.com/content/www/xr/en/io/data-direct-i-o-technology-...
>
> Eric

Andy (Super) Glew

unread,

Mar 5, 2013, 11:51:06 PM3/5/13

to

I'm going to slide past the question of whether a high speed device
should be "on the QPI" to access packets in memory (coherently).

I'd like to discuss the tradeoff between "poking a (memory mapped)
reserved location" and creating a new instruction to do the same thing.
I'll start:

a) poking a reserved location necessarily requires uncached memory
mapped I/O

b) if you want to do this from user mode, in almost every process (i.e.
if you want to support this sort of communication ubiquitously, as
opposed to in certain high priority user domain processes), then you
probably want to "virtualize" the reserved location. E.g. the OS may
need to know enough about it to unmap it from certain processes - if,
e.g. the of hardware contexts that the actual device can handle is
exceeded. This may not be necessary if you have an IOMMU, and the
device is fully self-virtualizing. E.f. the hardware behind the memory
location may need to understand process contexts.

c) a reserved location probably needs at least a 4KiB page, either all
to itself, or shared with other, similar, MMIO locations.

d) an instruction is - well, an instruction.

e) I don't think that I would want to build a fully hardwired version of
such an instruction. That's excessively CISCy.

f) I would prefer microcode or PALcode.

g) such microcode or PALcode could actually implement the operation,
without invoking anything like the Infiniband (R)DMA engine. which
might be appropriate for low end implementations. on a higher end
implementation, switch to using the engine.

--
The content of this message is my personal opinion only. Although I am
an employee (currently of MIPS Technologies, which has been acquired by
Imagination Technologies; in the past of companies such as Intellectual
Ventures and QIPS, Intel, AMD, Motorola, and Gould), I reveal this only
so that the reader may account for any possible bias I may have towards
my employer's products. The statements I make here in no way represent
my employers' positions on the issue, nor am I authorized to speak on
behalf of my employers, past or present.

Stephen Fuld

unread,

Mar 6, 2013, 2:55:22 AM3/6/13

to

I agree with everything you say above. I would prefer an instruction.
In the original planned implementation, a reserved location was used as
the IB logic was physically in the north bridge chip so I guess a
reserved location seemed the easiest way to do it. But now, an
instruction seems the clear choice.

> e) I don't think that I would want to build a fully hardwired version of
> such an instruction. That's excessively CISCy.
>
> f) I would prefer microcode or PALcode.
>
> g) such microcode or PALcode could actually implement the operation,
> without invoking anything like the Infiniband (R)DMA engine. which
> might be appropriate for low end implementations. on a higher end
> implementation, switch to using the engine.

IIRC, (and it has been a long time), the IB engine operated
asynchronously based on a user supplied queue of commands. If the
hardware finished an operation, it looked to see if there was anything
else on the queue, and if so, executed it. If the queue was empty, the
IB hardware went to sleep. The "poke" was just to wake up the HW if it
had gone to sleep. If it was already processing packets, the poke was
essentially a NOP.

I am certainly not a CPU architect or designer, but if that recollection
was correct, and if the newly planned implementation works the same way,
isn't the "poke" simple enough to be hardwired?

Now I can understand if you wanted to make the instruction actually do
some IB stuff, and perhaps they will implement it that way, then you
probably want some microcode, or at least a programmable sequencer of
some type, but I would guess that today, the amount of die space needed
for the IB engine is small enough that no one would want to do two
designs to save die space on the low end one.

Of course, I am talking out of my hat as I know nothing of Intel's
plans, and precious little about HW design, so please feel free to
correct me.

Ivan Godard

unread,

Mar 6, 2013, 4:46:56 AM3/6/13

to

One reason for *not* using an instruction is that there would have to be
a mechanism for permission-to-use said instruction, which is not easy to
get right; is redundant with already extant memory-access permissions;
and leads to corner cases where multiple permission systems interact.

That's why the Mill has neither reserved operations nor supervisor mode.
All permissions are in the address space, and everything that must be
restricted is reached by MMIO.

Ivan

Stephen Fuld

unread,

Mar 6, 2013, 12:19:28 PM3/6/13

to

On 3/6/2013 1:46 AM, Ivan Godard wrote:

snipped discussion about reserved address versus reserved location for
Infiniband

> One reason for *not* using an instruction is that there would have to be
> a mechanism for permission-to-use said instruction, which is not easy to
> get right; is redundant with already extant memory-access permissions;
> and leads to corner cases where multiple permission systems interact.
>
> That's why the Mill has neither reserved operations nor supervisor mode.
> All permissions are in the address space, and everything that must be
> restricted is reached by MMIO.

So how do you control who has access to what memory regions , i.e. who
sets the memory permissions? There has to be some mechanism for a
program with high trust to be able to prevent another errant programs
from doing damage.

Andy (Super) Glew

unread,

Mar 6, 2013, 12:22:00 PM3/6/13

to

So, what are the reasons not to do an instruction?

a) On some machines this may be too complex for hardware - and some
people are allergic to microcode / PALcode. (Apart from complexity, you
might want it in PALcode to allow communication between privilege domains.)

b) On some machines, in some implementations, the "instruction" might be
just a wrapper around a poke to the privileged location.
If so, why bother? (A: because other implementations might microcode)

c) a CPU compan may be reluctant to do anything that depends on having
an engine in the system - that might be on a different chip.
You don't want something to be architecture unless you can
guarantee that it is present. Or else, if not present, you have to
invest in another way of ding the same thing.

... any more? ...

Ivan Godard

unread,

Mar 6, 2013, 12:35:28 PM3/6/13

to

Pretty conventional. Permissions for each hunk of address space are in
OS tables along with other metainfo, cached into a lookaside buffer by
the hardware. At power-on the machine comes up with the initial
execution having all permissions for every byte in the whole address
space (single global address space model, also called SASOS). That
initial execution is by definition the OS, and as it has read-write
access to the tables it can hand rights out as it pleases.

While SASOS is an unusual model, the same
all-permissions-are-address-space model also works fine in a a
conventional aliasing-process model. If a driver (say) needs to be able
to shut off interrupts, you map the MMIO that does that into its address
space with write permission. An app without that mapping can't trigger
it because the MMIO isn't in its space, and it can't execute an INTSOFF
instruction because there isn't one.

Ivan

Paul A. Clayton

unread,

Mar 6, 2013, 1:54:52 PM3/6/13

to

On Mar 6, 12:35 pm, Ivan Godard <i...@ootbcomp.com> wrote:
[snip]

> While SASOS is an unusual model, the same
> all-permissions-are-address-space model also works fine in a a
> conventional aliasing-process model. If a driver (say) needs to be able
> to shut off interrupts, you map the MMIO that does that into its address
> space with write permission. An app without that mapping can't trigger
> it because the MMIO isn't in its space, and it can't execute an INTSOFF
> instruction because there isn't one.

Is it planned for the Mill to cache/prevalidate at least
some permissions? I seem to recall you mentioning that
stack accesses do not go through the PLB, so I am
guessing that some addresses might have a "virtual" PLB
entry directly attached with permissions for the current
permission domain. On the other hand, perhaps some
permissions are hardwired and/or are mapped as registers.

The Itanium 9500 (Poulson--previous implementations might
have done this) prevalidates permissions in the L1 TLB so
that the protection keys (page group ids) do not need to
be checked with every access. This has obvious advantages
in terms of reducing energy use and making some logic less
critical in timing. Of course, this also means that
invalidating a protection key or changing the associated
permissions becomes relatively expensive. (I suspect
such a change invalidated the protection-key permission
elements for all TLB entries, so such permission
information would need to be reinstalled.)

(Sadly, the Itanium features that would help with fine-
grained protection seem to be underutilized. Secure64
_might_ be the only ones to use protection keys--assuming
that they do--, and I do not know if any of the HP OSes use
the Enter Privileged Code instruction [fast system calls
speed changes in permission]--I think some Intel-supported
research ported some Linux system calls to that interface,
but, if I recall correctly, such was limited by lack of
access to the shadow registers and register stack issues
[and presumably also the structure of the existing Linux
kernel].)

(Prohibiting writes on blocks of registers might be useful
for sharing registers in a multithreaded design.
Prohibiting reads as well could facilitate variable sized
contexts. Even just prohibiting reads could be useful
for detecting misuse (effectively storing a signaling NaN
without using the full storage space), but that would
presumably want per-register control.)

EricP

unread,

Mar 6, 2013, 2:00:52 PM3/6/13

to

Andy (Super) Glew wrote:
>
> I'm going to slide past the question of whether a high speed device
> should be "on the QPI" to access packets in memory (coherently).
>
> I'd like to discuss the tradeoff between "poking a (memory mapped)
> reserved location" and creating a new instruction to do the same thing.
> I'll start:
>
> a) poking a reserved location necessarily requires uncached memory
> mapped I/O
>
> b) if you want to do this from user mode, in almost every process (i.e.
> if you want to support this sort of communication ubiquitously, as
> opposed to in certain high priority user domain processes), then you
> probably want to "virtualize" the reserved location. E.g. the OS may
> need to know enough about it to unmap it from certain processes - if,
> e.g. the of hardware contexts that the actual device can handle is
> exceeded. This may not be necessary if you have an IOMMU, and the
> device is fully self-virtualizing. E.f. the hardware behind the memory
> location may need to understand process contexts.

I presume there is an Infiniband init call at the start which
would create a memory section mapping the device register(s).

It would be handled just like an X-windows manager process
accessing the graphics hardware from user mode.
The OS must have a system service to, for suitably privileged processes,
map on request a physical address range to a user space virtual address
range, including possibly using cpu dependent cache controls.

Such a device memory section is managed differently from a
normal user mode memory section. In particular the physical
address is not main memory and must not be returned
to the free memory list on section destroy.
Other options can control whether the page table pages are all
allocated and pinned on section create, or are allocated on demand.

> c) a reserved location probably needs at least a 4KiB page, either all
> to itself, or shared with other, similar, MMIO locations.
>
> d) an instruction is - well, an instruction.
>
> e) I don't think that I would want to build a fully hardwired version of
> such an instruction. That's excessively CISCy.
>
> f) I would prefer microcode or PALcode.
>
> g) such microcode or PALcode could actually implement the operation,
> without invoking anything like the Infiniband (R)DMA engine. which
> might be appropriate for low end implementations. on a higher end
> implementation, switch to using the engine.

I don't see the need for an instruction.
The interface design does need to work reliably when writing
multiple items to the control register in the presence of
interrupts without requiring locks.

It is easy to handle by making all such control register readable
and writable, and having the 'poke' routine save prior values
before writing its new values.

So say there are 3 value registers RegA, RegB, RegC
and a command register RegCmd.
The reentrancy safe way to send a command is to save
the old values, deposit your new values, poke the command,
then restore the original values.

void Poke (int cmd, int a, int b, int c)
{
int oldA, oldB, oldC;
oldA = RegA;, oldB = RegB; oldC = RegC; // save current state
RegA = a; RegB = b; RegC = c;
RegCmd = cmd;
RegA = oldA; RegB = oldB; RegC = oldC; // restore state
}

Otherwise you need a guard lock.

Eric

Ivan Godard

unread,

Mar 6, 2013, 2:14:10 PM3/6/13

to

On 3/6/2013 10:54 AM, Paul A. Clayton wrote:
> On Mar 6, 12:35 pm, Ivan Godard <i...@ootbcomp.com> wrote:
> [snip]
>> While SASOS is an unusual model, the same
>> all-permissions-are-address-space model also works fine in a a
>> conventional aliasing-process model. If a driver (say) needs to be able
>> to shut off interrupts, you map the MMIO that does that into its address
>> space with write permission. An app without that mapping can't trigger
>> it because the MMIO isn't in its space, and it can't execute an INTSOFF
>> instruction because there isn't one.
>
> Is it planned for the Mill to cache/prevalidate at least
> some permissions? I seem to recall you mentioning that
> stack accesses do not go through the PLB, so I am
> guessing that some addresses might have a "virtual" PLB
> entry directly attached with permissions for the current
> permission domain. On the other hand, perhaps some
> permissions are hardwired and/or are mapped as registers.

Some of the most heavily used permission regions have SPRs as an
optimization of the general PLB. The semantics is the same, but the
power is less. These SPRs must be domain-swapped while the PLB is LRU,
but that overhead is minor.

Ivan

EricP

unread,

Mar 6, 2013, 2:23:33 PM3/6/13

to

Stephen Fuld wrote:
>
> I agree with everything you say above. I would prefer an instruction.
> In the original planned implementation, a reserved location was used as
> the IB logic was physically in the north bridge chip so I guess a
> reserved location seemed the easiest way to do it. But now, an
> instruction seems the clear choice.

As I point out in another reply, I don't see then need
so far for an instruction provided the Infiniband controller
designer considers reentrancy issues.

>> e) I don't think that I would want to build a fully hardwired version of
>> such an instruction. That's excessively CISCy.
>>
>> f) I would prefer microcode or PALcode.
>>
>> g) such microcode or PALcode could actually implement the operation,
>> without invoking anything like the Infiniband (R)DMA engine. which
>> might be appropriate for low end implementations. on a higher end
>> implementation, switch to using the engine.
>
> IIRC, (and it has been a long time), the IB engine operated
> asynchronously based on a user supplied queue of commands. If the
> hardware finished an operation, it looked to see if there was anything
> else on the queue, and if so, executed it. If the queue was empty, the
> IB hardware went to sleep. The "poke" was just to wake up the HW if it
> had gone to sleep. If it was already processing packets, the poke was
> essentially a NOP.

Having the IB controller manage its own queue is the simplest
as it can avoid race conditions. The poke sends the command
and its args to the controller. If the controller is busy it
tucks them aside in its own memory.

I'd also want a user mode software defined command id tag with that
so one can cancel specific outstanding commands at a later time.

Also need to consider some mechanism to sync the pending queue
if a thread terminates, to make sure it has no outstanding commands.
If an command got hung somehow, you wouldn't want that to
prevent thread termination because there were outstanding commands.

Eric

Stephen Fuld

unread,

Mar 6, 2013, 3:00:43 PM3/6/13

to

On 3/6/2013 9:22 AM, Andy (Super) Glew wrote:
> On 3/5/2013 11:55 PM, Stephen Fuld wrote:

snip

>> I agree with everything you say above. I would prefer an instruction.
>> In the original planned implementation, a reserved location was used as
>> the IB logic was physically in the north bridge chip so I guess a
>> reserved location seemed the easiest way to do it. But now, an
>> instruction seems the clear choice.
>
> So, what are the reasons not to do an instruction?
>
> a) On some machines this may be too complex for hardware - and some
> people are allergic to microcode / PALcode. (Apart from complexity, you
> might want it in PALcode to allow communication between privilege domains.)

If the instruction directly performs the IB functions, then absolutely.
But in this case, how does poking a memory location differ from an
instruction that essentially does the same thing. If it is too complex
to implement in hardware, but you don't want microcode/PAL code, then I
suppose you could have the IB function on a separate memory mapped card
and the poke is more straightforward, but in that case see my comments
on your b below

> b) On some machines, in some implementations, the "instruction" might be
> just a wrapper around a poke to the privileged location.
> If so, why bother? (A: because other implementations might microcode)

Yes, and to eliminate the mess you described previously with
virtualizing, protecting, etc. the memory location.

> c) a CPU compan may be reluctant to do anything that depends on having
> an engine in the system - that might be on a different chip.

Agreed, but I think this is wrong headed. For a "CPU company" to ignore
I/O and punt it to memory mapped interfaces is, IMNSHO, a dereliction of
duty. As for having the IB hardware on a separate chip, I agree this
might be appropriate, but it requires only a single pin to poke that
chip. ISTM a small price to pay for eliminating the memory mapping
problems, as well as the throughput implications of memory mapping.

> You don't want something to be architecture unless you can
> guarantee that it is present. Or else, if not present, you have to
> invest in another way of ding the same thing.

Agreed. But I think that I/O should rightly be part of the
architecture. I know this might be controversial, and does put more
burden on a CPU company.

>
> ... any more? ...
>

Well, there is the argument that it does take another op-code.

Stephen Fuld

unread,

Mar 6, 2013, 3:59:11 PM3/6/13

to

On 3/6/2013 11:00 AM, EricP wrote:

snip

> I presume there is an Infiniband init call at the start which
> would create a memory section mapping the device register(s).

NO! This is a total misconception of how IB works. There are no device
registers to be mapped. IB is a packet oriented interface; you send and
receive packets from peripheral devices (or other computers). You have
to think of the peripheral at a higher level.

Stephen Fuld

unread,

Mar 6, 2013, 4:15:00 PM3/6/13

to

On 3/6/2013 11:23 AM, EricP wrote:
> Stephen Fuld wrote:
>>
>> I agree with everything you say above. I would prefer an instruction.
>> In the original planned implementation, a reserved location was used
>> as the IB logic was physically in the north bridge chip so I guess a
>> reserved location seemed the easiest way to do it. But now, an
>> instruction seems the clear choice.
>
> As I point out in another reply, I don't see then need
> so far for an instruction provided the Infiniband controller
> designer considers reentrancy issues.

There is no *need* for a new instruction. It is an alternative to a
reserved memory location. I prefer the instruction but YMMV.

>>> e) I don't think that I would want to build a fully hardwired version of
>>> such an instruction. That's excessively CISCy.
>>>
>>> f) I would prefer microcode or PALcode.
>>>
>>> g) such microcode or PALcode could actually implement the operation,
>>> without invoking anything like the Infiniband (R)DMA engine. which
>>> might be appropriate for low end implementations. on a higher end
>>> implementation, switch to using the engine.
>>
>> IIRC, (and it has been a long time), the IB engine operated
>> asynchronously based on a user supplied queue of commands. If the
>> hardware finished an operation, it looked to see if there was anything
>> else on the queue, and if so, executed it. If the queue was empty,
>> the IB hardware went to sleep. The "poke" was just to wake up the HW
>> if it had gone to sleep. If it was already processing packets, the
>> poke was essentially a NOP.
>
> Having the IB controller manage its own queue is the simplest
> as it can avoid race conditions. The poke sends the command
> and its args to the controller. If the controller is busy it
> tucks them aside in its own memory.

Then you have the issue of how much memory to provide within the
controller, and what to do if you run out.

> I'd also want a user mode software defined command id tag with that
> so one can cancel specific outstanding commands at a later time.

That's standard in IB.

> Also need to consider some mechanism to sync the pending queue
> if a thread terminates, to make sure it has no outstanding commands.
> If an command got hung somehow, you wouldn't want that to
> prevent thread termination because there were outstanding commands.

Agreed, and all provided for. And there is even a straightforward way
to prevent a transfer request that has already gone to the device from
corrupting memory that may be reallocated after the task terminates.
The standard is pretty robust.

nm...@cam.ac.uk

unread,

Mar 6, 2013, 4:16:57 PM3/6/13

to

In article <kh8ak0$9qa$1...@dont-email.me>,

Stephen Fuld <SF...@Alumni.cmu.edu.invalid> wrote:
>On 3/6/2013 11:00 AM, EricP wrote:
>
>> I presume there is an Infiniband init call at the start which
>> would create a memory section mapping the device register(s).
>
>NO! This is a total misconception of how IB works. There are no device
>registers to be mapped. IB is a packet oriented interface; you send and
>receive packets from peripheral devices (or other computers). You have
>to think of the peripheral at a higher level.

However, Infiniband's RDMA support does cache a copy of the page
tables in the 'device', which leads to some hairy issues with
recovery from device failure. It may be why OpenIB still doesn't
support it, as far as I know.

Regards,
Nick Maclaren.

Ivan Godard

unread,

Mar 6, 2013, 4:24:55 PM3/6/13

to

But you have to have all that anyway for plain DRAM, so it's free for
I/O use too.

Ivan

Stephen Fuld

unread,

Mar 6, 2013, 4:30:15 PM3/6/13

to

On 3/6/2013 1:24 PM, Ivan Godard wrote:
> On 3/6/2013 12:00 PM, Stephen Fuld wrote:

snip

>> Yes, and to eliminate the mess you described previously with
>> virtualizing, protecting, etc. the memory location.
>
> But you have to have all that anyway for plain DRAM, so it's free for
> I/O use too.

I just refer you to Andy's earlier posts about the issues with memory
mapped I/O.

Stephen Fuld

unread,

Mar 6, 2013, 4:33:16 PM3/6/13

to

On 3/6/2013 1:16 PM, nm...@cam.ac.uk wrote:
> In article <kh8ak0$9qa$1...@dont-email.me>,
> Stephen Fuld <SF...@Alumni.cmu.edu.invalid> wrote:
>> On 3/6/2013 11:00 AM, EricP wrote:
>>
>>> I presume there is an Infiniband init call at the start which
>>> would create a memory section mapping the device register(s).
>>
>> NO! This is a total misconception of how IB works. There are no device
>> registers to be mapped. IB is a packet oriented interface; you send and
>> receive packets from peripheral devices (or other computers). You have
>> to think of the peripheral at a higher level.
>
> However, Infiniband's RDMA support does cache a copy of the page
> tables in the 'device', which leads to some hairy issues with
> recovery from device failure.

Another reason to put the IB engine in the CPU chip. No need for
another copy of the page tables.

It may be why OpenIB still doesn't
> support it, as far as I know.

I have been out of it for so long that U have no idea of what current
implementations support. You may be right.

EricP

unread,

Mar 6, 2013, 4:35:36 PM3/6/13

to

Stephen Fuld wrote:
> On 3/6/2013 11:00 AM, EricP wrote:
>
> snip
>
>
>> I presume there is an Infiniband init call at the start which
>> would create a memory section mapping the device register(s).
>
>
> NO! This is a total misconception of how IB works. There are no device
> registers to be mapped. IB is a packet oriented interface; you send and
> receive packets from peripheral devices (or other computers). You have
> to think of the peripheral at a higher level.
>
>
>

Uhm... there is an IB device called a Host Channel Adaptor.
It has control registers located at some physical address on the computer.
For performance reasons they want to control that from user mode.

That the IB sends packets from one place to another is irrelevant
WRT how it is controlled.

Eric

Stephen Fuld

unread,

Mar 6, 2013, 6:47:30 PM3/6/13

to

On 3/6/2013 1:35 PM, EricP wrote:
> Stephen Fuld wrote:
>> On 3/6/2013 11:00 AM, EricP wrote:
>>
>> snip
>>
>>
>>> I presume there is an Infiniband init call at the start which
>>> would create a memory section mapping the device register(s).
>>
>>
>> NO! This is a total misconception of how IB works. There are no
>> device registers to be mapped. IB is a packet oriented interface; you
>> send and receive packets from peripheral devices (or other
>> computers). You have to think of the peripheral at a higher level.
>>
>>
>>
>
> Uhm... there is an IB device called a Host Channel Adaptor.
> It has control registers located at some physical address on the computer.
> For performance reasons they want to control that from user mode.

Yes, that is the current implementation, based on attaching to the PCI
bus. The original plan, and what I presume they are talking about now
"integrates" what was the HBA into the CPU chip, replacing the PCI-e
interface. In that context, they can do away with the control registers
at some location paradigm, which should, if they do things correctly,
improve performance.

If you want to see the kind of interface I am talking about, think IBM
channels. The analogy is far from exact, but the software executes an
instruction that sends the address of a list of channel commands (a
Channel Program), to the channel, which then executes the commands
without needing further CPU (i.e. instruction processing) until the
channel signals done with an interrupt. There are no control registers
at physical locations. Or, to put it another way, nothing is memory mapped.

EricP

unread,

Mar 6, 2013, 9:59:37 PM3/6/13

to

EricP wrote:
>
> So say there are 3 value registers RegA, RegB, RegC
> and a command register RegCmd.
> The reentrancy safe way to send a command is to save
> the old values, deposit your new values, poke the command,
> then restore the original values.
>
> void Poke (int cmd, int a, int b, int c)
> {
> int oldA, oldB, oldC;
> oldA = RegA;, oldB = RegB; oldC = RegC; // save current state
> RegA = a; RegB = b; RegC = c;
> RegCmd = cmd;
> RegA = oldA; RegB = oldB; RegC = oldC; // restore state
> }

oops... posted that while half asleep.
That only works for nested priority interrupts not thread switches.
If multiple priority interrupt levels have to write multiple
values to the same controller, that avoids interrupt disable & enable.

Eric

MitchAlsup

unread,

Mar 7, 2013, 12:07:03 AM3/7/13

to an...@spam.comp-arch.net

On Wednesday, March 6, 2013 11:22:00 AM UTC-6, Andy (Super) Glew wrote:
> any more?

Poke interfaces that require seveal pokes in a row to initiate something.

Now imagine an interrupt between one Poke and the next Poke!

Terje Mathisen

unread,

Mar 7, 2013, 1:07:19 AM3/7/13

to

EricP wrote:
> It is easy to handle by making all such control register readable
> and writable, and having the 'poke' routine save prior values
> before writing its new values.
>
> So say there are 3 value registers RegA, RegB, RegC
> and a command register RegCmd.
> The reentrancy safe way to send a command is to save
> the old values, deposit your new values, poke the command,
> then restore the original values.
>
> void Poke (int cmd, int a, int b, int c)
> {
> int oldA, oldB, oldC;
> oldA = RegA;, oldB = RegB; oldC = RegC; // save current state
> RegA = a; RegB = b; RegC = c;
> RegCmd = cmd;
> RegA = oldA; RegB = oldB; RegC = oldC; // restore state
> }
>
> Otherwise you need a guard lock.

What about any kind of multi-threading?

The block above is _obviously_ unsafe if you have more than a single
thread ever accessing the same (possibly remapped) registers.

You needs something like a "write once" memory region, private to each
thread, plus a single command address where you write the start command
(possibly just the address of the command block?)

I.e. you need an atomic write to start the process!

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

EricP

unread,

Mar 7, 2013, 11:26:43 AM3/7/13

to

Terje Mathisen wrote:
> EricP wrote:
>> It is easy to handle by making all such control register readable
>> and writable, and having the 'poke' routine save prior values
>> before writing its new values.
>>
>> So say there are 3 value registers RegA, RegB, RegC
>> and a command register RegCmd.
>> The reentrancy safe way to send a command is to save
>> the old values, deposit your new values, poke the command,
>> then restore the original values.
>>
>> void Poke (int cmd, int a, int b, int c)
>> {
>> int oldA, oldB, oldC;
>> oldA = RegA;, oldB = RegB; oldC = RegC; // save current state
>> RegA = a; RegB = b; RegC = c;
>> RegCmd = cmd;
>> RegA = oldA; RegB = oldB; RegC = oldC; // restore state
>> }
>>
>> Otherwise you need a guard lock.
>
> What about any kind of multi-threading?
>
> The block above is _obviously_ unsafe if you have more than a single
> thread ever accessing the same (possibly remapped) registers.

Yes, it was obviously wrong.
The hazards of my posting before brain was fully booted.

> You needs something like a "write once" memory region, private to each
> thread, plus a single command address where you write the start command
> (possibly just the address of the command block?)
>
> I.e. you need an atomic write to start the process!

Yes. Poke it with a command packet pointer, then the device loads the packet.

I was also considering what to do if the pokes come in too fast.
Say 128 cores all start poking as fast as possible.
My concern is that stalling during a write operation is
not good as it could potentially hang the bus.

This happened on a PC I had years ago with a misbehaving
video device driver for a PCI video card. The driver was
not checking if the cards command registers was empty before
trying to write a new command. If the register was full
then it used the NotReady signal to stall the sender.
That had the effect of hanging the system bus until
the current command finished, and glitched the whole system.

So maybe rather than a write of the packet pointer,
an exchange instruction could write the pointer and
fetch the status. If the status was full, retry.

Eric

timca...@aol.com

unread,

Mar 7, 2013, 1:45:07 PM3/7/13

to SF...@alumni.cmu.edu.invalid

I know nothing about IB, but Intel did release a I2O specification years ago, which put a nice FIFO port on the I/O card. Almost nobody used the I2O spec
exactly, but lots of companies used the FIFO hardware. The basic approach
was: Build a command block in memory. Write the address of the command block
to the I/O devices FIFO. When it was complete, it indicated complete (in various
ways) and write something to the I/O device->host FIFO, which caused an
interrupt. The CPU reads the FIFO (which flushes out any pending writes
from the I/O device) and processes accordingly.

As I said, the details varied, but those FIFOs were pretty useful. And there was no problem with multiple CPUs doing I/O.

ChrisQ

unread,

Mar 8, 2013, 11:09:31 AM3/8/13

to

On 03/07/13 16:26, EricP wrote:

>
> Yes. Poke it with a command packet pointer, then the device loads the
> packet.
>
> I was also considering what to do if the pokes come in too fast.
> Say 128 cores all start poking as fast as possible.
> My concern is that stalling during a write operation is
> not good as it could potentially hang the bus.
>
> This happened on a PC I had years ago with a misbehaving
> video device driver for a PCI video card. The driver was
> not checking if the cards command registers was empty before
> trying to write a new command. If the register was full
> then it used the NotReady signal to stall the sender.
> That had the effect of hanging the system bus until
> the current command finished, and glitched the whole system.
>
> So maybe rather than a write of the packet pointer,
> an exchange instruction could write the pointer and
> fetch the status. If the status was full, retry.
>
> Eric
>

The neatest way to get round concurrency problems is to have a
queue on the i/o / channel device. Each thread talking to the
channel builds a 1-n list of command blocks in memory, then
queues a pointer to the the first entry for each thread. The i/o
device transparently dma's the command blocks and associated data
to i/o and signals the thread at transfer end, via messaging or
interrupts.

None of this is new, eg: pdp11 and vax disk and tape controllers used
this sort of scheme decades ago...

Regards,

Chris

Andy (Super) Glew

unread,

Mar 10, 2013, 12:22:53 PM3/10/13

to

On 3/6/2013 11:00 AM, EricP wrote:

> Andy (Super) Glew wrote:
>>
>> I'm going to slide past the question of whether a high speed device
>> should be "on the QPI" to access packets in memory (coherently).
>>
>> I'd like to discuss the tradeoff between "poking a (memory mapped)
>> reserved location" and creating a new instruction to do the same thing.
>> I'll start:

...

>> d) an instruction is - well, an instruction.
>>
>> e) I don't think that I would want to build a fully hardwired version
>> of such an instruction. That's excessively CISCy.
>>
>> f) I would prefer microcode or PALcode.
>>
>> g) such microcode or PALcode could actually implement the operation,
>> without invoking anything like the Infiniband (R)DMA engine. which
>> might be appropriate for low end implementations. on a higher end
>> implementation, switch to using the engine.
>
> I don't see the need for an instruction.
> The interface design does need to work reliably when writing
> multiple items to the control register in the presence of
> interrupts without requiring locks.

I may not have been clear:

One of my reasons for considering an instruction is if you want this
facility - basically, messaging - to be available, EVEN THOUGH THERE IS
NO (R)DMA ENGINE.

Basically, if there is an instruction,

a) on a really low end machine that has no (R)DMA engine (and all of the
associated hardware, like an IOMMU that the RDMA engine can use), have
microcode do the copy, possibly between address spaces, that implements
the functionality.

b) on a higher end machine, have the RDMA engine.

I.e. an instruction is an API. An interface. An interface owned by the
processor vendor.

By the way, on some flavors of really high end machine I suspect that
you might revert to microcode again. E.g. on flavors of the Intel i960,
the DMA engine was really microcode running in a separate hardware thread.

ChrisQ

unread,

Mar 10, 2013, 2:55:00 PM3/10/13

to

On 03/06/13 07:55, Stephen Fuld wrote:

>
> IIRC, (and it has been a long time), the IB engine operated
> asynchronously based on a user supplied queue of commands. If the
> hardware finished an operation, it looked to see if there was anything
> else on the queue, and if so, executed it. If the queue was empty, the
> IB hardware went to sleep. The "poke" was just to wake up the HW if it
> had gone to sleep. If it was already processing packets, the poke was
> essentially a NOP.
>

Am I missing something here ?. Seems to me that to keep it really
simple, the
underlying queue hardware should wake up when the first data is stored
to it,
then go back to sleep when the queue is empty. Seems a bit laboured to
require
a poke, whatever, to do this...

Regards,

Chris

Stephen Fuld

unread,

Mar 10, 2013, 3:59:51 PM3/10/13

to

The queue is in main memory. The poke is needed to tell the IB hardware
that the "first data" is ready. As you said, the IB hardware then steps
down the queue, only going to sleep when the queue is empty. But after
it goes to sleep, it needs something to tell it that thee is a new
entry. The Poke does that.

ChrisQ

unread,

Mar 10, 2013, 7:07:03 PM3/10/13

to

On 03/10/13 19:59, Stephen Fuld wrote:

>
> The queue is in main memory. The poke is needed to tell the IB hardware
> that the "first data" is ready. As you said, the IB hardware then steps
> down the queue, only going to sleep when the queue is empty. But after
> it goes to sleep, it needs something to tell it that thee is a new
> entry. The Poke does that.
>

Thanks. Was assuming that any modern controller would have an input pointer
queue implemented as a h/w fifo. It's then easy to arrange that the
interface
wakes up on first write to it.

I have seen this in the past in some controllers, but perhaps difficult to
generalise if there's no dma support. Where that is the case, transfers are
limited by fifo size, but even embedded controllers often have several dma
modes within the device...

Regards,

Chris

Stephen Fuld

unread,

Mar 11, 2013, 12:41:01 AM3/11/13

to

On 3/10/2013 4:07 PM, ChrisQ wrote:
> On 03/10/13 19:59, Stephen Fuld wrote:
>
>>
>> The queue is in main memory. The poke is needed to tell the IB hardware
>> that the "first data" is ready. As you said, the IB hardware then steps
>> down the queue, only going to sleep when the queue is empty. But after
>> it goes to sleep, it needs something to tell it that thee is a new
>> entry. The Poke does that.
>>
>
> Thanks. Was assuming that any modern controller would have an input pointer
> queue implemented as a h/w fifo. It's then easy to arrange that the
> interface
> wakes up on first write to it.

Sure. But then in order to do the write to the FIFO, it has to be
mapped to some memory address, which is, of course, what we are trying
to eliminate. :-)

> I have seen this in the past in some controllers, but perhaps difficult to
> generalise if there's no dma support. Where that is the case, transfers are
> limited by fifo size, but even embedded controllers often have several dma
> modes within the device...

I'm not saying what you are talking about can't, or indeed isn't, done.
In fact, it is the most popular way of doing it. But it does suffer
from the problems of memory mapping that we are trying to avoid.

ChrisQ

unread,

Mar 11, 2013, 1:24:49 PM3/11/13

to

On 03/11/13 04:41, Stephen Fuld wrote:

>
> I'm not saying what you are talking about can't, or indeed isn't, done.
> In fact, it is the most popular way of doing it. But it does suffer from
> the problems of memory mapping that we are trying to avoid.
>

Even if the function is encapsulated into a separate controller, the device
on the local machine that handles the comms to it still has to talk to it's
own hardware at some stage. I don't see any way around that and why is
it so bad anyway ?. Other than some small embedded devices, all
processors have
dma capability these days. It's not too difficult to design elegant and
solid data transfer schemes, especially if it there is the type of hardware
support that make the software design less troublesome. That is, someone
has
thought about the overall system requirements. In the past, far too much
hardware design has been minimalist and assumed that the software will
paper
over the cracks, but there's no excuse for that now.

Back to i/o, machines can either use memory mapping for i/o, or
instructions
specifically to talk to devices, but you could argue that it's still memory
mapped, just has fewer instructions and a smaller address space :-)...

Regards,

Chris

Stephen Fuld

unread,

Mar 11, 2013, 2:39:07 PM3/11/13

to

On 3/11/2013 10:24 AM, ChrisQ wrote:
> On 03/11/13 04:41, Stephen Fuld wrote:
>
>>
>> I'm not saying what you are talking about can't, or indeed isn't, done.
>> In fact, it is the most popular way of doing it. But it does suffer from
>> the problems of memory mapping that we are trying to avoid.
>>
>
> Even if the function is encapsulated into a separate controller, the device
> on the local machine that handles the comms to it still has to talk to it's
> own hardware at some stage.

Sure.

> I don't see any way around that and why is
> it so bad anyway ?.

There has to be communication. The issue here is whether that
communication takes place via mapping some of the peripheral's function
into the memory space of the processor or via an instruction in the
processor. The former is most common now. The latter is how it used to
be done. I am arguing for the advantages of the latter. As to why it
is better to have an instruction than use memory mapping, see the
upthread list that Andy created, and other have added to, comparing the
two methods.

> Other than some small embedded devices, all
> processors have
> dma capability these days. It's not too difficult to design elegant and
> solid data transfer schemes, especially if it there is the type of hardware
> support that make the software design less troublesome. That is, someone
> has
> thought about the overall system requirements. In the past, far too much
> hardware design has been minimalist and assumed that the software will
> paper
> over the cracks, but there's no excuse for that now.

Yes, but that isn't the issue. The issue is how that hardware's
function is activated.

> Back to i/o, machines can either use memory mapping for i/o, or
> instructions
> specifically to talk to devices, but you could argue that it's still memory
> mapped, just has fewer instructions and a smaller address space :-)...

Yea, sort of like TTAs, where you do an add by moving the operands to
the address of the adder. :-)

Andy (Super) Glew

unread,

Mar 11, 2013, 4:31:17 PM3/11/13

to

On 3/6/2013 1:15 PM, Stephen Fuld wrote:> On 3/6/2013 11:23 AM, EricP wrote:
>> Stephen Fuld wrote:
>>>
>>> I agree with everything you say above. I would prefer an instruction.
>>> In the original planned implementation, a reserved location was used
>>> as the IB logic was physically in the north bridge chip so I guess a
>>> reserved location seemed the easiest way to do it. But now, an
>>> instruction seems the clear choice.
>>
>> As I point out in another reply, I don't see then need
>> so far for an instruction provided the Infiniband controller
>> designer considers reentrancy issues.
>
> There is no *need* for a new instruction. It is an alternative to a
> reserved memory location. I prefer the instruction but YMMV.

Hmmm... this may be the key characteristic.

Most instruction sets do not have reserved memory locations visible to
user code.

They may have some visible to supervisor code, but even this is less and
less common: more often there is a register that contains the special
address, like CR3 (the Page Table base Register) on x86, or
the interrupt vector table.

--

I conjecture that this may be a principle of good ISA design: avoiding
hardwired, reserved, memory locations.

(I think I have skirted around this in the past, but Stephen makes it
very explicit.)

Q: can anyone remember a sccessful ISA extension that hardwires,
reserves, memory addresses.

(Sure, mention the accumulator pages of certain old microprocessors -
6502? - but I am more interested in modern stuff.)

==

By the way, Ivan's point about using memory mapped permissions is also
quite reasonable. And not incompatible with the above. It is
straightforward to map any new instruction or feature to an address -
probably a page address - without memory behind it.

But you may need a lot of address space...

Andy (Super) Glew

unread,

Mar 11, 2013, 4:48:18 PM3/11/13

to

Guys:

Let's just stipulate that anything that needs multiple POKES to start
off an RDMA is ... well, broken as a candidate for something that you
want to make ubiquitous.

(Unless you have arranged to have completely private reserved locations
per process/thread...)

(Or unless you have transactional memory support.)

Let's also stipulate that you don't need to save/restore old POKE'd values.

This narrows us down to:

a) an instruction

versus

b) a memory mapped interface that you only need a single atomic write to
access. Probably a write of an address word sized quantity, pointing to
a command buffer that contains more info, a full command.

And I guess that this further restricts the ubiquity:

You cannot use such an extension until memory is initialized. Which can
occur quite late in some systems. Whereas an instruction mapped
interface might even be the thing that you use to bring an MPP up -
assuming that there is hardware that doesn't require memory.

ChrisQ

unread,

Mar 11, 2013, 5:49:37 PM3/11/13

to

On 03/11/13 18:39, Stephen Fuld wrote:

>
> There has to be communication. The issue here is whether that
> communication takes place via mapping some of the peripheral's function
> into the memory space of the processor or via an instruction in the
> processor. The former is most common now. The latter is how it used to
> be done. I am arguing for the advantages of the latter. As to why it is
> better to have an instruction than use memory mapping, see the upthread
> list that Andy created, and other have added to, comparing the two methods.
>

The problem with separate i/o space and instructions is that it's always
vendor specific, whereas memory mapped i/o is not precluded even where there
is support for such instructions. The memory mapped scheme is more of a
generic solution and makes it easier to transfer hardware designs between
architectures, even if the isa is different. It also has the advantage that
most of the machine instructions can be used, making it possible to
write the
i/o code in a high level language. As for the past, even on machines that
supported separate i/o space, it was not uncommon ime to find all the i/o
hardware mapped into memory space, at least in the embedded world.
I'm also not so sure that it was common elsewhere. Pdp, vax and 68k all
used memory mapped i/o and only a few designs, Intel and z80, for example,
used the i/o page idea. I think the original argument was that it saved
expensive decoding h/w, but that's hardly relevant now.

I really don't see the benefit, but perhaps a good compromise would be to
have an shadow i/o address space equivalent to main memory, accessed by
the same hardware lines and instruction set, with just a single hardware
line to select memory / io mode ?. You then need instructions or
mechanism to swap data between the spaces, but it's already starting to
look like a camel :-).

>
>
> Yea, sort of like TTAs, where you do an add by moving the operands to
> the address of the adder. :-)
>

TTA ?. Yes, or in some comms uarts, where the first byte written the the tx
data queue sets off the transmission...

Regards,

Chris

Ivan Godard

unread,

Mar 11, 2013, 6:02:21 PM3/11/13

to

Stretch and PDP-10 mapped the registers to addresses, but you wanted
recent. Well, if you are talking about user code then you are talking
about reserved virtual addresses. One obvious one is address 0x0 :-)

Ivan

Andy (Super) Glew

unread,

Mar 11, 2013, 6:22:55 PM3/11/13

to

If the IBM/360 was inspired by STRETCH, why did it not have memory
mapped registers?

The PDP10 was a fairly successful family. Although not so much as the
PDP11 or VAX. Why memory mapped registers in the one, but not the other?

Does memory mapping the registers make it harder to extend them,
increasing them from 16->32->64->... bits?

I don't understand your comment about 0x0, Ivan. Unless you mean to say
that it would be very risky to let any address close to zero have side
effects.

ChrisQ

unread,

Mar 11, 2013, 6:29:43 PM3/11/13

to

On 03/06/13 21:30, Stephen Fuld wrote:
> On 3/6/2013 1:24 PM, Ivan Godard wrote:
>> On 3/6/2013 12:00 PM, Stephen Fuld wrote:
>
> snip
>
>>> Yes, and to eliminate the mess you described previously with
>>> virtualizing, protecting, etc. the memory location.
>>
>> But you have to have all that anyway for plain DRAM, so it's free for
>> I/O use too.
>
> I just refer you to Andy's earlier posts about the issues with memory
> mapped I/O.
>

I may be missing something here again, but seems to me that the mmu should
be programmable to the extent that all i/o device registers are locked in a
fixed area of address space, visible to all....

Regards,

Chris

Stephen Fuld

unread,

Mar 11, 2013, 6:50:12 PM3/11/13

to

Look at Andy's post. I have pasted the relevant parts below: (forgive
my odd formatting)

>
> a) poking a reserved location necessarily requires uncached memory mapped I/O
>
> b) if you want to do this from user mode, in almost every process (i.e. if you want to support this sort of communication ubiquitously,

> as opposed to in certain high priority user domain processes), then you probably want to "virtualize" the reserved location. E.g. the

> OS may need to know enough about it to unmap it from certain processes - if, e.g. the of hardware contexts that the actual device can

> handle is exceeded. This may not be necessary if you have an IOMMU, and the device is fully self-virtualizing. E.f. the hardware

> behind the memory location may need to understand process contexts.
>
> c) a reserved location probably needs at least a 4KiB page, either all to itself, or shared with other, similar, MMIO locations.

In addition, though it is probably not a problem in this case, the
processor, or at least the load/store unit, is slowed down to the speed
of whatever is memory mapped, i.e. no other loads or stores on this L/S
unit while this L/S is going on, which may be more lengthy than a DRAM
reference. This was a problem with parallel PCI with its bridges etc.

Stephen Fuld

unread,

Mar 11, 2013, 7:00:35 PM3/11/13

to

On 3/11/2013 2:49 PM, ChrisQ wrote:
> On 03/11/13 18:39, Stephen Fuld wrote:
>
>>
>> There has to be communication. The issue here is whether that
>> communication takes place via mapping some of the peripheral's function
>> into the memory space of the processor or via an instruction in the
>> processor. The former is most common now. The latter is how it used to
>> be done. I am arguing for the advantages of the latter. As to why it is
>> better to have an instruction than use memory mapping, see the upthread
>> list that Andy created, and other have added to, comparing the two
>> methods.
>>
>
> The problem with separate i/o space and instructions is that it's always
> vendor specific, whereas memory mapped i/o is not precluded even where
> there
> is support for such instructions. The memory mapped scheme is more of a
> generic solution and makes it easier to transfer hardware designs between
> architectures, even if the isa is different.

Agreed. Using memory mapped I/O certainly makes it easier for the CPU
designer/company. Both Andy and I mentioned this already.

It also has the advantage that
> most of the machine instructions can be used, making it possible to
> write the
> i/o code in a high level language.

Remember, we are talking about creating a packet with I/O operation
description in memory. That can be totally user code. We are then
talking about a single instruction that may require privilege. That can
be done in various ways. Inconvenient, yes, but only slightly.

> As for the past, even on machines that
> supported separate i/o space, it was not uncommon ime to find all the i/o
> hardware mapped into memory space, at least in the embedded world.
> I'm also not so sure that it was common elsewhere. Pdp, vax and 68k all
> used memory mapped i/o and only a few designs, Intel and z80, for example,
> used the i/o page idea. I think the original argument was that it saved
> expensive decoding h/w, but that's hardly relevant now.

No one here is talking about a separate I/O space.

> I really don't see the benefit, but perhaps a good compromise would be to
> have an shadow i/o address space equivalent to main memory, accessed by
> the same hardware lines and instruction set, with just a single hardware
> line to select memory / io mode ?. You then need instructions or
> mechanism to swap data between the spaces, but it's already starting to
> look like a camel :-).

Yes. If you have the "single instruction" you mentioned, then you don't
need the separate I/O space. Just use the main memory space and make
the "separate instruction" do what I suggested. :-)

>> Yea, sort of like TTAs, where you do an add by moving the operands to
>> the address of the adder. :-)
>>
>
> TTA ?.

Architectures where the only instruction is a "move". Operations are
accomplished by moving the operands to reserved locations for things
like the adder, etc. Not popular, perhaps for obvious reasons. :-)

Ivan Godard

unread,

Mar 11, 2013, 7:04:20 PM3/11/13

to

At a guess, because it was designed as a family and there would not have
been a consistent resister set across the family.

> The PDP10 was a fairly successful family. Although not so much as the
> PDP11 or VAX. Why memory mapped registers in the one, but not the other?

STRETCH use the memory-mapped registers to unload results that otherwise
couldn't be reached. It was an accumulator machine, which caused issues
reaching the double accumulator and the quotient/remainder. Once you
have a GPR machine you don't need this.

I think the 10 used the memory onderneath the registers for something;
you could reach it with suitable convolutions. I've used both 360 and
PDO-10, but as a user. Perhaps those nice folks over at
alt.folklore.computers could answer your questions :-)

> Does memory mapping the registers make it harder to extend them,
> increasing them from 16->32->64->... bits?

Only if you care about changing the mapping underneath the users.

> I don't understand your comment about 0x0, Ivan. Unless you mean to say
> that it would be very risky to let any address close to zero have side
> effects.

Pretty much any machine I've used in the last 30 years did in fact have
side effects for virtual address close to 0x0. You got a segv. :-)

Ivan

Ivan Godard

unread,

Mar 11, 2013, 7:29:33 PM3/11/13

to

On 3/11/2013 4:00 PM, Stephen Fuld wrote:
> On 3/11/2013 2:49 PM, ChrisQ wrote:
>>

<snip>

>>> Yea, sort of like TTAs, where you do an add by moving the operands to
>>> the address of the adder. :-)
>>>
>>
>> TTA ?.

http://en.wikipedia.org/wiki/Transport_triggered_architecture

A very interesting idea that doesn't quite work IMO.

Ivan

Bill Findlay

unread,

Mar 11, 2013, 8:26:50 PM3/11/13

to

On 11/03/2013 23:29, in article khlp9p$9f5$1...@dont-email.me, "Ivan Godard"

It worked very nicely on Turing's ACE design and its English Electric Deuce
derivative. Of course, that was then and this is now.

Sadly, the Wikipedia pages on these machines do not describe this aspect of
their architecture; but see, e.g.

<bitsavers.trailing-edge.com/pdf/englishElectric/deuce/Haley_DEUCE_Mar56.pdf
>

--
Bill Findlay
with blueyonder.co.uk;
use surname & forename;

Ivan Godard

unread,

Mar 11, 2013, 8:43:46 PM3/11/13

to

I hope you updated the Wikipedia entry?

Ivan

Jean-Marc Bourguet

unread,

Mar 12, 2013, 4:52:07 AM3/12/13

to

"Andy (Super) Glew" <an...@SPAM.comp-arch.net> writes:

> Q: can anyone remember a sccessful ISA extension that hardwires, reserves,
> memory addresses.

The Intel 8051 is funny. It has a internal memory space of 256 bytes,
the lower 128 bytes is memory. The upper 128 bytes is memory or control
registers depending on the adressing mode (IIRC, the memory can be
accessed only with indirect access). There are four banks or registers
which are also memory mapped in the internal memory space.

Yours,

--
Jean-Marc

Anton Ertl

unread,

Mar 12, 2013, 8:35:35 AM3/12/13

to

"Andy (Super) Glew" <an...@SPAM.comp-arch.net> writes:

>Q: can anyone remember a sccessful ISA extension that hardwires,
>reserves, memory addresses.
>
>(Sure, mention the accumulator pages of certain old microprocessors -
>6502? - but I am more interested in modern stuff.)

The zero page is not reserved in the 6502, it just can be used with
cheaper and additional addressing modes. Page 1 is also not reserved,
but the stack (indexed with an 8-bit stack pointer) resides there.

The vectors for reset, NMI and interrupt are hard-wired to high
addresses ($FFFE and such), but also not reserved.

The 6510 (the C64's CPU) integrated 2 I/O registers on-chip at
addresses 0 and 1, but that's not substantially different from any
other memory-mapped I/O (which usually reserves that memory, too),
which we still have in modern machines.

I wonder why these registers were done on the CPU; my guess is that
was because they were used for controlling the MMU*, so maybe it would
not have been a good idea to route the accesses through the MMU first;
although, given the limits of the MMU, that should not have been a
problem. Or maybe it was just a matter of cost: an extra chip for
controlling the MMU would have cost extra, and the MMU, being a PAL
chip, could not store and hold the control signals for itself.

*) This was not an MMU in the usual sense, but it controlled the
banking of various memory areas (RAM, ROM blocks, I/O block).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Robert Wessel

unread,

Mar 12, 2013, 10:28:54 PM3/12/13

to

On Tue, 12 Mar 2013 12:35:35 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>The 6510 (the C64's CPU) integrated 2 I/O registers on-chip at
>addresses 0 and 1, but that's not substantially different from any
>other memory-mapped I/O (which usually reserves that memory, too),
>which we still have in modern machines.
>
>I wonder why these registers were done on the CPU; my guess is that
>was because they were used for controlling the MMU*, so maybe it would
>not have been a good idea to route the accesses through the MMU first;
>although, given the limits of the MMU, that should not have been a
>problem. Or maybe it was just a matter of cost: an extra chip for
>controlling the MMU would have cost extra, and the MMU, being a PAL
>chip, could not store and hold the control signals for itself.

The 6510 was basically identical to the 6502, except it could
tri-state the bus, and it had a built-in 8-bit parallel I/O port. The
registers at $00 and $01 were just to control that port, and the
selected address was really just MOS's whim.

There's no reason Commodore couldn't have put an external 6520 (a
rather more capable PIO device) on the motherboard, and then added the
necessary decoding, and perhaps buffering to keep it ahead of the MMU,
(possibly introducing a performance problem with the longer data
path). But it's likely that the port was just convenient, or perhaps
the tri-stating was, and the port came with that.

Anton Ertl

unread,

Mar 13, 2013, 8:06:05 AM3/13/13

to

Robert Wessel <robert...@yahoo.com> writes:
>The 6510 was basically identical to the 6502, except it could
>tri-state the bus, and it had a built-in 8-bit parallel I/O port. The
>registers at $00 and $01 were just to control that port, and the
>selected address was really just MOS's whim.

...

> But it's likely that the port was just convenient, or perhaps
>the tri-stating was, and the port came with that.

I think that tri-stating was necessary for DMA by the VIC-II chip.
Not sure how the VIC-20 did it, maybe it had a separate buffer chip
for tri-stating the bus? The 6510 port probably had nothing to do
with it; but if you already do a new chip and have some pins free, why
not put an I/O port on it.

Paul A. Clayton

unread,

Mar 13, 2013, 9:19:21 AM3/13/13

to

On Mar 11, 4:31 pm, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:
[snip]

> Q: can anyone remember a sccessful ISA extension that hardwires,
> reserves, memory addresses.

Only somewhat related (not extensions, not necessarily
successful, and generally address ranges):

MIPS segments the virtual address space (five segments
for 32-bit implementations: kernel (mapped | unmapped
(cached | uncached)), supervisor mapped, and user
mapped).

Lots of ISAs had a reserved address for interrupt
vectors and start-up code.

Fairchild's CLIPPER C100 reserved eight pages of its
supervisor address space to hardwired addresses (2
boot ROM pages, 2 I/O pages, and 4 memory pages [1
for interrupt/trap vectors, 1 private and copy-back,
and 2 noncacheable--note, page tables were in
noncacheable memory]--the boot ROM and I/O pages
actually shared the same physical address as the 2
lowest memory pages but were distinguished by
special signals that included cacheability information
used for coherence). The MMUs were at fixed addresses
(not too surprising as standard devices).

Itanium dedicates the upper half of the physical
address space to uncached (I/O) memory.

acd

unread,

Mar 14, 2013, 6:08:43 AM3/14/13

to

Am Mittwoch, 6. März 2013 05:51:06 UTC+1 schrieb Andy (Super) Glew:

>
> a) poking a reserved location necessarily requires uncached memory
>
> mapped I/O
>
>
>
> b) if you want to do this from user mode, in almost every process (i.e.
>
> if you want to support this sort of communication ubiquitously, as
>
> opposed to in certain high priority user domain processes), then you
>
> probably want to "virtualize" the reserved location. E.g. the OS may
>
> need to know enough about it to unmap it from certain processes - if,
>
> e.g. the of hardware contexts that the actual device can handle is
>
> exceeded. This may not be necessary if you have an IOMMU, and the
>
> device is fully self-virtualizing. E.f. the hardware behind the memory
>
> location may need to understand process contexts.
>

The IBM PERCS system uses a separate communication chip called P7 hub.
It is part of the cache coherency domain.
The network interface accelerates many different protocols, but it is not true IB AFAIK.

There are two ways to send a message, for a short message, there is a co-processor instruction, this results in a single bus transfer, similar to a write, but a whole cache line containing the packet data is transferred.
So this is the maximum speed you can get: one instruction in the processor, one bus transfer, and data is sent out. The number of cycles in the network interface for those short messages is also very low.

The second way is, as pointed out, based on queues and you need a single, non-cached, per-application allocatable, doorbell register.

>
> c) a reserved location probably needs at least a 4KiB page, either all
>
> to itself, or shared with other, similar, MMIO locations.
>
>
>

> d) an instruction is - well, an instruction.
>
>
>
> e) I don't think that I would want to build a fully hardwired version of
>
> such an instruction. That's excessively CISCy.
>

The coprocessor-call instructions existed already.

Andreas

timca...@aol.com

unread,

Mar 19, 2013, 2:27:25 PM3/19/13

to an...@spam.comp-arch.net

On Monday, March 11, 2013 4:31:17 PM UTC-4, Andy (Super) Glew wrote:
> On 3/6/2013 1:15 PM, Stephen Fuld wrote:> On 3/6/2013 11:23 AM, EricP wrote:
>

> Q: can anyone remember a sccessful ISA extension that hardwires,
>
> reserves, memory addresses.
>

The Weitek numeric coprocessors used a trick with 32 bit I/O
instructions on the 386/486 to basically extend the ISA.

Any embedded product usually has some number of memory mapped registers that are hardwired (which is not really extending the ISA). I know, not what you are really talking about, but how much is it different?

- Tim

John D. McCalpin

unread,

Mar 20, 2013, 2:50:53 PM3/20/13

to SF...@alumni.cmu.edu.invalid

On Monday, March 11, 2013 1:39:07 PM UTC-5, Stephen Fuld wrote:
> There has to be communication. The issue here is whether that
> communication takes place via mapping some of the peripheral's function
> into the memory space of the processor or via an instruction in the

> processor. [...]

At the risk of wading in to a stale conversation....

I agree that there has to be "communication" -- I just wish I knew what that word meant.

It is fascinating to me that the term "communication" does not exist in the architectural specification documents defining the major computer architectures in use today.

Of course I am exaggerating slightly (as the greybeards here would expect). The term actually does occur a handful of times in the AMD64 and Intel64 architectural specifications, but only as an aside, never as an explicit architectural concept. E.g., the Intel SW developers guide mentions that a compare and swap operation can be used to implement a mutex, which clearly contains a kind of "communication".

Current architectures "allow" communication to be implemented indirectly using sequences of (partially) ordered loads and stores to a shared memory space. The indirect nature of the implementation is exacerbated by the dual addictions to transparent (uncontrollable) caches and the requirement that memory references be free of side effects.

Since "communication" is not an architectural feature, it is not possible to build optimized hardware to support it. Similarly, since the memory references that lead (indirectly) to "communication" are not distinguishable from "local" memory accesses, there is no way for the hardware to process those references using (different) protocol choices that are more appropriate/efficient.

So I will assert that no matter how many cores you put on a chip, there is not going to be a "multi-core revolution" until "communication" becomes a fundamental feature of the underlying architecture, and implementations are created for which the performance of "communication" is governed by direct physical limits rather than by the quirks of hack upon hack upon hack layered onto the uniprocessor flat-memory architecture that has changed little since the 1980's.

john (aka "Dr. Bandwidth")

Ivan Godard

unread,

Mar 20, 2013, 3:02:41 PM3/20/13

to

Strongly agree.

Unfortunately, at least for i/o there's not much that the CPU designer
can do about it. As a practical commercial matter, a new CPU must be
able to support existing mass-market peripherals and protocols. All too
often those have "mapped into the memory space" and other indirection
baked in.

So the architect's choice is to supply a low-level interface and make
dealing with evolving devices and protocols somebody else's problem, or
add a nice clean operation to hide the device/protocol, which nobody
will use because it's simpler/cheaper to cut/paste the code that drivers
on other machines use.

We are driven by network effects to use the lowest common denominator.
Which is pretty low.

BTW, please set your newswriter to line wrap your postings. I had to do
it manually for you to reply.

Ivan

Stephen Fuld

unread,

Mar 20, 2013, 4:21:48 PM3/20/13

to

On 3/20/2013 11:50 AM, John D. McCalpin wrote:

> On Monday, March 11, 2013 1:39:07 PM UTC-5, Stephen Fuld wrote:
>> There has to be communication. The issue here is whether that
>> communication takes place via mapping some of the peripheral's function
>> into the memory space of the processor or via an instruction in the
>> processor. [...]
>
> At the risk of wading in to a stale conversation....

No problem. It's good to have you back!

> I agree that there has to be "communication" -- I just wish I knew what that word meant.
>
> It is fascinating to me that the term "communication" does not exist in the architectural specification documents defining the major computer architectures in use today.
>
> Of course I am exaggerating slightly (as the greybeards here would expect). The term actually does occur a handful of times in the AMD64 and Intel64 architectural specifications,

Well, of course, both of these architectures have "punted" I/O (which
was the original subject of this thread) and not included any mention of
it, leaving it totally to the peripheral manufacturers.

> but only as an aside, never as an explicit architectural concept. E.g., the Intel SW developers guide mentions that a compare and swap operation can be used to implement
>
> a mutex, which clearly contains a kind of "communication".
>
> Current architectures "allow" communication to be implemented indirectly using sequences of (partially) ordered loads and stores to a shared memory space. The indirect
>
> nature of the implementation is exacerbated by the dual addictions to transparent (uncontrollable) caches and the requirement that memory references be free of side effects.
>
> Since "communication" is not an architectural feature, it is not possible to build optimized hardware to support it. Similarly, since the memory references that lead (indirectly)
>
> to "communication" are not distinguishable from "local" memory accesses, there is no way for the hardware to process those references using (different) protocol choices that
>
> are more appropriate/efficient.
>
> So I will assert that no matter how many cores you put on a chip, there is not going to be a "multi-core revolution" until "communication" becomes a fundamental feature of
>
> the underlying architecture, and implementations are created for which the performance of "communication" is governed by direct physical limits rather than by the quirks
>
> of hack upon hack upon hack layered onto the uniprocessor flat-memory architecture that has changed little since the 1980's.
>
> john (aka "Dr. Bandwidth")

Interesting point of view. I think I agree but I need to spend more
time thinking about it. While I certainly agree for I/O, what about
cache coherence traffic. Is that "communication" in the sense we are
talking about?

Stephen Fuld

unread,

Mar 20, 2013, 4:30:35 PM3/20/13

to

On 3/20/2013 12:02 PM, Ivan Godard wrote:

snip

> Unfortunately, at least for i/o there's not much that the CPU designer
> can do about it. As a practical commercial matter, a new CPU must be
> able to support existing mass-market peripherals and protocols. All too
> often those have "mapped into the memory space" and other indirection
> baked in.

That is why I regard not supporting an I/O mechanism in the CPU but
relying on memory mapping as one of the major errors in computer
architecture, we discussed here some time ago. BTW, if you regarding
yourself as a "CPU designer", instead of a "Computing System designer",
you fall into that trap rather easily.

Since that low level of I/O is generally used primarily by the OS, it
should be possible to provide a "hardware adapter chip" that takes your
new I/O and interfaces it to the old style peripherals. The OS will
have to know about the new low level stuff, but the users shouldn't.
Yes, it will be slower, but that is more incentive to move to the newer,
better I/O interface.

> So the architect's choice is to supply a low-level interface and make
> dealing with evolving devices and protocols somebody else's problem, or
> add a nice clean operation to hide the device/protocol, which nobody
> will use because it's simpler/cheaper to cut/paste the code that drivers
> on other machines use.

I think what I discussed above might be a compromise.

> We are driven by network effects to use the lowest common denominator.
> Which is pretty low.

Agreed.

nm...@cam.ac.uk

unread,

Mar 20, 2013, 5:57:49 PM3/20/13

to

In article <f9c65a17-57b7-49a1...@googlegroups.com>,
John D. McCalpin <mcca...@tacc.utexas.edu> wrote:

Greetings from a while back!

>The indirect nature of the implementation is exacerbated by the dual
>addictions to transparent (uncontrollable) caches and the requirement
>that memory references be free of side effects.

And, God help us all, the appalling inconsistencies in the C and
POSIX standards in this area.

>So I will assert that no matter how many cores you put on a chip, there
>is not going to be a "multi-core revolution" until "communication"
>becomes a fundamental feature of the underlying architecture, and
>implementations are created for which the performance of "communication"
>is governed by direct physical limits rather than by the quirks of hack
>upon hack upon hack layered onto the uniprocessor flat-memory architecture
>that has changed little since the 1980's.

I still think that this is a sociological/marketing and language
issue far more than a hardware one. Despite being the oldest of
extant languages, Fortran is in many ways the most advanced, and
is much less crippled by this than pretty well every other widely
used language. Which is not to say that it is immune.

Regards,
Nick Maclaren.

Ivan Godard

unread,

Mar 20, 2013, 6:46:19 PM3/20/13

to

On 3/20/2013 1:30 PM, Stephen Fuld wrote:
> On 3/20/2013 12:02 PM, Ivan Godard wrote:
>
> snip
>
>> Unfortunately, at least for i/o there's not much that the CPU designer
>> can do about it. As a practical commercial matter, a new CPU must be
>> able to support existing mass-market peripherals and protocols. All too
>> often those have "mapped into the memory space" and other indirection
>> baked in.
>
> That is why I regard not supporting an I/O mechanism in the CPU but
> relying on memory mapping as one of the major errors in computer
> architecture, we discussed here some time ago. BTW, if you regarding
> yourself as a "CPU designer", instead of a "Computing System designer",
> you fall into that trap rather easily.

Would that I could!

While I might aspire to CSD, I cannot aspire to Computer Systems
Salesman. Yes, a sufficiently large preexisting entity can consider
replacing an ecosystem all the way to the consumer. You argument is fair
if directed at Apple or IBM. But you have to be Apple or IBM first.

>
> Since that low level of I/O is generally used primarily by the OS, it
> should be possible to provide a "hardware adapter chip" that takes your
> new I/O and interfaces it to the old style peripherals. The OS will
> have to know about the new low level stuff, but the users shouldn't.
> Yes, it will be slower, but that is more incentive to move to the newer,
> better I/O interface.

Sorry guy, from the viewpoint of a chip architect the OS *is* a user. As
is the designer of a device or protocol. Why should they support an
unfamiliar chip with a non-standard (to them) interface?

Ivan

Stephen Fuld

unread,

Mar 20, 2013, 7:31:39 PM3/20/13

to

On 3/20/2013 3:46 PM, Ivan Godard wrote:
> On 3/20/2013 1:30 PM, Stephen Fuld wrote:
>> On 3/20/2013 12:02 PM, Ivan Godard wrote:
>>
>> snip
>>
>>> Unfortunately, at least for i/o there's not much that the CPU designer
>>> can do about it. As a practical commercial matter, a new CPU must be
>>> able to support existing mass-market peripherals and protocols. All too
>>> often those have "mapped into the memory space" and other indirection
>>> baked in.
>>
>> That is why I regard not supporting an I/O mechanism in the CPU but
>> relying on memory mapping as one of the major errors in computer
>> architecture, we discussed here some time ago. BTW, if you regarding
>> yourself as a "CPU designer", instead of a "Computing System designer",
>> you fall into that trap rather easily.
>
> Would that I could!
>
> While I might aspire to CSD, I cannot aspire to Computer Systems
> Salesman. Yes, a sufficiently large preexisting entity can consider
> replacing an ecosystem all the way to the consumer.

Well, I wasn't suggesting going quite that far. You would still use
standard disks, power supplies, DRAMs, etc.

> You argument is fair
> if directed at Apple or IBM. But you have to be Apple or IBM first.

Actually, interface standards such as PCI, USB, etc. came out of places
like Intel. But I take the point that you are not Intel. But see below.

> >
> > Since that low level of I/O is generally used primarily by the OS, it
> > should be possible to provide a "hardware adapter chip" that takes your
> > new I/O and interfaces it to the old style peripherals. The OS will
> > have to know about the new low level stuff, but the users shouldn't.
> > Yes, it will be slower, but that is more incentive to move to the newer,
> > better I/O interface.
>
> Sorry guy, from the viewpoint of a chip architect the OS *is* a user. As
> is the designer of a device or protocol. Why should they support an
> unfamiliar chip with a non-standard (to them) interface?

For the same reason they might want to support a chip with a totally
different ISA, and probably a different arrangement for interrupts,
different systems registers, etc. All are part of delivering what one
hopes would be a better experience to the ultimate end user.

Note that this is not a knock on your Mill. Just saying that I/O is, or
rather should be, as much a part of the system as many other things.

Ivan Godard

unread,

Mar 20, 2013, 8:27:47 PM3/20/13

to

Patience, waiting is :-)

>> >
>> > Since that low level of I/O is generally used primarily by the OS, it
>> > should be possible to provide a "hardware adapter chip" that takes
>> your
>> > new I/O and interfaces it to the old style peripherals. The OS will
>> > have to know about the new low level stuff, but the users shouldn't.
>> > Yes, it will be slower, but that is more incentive to move to the
>> newer,
>> > better I/O interface.
>>
>> Sorry guy, from the viewpoint of a chip architect the OS *is* a user. As
>> is the designer of a device or protocol. Why should they support an
>> unfamiliar chip with a non-standard (to them) interface?
>
> For the same reason they might want to support a chip with a totally
> different ISA, and probably a different arrangement for interrupts,
> different systems registers, etc. All are part of delivering what one
> hopes would be a better experience to the ultimate end user.

The goal of such people is *not* to provide a better experience for the
ultimate end user. It is to provide a salable experience at highest profit.

Economic and technological oligarchs have no incentive to put money into
better. They get more return putting their money into *suppressing*
better. http://en.wikipedia.org/wiki/Disruptive_technology

We are quite certain that the Mill will face one or more Warren Zevon
moments. http://en.wikipedia.org/wiki/Lawyers,_Guns_and_Money

> Note that this is not a knock on your Mill. Just saying that I/O is, or
> rather should be, as much a part of the system as many other things.

As much as possible we model i/o as just another thread.

MitchAlsup

unread,

Mar 20, 2013, 9:12:51 PM3/20/13

to

On Wednesday, March 20, 2013 2:02:41 PM UTC-5, Ivan Godard wrote:
> Unfortunately, at least for i/o there's not much that the CPU designer
> can do about it. As a practical commercial matter, a new CPU must be
> able to support existing mass-market peripherals and protocols. All too
> often those have "mapped into the memory space" and other indirection
> baked in.

Disagree! (in a bazzar way)!!

What one needs is an ISA compatable CPU that is small enough to put in the
perifferal chips and execute the I/O code in an In-Order processor placed
close to the perfiferal being manipulated. You can have dozens of these
CPUlettes sitting around waiting for an I/O necessity to be required. The
GBOoO processor takes a fault on the first Poke, the task is transfered
to the CPUlette, and run locally at the device.

The great big number-crunchers have (essentially) no business doing pokes
ar I/O control registers. The GBOoO processors might be several microseconds
from the I/O control register doing uncacheable stuff. The little bitty CPU
can be 100 nanoseconds from the device, perform the whole task, and the swap
the task back onto the run gueue of the BGOoO processor when the I/O event
is done. {That is where the ISA compatable requirement comes from.}

What this means is that the LBIOP (little Bitty In Order Processor) needs to
be implemented in verilog (or similar) and compiled onto the rest of the
external chip.

Mitch

Ivan Godard

unread,

Mar 20, 2013, 9:34:16 PM3/20/13

to

Well, yes, of course. Been done. http://en.wikipedia.org/wiki/CDC_6600.

But all this does is push the problem away from the GBOOO. The question
remains unanswered: should LBIOP use MMIO or an instruction to actually
goose the peripheral?

And if LBIOP and GBOOO are ISA compatible, then the solution you choose
for LBIOP will need to be in GBOOO too, even if it is not used in most
configurations.

Lastly, why should the communication between GBOOO and LBIOP be via
poke/fault? Shouldn't this be a more general HEYU operation, that can be
used not only to reach LBIOP but also the next GBOOO over in the rack?

That is, why model i/o as process migration when it could be modeled as
inter-process communication? The later seems conceptually cleaner to me.

Ivan

MitchAlsup

unread,

Mar 21, 2013, 3:55:58 PM3/21/13

to

On Wednesday, March 20, 2013 8:34:16 PM UTC-5, Ivan Godard wrote:
> Well, yes, of course. Been done. http://en.wikipedia.org/wiki/CDC_6600.

Excepting that these were not ISA compatiable.

> But all this does is push the problem away from the GBOOO. The question
> remains unanswered: should LBIOP use MMIO or an instruction to actually
> goose the peripheral?

Sure, why not. It makes porting the code so much easier.

> And if LBIOP and GBOOO are ISA compatible, then the solution you choose
> for LBIOP will need to be in GBOOO too, even if it is not used in most
> configurations.

Which it is why the thing IS ISA compatable.

> Lastly, why should the communication between GBOOO and LBIOP be via
> poke/fault? Shouldn't this be a more general HEYU operation, that can be
> used not only to reach LBIOP but also the next GBOOO over in the rack?

I'm assuming that the GBOoO has already transfered control to a thread
with adequate privaledges, and that the same OS image can run either
configuration. I am not claiming that this is optimal, just better than
trying to sort out the memory ordering issues on the GBOoO.

> That is, why model i/o as process migration when it could be modeled as
> inter-process communication? The later seems conceptually cleaner to me.

But the GBOoO i'm compatable with does not have this feature set.

Mitch

Ivan Godard

unread,

Mar 21, 2013, 5:27:32 PM3/21/13

to

Ah! Then we are addressing different issues and talking across each
other. Not the first time :-)

Taking your use case, your LBIOP would need to support the full ISA just
in case the driver for some reason did a double precision float divide
or something random SIMD. But you really wouldn't want to waste the area
and complications on those pointless ops. So you'd want to take an
unimplemented-instruction trap instead and do them is software if ever
called for.

However, taking a trap in the middle of I/O code seems doubtful,
especially if the code had to be paged in. So the emulation library has
to be pinned, and even then there might be timing and/or interrupt issues.

There might be handoff issues too, depending on how the state gets from
GBOOO to LBIOP. If LBIOP is on GBOOO's side of the TLB then it can share
the TLB, but not if there are hundreds of LBIOPs. Likewise, does LBIOP
have a cache? If so, how is coherency maintained during the handoff?
What if the GBOOO started a cache-based transaction before it first
banged on the MMIO location? What if the GBOOO is holding OS locks - can
LBIOP unlock them, or must that be done after migration back to GBOOO?

For that matter, I see how you trigger migration from GBOOO to LBIOP,
but how does the hardware know when it's the right time to move back?

Then what about virtualization? Are the LBIOPs virtual? Hard to imagine
how they could be, but I may be missing something here.

Lastly, do you really think that a full process migration (both ways)
will be faster than simply letting GBOOO wait for the no-cache MMIO
hardware?

And BTW, if you controlled the GBOOO's feature set (i.e. my issue),
would you still use this approach, or something more along the lines of
an IOP and HEYU fire-and-forget messages between GBOOO and IOP?

Ivan

MitchAlsup

unread,

Mar 22, 2013, 2:04:04 AM3/22/13

to

On Thursday, March 21, 2013 4:27:32 PM UTC-5, Ivan Godard wrote:
> Taking your use case, your LBIOP would need to support the full ISA just
> in case the driver for some reason did a double precision float divide
> or something random SIMD. But you really wouldn't want to waste the area
> and complications on those pointless ops.

You have better options: A) trap back to the GBOoO processor for this stuff
B) implement a full (albeit slow) FP unit in the LBIOP.

Blasphemy? Cell phone SOCs have ~64 SP FP units on the GPU right now.
Cell phones! with 2W thermal limits!

Secondly, you would NEVER do them in SW for a large variety of reasons.

> There might be handoff issues too, depending on how the state gets from
> GBOOO to LBIOP.

State gets from GBOoO to LBIOP in exactly the same way state gets from one
GBOoO to another GBOoO. Its called a context switch. And aside from some
afinity stuff, the OS is just scheduling another task switch.

> Lastly, do you really think that a full process migration (both ways)
> will be faster than simply letting GBOOO wait for the no-cache MMIO
> hardware?

Probably, especially isf the LBIOP has useful sized caches. Remember its
not just one MMIO that takes several microseconds, but a series of these
really slow things.

After a few years, when the concept is fully integrated, GBOoO processors
will have worked out a more efficient approach to talking with the LBIOPs
that do the IO on their behalf.

Mitch

ken...@cix.compulink.co.uk

unread,

Mar 22, 2013, 6:35:01 AM3/22/13

to

In article <kid65f$4ta$1...@dont-email.me>, SF...@alumni.cmu.edu.invalid

(Stephen Fuld) wrote:

> That is why I regard not supporting an I/O mechanism in the CPU but
> relying on memory mapping as one of the major errors in computer
> architecture,

Then again memory mapping may be the only way of handling some forms of
I/O. Certainly with 8 bit home computers and the early 16 bit ones
memory mapping was the only way to provide a video display. The IBM PC
has to retain memory mapped video to enable the use of early software.
Fast serial and parallel communications required maintaining a buffer in
RAM etc.

A lot depends on whether or not you are starting from a clean sheet and
what the I/O requirements are.

Ken Young

Stephen Fuld

unread,

Mar 22, 2013, 11:56:48 AM3/22/13

to

On 3/22/2013 3:35 AM, ken...@cix.compulink.co.uk wrote:
> In article <kid65f$4ta$1...@dont-email.me>, SF...@alumni.cmu.edu.invalid
> (Stephen Fuld) wrote:
>
>> That is why I regard not supporting an I/O mechanism in the CPU but
>> relying on memory mapping as one of the major errors in computer
>> architecture,
>
> Then again memory mapping may be the only way of handling some forms of
> I/O. Certainly with 8 bit home computers and the early 16 bit ones
> memory mapping was the only way to provide a video display. The IBM PC
> has to retain memory mapped video to enable the use of early software.
> Fast serial and parallel communications required maintaining a buffer in
> RAM etc.

I agree that video (especially the frame buffer) should be excepted from
my complaint about MMIO. I was referring more to things like disk and
printers, etc. Its the peeking and poking of memory mapped device
registers that I was mostly referring to.

> A lot depends on whether or not you are starting from a clean sheet and
> what the I/O requirements are.

Sure.

John D. McCalpin

unread,

Mar 22, 2013, 12:54:15 PM3/22/13

to

> - Stephen Fuld

Your question brings us back to the issue of definitions.

There is no doubt that cache coherence can be used to implement
"communication", but the implementations are painfully indirect.

Any definition of "communication" must include the propagation of
information from one task to another. Most low-level coherence
transactions consist of either moving copies of data (including
interventions) or invalidating lines in caches. Of all the coherence
transactions, only modified interventions directly implement the
propagation of information from one task to another. Writebacks
of dirty lines in shared memory or streaming stores to shared memory
provide *part* of a communication operation, requiring a read
of the updated memory location to complete the transfer of information.
In both cases one task writes to a shared memory location and the
other task reads from the shared memory location, but there are a
variety of (largely uncontrollable) sequences of low-level transactions
that may result.

Most definitions of "communication" require some type of synchronization,
and much (most?) "communication" requires multiple synchronization
operations. Synchronization typically involves "metadata", such as a
flag or version number indicating whether the shared data has been updated.
In shared memory systems, the consumer of shared data typically needs to
know if data has been updated before bothering to read it, and the consumer
typically needs to know when the consumer has finished using a set of
shared memory locations so that they can be re-used by the producer.

This simple example includes many assumptions, including the idea that
communication is taking place through reads and writes to shared memory,
and that the (mutable) shared addresses will be reused. While these are
natural assumptions for cached shared-memory systems, I worry that these
details are obscuring the more fundamental attributes of communication.
Exposing these more fundamental attributes might suggest alternate families
of implementations.

As a final aside, I should note that the overwhelming majority of cache coherence transactions on modern systems have nothing to do with
"communication" and so are optimized for cache access by a single thread.
Going further, I claim that the overwhelming majority of cache coherence
transactions are not even necessary.
(1) In single-threaded programs, for example, cache coherence provides
correctness if the O/S decides to move a thread to a different physical
processor, (which could also be handled by various types of cache
invalidation) and also provides correctness if the O/S operations on behalf
of a process are executed on a different core. Without process migration,
100% of the accesses to private memory are snooped unnecessarily, while
in the case of O/S-to-task communication, there is a clearly visible
"transaction point" at which an alternate communication mechanism could
be employed.
(2) Even in multi-threaded programs operation on a shared space, the majority
of memory accesses are typically to either "thread private" memory or to
addresses within the shared memory space that are only accessed by a single
thread. Again, in the absence of process migration, accesses to the thread
private memory don't need to be snooped, while O/S interactions have a clearly
defined transaction point. Accesses to the shared memory space are often
primarily local (e.g., domain decomposition) with the whole space being treated as "shared" simply to ease programability. In many other cases, most of the
cache coherence traffic could be eliminated by using a "write-through" policy
on the shared data, or by selectively writing back dirty shared cache lines
at the end of a parallel section.

John D. McCalpin

unread,

Mar 22, 2013, 1:59:48 PM3/22/13

to nm...@cam.ac.uk

On Wednesday, March 20, 2013 4:57:49 PM UTC-5, nm...@---.ac.uk wrote:
> In article <---@googlegroups.com>,

>
> John D. McCalpin <---> wrote:
>
> >The indirect nature of the implementation is exacerbated by the dual
> >addictions to transparent (uncontrollable) caches and the requirement
> >that memory references be free of side effects.
>
> And, God help us all, the appalling inconsistencies in the C and
> POSIX standards in this area.

I was referring to side effects at a much lower level. As an example,
current systems do not allow a processor to load a cache line and copy
all the data in that cache line into registers in a single atomic
operation. The hardware does guarantee that the cache line is moved
around the system atomically, but does not provide any mechanism to
allow the processor to read all the data before another agent invalidates
the cache line (or intervenes it and modifies it). With a modern
processor executing 2 16-Byte loads per cycle, copying all the data to
registers would require only 2 cycles, but you still can't guarantee
that the line will not be invalidated by another processor.

As a consequence, you cannot implement something as basic as a hardware
FIFO to deliver cache lines to processors --- the hardware is allowed to
load the address at any time, discard it at any time, reload it at any time,
etc, so you can't combine the side effect of shifting the (internal)
FIFO pointer with the load of the data. So you have to jump through
horrible hoops to figure out how to design a working FIFO using shared
memory, and a remarkable fraction of the published algorithms in this area
were later shown to be unsafe (i.e., giving incorrect results in some
corner cases).

(Aside: The Xeon Phi is an exception to *part* of the comments above,
since the SIMD width is a full cache line, making aligned loads are atomic.
The Xeon Phi still provides no way to guarantee that a memory location is
accessed exactly one time for one load instruction, so side effects are still
not allowable.)

>
> >So I will assert that no matter how many cores you put on a chip, there
> >is not going to be a "multi-core revolution" until "communication"
> >becomes a fundamental feature of the underlying architecture, and
> >implementations are created for which the performance of "communication"
> >is governed by direct physical limits rather than by the quirks of hack
> >upon hack upon hack layered onto the uniprocessor flat-memory architecture
> >that has changed little since the 1980's.
>
> I still think that this is a sociological/marketing and language
> issue far more than a hardware one. Despite being the oldest of
> extant languages, Fortran is in many ways the most advanced, and
> is much less crippled by this than pretty well every other widely
> used language. Which is not to say that it is immune.

What I am talking about is very much a hardware problem, though not
one that most users are aware of. Because the hardware does not provide
any direct support of communication and synchronization, these must be
implemented indirectly, with horrible efficiency. The very best
software implementations of fixed-length concurrent non-blocking
FIFOs, for example, take on the order of 1.5 microseconds (>4000 processor
cycles) for each "enqueue" or "dequeue" operation on a modern 2-socket
server -- even with zero contention! The standard Linux software for
pipes and FIFOs is often an order of magnitude slower that this.

In contract, a hardware solution could easily deliver this functionality
with under 20 ns latency for accesses from the same chip and under 100 ns
latency for accesses from the other chip, with full pipelining enabling
throughput limited by wire speeds.

Ivan Godard

unread,

Mar 22, 2013, 2:01:53 PM3/22/13

to

I think you are on to something here. It has long bugged me how much
waste there is in cache coherence, both in the design/implementation and
the execution. The only reason we are coherent is because there are
customer check-boxes that in effect mandate it. At least we did get rid
of the false-sharing problem.

In the embedded world you'd eschew shared memory entirely, and coherence
along with it, and treat everything as an i/o problem - vastly cleaner
and cheaper.

However, there still remain those grotty check-boxes.

nm...@cam.ac.uk

unread,

Mar 22, 2013, 3:06:42 PM3/22/13

to

In article <89a6d751-2bf4-4848...@googlegroups.com>,

John D. McCalpin <mcca...@tacc.utexas.edu> wrote:
>>
>> >The indirect nature of the implementation is exacerbated by the dual
>> >addictions to transparent (uncontrollable) caches and the requirement
>> >that memory references be free of side effects.
>>
>> And, God help us all, the appalling inconsistencies in the C and
>> POSIX standards in this area.
>
>I was referring to side effects at a much lower level. As an example,
>current systems do not allow a processor to load a cache line and copy
>all the data in that cache line into registers in a single atomic
>operation. The hardware does guarantee that the cache line is moved
>around the system atomically, but does not provide any mechanism to
>allow the processor to read all the data before another agent invalidates
>the cache line (or intervenes it and modifies it). With a modern
>processor executing 2 16-Byte loads per cycle, copying all the data to
>registers would require only 2 cycles, but you still can't guarantee
>that the line will not be invalidated by another processor.

Ah, right. Yes, that's a lower-level mess. However, part of the
reason that the mess is intractable is that the standards I refer
to require that individual locations are independent as the program
sees them, which includes when using paradigms that would conflict
with the above. Yes, both could be allowed, but it would require
restrictions on how such data could be accessed at the C and POSIX
level. I don't see that as a technical problem, so much as a
sociopolitical one :-(

>What I am talking about is very much a hardware problem, though not
>one that most users are aware of. Because the hardware does not provide
>any direct support of communication and synchronization, these must be
>implemented indirectly, with horrible efficiency. The very best
>software implementations of fixed-length concurrent non-blocking
>FIFOs, for example, take on the order of 1.5 microseconds (>4000 processor
>cycles) for each "enqueue" or "dequeue" operation on a modern 2-socket
>server -- even with zero contention! The standard Linux software for
>pipes and FIFOs is often an order of magnitude slower that this.
>
>In contract, a hardware solution could easily deliver this functionality
>with under 20 ns latency for accesses from the same chip and under 100 ns
>latency for accesses from the other chip, with full pipelining enabling
>throughput limited by wire speeds.

Oh, THAT one. Yes, indeed. And, in turn, that means that small
granularity parallelism is infeasible, so we end up with the ghastly
explicit thread models that we have today. Fixing that one in
hardware would be a MASSIVE improvement, and one I have been trying
to get vendors interested in for a couple of decades now.

Regards,
Nick Maclaren.

Paul A. Clayton

unread,

Mar 22, 2013, 4:21:35 PM3/22/13

to

On Mar 22, 12:54 pm, "John D. McCalpin" <mccal...@tacc.utexas.edu>
wrote:
[snip]

> As a final aside, I should note that the overwhelming majority of cache coherence transactions on modern systems have nothing to do with
> "communication" and so are optimized for cache access by a single thread.
> Going further, I claim that the overwhelming majority of cache coherence
> transactions are not even necessary.

Some thread local coherence traffic could be avoided
by marking Page Table Entries with a node number
(possibly with a unary encoded node size sharing the
space with the node number to support core, cluster,
neighborhood, chip, package, etc. distinctions).
When a remote node accesses the page, a migration,
expansion (to a larger node that includes all
sharers), or other operation would be performed
(such would include TLB updating/invalidating).
(Unfortunately, such binds coherence behavior to the
granularity of pages. It might be desirable to use
large translation pages with smaller tracking [and
ideally permission] pages.)

Recent POWER implementations have tracked locality
of use in cache block state to avoid unnecessarily
snooping across package boundaries. Adding four
bits per block would allow individually identifying
16 cores (using exclusive/modified state to indicate
node size of one), 8 core pairs, etc. (or a more
complex encoding could be used).

(Mitch Alsup has also "ranted" here about unnecessary
coherence.)

A last use indicator could also be used to
support simple producer-consumer communication
where the microarchitecture would track the
alternate location. (While directives have some
advantages, even a predictive mechanism could be
helpful.)

A flush-to-home operation might be slightly useful
(with node ownership and shared last level cache,
flush-to-home could be just flush to the nearest
level of cache shared by all users).

Last use might be flush-to-last-user. (In the
case of a consumer's final use, one would also
want to communicate that no value or all zeros
should be communicated. This could be done by
idiom recognition of DCBA/DCBZ followed
immediately by the DCBFTLU.) This would not be
as efficient as an explicit queuing/buffering,
but might be more acceptable since it is a more
incremental change.

By the way, the FIFO latency you indicated in a
later post is frighteningly high!

nm...@cam.ac.uk

unread,

Mar 22, 2013, 5:11:34 PM3/22/13

to

In article <30f74e1d-0d4d-41c0...@h7g2000yqi.googlegroups.com>,

Paul A. Clayton <paaron...@gmail.com> wrote:
>On Mar 22, 12:54 pm, "John D. McCalpin" <mccal...@tacc.utexas.edu>
>wrote:
>[snip]
>> As a final aside, I should note that the overwhelming majority of cache
>> coherence transactions on modern systems have nothing to do with
>> "communication" and so are optimized for cache access by a single thread.
>> Going further, I claim that the overwhelming majority of cache coherence
>> transactions are not even necessary.
>

>(Mitch Alsup has also "ranted" here about unnecessary
>coherence.)

The interesting thing here is that Mitch, John, and I have fairly
different viewpoints, but we all have the same opinion of most uses
of cache coherence. I don't think that even Andy (Glew) dissents,
nor do we dissent from his view that it should be done properly
when it is done.

I was speaking to an Intel person recently, and his comments on
MIC were very interesting. Apparently, one very sound approach
is to use the Hyperthreads for one core and AVX as pseudo-vectors
(as in the Hitachi SR2201) with OpenMP or similar, and link
multiple cores with MPI. The relevance here is that MPI doesn't
need cache coherence - or, indeed, shared memory! Nor do Fortran
coarrays and (if you really must) UPC.

This is remarkably similar to the GPU design.

Personally, I would like to see John's very lightweight sychronisation
and communication methods, which could be used in languages to allow
small granularity parallelism to be exploited. At present, it's a
complete loser :-( That would need coherence, but it would be very
local in time and execution contexts. Tera MTA like parallelism
is very nice, but I don't see it flying outside HPC.

Regards,
Nick Maclaren.

John D. McCalpin

unread,

Mar 22, 2013, 5:32:31 PM3/22/13

to

On Friday, March 22, 2013 3:21:35 PM UTC-5, Paul A. Clayton wrote:
> On Mar 22, 12:54 pm, "John D. McCalpin" <mccal...@tacc.utexas.edu>
>
>

> Recent POWER implementations have tracked locality
> of use in cache block state to avoid unnecessarily
> snooping across package boundaries. Adding four
> bits per block would allow individually identifying
> 16 cores (using exclusive/modified state to indicate
> node size of one), 8 core pairs, etc. (or a more
> complex encoding could be used).

POWER does very scary things....

I developed a "local cache block flush" instruction for
POWER5 (US Patent 7,194,587), to assist with controlling
cache space (not for improving communication), but at the
same time someone else on the team dropped the "invalid
victim select" preference. So I could invalidate lines
that I did not need, but they would stay in the cache
until they became LRU anyway. I did not understand why
anyone would want to do this until I saw how POWER6 uses
invalid entries in the cache to hold information about data
sharing patterns. I am sure it has gotten much more bizarre
since then....

Interestingly, Xeon Phi implements a pair of "local cache
block flush" instructions (CLFLUSH1 & CLFLUSH2) -- one to
flush data from the L1 and the other to flush data from the L2.
Like my patent, they have only local scope. On Xeon Phi they
are useful to control not only cache space, but also the timing
of writebacks, since the caches are (effectively) single-ported.
These don't make any difference for communication, since the
memory latency and cache-to-cache intervention latency are
effectively the same on this processor.

> A last use indicator could also be used to
> support simple producer-consumer communication
> where the microarchitecture would track the
> alternate location. (While directives have some
> advantages, even a predictive mechanism could be
> helpful.)

Hack upon hack upon hack? Why not design the
architecture to allow control over the things that
are important (costly), rather than hacking on
decisions that have not made sense for >20 years?

> A flush-to-home operation might be slightly useful
> (with node ownership and shared last level cache,
> flush-to-home could be just flush to the nearest
> level of cache shared by all users).

I developed a "push for sharing" instruction while
at AMD (US Patent 8,099,557), but I don't think it
was ever implemented.

> By the way, the FIFO latency you indicated in a
> later post is frighteningly high!

I only posted the lowest available numbers because
people often don't believe the actual measurements.

A quick check with lmbench3 shows that "pipe" latency
on my 2-socket Xeon E5 (Sandy Bridge) systems is about
4.7 microseconds (almost 15000 cycles).

On a 40-core (4 socket) Xeon E7 system, a single instance
of a non-blocking concurrent FIFO was reported (reference 1)
to go from ~1 microsecond (2000 cycle) overhead under no load
to ~50 microseconds (100,000 cycles) overhead under a relatively
heavy load (20 enqueuers and 20 dequeuers, each spinning
for ~2.3 microseconds between operations). I should note
that these are *overheads*, not just latencies, because
the processors are only doing real work for 2.3 microseconds
per enqueue or dequeue operation, while the other ~48 micro-
seconds are spent spinning on trying to access the queue.

For a different kind of "communication":
On my Xeon Phi SE10P coprocessors, an OpenMP barrier on
244 threads costs about 22 microseconds (24,000 cycles, or
about 23 Million double-precision floating-point operations).
This is just the overhead of the final barrier -- the overhead
of the initial "PARALLEL FOR" is higher, but I need to run
some more experiments before I decide the best way to report
the combined overhead.

Reference: (1) Christoph Kirsch, Hannes Payer, Harald Rock, and Ana
Sokolova. Performance, scalability, and semantics of concurrent
fifo queues. Algorithms and Architectures for Parallel Processing,
pages 273–287, 2012.

Paul A. Clayton

unread,

Mar 22, 2013, 7:53:49 PM3/22/13

to

On Mar 22, 5:32 pm, "John D. McCalpin" <mccal...@tacc.utexas.edu>
wrote:

> On Friday, March 22, 2013 3:21:35 PM UTC-5, Paul A. Clayton wrote:
>> On Mar 22, 12:54 pm, "John D. McCalpin" <mccal...@tacc.utexas.edu>
>
>> Recent POWER implementations have tracked locality
>> of use in cache block state to avoid unnecessarily
>> snooping across package boundaries. Adding four
>> bits per block would allow individually identifying
>> 16 cores (using exclusive/modified state to indicate
>> node size of one), 8 core pairs, etc. (or a more
>> complex encoding could be used).
>
> POWER does very scary things....

I kind of wish that the L3 "slice" placement and
replication mechanism was described in a freely
available document. It seems likely that some
interesting cleverness was used.

> I developed a "local cache block flush" instruction for
> POWER5 (US Patent 7,194,587), to assist with controlling
> cache space (not for improving communication), but at the
> same time someone else on the team dropped the "invalid
> victim select" preference. So I could invalidate lines
> that I did not need, but they would stay in the cache
> until they became LRU anyway. I did not understand why
> anyone would want to do this until I saw how POWER6 uses
> invalid entries in the cache to hold information about data
> sharing patterns. I am sure it has gotten much more bizarre
> since then....

Interesting. That is somewhat similar to Andy
Glew's idea of using storage space provided by
data compression to store metadata.

With inclusive L2/L3 and writeback L1/L2, dirty
L1/L2 cache blocks could--in theory--have their
L2/L3 storage be used for other purposes (though
such storage would tend to be relatively temporary).

> Interestingly, Xeon Phi implements a pair of "local cache
> block flush" instructions (CLFLUSH1 & CLFLUSH2) -- one to
> flush data from the L1 and the other to flush data from the L2.
> Like my patent, they have only local scope. On Xeon Phi they
> are useful to control not only cache space, but also the timing
> of writebacks, since the caches are (effectively) single-ported.
> These don't make any difference for communication, since the
> memory latency and cache-to-cache intervention latency are
> effectively the same on this processor.

With current DRAMs, timing writebacks to memory could
improve performance and power efficiency. (Even with
a smaller DRAM "page" size and the ability to have
multiple rows open per bank, such might still be
beneficial. I tend to rant about DRAM design--even
in my ignorance.)

>> A last use indicator could also be used to
>> support simple producer-consumer communication
>> where the microarchitecture would track the
>> alternate location. (While directives have some
>> advantages, even a predictive mechanism could be
>> helpful.)
>
> Hack upon hack upon hack?

Is that hack cubed or (hack to the power of hack) to
the power of hack? :-\

As someone mentally inclined toward microoptimization
and unfamiliar with so many aspects of the design
and use of computer systems, hacks are about all I
can offer. :-[

> Why not design the
> architecture to allow control over the things that
> are important (costly), rather than hacking on
> decisions that have not made sense for >20 years?

But, but, but . . . that would be different,
possibly even non-compatible! :-)

>> A flush-to-home operation might be slightly useful
>> (with node ownership and shared last level cache,
>> flush-to-home could be just flush to the nearest
>> level of cache shared by all users).
>
> I developed a "push for sharing" instruction while
> at AMD (US Patent 8,099,557), but I don't think it
> was ever implemented.

I have not seen such mentioned for Intel's or AMD's
implementations, but I do not track even x86 ISA
extensions particularly closely.

>> By the way, the FIFO latency you indicated in a
>> later post is frighteningly high!
>
> I only posted the lowest available numbers because
> people often don't believe the actual measurements.

[snip more communication latency data]

It is particularly sad because pipeline-style
parallelism is one of the easier forms to
understand.

Stephen Fuld

unread,

Mar 26, 2013, 1:21:05 PM3/26/13

to

On 3/20/2013 11:50 AM, John D. McCalpin wrote:

> On Monday, March 11, 2013 1:39:07 PM UTC-5, Stephen Fuld wrote:
>> There has to be communication. The issue here is whether that
>> communication takes place via mapping some of the peripheral's function
>> into the memory space of the processor or via an instruction in the
>> processor. [...]
>
> At the risk of wading in to a stale conversation....
>
> I agree that there has to be "communication" -- I just wish I knew what that word meant.
>
> It is fascinating to me that the term "communication" does not exist in the architectural specification documents defining the major computer architectures in use today.

True. But there have been at least two architectures where
communication was defined as a major point in their design (there may
have been others) I refer to the transputer and the Exixsi systems.

Of course, neither of these were commercial successes. This could be
for some combination of one or more of the following reasons:

1. The basic idea is flawed and communication just isn't that central.
2. The implementation was flawed. it was a good idea, just implemented
poorly.
3. Non technical e.g. market forces.

4 Perhaps other reasons

It is hard to assign "blame", but what can we learn from these
architectures in terms of implementing communication and why it hasn't
been successful?

Ivan Godard

unread,

Mar 26, 2013, 2:13:57 PM3/26/13

to

On 3/26/2013 10:21 AM, Stephen Fuld wrote:
> On 3/20/2013 11:50 AM, John D. McCalpin wrote:
>> On Monday, March 11, 2013 1:39:07 PM UTC-5, Stephen Fuld wrote:
>>> There has to be communication. The issue here is whether that
>>> communication takes place via mapping some of the peripheral's function
>>> into the memory space of the processor or via an instruction in the
>>> processor. [...]
>>
>> At the risk of wading in to a stale conversation....
>>
>> I agree that there has to be "communication" -- I just wish I knew
>> what that word meant.
>>
>> It is fascinating to me that the term "communication" does not exist
>> in the architectural specification documents defining the major
>> computer architectures in use today.
>
>
> True. But there have been at least two architectures where
> communication was defined as a major point in their design (there may
> have been others) I refer to the transputer and the Exixsi systems.

Exixsi?? New to me (and to Google). Spelling?

Stephen Fuld

unread,

Mar 26, 2013, 2:57:58 PM3/26/13

to

Sorry, Elxsi.

See

http://en.wikipedia.org/wiki/Elxsi

From that article one of its hardware features

Microcoded message system to communicate among software processes and
with I/O controllers and CPU microcode

There is also a manual on Bitsavers.

Michael S

unread,

Mar 26, 2013, 5:29:20 PM3/26/13

to

Is a single-producer multiple-consumers FIFO implemented as a fix-
length (probably power of 2) circular buffer without overflow
protection similarly bad?

John D. McCalpin

unread,

Mar 26, 2013, 7:29:34 PM3/26/13

to SF...@alumni.cmu.edu.invalid

On Tuesday, March 26, 2013 12:21:05 PM UTC-5, Stephen Fuld wrote:
> On 3/20/2013 11:50 AM, John D. McCalpin wrote:
>
> > On Monday, March 11, 2013 1:39:07 PM UTC-5, Stephen Fuld wrote:
> >> There has to be communication. The issue here is whether that
> >> communication takes place via mapping some of the peripheral's function
> >> into the memory space of the processor or via an instruction in the
> >> processor. [...]
>
> > At the risk of wading in to a stale conversation....
> >
> > I agree that there has to be "communication" -- I just wish I knew
> > what that word meant.

> > [...]

> > It is fascinating to me that the term "communication" does not
> > exist in the architectural specification documents defining the
> > major computer architectures in use today.
>
> True. But there have been at least two architectures where
> communication was defined as a major point in their design (there may
> have been others) I refer to the transputer and the Exixsi systems.
>
> Of course, neither of these were commercial successes. This could be
> for some combination of one or more of the following reasons:
>
> 1. The basic idea is flawed and communication just isn't that central.
> 2. The implementation was flawed. it was a good idea, just implemented
> poorly.
> 3. Non technical e.g. market forces.
> 4 Perhaps other reasons
>
> It is hard to assign "blame", but what can we learn from these
> architectures in terms of implementing communication and why it hasn't
> been successful?
>

> - Stephen Fuld

Both of these systems were developed during the 1980's, and came to
market just as the RISC systems that dominated the marketplace during
the 1990's were also being introduced (e.g., MIPS).

The Transputer was clearly an "interesting" approach, but the performance
of a single transputer was not competitive. Although relatively easy
to program, any solution that requires parallel programming to get the
same performance as a (comparably priced) solution that can be programmed
serially was not going to fare well in the market.

The Elxsi computer had performance, but the ECL implementation had to be
painfully expensive. Wikipedia says that each processor took up three
boards -- ouch! While this was cheaper than mainframes, it proved
vulnerable to undercutting by high-volume CMOS as LSI and VLSI
technologies were developed.

Both approaches separated communication from memory access (though
Elxsi later added shared memory), with logical communication channels
that looked pretty much like FIFOs.

The Elxsi system implemented its communication over a shared bus, so
the underlying hardware was clearly not scalable. It is not clear
whether the other aspects of the system had fundamental scaling limits --
the company was clearly interested in "mid-range" systems with up to
10 CPUs.

It seems to me that the time is right for renewed interest in
fundamentally parallel architectures.
(1) Technology scaling is mostly limited to giving us more
cores per die. More performance must (mostly) come from
using more cores, so parallel overheads are becoming more
important.
(2) Market forces have caused a stall in server prices at
$2000-$3500 per socket. This may create an opportunity
for disruptive innovation at lower price points.
(3) Power scaling suggests that we need to go to even larger
numbers of even slower processors to continuing improving
energy efficiency. Unless parallelism is extraordinarily
efficient, this will be a showstopper.

John D. McCalpin

unread,

Mar 26, 2013, 7:43:50 PM3/26/13

to

I have not looked at this case, but I imagine that dropping the
overflow protection would significantly reduce the breadth of
applicability?

The concurrent FIFO implementations handle the head and tail
separately (and with similar complexity), so limiting an
implementation to a single producer would only help on 1/2 of
the work. The multiple consumers would still have to dequeue
items atomically, which requires multiple instructions and many
coherence transactions.

The non-blocking concurrent FIFO implementations that I have studied
use sneaky tricks to ensure that a process can only fail to dequeue
(or enqueue) an item if another thread has succeeded. This is a good
property to have, but does not seem to be enough to ensure that these
implementations perform well under load -- they seem to slow down even
though it is guaranteed that one of the competing processes will succeed.
I suspect that the performance degradation is related to excessive cache
line "bouncing", but am not aware of any studies of this behavior at the
level of the coherence protocol using a realistic microarchitecture
simulation.

Stephen Fuld

unread,

Mar 26, 2013, 8:06:16 PM3/26/13

to

Also, didn't the Trasnputer require programming in Occam and didn't
support the more popular programming languages? That would be another
hurdle for popular acceptance.

> The Elxsi computer had performance, but the ECL implementation had to be
> painfully expensive. Wikipedia says that each processor took up three
> boards -- ouch! While this was cheaper than mainframes, it proved
> vulnerable to undercutting by high-volume CMOS as LSI and VLSI
> technologies were developed.

But if they had been successful, it is easy to see them transforming the
system to CMOS, probably first as a set of chips per CPU, later as a
single chip per CPU. Their CPU didn't seem so complex that it couldn't
evolve to newer processes. After all, other mainframe architectures did.

> Both approaches separated communication from memory access (though
> Elxsi later added shared memory), with logical communication channels
> that looked pretty much like FIFOs.

Yes. Shared memory is certainly a very nice mechanism for some
applications, e.g. a cache for database pages. You don't want to be
moving large chunks of data through the message system for local accesses.

> The Elxsi system implemented its communication over a shared bus, so
> the underlying hardware was clearly not scalable.

Yes, but at the time, that was probably sufficient. I don't know of
anything that would prevent them adapting to some form of routed serial
links for increases scalability.

> It is not clear
> whether the other aspects of the system had fundamental scaling limits --
> the company was clearly interested in "mid-range" systems with up to
> 10 CPUs.

Yes.

> It seems to me that the time is right for renewed interest in
> fundamentally parallel architectures.
> (1) Technology scaling is mostly limited to giving us more
> cores per die. More performance must (mostly) come from
> using more cores, so parallel overheads are becoming more
> important.
> (2) Market forces have caused a stall in server prices at
> $2000-$3500 per socket. This may create an opportunity
> for disruptive innovation at lower price points.
> (3) Power scaling suggests that we need to go to even larger
> numbers of even slower processors to continuing improving
> energy efficiency. Unless parallelism is extraordinarily
> efficient, this will be a showstopper.

So the question I have been driving at is, if someone were to implement
essentially the Elxsi message based system in a CPU implemented in a
modern CMOS process, with routed links between processors, hopefully
software transparent across multiple CPUs per chip, multiple chips per
board, and even multiple boards, would it be successful?

Ivan Godard

unread,

Mar 26, 2013, 8:33:03 PM3/26/13

to

Not so, but the single-thread improvements are still subject to
parallelization for still higher gain, so your point is still valid.

> (2) Market forces have caused a stall in server prices at
> $2000-$3500 per socket. This may create an opportunity
> for disruptive innovation at lower price points.

We may hope :-)

> (3) Power scaling suggests that we need to go to even larger
> numbers of even slower processors to continuing improving
> energy efficiency. Unless parallelism is extraordinarily
> efficient, this will be a showstopper.

Massive numbers of ants is a viable strategy only for some problems,
although a commercially large class. There remain many problems for
which there are no known parallel solutions, and reasons to believe none
exist. Perhaps more importantly, even the parallelizable programs have
portions that are serial, and Amdahl's Law says those serial parts will
become the bottleneck even with infinite numbers of cores.

Parallel architectures are interesting, agreed. There does seem to be a
bifurcation of designs though: grid machines (Illiac IV descendents like
GPUs) and connection machines (like the transputer). Reflects the basic
SIMD vs. MIMD divide.

Nick would say (I think) that nothing can be done until languages are
fixed. I'm not sure.

Ivan Godard

unread,

Mar 26, 2013, 8:35:27 PM3/26/13

to

Doesn't an optimistic concurrency primitive in hardware solve all this?

Ivan

Ivan Godard

unread,

Mar 26, 2013, 8:37:10 PM3/26/13

to

On 3/26/2013 5:06 PM, Stephen Fuld wrote:
<snip>

> So the question I have been driving at is, if someone were to implement
> essentially the Elxsi message based system in a CPU implemented in a
> modern CMOS process, with routed links between processors, hopefully
> software transparent across multiple CPUs per chip, multiple chips per
> board, and even multiple boards, would it be successful?

Sony Cell?

Paul A. Clayton

unread,

Mar 26, 2013, 10:01:25 PM3/26/13

to

On Mar 26, 7:43 pm, "John D. McCalpin" <mccal...@tacc.utexas.edu>
wrote:
[snip]

> The non-blocking concurrent FIFO implementations that I have studied
> use sneaky tricks to ensure that a process can only fail to dequeue
> (or enqueue) an item if another thread has succeeded. This is a good
> property to have, but does not seem to be enough to ensure that these
> implementations perform well under load -- they seem to slow down even
> though it is guaranteed that one of the competing processes will succeed.
> I suspect that the performance degradation is related to excessive cache
> line "bouncing", but am not aware of any studies of this behavior at the
> level of the coherence protocol using a realistic microarchitecture
> simulation.

Mitch Alsup's version of ASF provided something
like a conflict number (unlike the ASF specification
that AMD eventually published), so a second attempt
could use an appropriate offset to index a queue.
Such might not solve the problem, but it seems that
it would help. (This was mentioned here on comp.arch
a while back.)

nm...@cam.ac.uk

unread,

Mar 27, 2013, 4:59:55 AM3/27/13

to

In article <05d1e4b0-34e9-4790...@googlegroups.com>,

John D. McCalpin <mcca...@tacc.utexas.edu> wrote:
>

>Both of these systems were developed during the 1980's, and came to
>market just as the RISC systems that dominated the marketplace during
>the 1990's were also being introduced (e.g., MIPS).

And others.

>The Transputer was clearly an "interesting" approach, but the performance
>of a single transputer was not competitive. Although relatively easy
>to program, any solution that requires parallel programming to get the
>same performance as a (comparably priced) solution that can be programmed
>serially was not going to fare well in the market.

Yes. To some extent, that also killed FPS, which did well for a
while. That technology has resurfaced as GPUs, of course.

>The Elxsi computer had performance, but the ECL implementation had to be
>painfully expensive. Wikipedia says that each processor took up three
>boards -- ouch! While this was cheaper than mainframes, it proved
>vulnerable to undercutting by high-volume CMOS as LSI and VLSI
>technologies were developed.

And, later, that killed the Tera MTA. Only that was GaAs, if I recall.

>It seems to me that the time is right for renewed interest in
>fundamentally parallel architectures.
>(1) Technology scaling is mostly limited to giving us more
> cores per die. More performance must (mostly) come from
> using more cores, so parallel overheads are becoming more
> important.
>(2) Market forces have caused a stall in server prices at
> $2000-$3500 per socket. This may create an opportunity
> for disruptive innovation at lower price points.
>(3) Power scaling suggests that we need to go to even larger
> numbers of even slower processors to continuing improving
> energy efficiency. Unless parallelism is extraordinarily
> efficient, this will be a showstopper.

As most people will know, I am another person who has been banging
on about this for some time, and use some of those points in my
courses.

So far, outside HPC, almost all 'parallelism' has been entirely
embarrassingly parallel tasks, and is usually just running
independent processes in parallel. Web servers and similar may
use separate threads, but those are pretty similar and almost
all of their shared data access is for read-only data. The
threads are provided with parameters to run and return results,
and not much else.

This hides the fact that the languages and parallel methodologies
they are using are completely broken, because the number of race
conditions is small and the timescales between unsynchronised
parallel access very long. But I still see a lot of failures
in GUIs and on the Web due to incorrect parallel methodology,
though few people recognise them as such.

In summary, I fully agree about the points, but would add the
requirement that it is also critical to move to fit-for-purpose
parallel methodologies. MPI is OK, but I agree with the people
who say that it is the assembler of parallel methodologies, and
it isn't really suitable for small granuality parallelism.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Mar 27, 2013, 5:10:05 AM3/27/13

to

If you have a single producer, then it seems to me that this is also the
point where you handle scaling:

I.e. instead of having a single FIFO, you setup M FIFOs for the N
consumers, M<=N.

This increases producer complexity _very_ little, particularly when you
can use round-robin scheduling between the M FIFOs, while you can make
sure than N/M never becomes large enough to cause significant access
collisions between the consumers.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Brian Drummond

unread,

Mar 27, 2013, 5:42:18 AM3/27/13

to

I'm not so sure about that - for 1985 its performance was pretty good. I
briefly studied it in 1986 or early 87 and on my (paper only) benchmarks
it looked several times faster than the locally available RISC
alternative (ARM-1). Granted, only a core subset of instructions were
single cycle, but that cycle was at 20MHz when ARM had 4 and 6 MHz and
seemed reluctant to promise 8 MHz...

However, it was also several times the price!

I regrettably lost my early ARM documentation, but perusing the VLSI
Technology data manual for the ARM-2 (c)1987 I see only 10 and 12 MHz.
That might have given the T414 a run for its money.

Transputer was quite advanced in some ways : can anyone recall another
contemporary micro with clock multiplier to divorce core speed from board
speed? The next one I remember (much later) was the 486DX2.

By 1990 I would agree, whatever advantage they did have evaporated fast,
and I never actually saw a T9000.

- Brian

nm...@cam.ac.uk

unread,

Mar 27, 2013, 5:52:45 AM3/27/13

to

In article <kitekj$jmb$1...@dont-email.me>,

Ivan Godard <iv...@ootbcomp.com> wrote:
>>
>> It seems to me that the time is right for renewed interest in
>> fundamentally parallel architectures.
>> (1) Technology scaling is mostly limited to giving us more
>> cores per die. More performance must (mostly) come from
>> using more cores, so parallel overheads are becoming more
>> important.
>
>Not so, but the single-thread improvements are still subject to
>parallelization for still higher gain, so your point is still valid.

WHAT single-thread improvements? That's a serious question.
The problem about increasing table and cache sizes, reordering
of serial code etc. is that the gains are typically logarithmic
and we have already reached the point of poor returns.

When I write serial code, it runs negligibly faster on modern CPUs
than it did on the comparable ones of a decade earlier. Oh, yes,
I can run BIGGER problems, but that's the main difference.

>> (2) Market forces have caused a stall in server prices at
>> $2000-$3500 per socket. This may create an opportunity
>> for disruptive innovation at lower price points.
>
>We may hope :-)

Quite.

>> (3) Power scaling suggests that we need to go to even larger
>> numbers of even slower processors to continuing improving
>> energy efficiency. Unless parallelism is extraordinarily
>> efficient, this will be a showstopper.
>
>Massive numbers of ants is a viable strategy only for some problems,
>although a commercially large class. There remain many problems for
>which there are no known parallel solutions, and reasons to believe none
>exist. Perhaps more importantly, even the parallelizable programs have
>portions that are serial, and Amdahl's Law says those serial parts will
>become the bottleneck even with infinite numbers of cores.

Grrk. Amdahl's law is only a rule of thumb, and doesn't match real
problems well - there is no hard boundary between serialisable and
parallelisable. You are perfectly correct, but the simple fact is
that faster serial CPUs are not obtainable using current technology
and there has been essentially damn-all improvement in a decade.
When faced with a choice between a hard and insoluble problem,
engineers choose the former.

The benchmarketers get their massive improvements by using problems
that are limited by the sizes of the previous generation but not by
those of the current ones. That's polemic, not science.

>Nick would say (I think) that nothing can be done until languages are
>fixed. I'm not sure.

You are right. But please note that I am not saying that it is
just, or even primarily, languages. Methodologies are more important,
and hardware and operating system improvements are also critical.
If the improvements that John, I and others want were delivered,
they would open up new possibilities for parallel methodologies.
And that I regard as one key to the problem.

We have known ways of parallelising largely serial code that cannot
be delivered precisely because the current hardware and operating
systems provide such dire support for parallelisation. And I am
not talking new ones - some were explored in the 1960s and 1970s.

Regards,
Nick Maclaren.