Infiniband - practicalities for small clusters

Paul

unread,

May 2, 2004, 10:43:34 AM5/2/04

to

Since we're debating the rise (and fall?) of Infiniband in another thread, I
wonder how people actually decide whether to deploy IB and if so which IB
product/supplier to use in the lower end cluster market (accepting that
large parts of this market don't actually need infiniband, but those that do
may not be able to afford extensive evaluations.)

Say I was looking for low-cost hardware (say 50+ nodes) and spending a
third/half of the budget on the fabric. It means I probably don't have money
for extensive bakeoffs or investigating if performance drops as I get more
simultaneous connections.

I could pick Mellanox, Voltaire, Infinicon, Topspin and a few more esoteric
solutions with IB behind the scenes.

I could just listen to the 'We all use the same silicon' argument and just
pick on price, but I don't think silicon is the issue (and anyway, maybe
they don't all use the same silicon)

There seems to be very little public material about who does what, best etc.
PR 'success stories' are all very well, but I prefer science. Places like
Ohio State et al help in some respects (throughput, latency etc) but where
do I find out about 'up-time', who compares subnet management functions,
who's looked at storage across an IB fabric?

Until there is public, credible results, I fear FUD-rakers can still
persist. IB makers haven't helped their cause by promising much and
delivering little. (Personally I do think things are getting better now)

What do people committing to IB with a fair bit less than a million dollars
to spend do to allay any fears?

In a year (my guess), OpenIB will make the 'world good' and everything will
work everywhere (I can hope! It certainly is a forward step for users IMHO).
Right now I struggle with knowing who is a marketing spinmeister, who has
reliable, high performance product and who has a promising but 'not quite
there yet' solution.

I've looked at who has won larger orders as these people will probably have
done bake off's etc to get to their solutions, but any suggestions would be
appreciated.

I certainly don't believe the Myrinet people who seem to spread FUD every
time I mention IB in their presence.

Many of the larger blade/pizza box solution providers seem to have a roadmap
with availability just over the next hill, so I can't fallback on their
assurances that they'll do the right thing by me. If I want to do it now,
I've got to do the leg work it seems.

Maybe the switch manufacturers should publish the technicalities behind some
of their success stories? or do some 32-64 node tests for the lower end
market to get some confidence.

I'd like to know that I'm not going to get issues running 12 hrs a day,
every day.
I'd like to confirm where performance is good and where it isn't. e.g. "Yes,
we have a zero copy implementation of SDP."... Time passes "Sorry you
didn't understand... we DO do zero copy in special case number 37 which
isn't actually documented yet, so we can put zero copy on on the glossy
posters, but yes, sorry, you do have to put up with two memory copies in
your particular situation. Sadly your performance might suffer. We'll be
fixing it real soon now."

Advice appreciated.

Lets face it the sweet spot for IB uptake is likely to be in smaller
clusters and in distributed storage. If I hear another 512 node success
story I'll just say 'so what'

Personal views expressed.

Nick Maclaren

unread,

May 2, 2004, 4:00:30 PM5/2/04

to

In article <40950902$0$25318$cc9e...@news-text.dial.pipex.com>,

Paul <paulnospamlette...@hotmail.com> wrote:
>Since we're debating the rise (and fall?) of Infiniband in another thread, I
>wonder how people actually decide whether to deploy IB and if so which IB
>product/supplier to use in the lower end cluster market (accepting that
>large parts of this market don't actually need infiniband, but those that do
>may not be able to afford extensive evaluations.)

At present, if you have the time to spend beta-testing the hardware
and alpha-testing the software. If your last condition applies, don't
even think about it.

>There seems to be very little public material about who does what, best etc.
>PR 'success stories' are all very well, but I prefer science. Places like
>Ohio State et al help in some respects (throughput, latency etc) but where
>do I find out about 'up-time', who compares subnet management functions,
>who's looked at storage across an IB fabric?

Because there is damn little in use in production systems so far, that's
why. See above ....

>In a year (my guess), OpenIB will make the 'world good' and everything will
>work everywhere (I can hope! It certainly is a forward step for users IMHO).
>Right now I struggle with knowing who is a marketing spinmeister, who has
>reliable, high performance product and who has a promising but 'not quite
>there yet' solution.

NO chance! At BEST, there will be a few of the most popular scenarios
that work 'everywhere' in a couple of years from now, but there isn't
a hope in hell of "everything will work everywhere" in under a decade.

Last week, I was at a presentation where one of the Big Vendors said
that their initial InfiniBand card and software would support about
a dozen commands (out of the hundred that the InfiniBand specification
describes). Look at SCSI - who supports its CPU<->CPU model? After
how long?

InfiniBand may be coming, but it assuredly won't be here, in toto, in
a year. That would imply a squadron of fairy godmothers arriving on
their flying pigs to wave wands and create the software support out
string and sealing wax. Even assuming the lesser miracle that some
vendor has a card which supports all of InfiniBand.

Regards,
Nick Maclaren.

Billy

unread,

May 7, 2004, 2:44:10 AM5/7/04

to

nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<c73k0u$ea7$1...@pegasus.csx.cam.ac.uk>...

> Last week, I was at a presentation where one of the Big Vendors said
> that their initial InfiniBand card and software would support about
> a dozen commands (out of the hundred that the InfiniBand specification
> describes). Look at SCSI - who supports its CPU<->CPU model? After
> how long?

> InfiniBand may be coming, but it assuredly won't be here, in toto, in
> a year. That would imply a squadron of fairy godmothers arriving on
> their flying pigs to wave wands and create the software support out
> string and sealing wax. Even assuming the lesser miracle that some
> vendor has a card which supports all of InfiniBand.

I'm not going to respond to most of what you said, but this assertion
that there are not fully functional IB adapters is complete nonsense.
Mellanox has been shipping their PCI-X HCA for nearly two years, and
it supports nearly the entire IB spec (they had a chip bug, so reliable
datagram transport doesn't work, but other than that I don't know of
any missing features).

In fact Mellanox already has their PCI Express 8X adapter working, which
should support the entire IB spec including the upcoming verbs extensions
(barring of course undiscovered chip bugs).

Just about all of the useful IB features are supported by the OpenIB
software already.

Nick Maclaren

unread,

May 7, 2004, 3:28:37 AM5/7/04

to

In article <9436f2b3.04050...@posting.google.com>,

Yeah. When I was young and innocent, I might have believed that. It
also might even have been true, with the much simpler specifications
and higher standards of 30 years back. I haven't seen ONE single such
promise turn into actuality in a decade - so, while your claim might
be right, I expect to see such complete solutions delivered by flying
pig. Note that, in the above, I am talking primarily about the
(multiple) software layers.

The problems are not in the basic features, but in their interactions,
corner cases, obscure, rare and non-repeatable errors and so on.
Unless you have a mathematically precise specification (and InfiniBand
is not one such), claiming an absence of problems in those areas is
at best unreasonably optimistic. The debugging of such areas is done
by the customers stretching the limits of the software, which then
starts relying on the properties of the hardware (often where the
specification is unclear, ambiguous or silent). And that does not
happen until such hardware is in widespread use.

Furthermore, I don't think that you understand the software issues
if you think that the OpenIB layer can even POTENTIALLY tackle all
of the software problems. I have got utterly sick of the number of
vendors that claim that a component 'works' because it follows its
specification, despite the fact that using several of the features
that we need causes OTHER components (from the same vendor) to crash
or misbehave. And, of course, the OTHER component works because it
follows ITS specification, and you don't see the problems if you
aren't using the first component!

Regards,
Nick Maclaren.

Billy

unread,

May 7, 2004, 9:58:18 PM5/7/04

to

nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<c7fdr5$ahj$1...@pegasus.csx.cam.ac.uk>...

> Yeah. When I was young and innocent, I might have believed that. It
> also might even have been true, with the much simpler specifications
> and higher standards of 30 years back. I haven't seen ONE single such
> promise turn into actuality in a decade - so, while your claim might
> be right, I expect to see such complete solutions delivered by flying
> pig. Note that, in the above, I am talking primarily about the
> (multiple) software layers.

I don't know why you single out InfiniBand for your cynicism. I mean,
of course you're right: all chips have errata, all software has bugs,
and all computers crash. Nevertheless, people seem to find technology
useful.

10's of millions of dollars of IB gear has been sold and is being used
in production (not much in the grand scheme of things, but enough that
it's obviously more than a development lab curiousity). Of course the
technology is less mature than (say) ethernet, so there will be more bugs
and headaches, but lots of people are still finding enough value in IB to
use it. In the same way, people find ethernet useful even though their
applications (still to this day) expose bugs in the hardware or software.

Nick Maclaren

unread,

May 8, 2004, 4:57:14 AM5/8/04

to

In article <9436f2b3.04050...@posting.google.com>,

Billy <bi...@mailinator.com> wrote:
>
>I don't know why you single out InfiniBand for your cynicism. I mean,
>of course you're right: all chips have errata, all software has bugs,
>and all computers crash. Nevertheless, people seem to find technology
>useful.

You haven't even looked at the InfiniBand specification, have you?

One of the main causes of the sort of errors I am talking about is
complication, and the severity is highly non-linear in that. At
2,500 pages of specification, InfiniBand is many times larger than
SCSI and Ethernet combined. It is probably half-a-dozen times more
difficult to make an InfiniBand card than either of those, and could
be dozens of times harder to support it in the operating system.

You might like to ponder why the specification was published in
October 2000, to a chorus of claims that initial cards would be
openly available by mid-2001 and that complete system support would
be available (to real customers, and NOT just developers) by mid-2002,
and things have slipped by 2 years so far ....

And I have explained ad nauseam why allowing arbitrary access to
DMA by applications brings in massive operating system problems that
have been all but forgotten about for 30 years. As some of us in
HPC can witness, that doesn't mean that the problems have gone away;
oh, no; it is merely that the use is currently limited to a few
areas where the problems are manageable. One of InfiniBand's MAIN
features is to tear open that cosy little arrangement ....

Regards,
Nick Maclaren.

Billy

unread,

May 8, 2004, 2:02:57 PM5/8/04

to

nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<c7i7da$rpq$1...@pegasus.csx.cam.ac.uk>...

> You haven't even looked at the InfiniBand specification, have you?

I'm sure I've spent more time reading it than you have :)

> One of the main causes of the sort of errors I am talking about is
> complication, and the severity is highly non-linear in that. At
> 2,500 pages of specification, InfiniBand is many times larger than
> SCSI and Ethernet combined. It is probably half-a-dozen times more
> difficult to make an InfiniBand card than either of those, and could
> be dozens of times harder to support it in the operating system.

Sure, it's a long spec, but it is not quite as complicated as it seems.
For example, nearly everyone can just ignore the booting annex, the
stuff on IB-compliant managed chassis, IB form factor modules (including
page after page on LED blink codes), IB routers, etc.

If you compare the IB spec to the ethernet spec plus all the RFCs on
IP, TCP, SNMP, etc. I don't think the IB spec is many times larger.

You're absolutely right that IB is complicated and will take time to mature.
However, there is no need to wait 10 years for flying pigs. As I've said,
people are already doing useful work with IB. They may run into rough spots
and may not be able to use all the most esoteric features, but there's no
reason to wait for every detail to be perfect if it's useful to them now.

> You might like to ponder why the specification was published in
> October 2000, to a chorus of claims that initial cards would be
> openly available by mid-2001 and that complete system support would
> be available (to real customers, and NOT just developers) by mid-2002,
> and things have slipped by 2 years so far ....

Yes, the technology was overhyped. What technology was not overhyped in its
early days? Of course, we all remember that Intel tried to push IB as a PCI
replacement, which was all wrong. However, that doesn't change the fact that
IB has managed to evolve into quite a useful interconnect.

> And I have explained ad nauseam why allowing arbitrary access to
> DMA by applications brings in massive operating system problems that
> have been all but forgotten about for 30 years. As some of us in
> HPC can witness, that doesn't mean that the problems have gone away;
> oh, no; it is merely that the use is currently limited to a few
> areas where the problems are manageable. One of InfiniBand's MAIN
> features is to tear open that cosy little arrangement ....

Yes, I've read your gobbledygook, and I'm not going to debate that. It all
seems to be stating the obvious -- that IB won't work well for all apps on
all architectures -- as if it were a fundamental and important insight.

You can continue your pontificating on comp.arch. The large and growing
number of IB users will continue to do real work.

del cecchi

unread,

May 8, 2004, 5:19:48 PM5/8/04

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:c7i7da$rpq$1...@pegasus.csx.cam.ac.uk...

You ought to at least deduct the 800 pages of Volume 2 from your
complication page count. Ethernet throws that into another document or
3.

del cecchi

Nick Maclaren

unread,

May 9, 2004, 5:22:07 AM5/9/04

to

In article <2g518pF...@uni-berlin.de>,

del cecchi <dcecchi...@att.net> wrote:
>
>You ought to at least deduct the 800 pages of Volume 2 from your
>complication page count. Ethernet throws that into another document or
>3.

Or include the corresponding Ethernet documents, yes. I was attempting
to do so. I haven't looked at Ethernet either closely or recently,
so it may have suffered from bloat. I should be surprised if it were
THAT much, though.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

May 9, 2004, 6:12:39 AM5/9/04

to

In article <9436f2b3.04050...@posting.google.com>,
Billy <bi...@mailinator.com> wrote:

>nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<c7i7da$rpq$1...@pegasus.csx.cam.ac.uk>...
>> You haven't even looked at the InfiniBand specification, have you?
>
>I'm sure I've spent more time reading it than you have :)

In which case, please stop posting propaganda.

>Sure, it's a long spec, but it is not quite as complicated as it seems.
>For example, nearly everyone can just ignore the booting annex, the
>stuff on IB-compliant managed chassis, IB form factor modules (including
>page after page on LED blink codes), IB routers, etc.

And the same can be done for SCSI, which makes it really quite a
short specification. That is not the point - you were claiming
COMPLETE implementations.

>If you compare the IB spec to the ethernet spec plus all the RFCs on
>IP, TCP, SNMP, etc. I don't think the IB spec is many times larger.

That is an obvious straw man, on two counts. Firstly, RFCs are Requests
For Comment and are primarily working documents - only the accepted
'standards' should be counted. Secondly, they are at a higher level
and InfiniBand is intended to support them as well.

>Yes, the technology was overhyped. What technology was not overhyped in its
>early days? Of course, we all remember that Intel tried to push IB as a PCI
>replacement, which was all wrong. However, that doesn't change the fact that
>IB has managed to evolve into quite a useful interconnect.

Speaking as the technical lead of an organisation that does some quite
significant procurements, I am in a position to respond to that with
a educated Bronx cheer. You have your tenses wrong. Every InifiniBand-
favouring vendor sales team has started off peddling your wares, and
has backpedalled like fury when I have started demanding hard data on
deliverables. It seems that:

The hardware is a mixture of not yet available, beta test and first
production shipments. Most of the field engineers have seen some, but
only some have been trained for it, and none have worked on it for real
(in real customer sites, being used for real, natch).

The drivers and networking are mixtures of not yet available, alpha
test, beta test and "I need to check when the release date is". Some
are clearly "in use" but it is unclear how (or whether) they are
supported.

The higher levels are a mixture of planned for future releases,
not yet available, alpha test and "if you use the InfiniBand driver
from the off-the-shelf MPICH and TCP/IP source under Linux, it Will
All Work". I am not quite as green as I am cabbage-looking, thank
you very much.

The integration testing, validation and commitment (at all levels
of the software) had clearly not even been scheduled, let alone
started. The sales teams wriggled like eels when I tried to ask them
"Can you find out a combination of XXX (either MPI or TCP/IP), operating
system, drivers, InfiniBand cards and switches that you will support in
combination, and a schedule for when you will do it?"

Yes, I really do mean that they knew things were so far off being ready
that they couldn't even find out a SCHEDULE for the support of a complete
software stack for using InfiniBand!

>The large and growing number of IB users will continue to do real work.

Evidence, please? Not for the "growing", which I will accept - there
certainly couldn't be a shrinking number - but that there are any real
users doing real work, let alone your claim of a large number. I doubt
that there are even 10% of the number using the Itanium, and there may
well be fewer than 1%.

Please note that, as the person described above, I have asked several
large vendors, some small ones, and some of my technical contacts for
any reference sites. I have been pointed to several bleeding-edge
sites that are evaluating InfiniBand in collaboration with InfiniBand
developers, but that is about all. And not all of those were as far
along the line as was made out.

Regards,
Nick Maclaren.

Stephen Fuld

unread,

May 9, 2004, 11:05:59 AM5/9/04

to

"Billy" <bi...@mailinator.com> wrote in message
news:9436f2b3.04050...@posting.google.com...

snip

> Of course, we all remember that Intel tried to push IB as a PCI
> replacement, which was all wrong.

What was wrong with that? Are you talking about some technical reasons why
that was a bad idea or more market/business issues like fighting installed
base, etc. From a purely technical perspective, ISTM to be a good idea. It
unifies device I/O and cluster interconnect into a single, low latency
interface. You wouldn't have had to do things like PCI Express switching
extensions as they are "built in". It eliminates the issues with loads to
"distant" cards stalling the processor, etc.

I know it isn't going to happen, for a variety of reasons, but why do you
think it was a bad idea?

--
- Stephen Fuld
e-mail address disguised to prevent spam

Rob Warnock

unread,

May 9, 2004, 9:24:24 PM5/9/04

to

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
+---------------

| "Billy" <bi...@mailinator.com> wrote:
| > Of course, we all remember that Intel tried to push IB as a PCI
| > replacement, which was all wrong.
|
| What was wrong with that? Are you talking about some technical reasons why
| that was a bad idea or more market/business issues like fighting installed
| base, etc. From a purely technical perspective, ISTM to be a good idea. It
| unifies device I/O and cluster interconnect into a single, low latency
| interface. You wouldn't have had to do things like PCI Express switching
| extensions as they are "built in". It eliminates the issues with loads to
| "distant" cards stalling the processor, etc.

+---------------

The problem is... it didn't!! You see, the way most people were thinking
about "IB as a PCI replacement" was as a *transparent* replacement, the
way the early Mellanox chips worked, for example: You define some magic
PCI address space on the host that, when accessed, converts the PIOs to IB
sends/receives, which are then converted to PCI cycles at the target IB
device. This was supposed to allow *existing* drivers to run transparently
over such PCI-to-IB-to-PCI converters.

The problem was, of course, exactly what you mentioned: Existing drivers
do entirely too many PIO reads, and the CPU pipeline gets stalled *entirely*
too long.

You can only "eliminate the issues with loads to distant cards" if you
re-write the drivers to use native IB message-passing instead of PIOs,
but that invalidates the entire body of existing PCI (and ISA) device
drivers! (Oops!)

-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Billy

unread,

May 9, 2004, 11:13:02 PM5/9/04

to

"Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message news:<rNrnc.36091$Ut1.1...@bgtnsc05-news.ops.worldnet.att.net>...

> "Billy" <bi...@mailinator.com> wrote in message
> news:9436f2b3.04050...@posting.google.com...

> > Of course, we all remember that Intel tried to push IB as a PCI
> > replacement, which was all wrong.

> What was wrong with that? Are you talking about some technical reasons why
> that was a bad idea or more market/business issues like fighting installed
> base, etc. From a purely technical perspective, ISTM to be a good idea. It
> unifies device I/O and cluster interconnect into a single, low latency
> interface. You wouldn't have had to do things like PCI Express switching
> extensions as they are "built in". It eliminates the issues with loads to
> "distant" cards stalling the processor, etc.

> I know it isn't going to happen, for a variety of reasons, but why do you
> think it was a bad idea?

Two (related) reasons, IMHO. First of all, it was far too radical a break
with PCI. Even at the hardware level, I think going from the PCI card
form factor to the IB module form factor would have been too much. But
forcing a rewrite of every BIOS, breaking every legacy OS and so on was too
much to hope for, even with Intel's might behind it. Compare to HT, which
has been a success, and PCI Express, which looks like it will succeed: the
physical layers were radically improved, but the software interface changed
only incrementally from the PCI model. And I'm not sure how unified the
host bus and cluster interconnect could be: where would my subnet manager
run, for example? Would I be dependent on some external service for my
system to even boot?

Second, I'm not convinced that the IB model really is an improvement over
PCI for a core system bus. Even with RDMA, I don't see a good replacement
for memory-mapped IO. Perhaps everything is moving towards command interfaces
and device-initiated DMA, but it still seems nice to be able to map a frame
buffer or even just a command register without having to discover a service,
create a queue pair, connect to the service, post a work request, wait for
a completion, etc.

Stephen Fuld

unread,

May 10, 2004, 12:34:40 AM5/10/04

to

I agree that "fighting the installed base", especially of software was the
problem, but I disagree as to the extent of what gets "broken". First of
all, it wouldn't have replaced the in-box ATA style disk controller, so
systems could boot etc. without needing to deal with IB. Yes, there is a
provision to boot over IB (like being able to boot over an Ethernet
network), but that certainly could be phased in later if needed. So it
needent break the BIOS. It would have needed a new driver to replace the
PCI driver that is there, but no one expected PCI to go away immediately.
Most PCs still have ISA slots for criminy sakes! Also, the idea was to
start with servers, where its features were more valuable (lots of I/O and
perhaps cluster interconnect) and only later, when the software was there,
move toward desktops, etc. But Intel's desktop division got cold feet and
pulled out.

> Compare to HT, which
> has been a success, and PCI Express, which looks like it will succeed: the
> physical layers were radically improved, but the software interface
changed
> only incrementally from the PCI model.

Yes, both the blessing and the curse.

> And I'm not sure how unified the
> host bus and cluster interconnect could be: where would my subnet manager
> run, for example?

On one (or more likely two - for redundancy) of the participants in the
cluster.

> Would I be dependent on some external service for my
> system to even boot?

No. See above.

> Second, I'm not convinced that the IB model really is an improvement over
> PCI for a core system bus. Even with RDMA, I don't see a good replacement
> for memory-mapped IO. Perhaps everything is moving towards command
interfaces
> and device-initiated DMA, but it still seems nice to be able to map a
frame
> buffer or even just a command register without having to discover a
service,
> create a queue pair, connect to the service, post a work request, wait for
> a completion, etc.

There is no doubt that memory mapped I/O is in some sense "easier". But
much work on mainframes has shown that a "channel" type interface can be
very successful and has lots of nice properties when it comes to scaling,
RAS, etc. And it does eliminate the CPU performance problem with "slow
loads" I mentioned above.

Nick Maclaren

unread,

May 10, 2004, 4:07:36 AM5/10/04

to

In article <P_2dnamjkMF...@speakeasy.net>,

This is exactly the point with the claims that InfiniBand allows
zero-copying I/O by the use of RDMA. In order to do that without
causing chaos elsewhere, MASSIVE changes need to be made to the
operating system and applications. Whereupon, the proponents turn
round and say that it can be used transparently if the RDMA is only
to device-internal buffers for for semi-dedicated use (e.g. MPI),
which is approximately true.

What cannot be done is to get the claimed benefits of InfiniBand
(low latency, minimal CPU overhead etc.) AND plug it in as just
another I/O device. And I wish people would stop claiming both
advantages more-or-less in the same breath.

Regards,
Nick Maclaren.

Stephen Fuld

unread,

May 10, 2004, 11:42:18 AM5/10/04

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message

news:c7nd88$8ss$1...@pegasus.csx.cam.ac.uk...

You seem to have giant chips on your shoulders for some technologies. These
seem to be far in excess of what many people think to be reasonable. It
appears that they are caused by a reaction to the worst of the inevitable
hype that accompanies almost any new technology. You take the most
outrageous claims of the proponents and use them to justify the opposite
extreme. I think that most objective observers are neither as "glossy eyed"
over the technology are you seem to think they are, not as "anti " it as you
seem to be. (Noe that this applies to things like SMT as well as IB).

I agree, as I think des just about everyone else, that to gain the full
advantages of zero copy I/O you need substantial changes to at least the
"middleware" that most applications in the commercial space use for I/O
(such as database managers, web servers, etc.). Few changes to the
applications using those systems seem necessary. I realize that probably
isn't true in your area of the world (HPC), but after all, that is a tiny
market compared to commercial servers. And yes, you need some changes to
the OS, but I wouldn't describe them as "massive", but YMMV. And you do
need some additional software for network management, etc. especially if you
want to run anything more than a tiny IB system (but many IB systeems would
have been "tiny", that is, a server, a few disk racks and a couple of
network connections

However, there are advantages to using IB even if you don't use RDMA to
achieve zero copy I/O (i.e. you still use a system owned cache). These
include lower latency, better scaling and connectivity for large number of
peripherals, etc.

> Whereupon, the proponents turn
> round and say that it can be used transparently if the RDMA is only
> to device-internal buffers for for semi-dedicated use (e.g. MPI),
> which is approximately true.

Yes, I just said that. :-) See below.

> What cannot be done is to get the claimed benefits of InfiniBand
> (low latency, minimal CPU overhead etc.) AND plug it in as just
> another I/O device. And I wish people would stop claiming both
> advantages more-or-less in the same breath.

There is lower latency even without the use of zero copy I/O. This comes
from the elimination of the slow "memory" operations, the ability to reduce
the number of interrupts, etc. I don't know of anyone why claims you can
gain zero copy I/O and be totally software compatible. That is a straw man.

Rupert Pigott

unread,

May 10, 2004, 11:43:27 AM5/10/04

to

Nick Maclaren wrote:

> That is an obvious straw man, on two counts. Firstly, RFCs are Requests
> For Comment and are primarily working documents - only the accepted
> 'standards' should be counted. Secondly, they are at a higher level

That's as maybe, but they are *accepted* 'standards'.

Cheers,
Rupert

Nick Maclaren

unread,

May 10, 2004, 12:17:41 PM5/10/04

to

In article <upNnc.39564$Ut1.1...@bgtnsc05-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>However, there are advantages to using IB even if you don't use RDMA to
>achieve zero copy I/O (i.e. you still use a system owned cache). These
>include lower latency, better scaling and connectivity for large number of
>peripherals, etc.

You have proof of all of that? Or even evidence? Do tell.

>There is lower latency even without the use of zero copy I/O. This comes
>from the elimination of the slow "memory" operations, the ability to reduce
>the number of interrupts, etc. I don't know of anyone why claims you can
>gain zero copy I/O and be totally software compatible. That is a straw man.

No, it's not a straw man - that's not what a straw man is. And,
let me repeat, evidence for your first claim is minimal - unless you
are using the straw man of Ethernet's latency and SCSI's scalability.
While it is POSSIBLE that InfiniBand will deliver the best of all
the alternatives, simultaneously, the proof of the pudding is in the
eating.

If you take a look back through this group, or get inflicted with some
InfiniBand presentations, you will see plenty of such claims.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

May 10, 2004, 12:19:11 PM5/10/04

to

In article <10842038...@teapot.planet.gong>,

Read my posting more carefully. I am referring to the accepted
standards among the RFCs, ignoring the superseded documents, RFCs
that are ignored and so on. The proportion of RFCs that are accepted
standards is quite small.

Regards,
Nick Maclaren.

Rupert Pigott

unread,

May 10, 2004, 2:12:55 PM5/10/04

to

I think you are neglecting to take into account the zillions of lines
of code that is based on superceded RFCs and RFCs that never got a
nod from ISO et al. I would classify that as acceptance. After all
it's the code that does the bit bashing, not the standards bodies.

That said I don't hold with hacking up standards to fit code after
the fact. :)

Cheers,
Rupert

Eric

unread,

May 10, 2004, 2:40:25 PM5/10/04

to

Stephen Fuld wrote:
>
> <snip>

>
> There is lower latency even without the use of zero copy I/O. This comes
> from the elimination of the slow "memory" operations, the ability to reduce
> the number of interrupts, etc. I don't know of anyone why claims you can
> gain zero copy I/O and be totally software compatible. That is a straw man.

What are the slow "memory" operations you are referring to?
If this is a DMA, then there would be a few device control register
writes to set up the DMA transfer, right? Granted they are much
slower than L1 cache, but I don't see how they can be avoided.
It seems to me that the bigger overhead would be the OS call.
And (with appropriate hand waving) the OS could even pre-setup the
transfer by allocating and loading IO bus scatter-gather registers.

If you can find a machine which has them. I saw a To Be Defined
reference to bus scatter-gather support in a preliminary document
for an Intel bus interface chip, but it disappeared in the final draft.
I guess someone at Intel likes programmed IO for some strange reason.

I have wondered for a long time why bus scatter-gather support wasn't
in the Intel PCI and ISA bus interface chips. It allows DMA to be
zero copy (to user space if desired), transparent to page boundaries,
and maps 24 bit ISA and 32 bit PCI into 36 bit system address space.
And WNT at least has had support for it from the very start.

So it _should_ be well known technology.

Eric

David Acker

unread,

May 10, 2004, 3:29:51 PM5/10/04

to

I work at an IB company (InfiniCon Systems) and I disagree with you a
bit on your characterizations. As someone who has worked on IB since
before the 1.0 spec I have same idea of what state the IB
software/hardware world is in. For what's its worth (I know I am a
bit biased ) here are my thoughts. Sorry if it appears to be
advertising, but I take pride in the work we do.

> Speaking as the technical lead of an organisation that does some quite
> significant procurements, I am in a position to respond to that with
> a educated Bronx cheer. You have your tenses wrong. Every InifiniBand-
> favouring vendor sales team has started off peddling your wares, and
> has backpedalled like fury when I have started demanding hard data on
> deliverables. It seems that:
>
> The hardware is a mixture of not yet available, beta test and first
> production shipments. Most of the field engineers have seen some, but
> only some have been trained for it, and none have worked on it for real
> (in real customer sites, being used for real, natch).

I guess I am bit confused here. We have had software and hardware in
GA since 11/2002 when our Shared I/O platform GA'ed. This went
through a full system test cycle and had since been qualified by
outside OEMs. It is a shipping product used in production systems.
Since then many other ULPs and hardware platforms have been developed,
tested, GA'ed and are now shipping. For example, Fujitsu OEM'ed our
InfinIO 2000 and uses our InfinIO 3000. The InfinIO 3000 is the
switching fabric behind the 512 node cluster at Riken in Japan.
Fujitsu is using IB for commercial applications like banking.

IB hardware has been around for some time and has had lots of time to
stabalize. Agilent's and Mellanox's 8 port switching ASICs and
Mellanox's PCI-X ASIC have been out for awhile.

>
> The drivers and networking are mixtures of not yet available, alpha
> test, beta test and "I need to check when the release date is". Some
> are clearly "in use" but it is unclear how (or whether) they are
> supported.
>

We are always adding new functionality to our software stack but the
base has been GA'ed and in production use for some time. Core
protocols like MPI, IPoIB, SDP, IB to Ethernet, and IB to Fibre
Channel have been GA'ed for some time as well. I think that we are
pretty clear about what we support and when we will support new
features.

> The higher levels are a mixture of planned for future releases,
> not yet available, alpha test and "if you use the InfiniBand driver
> from the off-the-shelf MPICH and TCP/IP source under Linux, it Will
> All Work". I am not quite as green as I am cabbage-looking, thank
> you very much.
>

I know for a fact that with each release we note all the various OS
and kernel, versions, architectures, and distributions that we
support. By support we mean that we have tested and verified it. The
list is quite large. In almost all cases the feature set is the same
accross these various environments. We provide everything you should
including an MPI and support it.

> The integration testing, validation and commitment (at all levels
> of the software) had clearly not even been scheduled, let alone
> started. The sales teams wriggled like eels when I tried to ask them
> "Can you find out a combination of XXX (either MPI or TCP/IP), operating
> system, drivers, InfiniBand cards and switches that you will support in
> combination, and a schedule for when you will do it?"

We run a system test cycle with every release. We have test plans and
acceptance criteria for each release. We do not just throw this stuff
out the door; our customers would never accept that and neither do we.

>
> Yes, I really do mean that they knew things were so far off being ready
> that they couldn't even find out a SCHEDULE for the support of a complete
> software stack for using InfiniBand!
>

I don't know who you were talking to, but it wasn't us or our
partners. We have a complete software stack shipping today.

> >The large and growing number of IB users will continue to do real work.
>
> Evidence, please? Not for the "growing", which I will accept - there
> certainly couldn't be a shrinking number - but that there are any real
> users doing real work, let alone your claim of a large number. I doubt
> that there are even 10% of the number using the Itanium, and there may
> well be fewer than 1%.

We were probably the first IB company to test, qualify, and ship
products using Itanium. The percentage of users on Itanium vs x86 or
AMD64 is based on demand, not availability. In fact, Itanium has been
was supported by ICS in the Itanium 1 timeline, well before Opteron
came out.

>
> Please note that, as the person described above, I have asked several
> large vendors, some small ones, and some of my technical contacts for
> any reference sites. I have been pointed to several bleeding-edge
> sites that are evaluating InfiniBand in collaboration with InfiniBand
> developers, but that is about all. And not all of those were as far
> along the line as was made out.

I would say that the IB market is pretty well past the early adopter
phase. Most HPC folks understand IB and seem to be ready to use it
for bids. Database clusters are coming with Oracle's support for IB
in Oracle 10G.

Nick Maclaren

unread,

May 10, 2004, 4:01:38 PM5/10/04

to

In article <f4346eb5.04051...@posting.google.com>,

David Acker <sold...@yahoo.com> wrote:
>I work at an IB company (InfiniCon Systems) and I disagree with you a
>bit on your characterizations. As someone who has worked on IB since
>before the 1.0 spec I have same idea of what state the IB
>software/hardware world is in. For what's its worth (I know I am a
>bit biased ) here are my thoughts. Sorry if it appears to be
>advertising, but I take pride in the work we do.

I will take a look at your Web page tomorrow; please send me any
technical information that you feel might help. My posting was
NOT based on an absence of investigation, but no mere mortal can
check up on everything. I can always be wrong :-)

>> The hardware is a mixture of not yet available, beta test and first
>> production shipments. Most of the field engineers have seen some, but
>> only some have been trained for it, and none have worked on it for real
>> (in real customer sites, being used for real, natch).
>
>I guess I am bit confused here. We have had software and hardware in
>GA since 11/2002 when our Shared I/O platform GA'ed. This went
>through a full system test cycle and had since been qualified by
>outside OEMs. It is a shipping product used in production systems.
>Since then many other ULPs and hardware platforms have been developed,
>tested, GA'ed and are now shipping. For example, Fujitsu OEM'ed our
>InfinIO 2000 and uses our InfinIO 3000. The InfinIO 3000 is the
>switching fabric behind the 512 node cluster at Riken in Japan.
>Fujitsu is using IB for commercial applications like banking.

Interesting. Perhaps I should have stated explicitly that my remarks
were (not surprisingly) UK oriented. Is Fujitsu REALLY using InfiniBand
in PRODUCTION for banking?

After 18 months, you could be at second production shipments, but some
vendors do most of their beta testing after "General Availability".
I don't know your product, so can't say.

>IB hardware has been around for some time and has had lots of time to
>stabalize. Agilent's and Mellanox's 8 port switching ASICs and
>Mellanox's PCI-X ASIC have been out for awhile.

I am aware of that. Stabilisation does not occur purely due to the
lapse of time.

>We are always adding new functionality to our software stack but the
>base has been GA'ed and in production use for some time. Core
>protocols like MPI, IPoIB, SDP, IB to Ethernet, and IB to Fibre
>Channel have been GA'ed for some time as well. I think that we are
>pretty clear about what we support and when we will support new
>features.

I was talking about the large vendors - sorry, I should have made that
clearer. Card vendors have had drivers for some time, as you say,
though sometimes only under a few releases of a few systems.

>I know for a fact that with each release we note all the various OS
>and kernel, versions, architectures, and distributions that we
>support. By support we mean that we have tested and verified it. The
>list is quite large. In almost all cases the feature set is the same
>accross these various environments. We provide everything you should
>including an MPI and support it.

Well, firstly, by "support", I mean a LOT more that just having "tested
and verified" it. Real problems are found in real use, and I was and
am asking about support for complete SYSTEMS.

Secondly, there is a lot more to such things than including an MPI.
Integration with the resource and process control mechanisms of the
system is also critical. Did I say critical? Yes, well, I meant it.
Far too many software components require a particular system
configuration, and won't work with others; it isn't uncommon to have
all pairs of component work together but no way to get them all to.

However, I shall look you up, and request that you send information.

>We run a system test cycle with every release. We have test plans and
>acceptance criteria for each release. We do not just throw this stuff
>out the door; our customers would never accept that and neither do we.

System is as system does. I think that you may be meaning something
rather more specific than I am. One of my recent headaches has been
getting a combination of operating system, MPI and compilers that meets
our requirements.

>I don't know who you were talking to, but it wasn't us or our
>partners. We have a complete software stack shipping today.

It wasn't. It was most of the major vendors.

>I would say that the IB market is pretty well past the early adopter
>phase. Most HPC folks understand IB and seem to be ready to use it
>for bids. Database clusters are coming with Oracle's support for IB
>in Oracle 10G.

I wouldn't, but I am happy to be corrected. However, I do require
some evidence of reference sites (of the sort that I can push fairly
hard on what they are using it for, and get appropriate responses)
before I will believe that anything is stabilising. I have been in
this game too long to trust demonstrations and light use as any
kind of indication of what happens under stress.

Regards,
Nick Maclaren.

Jan Vorbrüggen

unread,

May 11, 2004, 3:50:56 AM5/11/04

to

> I have wondered for a long time why bus scatter-gather support wasn't
> in the Intel PCI and ISA bus interface chips. It allows DMA to be
> zero copy (to user space if desired), transparent to page boundaries,
> and maps 24 bit ISA and 32 bit PCI into 36 bit system address space.
> And WNT at least has had support for it from the very start.
>
> So it _should_ be well known technology.

I would presume WNT has had it because of its heritage (VMS), which also
means that it _is_ well known technology - about 25 years old now, at least.

Jan

Nick Maclaren

unread,

May 11, 2004, 4:04:23 AM5/11/04

to

In article <2gbev0...@uni-berlin.de>,

=?ISO-8859-1?Q?Jan_Vorbr=FCggen?= <jvorbrue...@mediasec.de> writes:
|>
|> I would presume WNT has had it because of its heritage (VMS), which also
|> means that it _is_ well known technology - about 25 years old now, at least.

40 years, at least. It was there in System/360.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

May 11, 2004, 6:24:14 AM5/11/04

to

In article <f4346eb5.04051...@posting.google.com>,
sold...@yahoo.com (David Acker) writes:
|>
|> I work at an IB company (InfiniCon Systems) ....

Hmm. I have taken a look at your Web page, and realise I found it
before and discarded it as targetted entirely at OEMs, and showing
no evidence of significant use in production environments. Upon a
more careful inpection, I cannot change my mind, based on what it
says.

|> > The hardware is a mixture of not yet available, beta test and first
|> > production shipments. Most of the field engineers have seen some, but
|> > only some have been trained for it, and none have worked on it for real
|> > (in real customer sites, being used for real, natch).
|>
|> I guess I am bit confused here. We have had software and hardware in
|> GA since 11/2002 when our Shared I/O platform GA'ed. This went
|> through a full system test cycle and had since been qualified by
|> outside OEMs. It is a shipping product used in production systems.
|> Since then many other ULPs and hardware platforms have been developed,
|> tested, GA'ed and are now shipping. For example, Fujitsu OEM'ed our
|> InfinIO 2000 and uses our InfinIO 3000. The InfinIO 3000 is the
|> switching fabric behind the 512 node cluster at Riken in Japan.
|> Fujitsu is using IB for commercial applications like banking.

Well, the RIKEN cluster was scheduled to come online only in March
of this year - I haven't been able to find whether it has done, but
there assuredly can't be much experience with it!

|> > The higher levels are a mixture of planned for future releases,
|> > not yet available, alpha test and "if you use the InfiniBand driver
|> > from the off-the-shelf MPICH and TCP/IP source under Linux, it Will
|> > All Work". I am not quite as green as I am cabbage-looking, thank
|> > you very much.
|>
|> I know for a fact that with each release we note all the various OS
|> and kernel, versions, architectures, and distributions that we
|> support. By support we mean that we have tested and verified it. The
|> list is quite large. In almost all cases the feature set is the same
|> accross these various environments. We provide everything you should
|> including an MPI and support it.

I can't find any of that on your Web page.

If it is the case, please could you ask someone on the sales side
to send me some hard information on what hardware, operating systems,
MPI etc. are supported and by whom? And I do mean HARD information.

Regards,
Nick Maclaren.

Eric

unread,

May 11, 2004, 10:02:01 AM5/11/04

to

Yes. My point was that knowledge of the mechanism and software
support are not limiting factors. With bus scatter-gather mapping
hardware, zero copy DMA is possible. Without it, both WNT and
Linux must copy data into bounce buffers for the DMA.

So, given how important this hardware is for good IO performance,
why is it still missing from all the chipsets?
(I haven't checked to see if scatter-gather support is specified in
the HyperTransport interface. Maybe AMD can take the lead here too.)

Eric

Nick Maclaren

unread,

May 11, 2004, 11:19:52 AM5/11/04

to

In article <40A0DCD9...@sympaticoREMOVE.ca>,

Eric <eric_p...@sympaticoREMOVE.ca> writes:
|>
|> Yes. My point was that knowledge of the mechanism and software
|> support are not limiting factors. With bus scatter-gather mapping
|> hardware, zero copy DMA is possible. Without it, both WNT and
|> Linux must copy data into bounce buffers for the DMA.

Actually, that is not so. It is not needed for zero copying, but
the lack of it imposes certain restrictions on the operating system,
interfaces and its usage.

Regards,
Nick Maclaren.

Eric

unread,

May 11, 2004, 2:05:28 PM5/11/04

to

Without the scatter-gather hardware the restictions are:
- the buffer lie within the range directly addressable by DMA
from that I/O bus, which is limited it to the number of address
lines on that bus. For PCI that means <= 4GB, ISA <= 16 MB.
- the buffer memory address must not conflict with a device
address on the I/O bus, from the point of view of the I/O device.
This can happen if the cpu address space is larger than the
I/O bus address space.
- the I/O buffer length <= 4k (I/O buffers must not cross a
4K boundary because pages allocated to virtual buffer would
probably not be contiguous).

That is sufficiently restrictive that, to my knowledge,
no one implements what you suggest.

Eric

Christoph Hellwig

unread,

May 11, 2004, 2:32:02 PM5/11/04

to

On Tue, May 11, 2004 at 10:02:01AM -0400, Eric wrote:
> Yes. My point was that knowledge of the mechanism and software
> support are not limiting factors. With bus scatter-gather mapping
> hardware, zero copy DMA is possible. Without it, both WNT and
> Linux must copy data into bounce buffers for the DMA.
>
> So, given how important this hardware is for good IO performance,
> why is it still missing from all the chipsets?
> (I haven't checked to see if scatter-gather support is specified in
> the HyperTransport interface. Maybe AMD can take the lead here too.)

Well, basically all modern hardware except intel's shipset has
scatter-gather mappings in the pci chipset (iommus in Linux
terminology), e.g. the SGI and HP IA64 systems support it while the
intel chipsets don't. AMD provides and iommu in the Opteron/Athlon64,
but it's a hack bolted onto the AGP GART on request of the Linux folks.

Andi Kleen should be able to tell more on it.

Christoph Hellwig

unread,

May 11, 2004, 2:33:38 PM5/11/04

to

On Tue, May 11, 2004 at 02:05:28PM -0400, Eric wrote:
> Without the scatter-gather hardware the restictions are:
> - the buffer lie within the range directly addressable by DMA
> from that I/O bus, which is limited it to the number of address
> lines on that bus. For PCI that means <= 4GB, ISA <= 16 MB.

Most highend PCI cards actually support DAC addressing, aka 64bit
DMA, so there's no problem with that.

> - the I/O buffer length <= 4k (I/O buffers must not cross a
> 4K boundary because pages allocated to virtual buffer would
> probably not be contiguous).

I've heard windows actually has some scheme to hand out big
continguos areas where possible, but I can't really say I understand
where/why/etc..

Linux does have checks to do bigger I/O request in the case the
pages are physically contiguous, although this happens rather
rarely in practive. What also happens is that HPC or database
user can allocate large pages (usually 4MB on x86) which allow
you to do much larger I/O length.

> That is sufficiently restrictive that, to my knowledge,
> no one implements what you suggest.

What, scatter-gather I/O without iommu? Every OS running on PCs
that I know does..

Nick Maclaren

unread,

May 11, 2004, 2:59:19 PM5/11/04

to

In article <40A115E8...@sympaticoREMOVE.ca>,
Eric <eric_p...@sympaticoREMOVE.ca> wrote:

>Nick Maclaren wrote:
>>
>> |> Yes. My point was that knowledge of the mechanism and software
>> |> support are not limiting factors. With bus scatter-gather mapping
>> |> hardware, zero copy DMA is possible. Without it, both WNT and
>> |> Linux must copy data into bounce buffers for the DMA.
>>
>> Actually, that is not so. It is not needed for zero copying, but
>> the lack of it imposes certain restrictions on the operating system,
>> interfaces and its usage.
>
>Without the scatter-gather hardware the restictions are:
>- the buffer lie within the range directly addressable by DMA
> from that I/O bus, which is limited it to the number of address
> lines on that bus. For PCI that means <= 4GB, ISA <= 16 MB.

Tedious, but acceptable for PCI on many current systems. Also, it
is solved by a simple base offset register for the transfer, which
is VASTLY less hardware support than scatter/gather!

>- the buffer memory address must not conflict with a device
> address on the I/O bus, from the point of view of the I/O device.
> This can happen if the cpu address space is larger than the
> I/O bus address space.

Well, yes, but dealing with that is fairly standard operating system
technology, and has been since time immemorial.

>- the I/O buffer length <= 4k (I/O buffers must not cross a
> 4K boundary because pages allocated to virtual buffer would
> probably not be contiguous).

No, that is wrong. One of the restrictions I was referring to was
the need to use only contiguous memory for transfers. That is a
common and solved restriction for many HPC interconnects, and has
been since the early days of virtual memory (the 1960s).

>That is sufficiently restrictive that, to my knowledge,
>no one implements what you suggest.

Oh, yes, they do. Quite a few of the HPC interconnects have and
probably still do work that way. Hitachi's SR2201 RDMA did, for
example.

Regards,
Nick Maclaren.

Stephen Fuld

unread,

May 11, 2004, 3:11:52 PM5/11/04

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message

news:c7o9v5$7jk$1...@pegasus.csx.cam.ac.uk...

> In article <upNnc.39564$Ut1.1...@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
> >
> >However, there are advantages to using IB even if you don't use RDMA to
> >achieve zero copy I/O (i.e. you still use a system owned cache). These
> >include lower latency, better scaling and connectivity for large number
of
> >peripherals, etc.
>
> You have proof of all of that? Or even evidence? Do tell.

I don't have "proof" in that I have been out of that world for some years.
However, I can talk about the comparisons of IB, or at least its predecessor
designs (NGIO and System I/0) compared with things like PCI-X. I cannot
talk about PCI Express for AFAIK, the specs are not freely available (I am
not a member of the PCI SIG). PCI-X was limited to one slot at full speed
and two at reduced speed. That limited connectivity. Coupled with the
large number of pins required for PCI-X compared with IB limits the number
of interfaces you can easily fit on a motherboard or "blade card". With the
"external" switching capability of IB, you can get off the card with a few
wires to an external switch to a large number (dozens or even hundreds) of
peripheral devices. That is much harder, if not impossible with PCI-X. As
for the latency, I haven't followed the implementations, but the elimination
of the long latency loads from distant PCI cards, and the reduction in
interrupts inherent in the protocol should allow a competent implementation
to achieve this. In some lab tests, NGIO was achieving memory to memory
latencies of a few microseconds over distances of a few feet.

Yes, you can talk about what any particular implementation does, and it may
be poor. I wouldn't doubt you there. But the potential to achieve lower
latency and higher connectivity/scalability than PCI is clear from the specs
and was a primary driver of the design. I see no reason preventing it from
being accomplished by at least some vendors (assuming they haven't lost
interest).

> >There is lower latency even without the use of zero copy I/O. This comes
> >from the elimination of the slow "memory" operations, the ability to
reduce
> >the number of interrupts, etc. I don't know of anyone why claims you can
> >gain zero copy I/O and be totally software compatible. That is a straw
man.
>
> No, it's not a straw man - that's not what a straw man is. And,
> let me repeat, evidence for your first claim is minimal - unless you
> are using the straw man of Ethernet's latency and SCSI's scalability.

I was using the latency of the then competitors in the message passing area,
such as Myranet and Giganet (?), not Ethernet. As for high scalability,
parallel SCSI isn't even in the game, but Fibre Channel is. There was a lot
of dispute about to what extent these new interconnects would compete with
FC for attachment of lots of disks over distances greater than a few meters.
Since IB never got the wide acceptance taht some had hoped, this argument is
now moot, but cearly SCSI over IB to a disk controller would be lower
latency than FC with a PCI connection to the host.

> While it is POSSIBLE that InfiniBand will deliver the best of all
> the alternatives, simultaneously, the proof of the pudding is in the
> eating.

Of course. But due to marketing/installed base issues, etc. it could be a
great technical solution and not win. So you may not get to even try the
pudding or be exposed to the best pudding that could be made.

> If you take a look back through this group, or get inflicted with some
> InfiniBand presentations, you will see plenty of such claims.

I have been to the early IB stuff (while I was still involved) but nothing
lately. It certainly could have changed, and some people were "off the
wall" optimistic. But most of the people I talked to had a pretty godd idea
of the potential. Of course, as I said before YMMV. As for this group, I
have seen few really outrageous claims about IB of the type you sem to be
thinking of, but I might have missed them.

Stephen Fuld

unread,

May 11, 2004, 3:16:34 PM5/11/04

to

"Eric" <eric_p...@sympaticoREMOVE.ca> wrote in message
news:409FCC99...@sympaticoREMOVE.ca...

> Stephen Fuld wrote:
> >
> > <snip>
> >
> > There is lower latency even without the use of zero copy I/O. This
comes
> > from the elimination of the slow "memory" operations, the ability to
reduce
> > the number of interrupts, etc. I don't know of anyone why claims you
can
> > gain zero copy I/O and be totally software compatible. That is a straw
man.
>
> What are the slow "memory" operations you are referring to?

It is the loads from memory mapped control registers that actually
physically reside on the far side of PCI bridge chip (or equivalent). These
can take multi-microseconds each. IB eliminates them by doing everything
out of main memory.

> If this is a DMA, then there would be a few device control register
> writes to set up the DMA transfer, right? Granted they are much
> slower than L1 cache, but I don't see how they can be avoided.

You set up a data structure with all the control information in main memory
then execute one instruction that points the I/O hardware to that data
structure and says "go".

> It seems to me that the bigger overhead would be the OS call.
> And (with appropriate hand waving) the OS could even pre-setup the
> transfer by allocating and loading IO bus scatter-gather registers.

Elimination of the OS call is a substantial latency reducer, but that does
require software changes.

> If you can find a machine which has them. I saw a To Be Defined
> reference to bus scatter-gather support in a preliminary document
> for an Intel bus interface chip, but it disappeared in the final draft.
> I guess someone at Intel likes programmed IO for some strange reason.
>
> I have wondered for a long time why bus scatter-gather support wasn't
> in the Intel PCI and ISA bus interface chips. It allows DMA to be
> zero copy (to user space if desired), transparent to page boundaries,
> and maps 24 bit ISA and 32 bit PCI into 36 bit system address space.
> And WNT at least has had support for it from the very start.
>
> So it _should_ be well known technology.

Scatter gather has been done in hardware for over four decades. But I'm not
sure what you think that has to do with enabling zero copy I/O. You can do
I/O into the user space in one chunk (no SG). Having SG does make some
things simpler though.

Eric

unread,

May 11, 2004, 3:08:35 PM5/11/04

to

Sorry, I meant on PC's. I only looked at Intel chipset docs
because if Intel chipsets don't support it, it is unlikely
to be used by the OS even if present. As I have not heard of any
chipset manufacturer touting 'turbo advanced io performance'
I assumed they don't do it either. Am I wrong?

I'm not sure what you mean wrt 'bolted onto AGP'. Do you mean it
is only available for AGP and not general I/0?

Eric

unread,

May 11, 2004, 3:10:03 PM5/11/04

to

Christoph Hellwig wrote:
>
> On Tue, May 11, 2004 at 02:05:28PM -0400, Eric wrote:
> > Without the scatter-gather hardware the restictions are:
> > - the buffer lie within the range directly addressable by DMA
> > from that I/O bus, which is limited it to the number of address
> > lines on that bus. For PCI that means <= 4GB, ISA <= 16 MB.
>
> Most highend PCI cards actually support DAC addressing, aka 64bit
> DMA, so there's no problem with that.

This is a solution. However it forces everyone to go out
and buy new equipment, whereas IOMMU does not. As most people
have PCI/32 and are not likely to replace their cards,
the iommu solution is better because it always works.

>
> > - the I/O buffer length <= 4k (I/O buffers must not cross a
> > 4K boundary because pages allocated to virtual buffer would
> > probably not be contiguous).
>
> I've heard windows actually has some scheme to hand out big
> continguos areas where possible, but I can't really say I understand
> where/why/etc..
>
> Linux does have checks to do bigger I/O request in the case the
> pages are physically contiguous, although this happens rather
> rarely in practive. What also happens is that HPC or database
> user can allocate large pages (usually 4MB on x86) which allow
> you to do much larger I/O length.
>
> > That is sufficiently restrictive that, to my knowledge,
> > no one implements what you suggest.
>
> What, scatter-gather I/O without iommu? Every OS running on PCs
> that I know does..

How? The pages of a virtual buffer are scattered all over memory.
Without an iommu the OS has only two choices:
(a) copy to a continuous bounce buffer
(b) do an I/O operation for each fragment.
The overhead of option (b) is higher due to the interrupt processing
and is still limited to 4 GB for PCI/32 so everyone does option (a).

Is there something else that takes place?

Eric

Joe Seigh

unread,

May 11, 2004, 3:27:49 PM5/11/04

to

Stephen Fuld wrote:
>
> "Eric" <eric_p...@sympaticoREMOVE.ca> wrote in message
> news:409FCC99...@sympaticoREMOVE.ca...
>

> > It seems to me that the bigger overhead would be the OS call.
> > And (with appropriate hand waving) the OS could even pre-setup the
> > transfer by allocating and loading IO bus scatter-gather registers.
>
> Elimination of the OS call is a substantial latency reducer, but that does
> require software changes.
>

You could eliminate the OS call now (with software changes of course).

Joe Seigh

Christoph Hellwig

unread,

May 11, 2004, 3:35:35 PM5/11/04

to

On Tue, May 11, 2004 at 03:10:03PM -0400, Eric wrote:
> This is a solution. However it forces everyone to go out
> and buy new equipment, whereas IOMMU does not. As most people
> have PCI/32 and are not likely to replace their cards,
> the iommu solution is better because it always works.

It's hard to find a GigE NIC or SCSI/FC HBA without DAC,
in fact DAC support is mandatory in PCI-X.

> > What, scatter-gather I/O without iommu? Every OS running on PCs
> > that I know does..
>
> How? The pages of a virtual buffer are scattered all over memory.
> Without an iommu the OS has only two choices:
> (a) copy to a continuous bounce buffer
> (b) do an I/O operation for each fragment.
> The overhead of option (b) is higher due to the interrupt processing
> and is still limited to 4 GB for PCI/32 so everyone does option (a).
>
> Is there something else that takes place?

PCI hardware tends to support multiple dma operations per logical
operations, aka scatter gather I/O. You hand the hardware a few
dozend (typical limits are 128 or 256) I/O fragments (4k on x86)
and they'll do I/O on it.

Christoph Hellwig

unread,

May 11, 2004, 3:39:26 PM5/11/04

to

On Tue, May 11, 2004 at 03:08:35PM -0400, Eric wrote:
> Sorry, I meant on PC's. I only looked at Intel chipset docs
> because if Intel chipsets don't support it, it is unlikely
> to be used by the OS even if present. As I have not heard of any
> chipset manufacturer touting 'turbo advanced io performance'
> I assumed they don't do it either. Am I wrong?

Kinda. On the K8 it's not a chipset feature, but due to the HT
and memory controllers on the cpu a cpu feature (via chipsets
manage to fuck it up some way though, odn't ask me how). And yes,
Linux uses it.

> I'm not sure what you mean wrt 'bolted onto AGP'. Do you mean it
> is only available for AGP and not general I/0?

It means all recent PC shipsets actually support a very limited
I/O mmu for AGP only, the so called AGP GART. On request of the
Linux developers involved rather early in the K8 design process
it was encehanced to also support remapping of PCI bus addresses.

Nick Maclaren

unread,

May 11, 2004, 4:30:45 PM5/11/04

to

In article <Yz9oc.78630$Xj6.1...@bgtnsc04-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
>news:c7o9v5$7jk$1...@pegasus.csx.cam.ac.uk...
>> In article <upNnc.39564$Ut1.1...@bgtnsc05-news.ops.worldnet.att.net>,
>> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>> >
>> >However, there are advantages to using IB even if you don't use RDMA to
>> >achieve zero copy I/O (i.e. you still use a system owned cache). These
>> >include lower latency, better scaling and connectivity for large number
>of
>> >peripherals, etc.
>>
>> You have proof of all of that? Or even evidence? Do tell.
>
>I don't have "proof" in that I have been out of that world for some years.
>However, I can talk about the comparisons of IB, or at least its predecessor
>designs (NGIO and System I/0) compared with things like PCI-X. I cannot
>talk about PCI Express for AFAIK, the specs are not freely available (I am

>not a member of the PCI SIG). ...

>
>Yes, you can talk about what any particular implementation does, and it may
>be poor. I wouldn't doubt you there. But the potential to achieve lower
>latency and higher connectivity/scalability than PCI is clear from the specs
>and was a primary driver of the design. I see no reason preventing it from
>being accomplished by at least some vendors (assuming they haven't lost
>interest).

Right. We are reaching common ground :-)

I remain extremely interested in InfiniBand because of its potential
for the combination of high bandwidth, low latency and cheapness.
But I remain VERY suspicious of the jump from potential to actual
performance/reliability/whatever, ESPECIALLY when there is such a
massively complex specification that opens wormcans that I know (from
experience) caused hell 30 years back and were never entirely brought
under control.

A related example of that concerns Gigabit Ethernet. There was NO
chance that was going to deliver, based on the interfaces that were
'standard' for ordinary and Fast Ethernet. Well, the solution was
to drop them and go for non-standard DMA interfaces. Fine. But it
is then reasonable to ask whether similar things will happen in the
problem areas of InfiniBand (which are in very different areas).

The older I get, the less I believe that anything will work. And
the more reliable my predictions become :-(

Regards,
Nick Maclaren.

Greg Lindahl

unread,

May 11, 2004, 4:32:18 PM5/11/04

to

In article <40A124B3...@sympaticoREMOVE.ca>,
Eric <eric_p...@sympaticoREMOVE.ca> wrote:

>Sorry, I meant on PC's. I only looked at Intel chipset docs
>because if Intel chipsets don't support it, it is unlikely
>to be used by the OS even if present.

We were at least partly discussing Linux in this thread. Linux, unlike
Windows, runs on a lot of different hardware. The first couple of
64-bit Linux ports added that capability to the OS; it was a huge
deal, as it was long enough ago that 32-bit PCI cards were still very
common.

-- greg

Rob Warnock

unread,

May 11, 2004, 8:48:12 PM5/11/04

to

Christoph Hellwig <h...@engr.sgi.com> wrote:
+---------------

| Well, basically all modern hardware except intel's shipset has
| scatter-gather mappings in the pci chipset (iommus in Linux
| terminology), e.g. the SGI and HP IA64 systems support it while the
| intel chipsets don't.

+---------------

Well, to be more precise, SGI's Octane/Fuel/Origin/Altix systems
provide page-sized [and *16KB* pages, at that!] I/O MMUs for PCI
DMAs for 32-bit addressing only. For 64-bit PCI DMA addressing
the PCI bus addreses are interpreted as absolute memory addresess
[plus some upper flag bits], the driver must provide the virtual-
to-physical translation (using the provided kernel services), and
the DMA device itself must provide any scatter/gather functionality.

Note that setting up and tearing down the 32-bit IO MMU mappings
for every I/O operation is *very* CPU-intensive. Fortunately, almost
all modern PCI cards provide both 64-bit addressing and scatter/gather
descriptor rings...

-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Rob Warnock

unread,

May 11, 2004, 8:57:27 PM5/11/04

to

Joe Seigh <jsei...@xemaps.com> wrote:
+---------------
| Stephen Fuld wrote:

| > "Eric" <eric_p...@sympaticoREMOVE.ca> wrote:
| > > It seems to me that the bigger overhead would be the OS call.
| > > And (with appropriate hand waving) the OS could even pre-setup the
| > > transfer by allocating and loading IO bus scatter-gather registers.
| >
| > Elimination of the OS call is a substantial latency reducer, but that
| > does require software changes.
|
| You could eliminate the OS call now (with software changes of course).

+---------------

However, *safely* eliminating the O/S call often requires changing
the *hardware*, since -- unless it was a design constraint given to
the hardware engineers at the very beginning -- it is almost always
case that the available granularity of mapping hardware register space
into user mode does *not* match the partitioning of the hardware registers
into "safe" and "unsafe" ones. That is, a "Go!" register might be
in the same mapping granule as the register that sets the IO-to-host
address mapping, and exposing the former to the user (safe) will also
expose the *latter* to the user program (very unsafe!).

The better-designed IBA HAs have addressed this issue...

Stephen Fuld

unread,

May 12, 2004, 12:55:46 AM5/12/04

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message

news:c7rd5l$4an$1...@pegasus.csx.cam.ac.uk...

snip

> Right. We are reaching common ground :-)
>
> I remain extremely interested in InfiniBand because of its potential
> for the combination of high bandwidth, low latency and cheapness.
> But I remain VERY suspicious of the jump from potential to actual
> performance/reliability/whatever,

I think we both would agree that if it occurs, it won't be a "jump", but
probably a crawl, and at best a slow walk. :-)

Christoph Hellwig

unread,

May 12, 2004, 2:03:59 AM5/12/04

to

On Tue, May 11, 2004 at 07:48:12PM -0500, Rob Warnock wrote:
> Well, to be more precise, SGI's Octane/Fuel/Origin/Altix systems
> provide page-sized [and *16KB* pages, at that!] I/O MMUs for PCI
> DMAs for 32-bit addressing only. For 64-bit PCI DMA addressing
> the PCI bus addreses are interpreted as absolute memory addresess
> [plus some upper flag bits], the driver must provide the virtual-
> to-physical translation (using the provided kernel services), and
> the DMA device itself must provide any scatter/gather functionality.

Still not the whole truth (and yes, I know the bridge/xbridge/pci
in detail because I'm hacking on the Linux port for the Origin in
my spare time): The bridge and xbridge ASICs actually have three
different DMA modes:

32bit direct mapped
64bit direct mapped
dma access through the iommu

64bit direct mapped dma is pretty much as you describe it, 32bit
direct mapped dma is very much like the 64bit direct mapped, except
well, it's just 32bits. iommu (pmu in SGI teminology) access allow
different I/O page sizes actually, although as you say it's usually 16k.

The PCI ASC support the three above for PCI and only 64bit direct
mappings for PCI-X.

Now the Linux driver model always lets the driver setup SG tables for
physically non-contingous I/O and leaves it to the iommu driver to merge
the together, which the drivers for the sgi plattforms don't do yet for
the 32bit non-direct mapped case, although I have some unfinished code
for that - but given that we're trying to use 64bit direct addressing
wherever possible on SGI systems it doesn't matter much anyway.

> Note that setting up and tearing down the 32-bit IO MMU mappings
> for every I/O operation is *very* CPU-intensive. Fortunately, almost
> all modern PCI cards provide both 64-bit addressing and scatter/gather
> descriptor rings...

I know on the SGI plattforms 64bit direct addressing is usually
preferable, but Dave Miller tends to prefer using the iommu for
pci mapping in sparc, and given that's he's a performance freak I'm
pretty sure he has some reasons for it.

Nick Maclaren

unread,

May 12, 2004, 3:27:28 AM5/12/04

to

In article <m7ioc.81517$Xj6.1...@bgtnsc04-news.ops.worldnet.att.net>,

"Stephen Fuld" <s.f...@PleaseRemove.att.net> writes:
|> "Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
|> news:c7rd5l$4an$1...@pegasus.csx.cam.ac.uk...
|>

|> > Right. We are reaching common ground :-)
|> >
|> > I remain extremely interested in InfiniBand because of its potential
|> > for the combination of high bandwidth, low latency and cheapness.
|> > But I remain VERY suspicious of the jump from potential to actual
|> > performance/reliability/whatever,
|>
|> I think we both would agree that if it occurs, it won't be a "jump", but
|> probably a crawl, and at best a slow walk. :-)

Yes, and it is a long and tortuous path. One of my main criteria
for telling a salesman from a salesdroid is that he understands my
question "Which aspects and uses are you going to support initially?"
The response "Everything from day one" doesn't inspire confidence.

Regards,
Nick Maclaren.

Message has been deleted

Nick Maclaren

unread,

May 12, 2004, 3:38:04 AM5/12/04

to

In article <ZrWdnQ_1zLb...@speakeasy.net>,

Maybe. I haven't studied the specification in enough detail to know
whether (a) it seems to be watertight and (b) what implementors are
likely to get wrong. I have been told by reliable people that MOST
of the HPC DMA interconnects are unsafe (sometimes very unsafe); the
reason that this rarely matters is that HPC users are rarely actively
hostile.

And there is a LOT more to safety than simply the separation of the
'privileged' versus 'unprivileged' operations, at least when systems
are shared between jobs and reliability is an issue. It is also
important to minimise denial of service, and that is a MUCH harder
task :-(

Note that, for reasons that we both know, I am not really classifying
the Origins (and probably not the Altix) as using that class of HPC
interconnect.

Regards,
Nick Maclaren.

Rob Warnock

unread,

May 12, 2004, 6:11:34 AM5/12/04

to

Christoph Hellwig <h...@engr.sgi.com> wrote:
+---------------

| Rob Warnock wrote:
| > Well, to be more precise, SGI's Octane/Fuel/Origin/Altix systems
| > provide page-sized [and *16KB* pages, at that!] I/O MMUs for PCI
| > DMAs for 32-bit addressing only. For 64-bit PCI DMA addressing
| > the PCI bus addreses are interpreted as absolute memory addresess
| > [plus some upper flag bits], the driver must provide the virtual-
| > to-physical translation (using the provided kernel services), and
| > the DMA device itself must provide any scatter/gather functionality.
|
| Still not the whole truth (and yes, I know the bridge/xbridge/pci
| in detail because I'm hacking on the Linux port for the Origin in
| my spare time): The bridge and xbridge ASICs actually have three
| different DMA modes:
|
| 32bit direct mapped
| 64bit direct mapped
| dma access through the iommu

+---------------

Yes, well, for simplicity in presenting my point I omitted the 32-bit
direct-mapped mode since -- with the exception of one notable @^%#$@^%$!#
brain-damaged ATM controller (I'm sure you know which one I mean) --
it's rarely ever used. The main problem (IIRC) is that there's only
one such mapping available per PCI *bus*, not per device. The other
problem is that the granularity of such a mapping is the *entire* address
range, that is, the mapping must be to a 4GiB boundary. Since mbufs
(to use one example) can be allocated in *any* NUMAlink node, using
32-bit direct-mapped mode for networking drivers practically guarantees
you'll have to use "bounce buffers" and bcopy-ing. (*Ugh!*)

+---------------

| Now the Linux driver model always lets the driver setup SG tables for
| physically non-contingous I/O and leaves it to the iommu driver to merge
| the together, which the drivers for the sgi plattforms don't do yet for
| the 32bit non-direct mapped case, although I have some unfinished code
| for that - but given that we're trying to use 64bit direct addressing
| wherever possible on SGI systems it doesn't matter much anyway.

+---------------

Plus, as I noted, setting up the I/O MMU was ex*PEN*sive in the
Origin 2000 Bridge in the usual config with external mapping SRAM
[for reasons I won't go into here]. The O3000 XBridge does not support
the external mapping SRAM, which is good in that the slow O2k workaround
code is not needed, but bad in that there are far fewer map table entries
available in the internal SRAM, which can definitely become an issue
with a PCI bus that has a number of networking devices on it.

+---------------

| > Note that setting up and tearing down the 32-bit IO MMU mappings
| > for every I/O operation is *very* CPU-intensive. Fortunately, almost
| > all modern PCI cards provide both 64-bit addressing and scatter/gather
| > descriptor rings...
|
| I know on the SGI plattforms 64bit direct addressing is usually
| preferable, but Dave Miller tends to prefer using the iommu for
| pci mapping in sparc, and given that's he's a performance freak
| I'm pretty sure he has some reasons for it.

+---------------

SPARC != NUMAlink. All I will say here is go look at the O3k XBridge
mapping code, and then ask yourself the relative cost of PIOs versus
DMAs on the two systems.

[We probably should drop the topic at this point. Feel free to contact
me privately if you have further questions.]

Joe Seigh

unread,

May 12, 2004, 7:29:54 AM5/12/04

to

Rob Warnock wrote:

>
> Joe Seigh <jsei...@xemaps.com> wrote:
> | You could eliminate the OS call now (with software changes of course).
> +---------------
>
> However, *safely* eliminating the O/S call often requires changing
> the *hardware*, since -- unless it was a design constraint given to
> the hardware engineers at the very beginning -- it is almost always
> case that the available granularity of mapping hardware register space
> into user mode does *not* match the partitioning of the hardware registers
> into "safe" and "unsafe" ones. That is, a "Go!" register might be
> in the same mapping granule as the register that sets the IO-to-host
> address mapping, and exposing the former to the user (safe) will also
> expose the *latter* to the user program (very unsafe!).
>
> The better-designed IBA HAs have addressed this issue...
>

I meant you don't need to do the actual NMI to talk to the kernel,
not eliminate the kernel. Anyway there's nothing new conceptwise
that you're discussing here. It's already been done for decades on
mainframes.

Joe Seigh

Christoph Hellwig

unread,

May 12, 2004, 7:50:34 AM5/12/04

to

On Wed, May 12, 2004 at 05:11:34AM -0500, Rob Warnock wrote:
> Yes, well, for simplicity in presenting my point I omitted the 32-bit
> direct-mapped mode since -- with the exception of one notable @^%#$@^%$!#
> brain-damaged ATM controller (I'm sure you know which one I mean) --
> it's rarely ever used.

Well, that shows you IRIX background :) In Linux it's not up to the
driver to chose the the mapping method but rather to the platform
specific I/O mapping code. This simplies the driver greatly, especially
given the gazillions of different platforms Linux supports. It does
of course have downsides too as it doesn't allow fain grained control
over the actual dma mappings, but in general Linux choose
maintainability over squeezing out the last bit of performance and
relies on More's law for the remaining 0.001%.

> Plus, as I noted, setting up the I/O MMU was ex*PEN*sive in the
> Origin 2000 Bridge in the usual config with external mapping SRAM
> [for reasons I won't go into here].

I've actually chose to deliberately ignore the external ATE SRAM for
the Linux port because of exactly that issue.

> The O3000 XBridge does not support
> the external mapping SRAM, which is good in that the slow O2k workaround
> code is not needed, but bad in that there are far fewer map table entries
> available in the internal SRAM, which can definitely become an issue
> with a PCI bus that has a number of networking devices on it.

Only if you don't use DAC mappings. And at least the tg3 that's shipped
with SGI Altixens uses DAC only.

> | I know on the SGI plattforms 64bit direct addressing is usually
> | preferable, but Dave Miller tends to prefer using the iommu for
> | pci mapping in sparc, and given that's he's a performance freak
> | I'm pretty sure he has some reasons for it.
> +---------------
>
> SPARC != NUMAlink. All I will say here is go look at the O3k XBridge
> mapping code, and then ask yourself the relative cost of PIOs versus
> DMAs on the two systems.

Linux runs on much more than just SGI hardware, so it's a good thing
drivers aren't hardcoded to use either method.

Stephen Fuld

unread,

May 12, 2004, 12:01:23 PM5/12/04

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:40A20BFA...@xemaps.com...

>
>
> Rob Warnock wrote:
> >
> > Joe Seigh <jsei...@xemaps.com> wrote:
> > | You could eliminate the OS call now (with software changes of course).
> > +---------------
> >
> > However, *safely* eliminating the O/S call often requires changing
> > the *hardware*, since -- unless it was a design constraint given to
> > the hardware engineers at the very beginning -- it is almost always
> > case that the available granularity of mapping hardware register space
> > into user mode does *not* match the partitioning of the hardware
registers
> > into "safe" and "unsafe" ones. That is, a "Go!" register might be
> > in the same mapping granule as the register that sets the IO-to-host
> > address mapping, and exposing the former to the user (safe) will also
> > expose the *latter* to the user program (very unsafe!).
> >
> > The better-designed IBA HAs have addressed this issue...
> >
>
> I meant you don't need to do the actual NMI to talk to the kernel,
> not eliminate the kernel.

I don't get what you are saying here. No one talked about NMI. The typical
kernal call is through some instruction such as Int (PCs) or SVC (IBM
mainframes) or ER (Univac descendant mainframes, not an NMI. On the
completion side, the external interupt comes from the hardware and is
rarely, if ever an NMI.

> Anyway there's nothing new conceptwise
> that you're discussing here. It's already been done for decades on
> mainframes.

Again, I am confused. On all mainframes I know of, (except for the
Burroughs descendants, which is a "whole nother thing") the actual I/O
instructions (e.g. SIOF) are privliged and can only be executed by the OS.
Similarly, on the completion side, the interrupt puts the system in kernal
or OS mode; the user only gets control after the OS has responded to the
interupt, checked the status, etc. Thus the OS has to be involved in the
I/O operations in order to provide the safety required by most systems. IB
has a way of providing that safety without requiring OS intervention on
every I/O operations and that results in reduced latency. This cannot be
done safely with most existing I/O architectures. And, it does require
changing the application software to make use of, so it is not "software
transparent".

Nick Maclaren

unread,

May 12, 2004, 12:22:53 PM5/12/04

to

In article <nTroc.47103$Ut1.1...@bgtnsc05-news.ops.worldnet.att.net>,

"Stephen Fuld" <s.f...@PleaseRemove.att.net> writes:
|> "Joe Seigh" <jsei...@xemaps.com> wrote in message
|> news:40A20BFA...@xemaps.com...
|>

|> > Anyway there's nothing new conceptwise
|> > that you're discussing here. It's already been done for decades on
|> > mainframes.
|>
|> Again, I am confused. On all mainframes I know of, (except for the
|> Burroughs descendants, which is a "whole nother thing") the actual I/O
|> instructions (e.g. SIOF) are privliged and can only be executed by the OS.
|> Similarly, on the completion side, the interrupt puts the system in kernal
|> or OS mode; the user only gets control after the OS has responded to the
|> interupt, checked the status, etc. Thus the OS has to be involved in the
|> I/O operations in order to provide the safety required by most systems. IB
|> has a way of providing that safety without requiring OS intervention on
|> every I/O operations and that results in reduced latency. This cannot be
|> done safely with most existing I/O architectures. And, it does require
|> changing the application software to make use of, so it is not "software
|> transparent".

There is actually one way to eliminate the kernel call with no change
needed to the application as such, which is to use the facility to
shift most of the management from the kernel to the language library.
Come back the System/360 Access Methods - and, in particular, chained
scheduling using PCIs - all is forgiven :-)

I don't think that was what he meant!

In any case, all that does is remove the need to redesign the
application as such - the library needs even more gutting than is
needed to provide direct support for DMA. And the kernel changes
are pretty pervasive, too. Conceptually, it isn't new, but IBM
dropped support for chained scheduling in MVT 21.7 (if I recall),
as they found it too fiendish to support. The I/O efficiency of
System/370 went downhill all the way from there ....

I think that could be done right, without the problems of MVT, but
it assuredly would not be easy, either to design or to implement.
And it would need some extra data structure hardware on at least
some architectures. At the VERY least, you need enough to be able
to maintain a consistent queue, with insertion and removal at each
end from multiple parallel threads running at different protection
levels, without introducing security exposures or requiring locking.

Not impossible, but DEFINITELY not trivial.

Regards,
Nick Maclaren.

Joe Seigh

unread,

May 12, 2004, 1:21:55 PM5/12/04

to

Nick Maclaren wrote:
[snip]

>
> I think that could be done right, without the problems of MVT, but
> it assuredly would not be easy, either to design or to implement.
> And it would need some extra data structure hardware on at least
> some architectures. At the VERY least, you need enough to be able
> to maintain a consistent queue, with insertion and removal at each
> end from multiple parallel threads running at different protection
> levels, without introducing security exposures or requiring locking.
>
> Not impossible, but DEFINITELY not trivial.
>

That's very close to what I was thinking. There are lock-free
queues and stack algorithms. There's also fast pathed signaling
that will only invoke the kernel if a wait actually is required
(assuming you don't want to waste cpu cycles spin waiting).

Joe Seigh

Nick Maclaren

unread,

May 12, 2004, 1:48:37 PM5/12/04

to

In article <40A25E74...@xemaps.com>,
Joe Seigh <jsei...@xemaps.com> wrote:

You are, however, going quite a long way beyond what was standard
technology on traditional mainframes. One of the reasons that IBM
got rid of chained scheduling is that they really, really did NOT
want to have to support it in combination with virtual memory. In
fact, it was phased out before System/370 supported the modern form
of SMP - and, if I recall, before it supported the relevant parts of
the supervisor running on anything except CPU 0.

Don't underestimate the difficulty of getting this right.

Regards,
Nick Maclaren.

Message has been deleted

Joe Seigh

unread,

May 12, 2004, 2:15:41 PM5/12/04

to

I'm not. I used to be a kernel developer. Anyway I find
it rather strange to think it's magical if the code is
running on a special purpose processor from firmware rather
than on a general purpose processor. And the special purpose
processor doesn't even have a virtual memory unit. There
are some artificial and meaningless distinctions being
made here.

Joe Seigh

David C. DiNucci

unread,

May 12, 2004, 2:22:09 PM5/12/04

to

This discussion only seems to be strengthening my belief that there is
significant potential in the CDS (Cooperative Data Sharing) interface
(cds-bcr.sourceforge.net), especially as it might exploit Infiniband,
given sufficient expertise and resources applied to the problem. I do
not pretend to have IB expertise. CDS is based around shared consistent
queues with insertion and removal from parallel threads, but the
interface was designed more from a portability and usability perspective
than from a desire to support any particular communication fabric.

-Dave
--
David C. DiNucci Elepar Tools for portable grid,
da...@elepar.com http://www.elepar.com parallel, distributed, &
503-439-9431 Beaverton, OR 97006 peer-to-peer computing

Anne & Lynn Wheeler

unread,

May 12, 2004, 2:46:27 PM5/12/04

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
> There is actually one way to eliminate the kernel call with no change
> needed to the application as such, which is to use the facility to
> shift most of the management from the kernel to the language library.
> Come back the System/360 Access Methods - and, in particular, chained
> scheduling using PCIs - all is forgiven :-)
>
> I don't think that was what he meant!
>
> In any case, all that does is remove the need to redesign the
> application as such - the library needs even more gutting than is
> needed to provide direct support for DMA. And the kernel changes
> are pretty pervasive, too. Conceptually, it isn't new, but IBM
> dropped support for chained scheduling in MVT 21.7 (if I recall),
> as they found it too fiendish to support. The I/O efficiency of
> System/370 went downhill all the way from there ....

virtual memory was on its way ... ludlow was possibly already doing
the mvt hack with ccwtrans from cp/67 on the 360/67 ... getting ready
for vs2/svs.

the os/360 standard was that the application (or application library)
built the real channel command programs (CCWs) and did an excp/svc0 to
the kernal. the kernel did a little sanity checking on the CCWs and
possibly prefixed it with set file mask CCW (aka wouldn't allow disk
commands to move the arm).

PCI was a I/O hardware interrupt sometimes used on long running
channel programs and would generate a hardware interrupt indicating
that the channel had executed a specific command in the sequence. this
could be reflected to the application in a PCI appendage. Frequently
the application PCI appendage would interpret various things and
dynamically modify the running channel program ... anticipating that
the PCI appendage got control and was able to change the channel
program ... before the channel execution got to the channel commands
that were being modified.

so we come to mvt->svs & virtual memory ... all the application-side
CCWs now had address with respect to virtual address space. the
EXCP/svc0 code now had to make a copy of the (virtual) ccws and
substitute real addresses for all the virtual addresses (as well as
pin/fix the affected pages until the i/o completed). In the PCI
appendage world .. the application code would now be modifying the
"virtual" ccws ... not the CCWs that were really executing.

So there were some operating system functions that ran in virtual
space that still needed to do these "on-fhe-fly" channel program
modifications ... like VTAM. Some of the solutions were run the
subsystem V=R (virtual equals real) ... so that that the application
program CCWs could still be directly executed. Another in the CP/67 &
VM/370 world was to define a new virtual machine signal (diagnose
function) that signaled that there had been a modification to the
virtual CCWs and that the copied (real) CCWs had to be modified to
reflect the virtual channel program modifications.

a really big problem from the MVT raal memory to SVS virtual memory
transition was the whole design point of the application space being
allowed to build the channel programs ... and that in new virtual
memory environment ... channel programs still ran with "real memory"
address ... while standard processor instructions now could run with
virtual addresses. Under SVS, appications still continued to build
channel programs ... but they no longer could be the "real" channel
programs. The EXCP/SVC0 routine had to build a copy of the virtual
channel program commands, substituting real addresses for any virtual
addresses.

--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/

Nick Maclaren

unread,

May 12, 2004, 2:50:02 PM5/12/04

to

In article <40A26B14...@xemaps.com>,

I am afraid that you are, largely because you WERE a kernel developer!
If you had tried to fix up the problems left to the application
because the supervisor/kernel people found them too hard, without
even being able to write privileged code, you would know what I mean.

This isn't helped by the problem being started by the hardware people
not providing the right primitives in the first place, but passing
problems on with interest is not solving them.

The reason that it is EASIER on a separate processor is that it means
that a team that has the power to provide the right primitives also
has the responsibility for doing the whole job. No more than that.
If the right primitives were provided, it could be done in normal
applications, too. They VERY rarely are, if ever.

On the other hand, the InfiniBand specification does the whole job
only at the level of actual access. There is a lot more to the
problem than that, as I have posted before.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

May 12, 2004, 2:57:08 PM5/12/04

to

In article <u7jvh8...@mail.comcast.net>,

Anne & Lynn Wheeler <ly...@garlic.com> wrote:
>
>virtual memory was on its way ... ludlow was possibly already doing
>the mvt hack with ccwtrans from cp/67 on the 360/67 ... getting ready
>for vs2/svs.

Could be. I had little contact with that.

>the os/360 standard was that the application (or application library)
>built the real channel command programs (CCWs) and did an excp/svc0 to
>the kernal. the kernel did a little sanity checking on the CCWs and
>possibly prefixed it with set file mask CCW (aka wouldn't allow disk
>commands to move the arm).

No, that could be done, even in unprivileged code, if the file was
volume allocated (but creating such a file was very privileged)!

>PCI was a I/O hardware interrupt sometimes used on long running
>channel programs and would generate a hardware interrupt indicating
>that the channel had executed a specific command in the sequence. this
>could be reflected to the application in a PCI appendage. Frequently
>the application PCI appendage would interpret various things and
>dynamically modify the running channel program ... anticipating that
>the PCI appendage got control and was able to change the channel
>program ... before the channel execution got to the channel commands
>that were being modified.

PCI in general was privileged. Chained scheduling was one of the few
uses that wasn't.

>so we come to mvt->svs & virtual memory ... all the application-side
>CCWs now had address with respect to virtual address space. the
>EXCP/svc0 code now had to make a copy of the (virtual) ccws and
>substitute real addresses for all the virtual addresses (as well as
>pin/fix the affected pages until the i/o completed). In the PCI
>appendage world .. the application code would now be modifying the
>"virtual" ccws ... not the CCWs that were really executing.

Quite :-) I never said that there weren't good reasons not to want
to support it under MVS ....

>So there were some operating system functions that ran in virtual
>space that still needed to do these "on-fhe-fly" channel program
>modifications ... like VTAM. Some of the solutions were run the
>subsystem V=R (virtual equals real) ... so that that the application
>program CCWs could still be directly executed. Another in the CP/67 &
>VM/370 world was to define a new virtual machine signal (diagnose
>function) that signaled that there had been a modification to the
>virtual CCWs and that the copied (real) CCWs had to be modified to
>reflect the virtual channel program modifications.

Oh, yes, but were there any unprivileged ones? I can't remember any.

Regards,
Nick Maclaren.

Eric

unread,

May 12, 2004, 3:51:40 PM5/12/04

to

While setting up the mappings may be expensive for some system,
I don't see why is _needs_ to be so.
Is this just a particular implementation or is there
some broader problem you have in mind?

Eric

unread,

May 12, 2004, 3:54:56 PM5/12/04

to

Christoph Hellwig wrote:
>
> On Tue, May 11, 2004 at 03:10:03PM -0400, Eric wrote:
> > This is a solution. However it forces everyone to go out
> > and buy new equipment, whereas IOMMU does not. As most people
> > have PCI/32 and are not likely to replace their cards,
> > the iommu solution is better because it always works.
>
> It's hard to find a GigE NIC or SCSI/FC HBA without DAC,
> in fact DAC support is mandatory in PCI-X.

That is fine. However an OS is going to have to deal with
PCI-32 when it encounters it. Having the IOMMU allows that
to happen much smoother.

>
> > > What, scatter-gather I/O without iommu? Every OS running on PCs
> > > that I know does..
> >
> > How? The pages of a virtual buffer are scattered all over memory.
> > Without an iommu the OS has only two choices:
> > (a) copy to a continuous bounce buffer
> > (b) do an I/O operation for each fragment.
> > The overhead of option (b) is higher due to the interrupt processing
> > and is still limited to 4 GB for PCI/32 so everyone does option (a).
> >
> > Is there something else that takes place?
>
> PCI hardware tends to support multiple dma operations per logical
> operations, aka scatter gather I/O. You hand the hardware a few
> dozend (typical limits are 128 or 256) I/O fragments (4k on x86)
> and they'll do I/O on it.

Yes, that is _device_ sg while I was talking about _bus_ sg.
Device s-g is fine, if the device supports it, provided the io bus
address space >= memory space. If not, you are back to bounce
buffers which I was trying to avoid.

Eric

unread,

May 12, 2004, 3:43:00 PM5/12/04

to

Nick Maclaren wrote:
>
> In article <40A115E8...@sympaticoREMOVE.ca>,
> Eric <eric_p...@sympaticoREMOVE.ca> wrote:
> >Nick Maclaren wrote:
> >>
> >> |> Yes. My point was that knowledge of the mechanism and software
> >> |> support are not limiting factors. With bus scatter-gather mapping
> >> |> hardware, zero copy DMA is possible. Without it, both WNT and
> >> |> Linux must copy data into bounce buffers for the DMA.
> >>
> >> Actually, that is not so. It is not needed for zero copying, but
> >> the lack of it imposes certain restrictions on the operating system,
> >> interfaces and its usage.
> >
> >Without the scatter-gather hardware the restictions are:
> >- the buffer lie within the range directly addressable by DMA
> > from that I/O bus, which is limited it to the number of address
> > lines on that bus. For PCI that means <= 4GB, ISA <= 16 MB.
>
> Tedious, but acceptable for PCI on many current systems. Also, it
> is solved by a simple base offset register for the transfer, which
> is VASTLY less hardware support than scatter/gather!

I guess you missed my point, which was to avoid the
restrictions and use existing hardware and be efficient.

I also don't follow your design. You have multiple cards with
multiple devices which can have multiple transfers going at once.
Each buffer can be fragmented across N+1 page frames.
Seems to me that you'd need a bunch of mapping registers,
even for the resticted case.

The way I was thinking of it, there would be an s-g map
stored in main memory up to some maximum size (4K enties?).
The bus adaptor has a register that points to the base address
of the map table, a map limit register, and a 4 or 8 entry
fully assoc. translate cache. I don't think this would stress
technology limits too much.

>
> >- the buffer memory address must not conflict with a device
> > address on the I/O bus, from the point of view of the I/O device.
> > This can happen if the cpu address space is larger than the
> > I/O bus address space.
>
> Well, yes, but dealing with that is fairly standard operating system
> technology, and has been since time immemorial.
>
> >- the I/O buffer length <= 4k (I/O buffers must not cross a
> > 4K boundary because pages allocated to virtual buffer would
> > probably not be contiguous).
>
> No, that is wrong. One of the restrictions I was referring to was
> the need to use only contiguous memory for transfers. That is a
> common and solved restriction for many HPC interconnects, and has
> been since the early days of virtual memory (the 1960s).
>
> >That is sufficiently restrictive that, to my knowledge,
> >no one implements what you suggest.
>
> Oh, yes, they do. Quite a few of the HPC interconnects have and
> probably still do work that way. Hitachi's SR2201 RDMA did, for
> example.
>
> Regards,
> Nick Maclaren.

I was looking for a general solution for PC devices. I don't see
anything in what you are saying that suggests that that PCs
should NOT have bus s-g hardware just like other systems.

Eric

unread,

May 12, 2004, 4:01:00 PM5/12/04

to

I am aware that many mid/mainframe systems do have this hardware
but I didn't know whether Linux supported this or not. Now I do.
However I think that iommu support is just as valid for PCs also,
whatever the operating system.

Eric

unread,

May 12, 2004, 4:03:34 PM5/12/04

to

Stephen Fuld wrote:
>
> "Eric" <eric_p...@sympaticoREMOVE.ca> wrote in message
> news:409FCC99...@sympaticoREMOVE.ca...
> >
> > What are the slow "memory" operations you are referring to?
>
> It is the loads from memory mapped control registers that actually
> physically reside on the far side of PCI bridge chip (or equivalent). These
> can take multi-microseconds each. IB eliminates them by doing everything
> out of main memory.

I don't know the PCI details. Would this be 2 cycles (addr + data)
on a 33 mhz bus = 60 ns per uncontested bus access?
Is there more overhead?

>
> > If this is a DMA, then there would be a few device control register
> > writes to set up the DMA transfer, right? Granted they are much
> > slower than L1 cache, but I don't see how they can be avoided.
>
> You set up a data structure with all the control information in main memory
> then execute one instruction that points the I/O hardware to that data
> structure and says "go".

Ok, the I/O channel you mentioned. (Side thought: I was also
wondering if there was some way to simplify it this down to an
async blob transfer. Somethings that retains mem mapped devices
but borrows from channels too. Different topic.)

>
> > It seems to me that the bigger overhead would be the OS call.
> > And (with appropriate hand waving) the OS could even pre-setup the
> > transfer by allocating and loading IO bus scatter-gather registers.
>
> Elimination of the OS call is a substantial latency reducer, but that does
> require software changes.
>

> > If you can find a machine which has them. I saw a To Be Defined
> > reference to bus scatter-gather support in a preliminary document
> > for an Intel bus interface chip, but it disappeared in the final draft.
> > I guess someone at Intel likes programmed IO for some strange reason.
> >
> > I have wondered for a long time why bus scatter-gather support wasn't
> > in the Intel PCI and ISA bus interface chips. It allows DMA to be
> > zero copy (to user space if desired), transparent to page boundaries,
> > and maps 24 bit ISA and 32 bit PCI into 36 bit system address space.
> > And WNT at least has had support for it from the very start.
> >
> > So it _should_ be well known technology.
>
> Scatter gather has been done in hardware for over four decades. But I'm not
> sure what you think that has to do with enabling zero copy I/O. You can do
> I/O into the user space in one chunk (no SG). Having SG does make some
> things simpler though.

>
> --
> - Stephen Fuld
> e-mail address disguised to prevent spam

It has to do with zero copy because it enables the general case of a
PC to do a DMA transfer of large (>4KB) fragmented buffer physically
scattered around a large (> 4GB) address space to a PCI-32 device.

Eric

Nick Maclaren

unread,

May 12, 2004, 4:25:01 PM5/12/04

to

In article <40A27E44...@sympaticoREMOVE.ca>,

Eric <eric_p...@sympaticoREMOVE.ca> wrote:
>
>I guess you missed my point, which was to avoid the
>restrictions and use existing hardware and be efficient.

Which is a combination that can't be done cheaply. My point is that
you can get 75% of the gain for 5% of the pain. And, yes, it really
is as little as 5%.

>I also don't follow your design. You have multiple cards with
>multiple devices which can have multiple transfers going at once.
>Each buffer can be fragmented across N+1 page frames.
>Seems to me that you'd need a bunch of mapping registers,
>even for the resticted case.

Why do you fragment the buffers? You don't HAVE to, you know.

Regards,
Nick Maclaren.

Anton Rang

unread,

May 12, 2004, 5:05:36 PM5/12/04

to

Andi Kleen <fre...@alancoxonachip.com> writes:

> Christoph Hellwig <h...@engr.sgi.com> writes:
>
> > I know on the SGI plattforms 64bit direct addressing is usually
> > preferable, but Dave Miller tends to prefer using the iommu for
> > pci mapping in sparc, and given that's he's a performance freak I'm
> > pretty sure he has some reasons for it.
>

> Some SCSI controllers run faster when you do the scatter gather in the
> chipset instead of the controller. I had some good results (~5-7% better
> performance) doing that on Opteron with an LSI MPT Fusion controller.
>
> I presume that's because the IOMMU runs with a faster frequency and is
> true hardware compared to the microcoded scatter gather engine in the
> slower clocked HBA.

That's probably part of it (though the newer Adaptec SCSI HBAs have
a scatter/gather engine with enough hardware so that the microcoded
bit never becomes a performance bottleneck as long as you've got 64
bytes or so in each s/g segment; others may as well but I don't know).

The bigger issue is probably PCI bus turnaround time and interaction
with buffers in the memory-to-PCI bridge. The turnaround time can be
substantial if the s/g entries are relatively small, but if they're at
least 4K or so, is negligible. However, the memory-to-PCI bridge
tends to be doing readahead/writebehind which can introduce quite a
bit of extra delay for writes to disk from memory. (Since PCI
requests aren't pipelined, when you hit the s/g boundary and start
requesting data at a new address, the bridge has to stop any prefetch
operation in progress, start a request at the new address, wait for
data to start coming in, and then wait for the SCSI controller to
retry the operation, assuming that memory is slow enough that it's
reasonable or required to issue a PCI retry.)

-- Anton

Rob Warnock

unread,

May 13, 2004, 2:08:19 AM5/13/04

to

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
+---------------

| I think that could be done right, without the problems of MVT, but
| it assuredly would not be easy, either to design or to implement.
| And it would need some extra data structure hardware on at least
| some architectures. At the VERY least, you need enough to be able
| to maintain a consistent queue, with insertion and removal at each
| end from multiple parallel threads running at different protection
| levels, without introducing security exposures or requiring locking.

+---------------

To bring this back to IBA... The better 4x IBA chips provide what
is effectively a queue per connection, thus the queues can live in
(pinned) user space...

Nick Maclaren

unread,

May 13, 2004, 4:01:03 AM5/13/04

to

In article <hYednShGLYd...@speakeasy.net>,

rp...@rpw3.org (Rob Warnock) writes:
|> Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
|> +---------------
|> | I think that could be done right, without the problems of MVT, but
|> | it assuredly would not be easy, either to design or to implement.
|> | And it would need some extra data structure hardware on at least
|> | some architectures. At the VERY least, you need enough to be able
|> | to maintain a consistent queue, with insertion and removal at each
|> | end from multiple parallel threads running at different protection
|> | levels, without introducing security exposures or requiring locking.
|> +---------------
|>
|> To bring this back to IBA... The better 4x IBA chips provide what
|> is effectively a queue per connection, thus the queues can live in
|> (pinned) user space...

Which, in my terms, is providing the primitives needed to do the job.

Regards,
Nick Maclaren.

Jan Vorbrüggen

unread,

May 13, 2004, 4:40:17 AM5/13/04

to

> I was looking for a general solution for PC devices. I don't see
> anything in what you are saying that suggests that that PCs
> should NOT have bus s-g hardware just like other systems.

As I see it, you

- either put scatter-gather capability into the bus adapter, but then
you need to update the drivers (and OS I/O support utility routines)
to use it,
- or you supply an application the ability to allocate physically contiguous
regions, and allow it to use larger-than-page-size DMAs - which can
already be done in Linux because it has this support built in, you've
told us.

TANSTAAFL.

Jan

glen herrmannsfeldt

unread,

May 12, 2004, 5:25:38 PM5/12/04

to

Stephen Fuld wrote:

(snip)

> Again, I am confused. On all mainframes I know of, (except for the
> Burroughs descendants, which is a "whole nother thing") the actual I/O
> instructions (e.g. SIOF) are privliged and can only be executed by the OS.

Presumably not including virtual SIO's done under VM.

There is also EXCP, which gives user programs much of the
power of SIO, as long as the stay within appropriate bounds.

(snip)

-- glen

glen herrmannsfeldt

unread,

May 12, 2004, 5:59:15 PM5/12/04

to

Nick Maclaren wrote:

(snip)

> In any case, all that does is remove the need to redesign the
> application as such - the library needs even more gutting than is
> needed to provide direct support for DMA. And the kernel changes
> are pretty pervasive, too. Conceptually, it isn't new, but IBM
> dropped support for chained scheduling in MVT 21.7 (if I recall),
> as they found it too fiendish to support. The I/O efficiency of
> System/370 went downhill all the way from there ....

I am not so sure that I understand chained scheduling, but
S/370 added RPS (rotational position sensing) on DASD, and
I would think that would change the I/O scheduling system.

-- glen

glen herrmannsfeldt

unread,

May 12, 2004, 5:29:47 PM5/12/04

to

Nick Maclaren wrote:
(snip)

> You are, however, going quite a long way beyond what was standard
> technology on traditional mainframes. One of the reasons that IBM
> got rid of chained scheduling is that they really, really did NOT
> want to have to support it in combination with virtual memory. In
> fact, it was phased out before System/370 supported the modern form
> of SMP - and, if I recall, before it supported the relevant parts of
> the supervisor running on anything except CPU 0.

> Don't underestimate the difficulty of getting this right.

There was the multiprocessor 360/65 with appropriate
version of OS/360. It might be that only one processor
did all the I/O.

-- glen

glen herrmannsfeldt

unread,

May 12, 2004, 5:37:36 PM5/12/04

to

Anne & Lynn Wheeler wrote:
(snip)

> the os/360 standard was that the application (or application library)
> built the real channel command programs (CCWs) and did an excp/svc0 to
> the kernal. the kernel did a little sanity checking on the CCWs and
> possibly prefixed it with set file mask CCW (aka wouldn't allow disk
> commands to move the arm).

> PCI was a I/O hardware interrupt sometimes used on long running
> channel programs and would generate a hardware interrupt indicating
> that the channel had executed a specific command in the sequence. this
> could be reflected to the application in a PCI appendage. Frequently
> the application PCI appendage would interpret various things and
> dynamically modify the running channel program ... anticipating that
> the PCI appendage got control and was able to change the channel
> program ... before the channel execution got to the channel commands
> that were being modified.

Can you explain the VM command SET ISAM, and the self modifying
channel programs used by ISAM?

-- glen

Nick Maclaren

unread,

May 13, 2004, 6:17:35 AM5/13/04

to

In article <mDwoc.77369$0H1.7209471@attbi_s54>,
glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:

|> Stephen Fuld wrote:
|>
|> > Again, I am confused. On all mainframes I know of, (except for the
|> > Burroughs descendants, which is a "whole nother thing") the actual I/O
|> > instructions (e.g. SIOF) are privliged and can only be executed by the OS.
|>
|> Presumably not including virtual SIO's done under VM.

You mean one of the VM extension operations? I thought that most
were privileged, but used almost none of them.

|> There is also EXCP, which gives user programs much of the
|> power of SIO, as long as the stay within appropriate bounds.

Grrk. Not really. All it did was allow the user to bypass the
Access Methods. The restrictions on what could be done were
pretty comparable for it and BSAM/BDAM/BPAM, for example.

|> I am not so sure that I understand chained scheduling, but
|> S/370 added RPS (rotational position sensing) on DASD, and
|> I would think that would change the I/O scheduling system.

Not really. All it did was to permit disconnexion while the
disk rotated, thus allowing other, unrelated operations to go
ahead. It didn't make a lot of difference to a single CCW chain.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

May 13, 2004, 6:23:15 AM5/13/04

to

In article <fHwoc.78702$Ik.5794512@attbi_s53>,

glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:
|>
|> There was the multiprocessor 360/65 with appropriate
|> version of OS/360. It might be that only one processor
|> did all the I/O.

That was correct, yes, and the situation lasted for a long
time after that. I can't remember when the supervisor started
to allow parallel I/O, but I am pretty certain that it was
around the time of the MVT/MVS changeover.

Also, until well into the life of MVS, genuine parallel
applications were almost unknown. There was therefore none of
the problem with one thread scheduling I/O into a shared data
area, and another thread accessing that area.

Regards,
Nick Maclaren.

Joe Seigh

unread,

May 13, 2004, 6:55:15 AM5/13/04

to

> Everyone has been going on about this scatter gather for a while. I assume
this only works on real memory addresses rather than virtual ones. While
this may solve the problem of disconiguous pages it doesn't solve the problem
of using it safetly from user space. The pages would have to be pinned to
prevent remapping of memory to that of other address spaces. Except that
you can't trust the user application to do the proper coordination, so you
have to do the actual i/o from the kernel and that would involve syscalls.

> TANSTAAFL.

Yep.

Joe Seigh

Jan Vorbrüggen

unread,

May 13, 2004, 7:22:31 AM5/13/04

to

>I assume this only works on real memory addresses rather than virtual ones.

Alternatively, you put a page table walker into your bus or device chip -
see the CI adapters for VAXclusters. It resulted in the adapter having
more compute power than its host in at least the case of the 730...

> While this may solve the problem of disconiguous pages it doesn't solve
> the problem of using it safetly from user space. The pages would have to
> be pinned to prevent remapping of memory to that of other address spaces.

Yep. But that's a one-time setup cost, and can be done safely: the user
requests the kernel to prepare a certain buffer for direct I/O, and the
kernel not only sets up the scatter-gather map, it also pins the pages.
Real OSes have been doing this for decades. The only new thing (to a degree)
is to allow the user access to the device (hopefully in a safe way) such
that she can initiate data transfer directly from user mode.

Jan

Joe Seigh

unread,

May 13, 2004, 7:59:05 AM5/13/04

to

Jan Vorbrüggen wrote:
>
> >I assume this only works on real memory addresses rather than virtual ones.
>
> Alternatively, you put a page table walker into your bus or device chip -
> see the CI adapters for VAXclusters. It resulted in the adapter having
> more compute power than its host in at least the case of the 730...

There is some rather non-trivial coordination among general purpose
processors to maintain a consistent view of the virtual memory mapping
when that mapping gets changed.

>
> > While this may solve the problem of disconiguous pages it doesn't solve
> > the problem of using it safetly from user space. The pages would have to
> > be pinned to prevent remapping of memory to that of other address spaces.
>
> Yep. But that's a one-time setup cost, and can be done safely: the user
> requests the kernel to prepare a certain buffer for direct I/O, and the
> kernel not only sets up the scatter-gather map, it also pins the pages.
> Real OSes have been doing this for decades. The only new thing (to a degree)
> is to allow the user access to the device (hopefully in a safe way) such
> that she can initiate data transfer directly from user mode.
>

You'd have to be able to coordinate multiple concurrently active scatter-gather
mappings in a multi-processor environment and ensure that concurrent user processes
each only access their proper scatter-gather mappings. Good luck.

Joe Seigh

Jan Vorbrüggen

unread,

May 13, 2004, 8:15:14 AM5/13/04

to

> There is some rather non-trivial coordination among general purpose
> processors to maintain a consistent view of the virtual memory mapping
> when that mapping gets changed.

Sure enough. IIRC, in the case of the CI adapters they were mappings of
kernel address space (that probably could double-map process address space
for doing direct I/O to user space). As each mapping is specific to only
one host processor, all its OS has to do is to track relevant changes of
kernel address space - and, again IIRC, those couldn't occur by design.

> You'd have to be able to coordinate multiple concurrently active scatter-gather
> mappings in a multi-processor environment and ensure that concurrent user processes
> each only access their proper scatter-gather mappings. Good luck.

Well, I'd not consider "swapping" scatter-gather mapings - either the
hardware supports one more, or the next comer is refused. With regard to
MP environments, just as the pages are pinned in memory, so is the process
on its processor (at least, that is the simplest solution).

Jan

Message has been deleted

Joe Seigh

unread,

May 13, 2004, 8:28:26 AM5/13/04

to

Jan Vorbrüggen wrote:

>
> > You'd have to be able to coordinate multiple concurrently active scatter-gather
> > mappings in a multi-processor environment and ensure that concurrent user processes
> > each only access their proper scatter-gather mappings. Good luck.
>
> Well, I'd not consider "swapping" scatter-gather mapings - either the
> hardware supports one more, or the next comer is refused. With regard to
> MP environments, just as the pages are pinned in memory, so is the process
> on its processor (at least, that is the simplest solution).
>

Well, one, the RMDA device would be shared by multiple processors. Two,
while you may be able to control process dispatching, the network side
activity is not under dispatcher control.

I suppose if you restricted it to a uniprocessor and a single user process
with the goal of building a cluster out of those components specifically then
probably doable.

Joe Seigh

Eric

unread,

May 13, 2004, 11:17:09 AM5/13/04

to

Jan Vorbrüggen wrote:
>
> > I was looking for a general solution for PC devices. I don't see
> > anything in what you are saying that suggests that that PCs
> > should NOT have bus s-g hardware just like other systems.
>

I assume you are talking to me since you are quoting me.

> As I see it, you
>
> - either put scatter-gather capability into the bus adapter, but then
> you need to update the drivers (and OS I/O support utility routines)
> to use it,

This is how I would do it. The OS support is already present in both
WNT and Linux. There appears to be little difference: in WNT it is
part of the 'standard' driver architecture such all DMA devices
would get bus s-g support if the hardware was present.
In Linux, I gather from an O'Reilly & Assoc. driver books on the
2.4 kernel, driver writers optionally add support by passing
a 'scatterlist' to the 'pci_map_sg' routine.
So it is just a rose by another name.

> - or you supply an application the ability to allocate physically contiguous
> regions, and allow it to use larger-than-page-size DMAs - which can
> already be done in Linux because it has this support built in, you've
> told us.

I don't think I said this, though others did. Yes one could but it,
of course, has all sorts of nasty mem management considerations too.

>
> TANSTAAFL.
>
> Jan

(TANSTAAFL = There Is No Such Thing As A Free Lunch)

I didn't suggest there was a free lunch. However it looks
like the tab was already picked up a long time ago.

At any rate, if one is discussing general zero copy dma on PCs,
getting (cache coherent) bus scatter gather into the chipsets
looks to me like an important forgotten prerequisite.

Eric

Anne & Lynn Wheeler

unread,

May 13, 2004, 11:49:16 AM5/13/04

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
> Grrk. Not really. All it did was allow the user to bypass the
> Access Methods. The restrictions on what could be done were
> pretty comparable for it and BSAM/BDAM/BPAM, for example.

little explanation: (warning: long, wandering post)

EXCP/SVC0 was the (os/360 I/O) call from application space to the
supervisor/kernel ... whether it was actual user code or
access-method/library code. essentially the access methods were
library code that ran in user space, generated CCWs (with real
addresses) and performed a EXCP/SVC0. as a result, application
programs were pretty much free to also do anything that the
access-methods did. At this level, almost everything used a pointer
passing convention ("by reference" rather than "by value").

CCW -- channel command word; sequences of CCWs where channel programs
EXCP -- EXecute Channel Program

A typical disk channel program use to be:

seek BBCCHH position disk head
search match record information
tic *-8 branch back to search
read/write physical address

....

BBCCHH -- B: two bytes BIN, C: two bytes cylinder, H: two bytes head
BIN was typically zero, it was carry-over from 1960s 2321 datacell,
a device that was something like large washing machine with bins
positioned inside the cylinder. the bins rotated under the read/write
head. bins had long strips of magnetic material, which the read/write
head extracted and re-inserted from/to the bins.

search if criteria was succesful, it skipped the next CCW, otherwise it fell
thru to the immediate next CCW. typical search was for "identifier
equal", looping, examining each record until it found match

Processing for disks in an EXCP/SVC0 environment would have the
supervisor generate a CCW sequence of

seek BBCCHH
set file mask
tic <address of user space ccws>

so rather than starting the channel program with the first user space
CCW, it positioned the head and then used the "set file mask" command
to limit what following commands were valid/invalid; i.e. read or
write allowd, head switching allowed, diagnostic commands allowed.

normally channel programs ran asyncronously to the processor and
generated a I/O interrupt when complete. It was possible to turn
on the PCI-flag (programmed controlled interrupt) in a CCW which
would queue a kind of soft I/O interrupt.

scatter/gather could be accomplished by having a whole sequence
of search/read/write commands chained together.

within a read/write command it was possible to do scatter/gather with
"data chaining" ... a sequence of two or more CCWs where it only took
the command from the first CCW and the remaining CCWs were only used
for address and length fields.

in the move from 360 to 370 there was starting to be a timing problem.
channel programs are defined as being exactly serialized; the channel
can't fetch the next CCW until the previous CCW has completed (aka no
prefetching). There were starting to be scatter/gather timing
problems, especially with operations that had previously been single
real read/write CCW that now had to be copied/translated into
scatter/gather sequence with non-contiguous virtual pages. The problem
was that in some cases, the channel wasn't able to fetch the next
(data chained) CCW and extract the address before the transferring
data had overrun what limited buffering and/or the disk head had
rotated past position. IDALs were introduced (indirect data address
list) ... a flag in the CCW changed the address field from pointing
directly to target real data address to pointing at a list of data
addresses. It preserved the requirement that CCWs could not be
prefetched ... but allowed the channel to prefetch IDALs.

here is google HTML translation of Share PDF file giving intro to
channel programming:
http://216.239.57.104/search?q=cache:ilwHKHohAMUJ:www.share.org/proceedings/sh95/data/S2874A.PDF

it covers other features introduced over the years, things like
rotational position sensing and fixed-head architecture. it has some
sample disk programs ... including "EXCP to simulate BPAM".

note that later, IDALs solved another problem. The address field in
the CCW is 24bits; while IDAL entries were full 32bits. The 3033 was
a 24bit addressing machine but offered option for more than 16mbytes
of real memory. The page table entry format specified a 12bit page
number for 4k pages (giving 24bit addressing). However, there two
stray, unused bits in the PTE which could be scavanged and
concatenated with the page number to address up to 64mbytes of real
pages. This allowed multiple 24-bit virtual address spaces ... that
could have pages resident in more than 16mbytes of real memory. IDALs
then provided the addressing for doing I/O into/out-of real memory
above the 16mbyte line. The actual CCWs and IDALs were limited to
being below the 16mbyte line ... but the target data addresses could
be above the 16mbyte line. This carried forward to 370-XA and the
introduction of 31-bit virtual addressing. It was now possible to have
2gbyte real memories and 2gbyte virtual memories ... but (the real)
CCWs and IDALs were still limited to being below the 16mbyte line.

All library code running in application space coupled with everything
oriented towards pointer passing somewhat gave rise to the later
access register architecture. For various reason, there was motivation
to move various code out of application address space (in 24bit
virtual days ... it was beginning to consume large parts of the
address space, some installations, available space to applications was
down to as little as 4mbytes out of 16mbytes; also raised was
integrity of library code being out of the user address
space). However, they didn't want to give up the pointer passing
convention and the efficiency of direct branch calls (w/o having to go
thru a kernel call). So new tables were built and control registers
were set up and a "program call" instruction invented. Program call
effectively emulated the old branch&link subroutine call ... but
specified an address in a different virtual address space ... under
control of a protected table. So now, library code is running in a
different address space and it is being passed pointers from the user
application addres space. There also now has to be instructions
(cross-memory services) where the library code can differentiate
between instruction address arguments for the current virtual address
space and the calling application address space.

discussion of address types: absolute, real, virtual, primary virtual,
secondary virtual, AR-specified, home virtual, logical, instruction,
and effective:
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/3.2.1?SHELF=EZ2HW125&DT=19970613131822&CASE=

translation control (and various different virtual address spaces):
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/3.11.1?SHELF=EZ2HW125&DT=19970613131822

a little multiple virtual address space overview
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/3.8?SHELF=EZ2HW125&DT=19970613131822

changing address spaces
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/3.8.1?SHELF=EZ2HW125&DT=19970613131822

set address space control insturction
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/10.33?SHELF=EZ2HW125&DT=19970613131822

program call instruction
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/10.26?SHELF=EZ2HW125&DT=19970613131822

program return instruction
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/10.27?SHELF=EZ2HW125&DT=19970613131822

program transfer instruction
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9AR004/10.28?SHELF=EZ2HW125&DT=19970613131822

--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/

Message has been deleted

Nick Maclaren

unread,

May 13, 2004, 12:06:54 PM5/13/04

to

In article <uzn8c7...@mail.comcast.net>,

Anne & Lynn Wheeler <ly...@garlic.com> wrote:
>nm...@cus.cam.ac.uk (Nick Maclaren) writes:
>> Grrk. Not really. All it did was allow the user to bypass the
>> Access Methods. The restrictions on what could be done were
>> pretty comparable for it and BSAM/BDAM/BPAM, for example.
>
>little explanation: (warning: long, wandering post)
>
>EXCP/SVC0 was the (os/360 I/O) call from application space to the
>supervisor/kernel ... whether it was actual user code or
>access-method/library code. essentially the access methods were
>library code that ran in user space, generated CCWs (with real
>addresses) and performed a EXCP/SVC0. as a result, application
>programs were pretty much free to also do anything that the
>access-methods did. At this level, almost everything used a pointer
>passing convention ("by reference" rather than "by value").

Thanks for taking the trouble to explain this - I baulked!

I have one slight niggle.

The term EXCP was also a pseudo access method for DCBs set up
with MACRF=(E), and set bit X'80' in DCBMACRF. There were a few
differences that depended on that in the various appendages, such
a couple for BPAM, though I now forget the details. Anyway, an
unprivileged program could not emulate BPAM precisely using EXCP,
though a privileged one could.

And, similarly, an unprivileged program could do a few things using
MACRF=(E) that it could not do using any of the other access methods.
Again, I can't remember the details, though I think I still have
code that does that. But they were very much details, and nothing
fundamental.

That was why I wrote what I did :-)

Regards,
Nick Maclaren.

Del Cecchi

unread,

May 13, 2004, 12:19:37 PM5/13/04

to

Just came across this article, which seems on topic.
http://www.nwfusion.com/news/2004/0412specialfocus.html

I was going to edit the article and post the text. But that wasn't really
possible. It seems pretty objective.

del cecchi

Anne & Lynn Wheeler

unread,

May 13, 2004, 12:32:39 PM5/13/04

to

glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:
> Can you explain the VM command SET ISAM, and the self modifying
> channel programs used by ISAM?

ISAM had long, wandering CCW sequences ... and all the structure
of everything was out on disk (more than you really want to know).

Modern days, you have lots of incore tables ... it tells you where you
want to go and a i/o command is generated to read/write that record.

in ckd "dasd", there was early convention of lots of the structure
being on disk (in part of really limited real storage). "search"
commands could search for id equal, unequal, greater than, less than,
etc. in vtoc/PDS convention ... you might know the "identifier" of the
record you want but only the general area that it might be located
in. A multi-track search is generated of "identifier-equal" and turned
loose to scan the whole cylinder for the specific record.

ISAM got even more complicated .... structure on disk could have
record identifiers of the actual record you were looking for .. so
you have these long wandering CCW sequences ... that search for a
record of a specific match (high, low, equal) and reads record
identifier for other reocrds which are then searched for (high, low,
equal).

so here is hypothetical channel program example:

seek BBCCHH
search condition (hi, low, equal) identifier
tic *-8
read BBCCHH-1
read identifier-1
seek BBCCHH-1
search (hi,low,equal) identier-1
read BBCCHH-2
read identifer-2
seek BBCCH-2
search (hi,low,equal) identifier-2
read/write data

so process is somewhat:
1) seek to known location
2) search for a known identifier record
3) read new location-1
4) read new identifier-1
5) seek to new location-1
6) search for new identifier-1 record
7) read new location-2
8) read new identifier-2
9) seek to new location-2
10) search for new identifier-2 record
11) read/write data

now all of the "read" operations for BBCCHHs and identifiers are
addresses in the virtual address space.

for CP67/VM370 ... the CCWs as well as the BBCCHH fields are copied to
real storage and the seek CCWs updated to reflect the copied BBCCHH
fields. The example channel program starts and eventually reads the
BBCCHH-1 into the real page containing the virtual page. The channel
program then gets to the seek CCW that references the BBCCHH-1 field.
However, the real seek is pointing to the copy of the contents of the
BBCCHH-1 location ... not the virtual address location where the new
BBCCHH-1 value has just been read.

So the SET ISAM option turns on additional scanning of CCW sequences
(during the copy/translation process) to specifically recognize case
of previous CCW in the chain reading a value needed as an argument by
a subsequent CCW. There is also some restrictions that these have to
be disks where there is a one-to-one mapping between virtual disk
location and real disk location ... and there are no integrity issues
with one virtual machine being able to seek into a non-authorized area
of the disk.

The example channel program could even have something at the end that
read a new value into the original BBCCHH field and a "tic"
instruction that branched back to the first CCW and looped the whole
process all over again.

think of it as sort of register dependency checking for instruction
out-of-order, prefetching

a couple months out of the university (summer of '70), i got selected
to do onsite for a customer that was trying to bring up large ISAM
production under cp/67 ... and this was before any ISAM support what
so ever. This was still in the days where corporations put their
datacenter on display behind glass on first floor of tall corporate
buildings. This place ran normal operation during the day and I got
the machine from midnight to 7am or so. You are sitting there doing
some all night debugging and the early birds are starting to walk by
on their way to work and stare at you.

then there is my soap-box that this is all from the days where there
was significant constraints on memory and significant extra I/O
capacity ... so there was a design point trade-off of I/O for memory
(significant I/O resources were consumed to save having tables of
stuff in memory). by the mid-70s, the technology had shifted to where
memory was starting to be more abundant and disk I/O was the
constrained resource.

Stephen Fuld

unread,

May 13, 2004, 4:17:36 PM5/13/04

to

"glen herrmannsfeldt" <g...@ugcs.caltech.edu> wrote in message
news:fHwoc.78702$Ik.5794512@attbi_s53...

IBM was really behind in the area of true, shared memory (AKA tightly
coupled) multi-processors. (They were ahead in "loosly coupled" processor
complexes). The Univac 1108 had true shared memory multi-processing, with
each CPU capable of doing I/O back in the mid-late 1960s. IBM didn't really
address that problem until the "dyadic" processors (early 80s?).

Stephen Fuld

unread,

May 13, 2004, 4:27:44 PM5/13/04

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:40A2531D...@xemaps.com...
>
>
> Stephen Fuld wrote:
> >
> > "Joe Seigh" <jsei...@xemaps.com> wrote in message
> > news:40A20BFA...@xemaps.com...
> > >
> > >
> > > I meant you don't need to do the actual NMI to talk to the kernel,
> > > not eliminate the kernel.
> >
> > I don't get what you are saying here. No one talked about NMI. The
typical
> > kernal call is through some instruction such as Int (PCs) or SVC (IBM
> > mainframes) or ER (Univac descendant mainframes, not an NMI. On the
> > completion side, the external interupt comes from the hardware and is
> > rarely, if ever an NMI.
>
> That's what I meant. You don't need to do a synchronous interrupt to
> talk to the kernel. Or asynchronous interrupt for that matter.

You need something that does pretty much what an interrupt does however.
i.e. change the context (saving some user registers and loading some OS
context registers, changing the addressing mode/ processor privlidge, etc.)
then undoing that on the return. That is what those instructions do and
that is the overhead that can be avoided.

> >
> > > Anyway there's nothing new conceptwise
> > > that you're discussing here. It's already been done for decades on
> > > mainframes.

> >
> > Again, I am confused. On all mainframes I know of, (except for the
> > Burroughs descendants, which is a "whole nother thing") the actual I/O
> > instructions (e.g. SIOF) are privliged and can only be executed by the
OS.

> > Similarly, on the completion side, the interrupt puts the system in
kernal
> > or OS mode; the user only gets control after the OS has responded to the
> > interupt, checked the status, etc. Thus the OS has to be involved in
the
> > I/O operations in order to provide the safety required by most systems.
IB
> > has a way of providing that safety without requiring OS intervention on
> > every I/O operations and that results in reduced latency. This cannot
be
> > done safely with most existing I/O architectures. And, it does require
> > changing the application software to make use of, so it is not "software
> > transparent".
> >
>
> I was waiting for Lynn Wheeler to come in at this point and give the
lecture.
> Basically IBM channel programs could be modified on the fly under certain
> conditions. This allowed you to "queue" i/o reqests without having to
> invoke the kernel every time by modifying the channel program. Not 100%
> like RDMA but RDMA isn't 100% new either.

Unfortunately, I know more than I want to about modifying channel programs
on the fly, but the timing is really tricky and is pretty ugly on channel
utilization. Things like IB adapters are quite different from that
particular "technique".

Stephen Fuld

unread,

May 13, 2004, 4:37:52 PM5/13/04

to

"Eric" <eric_p...@sympaticoREMOVE.ca> wrote in message
news:40A28316...@sympaticoREMOVE.ca...
> Stephen Fuld wrote:
> >
> > "Eric" <eric_p...@sympaticoREMOVE.ca> wrote in message
> > news:409FCC99...@sympaticoREMOVE.ca...
> > >
> > > What are the slow "memory" operations you are referring to?
> >
> > It is the loads from memory mapped control registers that actually
> > physically reside on the far side of PCI bridge chip (or equivalent).
These
> > can take multi-microseconds each. IB eliminates them by doing
everything
> > out of main memory.
>
> I don't know the PCI details. Would this be 2 cycles (addr + data)
> on a 33 mhz bus = 60 ns per uncontested bus access?
> Is there more overhead?

Could easily be. Imagine that the device you want is not directly on the
PCI bus that is immediately accessable from the processor. That is, assume
that it is on the far side of a PCI bridge chip. Now you have to add the
latency of that bridge chip. Then look at replacing a multi-drop bus
(original PCI) with one that only supports one drop (PCI-X or a serial point
to point technology like PCI Express. That results in more use of
bridges/switches so that a device register you want may be on the far side
of several bridges/switches. That is where the increased latency comes
from.

> > > If this is a DMA, then there would be a few device control register
> > > writes to set up the DMA transfer, right? Granted they are much
> > > slower than L1 cache, but I don't see how they can be avoided.
> >
> > You set up a data structure with all the control information in main
memory
> > then execute one instruction that points the I/O hardware to that data
> > structure and says "go".
>
> Ok, the I/O channel you mentioned.

Precisely.

Anne & Lynn Wheeler

unread,

May 13, 2004, 5:31:06 PM5/13/04

to

"Stephen Fuld" <s.f...@PleaseRemove.att.net> writes:
> IBM was really behind in the area of true, shared memory (AKA tightly
> coupled) multi-processors. (They were ahead in "loosly coupled" processor
> complexes). The Univac 1108 had true shared memory multi-processing, with
> each CPU capable of doing I/O back in the mid-late 1960s. IBM didn't really
> address that problem until the "dyadic" processors (early 80s?).

you are possibly thinking of two different things. the standard 360
multiprocessors had shared memory but non-shared I/O channels; they
achieved "shared devices" using the same technology they used for
loosely-coupled (i.e. non-shared memory); aka device controllers that
could attach to multiple channels. A typical tightly-coupled, shared
memory configuration tended to have controllers configured so that
devices appeared at the same channel address on the different
processors.

there was a big distinction/deal made about multiprocessors that they
could be divided and run as totally independent uniprocessors.

the exception to non-shared channels was 360/67 multiprocessor (and
some of the special stuff for FAA air traffic control system). The
standard 360/67 multiprocessor had a channel controller and other RAS
features ... which allowed configuring memory boxes and channels
... as shared or non-shared; aka a 360/67 multiprocessor could be
configured so that all processors had access to all channels.

370s continued the standard 360 multiprocessor convention of being
able to partition the hardware and run as independent uniprocessors as
well as non-shared channels ... that typically had controllers
configured so that devices appeared at the same i/o addresses on the
different processors. Later on, 370 introduce a "cheaper"
multiprocessor called an "Attached" processor ... it was a second
(shared-memory) processor that didn't have any channels at all.

3081 introduced the dyadic ... it was a two-processor shared memory
box that couldn't be partitioned to operate as two independent
uniprocessors and the channels could be configured as either shared or
non-shared (something that hadn't been seen since the 360/67). The
3081 was never intended to be made available as a uniprocessor. It was
however, possible to combine two 3081s into a four-processor 3084 (and
3084 could be partitioned to operate as two 3081s). Somewhere along
the way ... I believe primarily for the "TPF" market ... a less
expensive, "faster", single processor was made available called a 3083
(part of the issue was it couldn't be a whole lot less expensive than
the 3081 ... since the 3081 didn't have a lot of fully replicated
infrastructure ... so going to a single processor 3083 was still a lot
more than half a two processor 3081).

The two-processor cache-machine 370s ... and carried into the 3081,
ran the caches ten percent slower in multiprocessor mode than in
uniprocessor mode. This was to accomodate the cross-cache chatter
having to do with keeping a strongly consistent memory model. While
the 3083 uniprocessor couldn't actually cut the 3081 hardware (&
costs) in half ... it could run the cache nearly 15 percent faster
(than the 3081 caches).

Note TPF was the follow-on to the airline control program operating
system ... originally developed for airline res. systems ... but by
the time of the 3081 it was also being used in a number of high
transaction financial network applications. While TPF had support for
loosely-coupled (non-shared memory multiprocessing ... or clustering),
at the time, it didn't yet have support for tightly-coupled,
shared-memory multiprocessing ... and many customers were running
processors at saturation during peak loads ... and they could use all
the processing cycles that they could get ahold of.

Some number of the TPF customers would run VM supporting 3081
multiprocessing and run two copies of TPF in different virtual
machines (each getting a 3081 processor) and coordinate their activity
with the loosely-coupled protocol support (shared i/o devices and
various message passing).

somewhat aside/drift, charlie's work at the science center
http://www.garlic.com/~lynn/subtopic.html#545tech
on fine-grain (kernel) locking for the cp/67 kernel running on 360/67
multiprocessing resulted in the compare&swap instruction (CAS are
charlie's initials, the selection of name compare&swap was so the
mnemonic would match charlie's initials) ... which first appeared in
370s over thirty years ago ... random smp posts:
http://www.garlic.com/~lynn/subtopic.html#smp

http://www.garlic.com/~lynn/2002f.html#60 Mainframes and "mini-computers"

Anton Rang

unread,

May 13, 2004, 6:49:04 PM5/13/04

to

Eric <eric_p...@sympaticoREMOVE.ca> writes:
> Stephen Fuld wrote:
> >
> > "Eric" <eric_p...@sympaticoREMOVE.ca> wrote in message
> > news:409FCC99...@sympaticoREMOVE.ca...
> > >
> > > What are the slow "memory" operations you are referring to?
> >
> > It is the loads from memory mapped control registers that actually
> > physically reside on the far side of PCI bridge chip (or equivalent). These
> > can take multi-microseconds each. IB eliminates them by doing everything
> > out of main memory.
>
> I don't know the PCI details. Would this be 2 cycles (addr + data)
> on a 33 mhz bus = 60 ns per uncontested bus access?
> Is there more overhead?

Yes, there's more overhead.

If your processor bus isn't running at 33 MHz ;-), you need to add in
the time to synchronize to the beginning of a PCI cycle.

The actual PCI transaction includes "turnaround cycles" which are used
when the direction of a bus line changes. A read transaction (as is
used for reading the control register, above) is then:

1: transaction start
2: ADDRESS
3: turnaround
4: DATA
5: idle (doesn't count against this transaction)

So you've got a minimum latency of 4 PCI cycles, with throughput of 5
cycles. (And an average of around 0.5 PCI cycles for the initial
synchronization, so figure 4.5 PCI cycles.) That's before figuring in
the processor bus itself, or any delay in the bridge.

This assumes that the target can respond immediately from its control
registers. It's not uncommon to see an extra cycle or so of delay.

Note that writes are cheaper; they don't require a turnaround cycle,
and from a processor's point of view they can be acknowledged by the
processor-to-PCI bridge rather than waiting for the PCI bus. (The
disadvantage of doing this is that a failed write can't be
pinpointed.) Lots of drivers only write to their devices because of
this....

Incidentally, you keep mentioning
> [ ... ] a PCI-32 device.

I assume that you mean a 32-bit PCI device which does not support DAC,
as opposed to a 32-bit PCI device which *does* support DAC?

-- Anton

Paul Repacholi

unread,

May 13, 2004, 4:49:28 PM5/13/04

to

Eric <eric_p...@sympaticoREMOVE.ca> writes:

> I was looking for a general solution for PC devices. I don't see
> anything in what you are saying that suggests that that PCs should
> NOT have bus s-g hardware just like other systems.

The two general methods I've come across are:

Double map the IO buffer into a contiguous block of kernel address
spaces and do the IO from there.

Have TLBs in the addapter and do IO with the user page tables. The
addapter can walk the tables and reload the TLBs as needed, but can
not access outside what the page tables map to without outsid
asistance. This is pretty much what the DEC CI controllers do.

--
Paul Repacholi 1 Crescent Rd.,
+61 (08) 9257-1001 Kalamunda.
West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be.

Jan Vorbrüggen

unread,

May 14, 2004, 3:22:28 AM5/14/04

to

> This is how I would do it. The OS support is already present in both
> WNT and Linux.

But many - most? - of the drivers aren't supplied with the OS, they are
supplied with the hardware and are written by the manufacturers of said
hardware. Are they willing and even capable of adapting their drivers to
new hardware that supports scatter-gather? Colour me sceptical.

[physically contiguous buffers]

> I don't think I said this, though others did. Yes one could but it,
> of course, has all sorts of nasty mem management considerations too.

Sure, that was the case when the systems that required it (e.g., the
MicroVAX I) were the smallest at the time, when even the largest would
be called memory-poor compared to any system of today. Given today's
memory sizes, I'd say it would be only a minor hassle, transparently
handled by the OS for all drivers.

Jan

Nick Maclaren

unread,

May 14, 2004, 3:39:21 AM5/14/04

to

In article <AJQoc.1232$hH.2...@bgtnsc04-news.ops.worldnet.att.net>,

"Stephen Fuld" <s.f...@PleaseRemove.att.net> writes:
|>
|> IBM was really behind in the area of true, shared memory (AKA tightly
|> coupled) multi-processors. (They were ahead in "loosly coupled" processor
|> complexes). The Univac 1108 had true shared memory multi-processing, with
|> each CPU capable of doing I/O back in the mid-late 1960s. IBM didn't really
|> address that problem until the "dyadic" processors (early 80s?).

About then, yes.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

May 14, 2004, 3:42:17 AM5/14/04

to

In article <4TQoc.52394$Ut1.1...@bgtnsc05-news.ops.worldnet.att.net>,

"Stephen Fuld" <s.f...@PleaseRemove.att.net> writes:
|>
|> > That's what I meant. You don't need to do a synchronous interrupt to
|> > talk to the kernel. Or asynchronous interrupt for that matter.
|>
|> You need something that does pretty much what an interrupt does however.
|> i.e. change the context (saving some user registers and loading some OS
|> context registers, changing the addressing mode/ processor privlidge, etc.)
|> then undoing that on the return. That is what those instructions do and
|> that is the overhead that can be avoided.

Only when you are using a synchronous (single-CPU) design. On a
genuinely asynchronous threaded design, you can simply queue an
action for a kernel thread to execute. But I know of no current
general purpose system that works that way. I shall try to see
if I can catch Marc Tremblay on that ....

Regards,
Nick Maclaren.

Joe Seigh

unread,

May 14, 2004, 5:45:05 AM5/14/04

to

Nick Maclaren wrote:
>
>
> Only when you are using a synchronous (single-CPU) design. On a
> genuinely asynchronous threaded design, you can simply queue an
> action for a kernel thread to execute. But I know of no current
> general purpose system that works that way. I shall try to see
> if I can catch Marc Tremblay on that ....
>

I always though it rather strange that since both POSIX asychrononous
i/o and windows IOCP api's would allow such an implementation in both
directions, user to kernel and kernel to user, that nobody appears to
have done such an implementation. Especially since cutting out the
overhead of the context switching gives you rather dramatic performance
improvements.

Joe Seigh

Eric

unread,

May 14, 2004, 10:04:21 AM5/14/04

to

Without a bridge, we are still in the 150 ns range, about the
same as main memory. The potential for a bridge does make this
delay open ended though.

>
> Note that writes are cheaper; they don't require a turnaround cycle,
> and from a processor's point of view they can be acknowledged by the
> processor-to-PCI bridge rather than waiting for the PCI bus. (The
> disadvantage of doing this is that a failed write can't be
> pinpointed.) Lots of drivers only write to their devices because of
> this....

I heard something on this some time ago. It seemed that certain
graphics drivers avoided reading their command fifo status but just
blindly wrote to it. If the fifo was full, it caused a bus wait and
stalled the whole system, but improved *their* product performance.

>
> Incidentally, you keep mentioning
> > [ ... ] a PCI-32 device.
>
> I assume that you mean a 32-bit PCI device which does not support DAC,
> as opposed to a 32-bit PCI device which *does* support DAC?

Yeah, I was trying to anticipate all the marketting depts.

>
> -- Anton